2021-11-15 13:37:13

by Abhishek Sahu

[permalink] [raw]
Subject: [RFC 0/3] vfio/pci: Enable runtime power management support

From: Abhishek Sahu <[email protected]>

Currently, there is very limited power management support available
in the upstream vfio-pci driver. If there is no user of vfio-pci device,
then it will be moved into D3Hot state. Similarly, if we enable the
runtime power management for vfio-pci device in the guest OS, then the
device is being runtime suspended (for linux guest OS) and the PCI
device will be put into D3hot state (in function
vfio_pm_config_write()). If the D3cold state can be used instead of
D3hot, then it will help in saving maximum power. The D3cold state can't
be possible with native PCI PM. It requires interaction with platform
firmware which is system-specific. To go into low power states
(including D3cold), the runtime PM framework can be used which
internally interacts with PCI and platform firmware and puts the device
into the lowest possible D-States. This patch series registers the
vfio-pci driver with runtime PM framework and uses the same for moving
the physical PCI device to go into the low power state.

The current PM support was added with commit 6eb7018705de ("vfio-pci:
Move idle devices to D3hot power state") where following point was
mentioned regarding D3cold state.

"It's tempting to try to use D3cold, but we have no reason to inhibit
hotplug of idle devices and we might get into a loop of having the
device disappear before we have a chance to try to use it."

With the runtime PM, if the user want to prevent going into D3cold then
/sys/bus/pci/devices/.../d3cold_allowed can be set to 0 for the
devices where the above functionality is required instead of
disallowing the D3cold state for all the cases.

Abhishek Sahu (3):
vfio/pci: register vfio-pci driver with runtime PM framework
vfio/pci: virtualize PME related registers bits and initialize to zero
vfio/pci: use runtime PM for vfio-device into low power state

drivers/vfio/pci/vfio_pci.c | 3 +
drivers/vfio/pci/vfio_pci_config.c | 210 ++++++++++++++++++++++++++---
drivers/vfio/pci/vfio_pci_core.c | 166 ++++++++++++-----------
include/linux/vfio_pci_core.h | 7 +-
4 files changed, 288 insertions(+), 98 deletions(-)

--
2.17.1



2021-11-15 13:37:41

by Abhishek Sahu

[permalink] [raw]
Subject: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state

From: Abhishek Sahu <[email protected]>

Currently, if the runtime power management is enabled for vfio-pci
device in guest OS, then guest OS will do the register write for
PCI_PM_CTRL register. This write request will be handled in
vfio_pm_config_write() where it will do the actual register write of
PCI_PM_CTRL register. With this, the maximum D3hot state can be
achieved for low power. If we can use the runtime PM framework, then
we can achieve the D3cold state which will help in saving
maximum power.

This patch uses runtime PM framework whenever vfio-pci device will
be put in the low power state.

1. If runtime PM is enabled, then instead of directly writing
PCI_PM_CTRL register, decrement the device usage counter whenever
the power state is non-D0. The kernel runtime PM framework will
itself put the PCI device in low power state when device usage
counter will become zero. Similarly, when the power state will be
again changed back to D0, then increment the device usage counter
and the kernel runtime PM framework will itself bring the PCI device
out of low power state.

2. The guest OS will read the PCI_PM_CTRL register back to
confirm the current power state so virtual register bits can be
used. For this, before decrementing the usage count, read the actual
PCI_PM_CTRL register, update the power state related bits, and then
update the vconfig bits corresponding to PCI_PM_CTRL offset. For
PCI_PM_CTRL register read, return the virtual value if runtime PM is
requested. This vconfig bits will be cleared when the power state
will be changed back to D0.

3. For the guest OS, the PCI power state will be still D3hot
even if put the actual PCI device into D3cold state. In the D3hot
state, the config space can still be read. So, if there is any request
from guest OS to read the config space, then we need to move the actual
PCI device again back to D0. For this, increment the device usage
count before reading/writing the config space and then decrement it
again after reading/writing the config space. There can be
back-to-back config register read/write request, but since the auto
suspend methods are being used here so only first access will
wake up the PCI device and for the remaining access, the device will
already be active.

4. This above-mentioned wake up is not needed if the register
read/write is done completely with virtual bits. For handling this
condition, the actual resume of device will only being done in
vfio_user_config_read()/vfio_user_config_write(). All the config
register access will come vfio_pci_config_rw(). So, in this
function, use the pm_runtime_get_noresume() so that only usage count
will be incremented without resuming the device. Inside,
vfio_user_config_read()/vfio_user_config_write(), use the routines
with pm_runtime_put_noidle() so that the PCI device won’t be
suspended in the lower level functions. Again in the top level
vfio_pci_config_rw(), use the pm_runtime_put_autosuspend(). Now the
auto suspend timer will be started and the device will be suspended
again. If the device is already runtime suspended, then this routine
will return early.

5. In the host side D3cold will only be used if the platform has
support for this, otherwise some other state will be used. The
config space can be read if the device is not in D3cold state. So in
this case, we can skip the resuming of PCI device. The wrapper
function vfio_pci_config_pm_runtime_get() takes care of this
condition and invoke the pm_runtime_resume() only if current power
state is D3cold.

6. For vfio_pci_config_pm_runtime_get()/vfio_
pci_config_pm_runtime_put(), the reference code is taken from
pci_config_pm_runtime_get()/pci_config_pm_runtime_put() and then it
is modified according to vfio-pci driver requirement.

7. vfio_pci_set_power_state() will be unused after moving to runtime
PM, so this function can be removed along with other related
functions and structure fields.

Signed-off-by: Abhishek Sahu <[email protected]>
---
drivers/vfio/pci/vfio_pci_config.c | 178 ++++++++++++++++++++++++++---
drivers/vfio/pci/vfio_pci_core.c | 64 ++---------
include/linux/vfio_pci_core.h | 3 +-
3 files changed, 174 insertions(+), 71 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index fb3a503a5b99..5576eb4308b4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -25,6 +25,7 @@
#include <linux/uaccess.h>
#include <linux/vfio.h>
#include <linux/slab.h>
+#include <linux/pm_runtime.h>

#include <linux/vfio_pci_core.h>

@@ -119,12 +120,51 @@ struct perm_bits {
#define NO_WRITE 0
#define ALL_WRITE 0xFFFFFFFFU

-static int vfio_user_config_read(struct pci_dev *pdev, int offset,
+static void vfio_pci_config_pm_runtime_get(struct pci_dev *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct device *parent = dev->parent;
+
+ if (parent)
+ pm_runtime_get_sync(parent);
+
+ pm_runtime_get_noresume(dev);
+ /*
+ * pdev->current_state is set to PCI_D3cold during suspending,
+ * so wait until suspending completes
+ */
+ pm_runtime_barrier(dev);
+ /*
+ * Only need to resume devices in D3cold, because config
+ * registers are still accessible for devices suspended but
+ * not in D3cold.
+ */
+ if (pdev->current_state == PCI_D3cold)
+ pm_runtime_resume(dev);
+}
+
+static void vfio_pci_config_pm_runtime_put(struct pci_dev *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct device *parent = dev->parent;
+
+ pm_runtime_mark_last_busy(dev);
+ pm_runtime_put_noidle(dev);
+
+ if (parent)
+ pm_runtime_put_noidle(parent);
+}
+
+static int vfio_user_config_read(struct vfio_pci_core_device *vdev, int offset,
__le32 *val, int count)
{
+ struct pci_dev *pdev = vdev->pdev;
int ret = -EINVAL;
u32 tmp_val = 0;

+ if (vdev->pm_runtime_suspended)
+ vfio_pci_config_pm_runtime_get(pdev);
+
switch (count) {
case 1:
{
@@ -147,15 +187,22 @@ static int vfio_user_config_read(struct pci_dev *pdev, int offset,

*val = cpu_to_le32(tmp_val);

+ if (vdev->pm_runtime_suspended)
+ vfio_pci_config_pm_runtime_put(pdev);
+
return ret;
}

-static int vfio_user_config_write(struct pci_dev *pdev, int offset,
+static int vfio_user_config_write(struct vfio_pci_core_device *vdev, int offset,
__le32 val, int count)
{
+ struct pci_dev *pdev = vdev->pdev;
int ret = -EINVAL;
u32 tmp_val = le32_to_cpu(val);

+ if (vdev->pm_runtime_suspended)
+ vfio_pci_config_pm_runtime_get(pdev);
+
switch (count) {
case 1:
ret = pci_user_write_config_byte(pdev, offset, tmp_val);
@@ -168,6 +215,9 @@ static int vfio_user_config_write(struct pci_dev *pdev, int offset,
break;
}

+ if (vdev->pm_runtime_suspended)
+ vfio_pci_config_pm_runtime_put(pdev);
+
return ret;
}

@@ -183,11 +233,10 @@ static int vfio_default_config_read(struct vfio_pci_core_device *vdev, int pos,

/* Any non-virtualized bits? */
if (cpu_to_le32(~0U >> (32 - (count * 8))) != virt) {
- struct pci_dev *pdev = vdev->pdev;
__le32 phys_val = 0;
int ret;

- ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+ ret = vfio_user_config_read(vdev, pos, &phys_val, count);
if (ret)
return ret;

@@ -224,18 +273,17 @@ static int vfio_default_config_write(struct vfio_pci_core_device *vdev, int pos,

/* Non-virtualzed and writable bits go to hardware */
if (write & ~virt) {
- struct pci_dev *pdev = vdev->pdev;
__le32 phys_val = 0;
int ret;

- ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+ ret = vfio_user_config_read(vdev, pos, &phys_val, count);
if (ret)
return ret;

phys_val &= ~(write & ~virt);
phys_val |= (val & (write & ~virt));

- ret = vfio_user_config_write(pdev, pos, phys_val, count);
+ ret = vfio_user_config_write(vdev, pos, phys_val, count);
if (ret)
return ret;
}
@@ -250,7 +298,7 @@ static int vfio_direct_config_read(struct vfio_pci_core_device *vdev, int pos,
{
int ret;

- ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+ ret = vfio_user_config_read(vdev, pos, val, count);
if (ret)
return ret;

@@ -275,7 +323,7 @@ static int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
{
int ret;

- ret = vfio_user_config_write(vdev->pdev, pos, val, count);
+ ret = vfio_user_config_write(vdev, pos, val, count);
if (ret)
return ret;

@@ -288,7 +336,7 @@ static int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos,
{
int ret;

- ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+ ret = vfio_user_config_read(vdev, pos, val, count);
if (ret)
return ret;

@@ -692,13 +740,86 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
return 0;
}

+static int vfio_perform_runtime_pm(struct vfio_pci_core_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, __le32 val, pci_power_t state)
+{
+ /*
+ * If runtime PM is enabled, then instead of directly writing
+ * PCI_PM_CTRL register, decrement the device usage counter whenever
+ * the power state is non-D0. The kernel runtime PM framework will
+ * itself put the PCI device in the low power state when device usage
+ * counter will become zero. The guest OS will read the PCI_PM_CTRL
+ * register back to confirm the current power state so virtual
+ * register bits can be used. For this, read the actual PCI_PM_CTRL
+ * register, update the power state related bits and then update the
+ * vconfig bits corresponding to PCI_PM_CTRL offset. If the
+ * pm_runtime_suspended status is set, then return the virtual
+ * register value for PCI_PM_CTRL register read. All the bits
+ * in PCI_PM_CTRL are being returned from virtual config, so that
+ * this register read will not wake up the PCI device from suspended
+ * state.
+ *
+ * Once power state will be changed back to D0, then clear the power
+ * state related bits in vconfig. After this, increment the device
+ * usage counter which will internally wake up the PCI device from
+ * suspended state.
+ */
+ if (state != PCI_D0 && !vdev->pm_runtime_suspended) {
+ __le32 virt_val = 0;
+
+ count = vfio_default_config_write(vdev, pos, count, perm,
+ offset, val);
+ if (count < 0)
+ return count;
+
+ vfio_default_config_read(vdev, pos, 4, perm, offset, &virt_val);
+ virt_val &= ~cpu_to_le32(PCI_PM_CTRL_STATE_MASK);
+ virt_val |= (val & cpu_to_le32(PCI_PM_CTRL_STATE_MASK));
+ memcpy(vdev->vconfig + pos, &virt_val, 4);
+ vdev->pm_runtime_suspended = true;
+ pm_runtime_mark_last_busy(&vdev->pdev->dev);
+ pm_runtime_put_autosuspend(&vdev->pdev->dev);
+ return count;
+ }
+
+ if (vdev->pm_runtime_suspended && state == PCI_D0) {
+ vdev->pm_runtime_suspended = false;
+ *(__le16 *)&vdev->vconfig[pos] &=
+ ~cpu_to_le16(PCI_PM_CTRL_STATE_MASK);
+ pm_runtime_get_sync(&vdev->pdev->dev);
+ }
+
+ return vfio_default_config_write(vdev, pos, count, perm, offset, val);
+}
+
+static int vfio_pm_config_read(struct vfio_pci_core_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, __le32 *val)
+{
+ /*
+ * If pm_runtime_suspended status is set, then return the virtual
+ * register value for PCI_PM_CTRL register read.
+ */
+ if (vdev->pm_runtime_suspended &&
+ offset >= PCI_PM_CTRL && offset < (PCI_PM_CTRL + 4)) {
+ memcpy(val, vdev->vconfig + pos, count);
+ return count;
+ }
+
+ return vfio_default_config_read(vdev, pos, count, perm, offset, val);
+}
+
static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
int count, struct perm_bits *perm,
int offset, __le32 val)
{
- count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
- if (count < 0)
- return count;
+ if (offset != PCI_PM_CTRL) {
+ count = vfio_default_config_write(vdev, pos, count, perm,
+ offset, val);
+ if (count < 0)
+ return count;
+ }

if (offset == PCI_PM_CTRL) {
pci_power_t state;
@@ -718,7 +839,8 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
break;
}

- vfio_pci_set_power_state(vdev, state);
+ count = vfio_perform_runtime_pm(vdev, pos, count, perm,
+ offset, val, state);
}

return count;
@@ -731,6 +853,7 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
return -ENOMEM;

perm->writefn = vfio_pm_config_write;
+ perm->readfn = vfio_pm_config_read;

/*
* We always virtualize the next field so we can remove
@@ -1921,13 +2044,31 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
size_t done = 0;
int ret = 0;
loff_t pos = *ppos;
+ bool runtime_put_required = false;

pos &= VFIO_PCI_OFFSET_MASK;

+ /*
+ * For virtualized bits read/write, the device should not be resumed.
+ * Increment the device usage count alone so that the device won't be
+ * runtime suspended during config read/write.
+ */
+ if (vdev->pm_runtime_suspended) {
+ pm_runtime_get_noresume(&vdev->pdev->dev);
+ runtime_put_required = true;
+ }
+
while (count) {
ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
- if (ret < 0)
+ if (ret < 0) {
+ /*
+ * Decrement the device usage counter corresponding to
+ * previous pm_runtime_get_noresume().
+ */
+ if (runtime_put_required)
+ pm_runtime_put_autosuspend(&vdev->pdev->dev);
return ret;
+ }

count -= ret;
done += ret;
@@ -1937,5 +2078,12 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,

*ppos += done;

+ /*
+ * Decrement the device usage counter corresponding to previous
+ * pm_runtime_get_noresume().
+ */
+ if (runtime_put_required)
+ pm_runtime_put_autosuspend(&vdev->pdev->dev);
+
return done;
}
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 511a52e92b32..79fa86914b6c 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -187,57 +187,6 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
return false;
}

-static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
-{
- struct pci_dev *pdev = vdev->pdev;
- u16 pmcsr;
-
- if (!pdev->pm_cap)
- return;
-
- pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
-
- vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
-}
-
-/*
- * pci_set_power_state() wrapper handling devices which perform a soft reset on
- * D3->D0 transition. Save state prior to D0/1/2->D3, stash it on the vdev,
- * restore when returned to D0. Saved separately from pci_saved_state for use
- * by PM capability emulation and separately from pci_dev internal saved state
- * to avoid it being overwritten and consumed around other resets.
- */
-int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
-{
- struct pci_dev *pdev = vdev->pdev;
- bool needs_restore = false, needs_save = false;
- int ret;
-
- if (vdev->needs_pm_restore) {
- if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
- pci_save_state(pdev);
- needs_save = true;
- }
-
- if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
- needs_restore = true;
- }
-
- ret = pci_set_power_state(pdev, state);
-
- if (!ret) {
- /* D3 might be unsupported via quirk, skip unless in D3 */
- if (needs_save && pdev->current_state >= PCI_D3hot) {
- vdev->pm_save = pci_store_saved_state(pdev);
- } else if (needs_restore) {
- pci_load_and_free_saved_state(pdev, &vdev->pm_save);
- pci_restore_state(pdev);
- }
- }
-
- return ret;
-}
-
int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
{
struct pci_dev *pdev = vdev->pdev;
@@ -323,6 +272,16 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
/* For needs_reset */
lockdep_assert_held(&vdev->vdev.dev_set->lock);

+ /*
+ * The vfio device user can close the device after putting the device
+ * into runtime suspended state so wake up the device first in
+ * this case.
+ */
+ if (vdev->pm_runtime_suspended) {
+ vdev->pm_runtime_suspended = false;
+ pm_runtime_get_sync(&pdev->dev);
+ }
+
/* Stop the device from further DMA */
pci_clear_master(pdev);

@@ -1809,7 +1768,6 @@ void vfio_pci_core_uninit_device(struct vfio_pci_core_device *vdev)
mutex_destroy(&vdev->vma_lock);
vfio_uninit_group_dev(&vdev->vdev);
kfree(vdev->region);
- kfree(vdev->pm_save);
}
EXPORT_SYMBOL_GPL(vfio_pci_core_uninit_device);

@@ -1855,8 +1813,6 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
if (ret)
goto out_vf;

- vfio_pci_probe_power_state(vdev);
-
/*
* pci-core sets the device power state to an unknown value at
* bootup and after being removed from a driver. The only
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index aafe09c9fa64..2b1a556ce73f 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -123,9 +123,8 @@ struct vfio_pci_core_device {
bool has_vga;
bool needs_reset;
bool nointx;
- bool needs_pm_restore;
+ bool pm_runtime_suspended;
struct pci_saved_state *pci_saved_state;
- struct pci_saved_state *pm_save;
int ioeventfds_nr;
struct eventfd_ctx *err_trigger;
struct eventfd_ctx *req_trigger;
--
2.17.1


2021-11-15 13:37:45

by Abhishek Sahu

[permalink] [raw]
Subject: [RFC 2/3] vfio/pci: virtualize PME related registers bits and initialize to zero

From: Abhishek Sahu <[email protected]>

If any PME event will be generated by PCI, then it will be mostly
handled in the host by the root port PME code. For example, in the case
of PCIe, the PME event will be sent to the root port and then the PME
interrupt will be generated. This will be handled in
drivers/pci/pcie/pme.c at the host side. Inside this, the
pci_check_pme_status() will be called where PME_Status and PME_En bits
will be cleared. So, the guest OS which is using vfio-pci device will
not come to know about this PME event.

To handle these PME events inside guests, we need some framework so
that if any PME events will happen, then it needs to be forwarded to
virtual machine monitor. We can virtualize PME related registers bits
and initialize these bits to zero so vfio-pci device user will assume
that it is not capable of asserting the PME# signal from any power state.

Signed-off-by: Abhishek Sahu <[email protected]>
---
drivers/vfio/pci/vfio_pci_config.c | 32 +++++++++++++++++++++++++++++-
1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 6e58b4bf7a60..fb3a503a5b99 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -738,12 +738,27 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
*/
p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);

+ /*
+ * The guests can't process PME events. If any PME event will be
+ * generated, then it will be mostly handled in the host and the
+ * host will clear the PME_STATUS. So virtualize PME_Support bits.
+ * It will be initialized to zero later on.
+ */
+ p_setw(perm, PCI_PM_PMC, PCI_PM_CAP_PME_MASK, NO_WRITE);
+
/*
* Power management is defined *per function*, so we can let
* the user change power state, but we trap and initiate the
* change ourselves, so the state bits are read-only.
+ *
+ * The guest can't process PME from D3cold so virtualize PME_Status
+ * and PME_En bits. It will be initialized to zero later on.
*/
- p_setd(perm, PCI_PM_CTRL, NO_VIRT, ~PCI_PM_CTRL_STATE_MASK);
+ p_setd(perm, PCI_PM_CTRL,
+ PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS,
+ ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS |
+ PCI_PM_CTRL_STATE_MASK));
+
return 0;
}

@@ -1412,6 +1427,18 @@ static int vfio_ext_cap_len(struct vfio_pci_core_device *vdev, u16 ecap, u16 epo
return 0;
}

+static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev,
+ int offset)
+{
+ /* initialize virtualized PME_Support bits to zero */
+ *(__le16 *)&vdev->vconfig[offset + PCI_PM_PMC] &=
+ ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
+
+ /* initialize virtualized PME_Status and PME_En bits to zero */
+ *(__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL] &=
+ ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);
+}
+
static int vfio_fill_vconfig_bytes(struct vfio_pci_core_device *vdev,
int offset, int size)
{
@@ -1535,6 +1562,9 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev)
if (ret)
return ret;

+ if (cap == PCI_CAP_ID_PM)
+ vfio_update_pm_vconfig_bytes(vdev, pos);
+
prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
pos = next;
caps++;
--
2.17.1


2021-11-15 13:38:19

by Abhishek Sahu

[permalink] [raw]
Subject: [RFC 1/3] vfio/pci: register vfio-pci driver with runtime PM framework

From: Abhishek Sahu <[email protected]>

Currently, there is very limited power management support
available in upstream vfio-pci driver. If there is no user of vfio-pci
device, then the PCI device will be moved into D3Hot state by writing
directly into PCI PM registers. This D3Hot state help in saving some
amount of power but we can achieve zero power consumption if we go
into D3cold state. The D3cold state cannot be possible with Native PCI
PM. It requires interaction with platform firmware which is
system-specific. To go into low power states (including D3cold), the
runtime PM framework can be used which internally interacts with PCI
and platform firmware and puts the device into the lowest possible
D-States.

This patch registers vfio-pci driver with the runtime PM framework.

1. The PCI core framework takes care of most of the runtime PM
related things. For enabling the runtime PM, the PCI driver needs to
decrement the usage count and needs to register the runtime
suspend/resume routines. For vfio-pci based driver, these routines can
be stubbed since the vfio-pci driver is not doing the PCI device
initialization. All the config state saving, and PCI power management
related things will be done by PCI core framework itself inside its
runtime suspend/resume callbacks.

2. To prevent frequent runtime, suspend/resume, it uses autosuspend
support with a default delay of 1 second.

3. Inside pci_reset_bus(), all the devices in bus/slot will be moved
out of D0 state. This state change to D0 can happen directly without
going through the runtime PM framework. So if runtime PM is enabled,
then pm_runtime_resume() makes the runtime state active. Since the PCI
device power state is already D0, so it should return early when it
tries to change the state with pci_set_power_state(). Then
pm_request_autosuspend() can be used which will internally check for
device usage count and will move the device again into low power
state.

4. Inside vfio_pci_core_disable(), the device usage count always needs
to decremented which was incremented vfio_pci_core_enable() with
pm_runtime_get_sync().

5. Since the runtime PM framework will provide the same functionality,
so directly writing into PCI PM config register can be replaced with
use of runtime PM routines. Also, the use of runtime PM can help us in
more power saving.

In the systems which do not support D3Cold,

With the existing implementation:

// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3hot
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D0

With runtime PM:

// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3hot
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D3hot

So, with runtime PM, the upstream bridge or root port will also go
into lower power state which is not possible with existing
implementation.

In the systems which support D3Cold,

// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3hot
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D0

With runtime PM:

// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3cold
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D3cold

So, with runtime PM, both the PCI device and upstream bridge will
go into D3cold state.

6. If 'disable_idle_d3' module parameter is set, then also the runtime
PM will be enabled, but in this case, the usage count should not be
decremented.

7. vfio_pci_dev_set_try_reset() return value is unused now, so this
function return type can be changed to void.

Signed-off-by: Abhishek Sahu <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 3 +
drivers/vfio/pci/vfio_pci_core.c | 104 +++++++++++++++++++++++--------
include/linux/vfio_pci_core.h | 4 ++
3 files changed, 84 insertions(+), 27 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index a5ce92beb655..c8695baf3b54 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -193,6 +193,9 @@ static struct pci_driver vfio_pci_driver = {
.remove = vfio_pci_remove,
.sriov_configure = vfio_pci_sriov_configure,
.err_handler = &vfio_pci_core_err_handlers,
+#if defined(CONFIG_PM)
+ .driver.pm = &vfio_pci_core_pm_ops,
+#endif
};

static void __init vfio_pci_fill_ids(void)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f948e6cd2993..511a52e92b32 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -152,7 +152,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
}

struct vfio_pci_group_info;
-static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
+static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
struct vfio_pci_group_info *groups);

@@ -245,7 +245,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
u16 cmd;
u8 msix_pos;

- vfio_pci_set_power_state(vdev, PCI_D0);
+ if (!disable_idle_d3)
+ pm_runtime_get_sync(&pdev->dev);

/* Don't allow our initial saved state to include busmaster */
pci_clear_master(pdev);
@@ -405,8 +406,17 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
out:
pci_disable_device(pdev);

- if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
- vfio_pci_set_power_state(vdev, PCI_D3hot);
+ vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
+
+ /*
+ * The device usage count always needs to decremented which was
+ * incremented in vfio_pci_core_enable() with
+ * pm_runtime_get_sync().
+ */
+ if (!disable_idle_d3) {
+ pm_runtime_mark_last_busy(&pdev->dev);
+ pm_runtime_put_autosuspend(&pdev->dev);
+ }
}
EXPORT_SYMBOL_GPL(vfio_pci_core_disable);

@@ -1847,19 +1857,23 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)

vfio_pci_probe_power_state(vdev);

- if (!disable_idle_d3) {
- /*
- * pci-core sets the device power state to an unknown value at
- * bootup and after being removed from a driver. The only
- * transition it allows from this unknown state is to D0, which
- * typically happens when a driver calls pci_enable_device().
- * We're not ready to enable the device yet, but we do want to
- * be able to get to D3. Therefore first do a D0 transition
- * before going to D3.
- */
- vfio_pci_set_power_state(vdev, PCI_D0);
- vfio_pci_set_power_state(vdev, PCI_D3hot);
- }
+ /*
+ * pci-core sets the device power state to an unknown value at
+ * bootup and after being removed from a driver. The only
+ * transition it allows from this unknown state is to D0, which
+ * typically happens when a driver calls pci_enable_device().
+ * We're not ready to enable the device yet, but we do want to
+ * be able to get to D3. Therefore first do a D0 transition
+ * before enabling runtime PM.
+ */
+ pci_set_power_state(pdev, PCI_D0);
+ pm_runtime_set_autosuspend_delay(&pdev->dev, 1000);
+ pm_runtime_use_autosuspend(&pdev->dev);
+ pm_runtime_mark_last_busy(&pdev->dev);
+ pm_runtime_allow(&pdev->dev);
+
+ if (!disable_idle_d3)
+ pm_runtime_put_autosuspend(&pdev->dev);

ret = vfio_register_group_dev(&vdev->vdev);
if (ret)
@@ -1868,7 +1882,10 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)

out_power:
if (!disable_idle_d3)
- vfio_pci_set_power_state(vdev, PCI_D0);
+ pm_runtime_get_noresume(&pdev->dev);
+
+ pm_runtime_dont_use_autosuspend(&pdev->dev);
+ pm_runtime_forbid(&pdev->dev);
out_vf:
vfio_pci_vf_uninit(vdev);
return ret;
@@ -1887,7 +1904,10 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
vfio_pci_vga_uninit(vdev);

if (!disable_idle_d3)
- vfio_pci_set_power_state(vdev, PCI_D0);
+ pm_runtime_get_noresume(&pdev->dev);
+
+ pm_runtime_dont_use_autosuspend(&pdev->dev);
+ pm_runtime_forbid(&pdev->dev);
}
EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);

@@ -2093,33 +2113,63 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
* - At least one of the affected devices is marked dirty via
* needs_reset (such as by lack of FLR support)
* Then attempt to perform that bus or slot reset.
- * Returns true if the dev_set was reset.
*/
-static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
+static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
{
struct vfio_pci_core_device *cur;
struct pci_dev *pdev;
int ret;

if (!vfio_pci_dev_set_needs_reset(dev_set))
- return false;
+ return;

pdev = vfio_pci_dev_set_resettable(dev_set);
if (!pdev)
- return false;
+ return;

ret = pci_reset_bus(pdev);
if (ret)
- return false;
+ return;

list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
cur->needs_reset = false;
- if (!disable_idle_d3)
- vfio_pci_set_power_state(cur, PCI_D3hot);
+ if (!disable_idle_d3) {
+ /*
+ * Inside pci_reset_bus(), all the devices in bus/slot
+ * will be moved out of D0 state. This state change to
+ * D0 can happen directly without going through the
+ * runtime PM framework. pm_runtime_resume() will
+ * help make the runtime state as active and then
+ * pm_request_autosuspend() can be used which will
+ * internally check for device usage count and will
+ * move the device again into the low power state.
+ */
+ pm_runtime_resume(&pdev->dev);
+ pm_runtime_mark_last_busy(&pdev->dev);
+ pm_request_autosuspend(&pdev->dev);
+ }
}
- return true;
}

+#ifdef CONFIG_PM
+static int vfio_pci_core_runtime_suspend(struct device *dev)
+{
+ return 0;
+}
+
+static int vfio_pci_core_runtime_resume(struct device *dev)
+{
+ return 0;
+}
+
+const struct dev_pm_ops vfio_pci_core_pm_ops = {
+ SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
+ vfio_pci_core_runtime_resume,
+ NULL)
+};
+EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);
+#endif
+
void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
bool is_disable_idle_d3)
{
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index ef9a44b6cf5d..aafe09c9fa64 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -231,6 +231,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);

+#ifdef CONFIG_PM
+extern const struct dev_pm_ops vfio_pci_core_pm_ops;
+#endif
+
static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
{
return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
--
2.17.1


2021-11-17 17:53:03

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 1/3] vfio/pci: register vfio-pci driver with runtime PM framework

On Mon, 15 Nov 2021 19:06:38 +0530
<[email protected]> wrote:

> From: Abhishek Sahu <[email protected]>
>
> Currently, there is very limited power management support
> available in upstream vfio-pci driver. If there is no user of vfio-pci
> device, then the PCI device will be moved into D3Hot state by writing
> directly into PCI PM registers. This D3Hot state help in saving some
> amount of power but we can achieve zero power consumption if we go
> into D3cold state. The D3cold state cannot be possible with Native PCI
> PM. It requires interaction with platform firmware which is
> system-specific. To go into low power states (including D3cold), the
> runtime PM framework can be used which internally interacts with PCI
> and platform firmware and puts the device into the lowest possible
> D-States.
>
> This patch registers vfio-pci driver with the runtime PM framework.
>
> 1. The PCI core framework takes care of most of the runtime PM
> related things. For enabling the runtime PM, the PCI driver needs to
> decrement the usage count and needs to register the runtime
> suspend/resume routines. For vfio-pci based driver, these routines can
> be stubbed since the vfio-pci driver is not doing the PCI device
> initialization. All the config state saving, and PCI power management
> related things will be done by PCI core framework itself inside its
> runtime suspend/resume callbacks.
>
> 2. To prevent frequent runtime, suspend/resume, it uses autosuspend
> support with a default delay of 1 second.
>
> 3. Inside pci_reset_bus(), all the devices in bus/slot will be moved
> out of D0 state. This state change to D0 can happen directly without
> going through the runtime PM framework. So if runtime PM is enabled,
> then pm_runtime_resume() makes the runtime state active. Since the PCI
> device power state is already D0, so it should return early when it
> tries to change the state with pci_set_power_state(). Then
> pm_request_autosuspend() can be used which will internally check for
> device usage count and will move the device again into low power
> state.
>
> 4. Inside vfio_pci_core_disable(), the device usage count always needs
> to decremented which was incremented vfio_pci_core_enable() with
> pm_runtime_get_sync().
>
> 5. Since the runtime PM framework will provide the same functionality,
> so directly writing into PCI PM config register can be replaced with
> use of runtime PM routines. Also, the use of runtime PM can help us in
> more power saving.
>
> In the systems which do not support D3Cold,
>
> With the existing implementation:
>
> // PCI device
> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
> D3hot
> // upstream bridge
> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
> D0
>
> With runtime PM:
>
> // PCI device
> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
> D3hot
> // upstream bridge
> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
> D3hot
>
> So, with runtime PM, the upstream bridge or root port will also go
> into lower power state which is not possible with existing
> implementation.
>
> In the systems which support D3Cold,
>
> // PCI device
> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
> D3hot
> // upstream bridge
> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
> D0
>
> With runtime PM:
>
> // PCI device
> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
> D3cold
> // upstream bridge
> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
> D3cold
>
> So, with runtime PM, both the PCI device and upstream bridge will
> go into D3cold state.
>
> 6. If 'disable_idle_d3' module parameter is set, then also the runtime
> PM will be enabled, but in this case, the usage count should not be
> decremented.
>
> 7. vfio_pci_dev_set_try_reset() return value is unused now, so this
> function return type can be changed to void.
>
> Signed-off-by: Abhishek Sahu <[email protected]>
> ---
> drivers/vfio/pci/vfio_pci.c | 3 +
> drivers/vfio/pci/vfio_pci_core.c | 104 +++++++++++++++++++++++--------
> include/linux/vfio_pci_core.h | 4 ++
> 3 files changed, 84 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index a5ce92beb655..c8695baf3b54 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -193,6 +193,9 @@ static struct pci_driver vfio_pci_driver = {
> .remove = vfio_pci_remove,
> .sriov_configure = vfio_pci_sriov_configure,
> .err_handler = &vfio_pci_core_err_handlers,
> +#if defined(CONFIG_PM)
> + .driver.pm = &vfio_pci_core_pm_ops,
> +#endif
> };
>
> static void __init vfio_pci_fill_ids(void)
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index f948e6cd2993..511a52e92b32 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -152,7 +152,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
> }
>
> struct vfio_pci_group_info;
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
> static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> struct vfio_pci_group_info *groups);
>
> @@ -245,7 +245,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
> u16 cmd;
> u8 msix_pos;
>
> - vfio_pci_set_power_state(vdev, PCI_D0);
> + if (!disable_idle_d3)
> + pm_runtime_get_sync(&pdev->dev);

I'm not a pm_runtime expert, but why are we not using
pm_runtime_resume_and_get() here and aborting the function on error?

I hope some folks more familiar with pm_runtime can also review usage
across this series.

>
> /* Don't allow our initial saved state to include busmaster */
> pci_clear_master(pdev);
> @@ -405,8 +406,17 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
> out:
> pci_disable_device(pdev);
>
> - if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
> - vfio_pci_set_power_state(vdev, PCI_D3hot);
> + vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
> +
> + /*
> + * The device usage count always needs to decremented which was
> + * incremented in vfio_pci_core_enable() with
> + * pm_runtime_get_sync().
> + */

*to be

Maybe:

/*
* Put the pm-runtime usage counter acquired during enable; mark
* last use time to delay autosuspend.
*/

> + if (!disable_idle_d3) {
> + pm_runtime_mark_last_busy(&pdev->dev);
> + pm_runtime_put_autosuspend(&pdev->dev);
> + }
> }
> EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>
> @@ -1847,19 +1857,23 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>
> vfio_pci_probe_power_state(vdev);
>
> - if (!disable_idle_d3) {
> - /*
> - * pci-core sets the device power state to an unknown value at
> - * bootup and after being removed from a driver. The only
> - * transition it allows from this unknown state is to D0, which
> - * typically happens when a driver calls pci_enable_device().
> - * We're not ready to enable the device yet, but we do want to
> - * be able to get to D3. Therefore first do a D0 transition
> - * before going to D3.
> - */
> - vfio_pci_set_power_state(vdev, PCI_D0);
> - vfio_pci_set_power_state(vdev, PCI_D3hot);
> - }
> + /*
> + * pci-core sets the device power state to an unknown value at
> + * bootup and after being removed from a driver. The only
> + * transition it allows from this unknown state is to D0, which
> + * typically happens when a driver calls pci_enable_device().
> + * We're not ready to enable the device yet, but we do want to
> + * be able to get to D3. Therefore first do a D0 transition
> + * before enabling runtime PM.
> + */
> + pci_set_power_state(pdev, PCI_D0);
> + pm_runtime_set_autosuspend_delay(&pdev->dev, 1000);

Let's #define this 1000 at the top of the file with some rationale how
we arrived at this heuristic (ie. avoid magic numbers in code). Thanks,

Alex

> + pm_runtime_use_autosuspend(&pdev->dev);
> + pm_runtime_mark_last_busy(&pdev->dev);
> + pm_runtime_allow(&pdev->dev);
> +
> + if (!disable_idle_d3)
> + pm_runtime_put_autosuspend(&pdev->dev);
>
> ret = vfio_register_group_dev(&vdev->vdev);
> if (ret)
> @@ -1868,7 +1882,10 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>
> out_power:
> if (!disable_idle_d3)
> - vfio_pci_set_power_state(vdev, PCI_D0);
> + pm_runtime_get_noresume(&pdev->dev);
> +
> + pm_runtime_dont_use_autosuspend(&pdev->dev);
> + pm_runtime_forbid(&pdev->dev);
> out_vf:
> vfio_pci_vf_uninit(vdev);
> return ret;
> @@ -1887,7 +1904,10 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
> vfio_pci_vga_uninit(vdev);
>
> if (!disable_idle_d3)
> - vfio_pci_set_power_state(vdev, PCI_D0);
> + pm_runtime_get_noresume(&pdev->dev);
> +
> + pm_runtime_dont_use_autosuspend(&pdev->dev);
> + pm_runtime_forbid(&pdev->dev);
> }
> EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>
> @@ -2093,33 +2113,63 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
> * - At least one of the affected devices is marked dirty via
> * needs_reset (such as by lack of FLR support)
> * Then attempt to perform that bus or slot reset.
> - * Returns true if the dev_set was reset.
> */
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
> {
> struct vfio_pci_core_device *cur;
> struct pci_dev *pdev;
> int ret;
>
> if (!vfio_pci_dev_set_needs_reset(dev_set))
> - return false;
> + return;
>
> pdev = vfio_pci_dev_set_resettable(dev_set);
> if (!pdev)
> - return false;
> + return;
>
> ret = pci_reset_bus(pdev);
> if (ret)
> - return false;
> + return;
>
> list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
> cur->needs_reset = false;
> - if (!disable_idle_d3)
> - vfio_pci_set_power_state(cur, PCI_D3hot);
> + if (!disable_idle_d3) {
> + /*
> + * Inside pci_reset_bus(), all the devices in bus/slot
> + * will be moved out of D0 state. This state change to
> + * D0 can happen directly without going through the
> + * runtime PM framework. pm_runtime_resume() will
> + * help make the runtime state as active and then
> + * pm_request_autosuspend() can be used which will
> + * internally check for device usage count and will
> + * move the device again into the low power state.
> + */
> + pm_runtime_resume(&pdev->dev);
> + pm_runtime_mark_last_busy(&pdev->dev);
> + pm_request_autosuspend(&pdev->dev);
> + }
> }
> - return true;
> }
>
> +#ifdef CONFIG_PM
> +static int vfio_pci_core_runtime_suspend(struct device *dev)
> +{
> + return 0;
> +}
> +
> +static int vfio_pci_core_runtime_resume(struct device *dev)
> +{
> + return 0;
> +}
> +
> +const struct dev_pm_ops vfio_pci_core_pm_ops = {
> + SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
> + vfio_pci_core_runtime_resume,
> + NULL)
> +};
> +EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);
> +#endif
> +
> void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
> bool is_disable_idle_d3)
> {
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index ef9a44b6cf5d..aafe09c9fa64 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -231,6 +231,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
> void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
> void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);
>
> +#ifdef CONFIG_PM
> +extern const struct dev_pm_ops vfio_pci_core_pm_ops;
> +#endif
> +
> static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> {
> return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;


2021-11-17 17:53:18

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 2/3] vfio/pci: virtualize PME related registers bits and initialize to zero

On Mon, 15 Nov 2021 19:06:39 +0530
<[email protected]> wrote:

> From: Abhishek Sahu <[email protected]>
>
> If any PME event will be generated by PCI, then it will be mostly
> handled in the host by the root port PME code. For example, in the case
> of PCIe, the PME event will be sent to the root port and then the PME
> interrupt will be generated. This will be handled in
> drivers/pci/pcie/pme.c at the host side. Inside this, the
> pci_check_pme_status() will be called where PME_Status and PME_En bits
> will be cleared. So, the guest OS which is using vfio-pci device will
> not come to know about this PME event.
>
> To handle these PME events inside guests, we need some framework so
> that if any PME events will happen, then it needs to be forwarded to
> virtual machine monitor. We can virtualize PME related registers bits
> and initialize these bits to zero so vfio-pci device user will assume
> that it is not capable of asserting the PME# signal from any power state.
>
> Signed-off-by: Abhishek Sahu <[email protected]>
> ---
> drivers/vfio/pci/vfio_pci_config.c | 32 +++++++++++++++++++++++++++++-
> 1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 6e58b4bf7a60..fb3a503a5b99 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -738,12 +738,27 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
> */
> p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
>
> + /*
> + * The guests can't process PME events. If any PME event will be
> + * generated, then it will be mostly handled in the host and the
> + * host will clear the PME_STATUS. So virtualize PME_Support bits.
> + * It will be initialized to zero later on.
> + */
> + p_setw(perm, PCI_PM_PMC, PCI_PM_CAP_PME_MASK, NO_WRITE);
> +
> /*
> * Power management is defined *per function*, so we can let
> * the user change power state, but we trap and initiate the
> * change ourselves, so the state bits are read-only.
> + *
> + * The guest can't process PME from D3cold so virtualize PME_Status
> + * and PME_En bits. It will be initialized to zero later on.
> */
> - p_setd(perm, PCI_PM_CTRL, NO_VIRT, ~PCI_PM_CTRL_STATE_MASK);
> + p_setd(perm, PCI_PM_CTRL,
> + PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS,
> + ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS |
> + PCI_PM_CTRL_STATE_MASK));
> +
> return 0;
> }
>
> @@ -1412,6 +1427,18 @@ static int vfio_ext_cap_len(struct vfio_pci_core_device *vdev, u16 ecap, u16 epo
> return 0;
> }
>
> +static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev,
> + int offset)
> +{
> + /* initialize virtualized PME_Support bits to zero */
> + *(__le16 *)&vdev->vconfig[offset + PCI_PM_PMC] &=
> + ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
> +
> + /* initialize virtualized PME_Status and PME_En bits to zero */

^ Extra space here and above.


> + *(__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL] &=
> + ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);

Perhaps more readable and consistent with elsewhere as:

__le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC];
__le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL];

/* Clear vconfig PME_Support, PME_Status, and PME_En bits */
*pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
*ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);

Thanks,
Alex

> +}
> +
> static int vfio_fill_vconfig_bytes(struct vfio_pci_core_device *vdev,
> int offset, int size)
> {
> @@ -1535,6 +1562,9 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev)
> if (ret)
> return ret;
>
> + if (cap == PCI_CAP_ID_PM)
> + vfio_update_pm_vconfig_bytes(vdev, pos);
> +
> prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
> pos = next;
> caps++;


2021-11-17 17:53:36

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state

On Mon, 15 Nov 2021 19:06:40 +0530
<[email protected]> wrote:

> From: Abhishek Sahu <[email protected]>
>
> Currently, if the runtime power management is enabled for vfio-pci
> device in guest OS, then guest OS will do the register write for
> PCI_PM_CTRL register. This write request will be handled in
> vfio_pm_config_write() where it will do the actual register write of
> PCI_PM_CTRL register. With this, the maximum D3hot state can be
> achieved for low power. If we can use the runtime PM framework, then
> we can achieve the D3cold state which will help in saving
> maximum power.
>
> This patch uses runtime PM framework whenever vfio-pci device will
> be put in the low power state.
>
> 1. If runtime PM is enabled, then instead of directly writing
> PCI_PM_CTRL register, decrement the device usage counter whenever
> the power state is non-D0. The kernel runtime PM framework will
> itself put the PCI device in low power state when device usage
> counter will become zero. Similarly, when the power state will be
> again changed back to D0, then increment the device usage counter
> and the kernel runtime PM framework will itself bring the PCI device
> out of low power state.
>
> 2. The guest OS will read the PCI_PM_CTRL register back to
> confirm the current power state so virtual register bits can be
> used. For this, before decrementing the usage count, read the actual
> PCI_PM_CTRL register, update the power state related bits, and then
> update the vconfig bits corresponding to PCI_PM_CTRL offset. For
> PCI_PM_CTRL register read, return the virtual value if runtime PM is
> requested. This vconfig bits will be cleared when the power state
> will be changed back to D0.
>
> 3. For the guest OS, the PCI power state will be still D3hot
> even if put the actual PCI device into D3cold state. In the D3hot
> state, the config space can still be read. So, if there is any request
> from guest OS to read the config space, then we need to move the actual
> PCI device again back to D0. For this, increment the device usage
> count before reading/writing the config space and then decrement it
> again after reading/writing the config space. There can be
> back-to-back config register read/write request, but since the auto
> suspend methods are being used here so only first access will
> wake up the PCI device and for the remaining access, the device will
> already be active.
>
> 4. This above-mentioned wake up is not needed if the register
> read/write is done completely with virtual bits. For handling this
> condition, the actual resume of device will only being done in
> vfio_user_config_read()/vfio_user_config_write(). All the config
> register access will come vfio_pci_config_rw(). So, in this
> function, use the pm_runtime_get_noresume() so that only usage count
> will be incremented without resuming the device. Inside,
> vfio_user_config_read()/vfio_user_config_write(), use the routines
> with pm_runtime_put_noidle() so that the PCI device won’t be
> suspended in the lower level functions. Again in the top level
> vfio_pci_config_rw(), use the pm_runtime_put_autosuspend(). Now the
> auto suspend timer will be started and the device will be suspended
> again. If the device is already runtime suspended, then this routine
> will return early.
>
> 5. In the host side D3cold will only be used if the platform has
> support for this, otherwise some other state will be used. The
> config space can be read if the device is not in D3cold state. So in
> this case, we can skip the resuming of PCI device. The wrapper
> function vfio_pci_config_pm_runtime_get() takes care of this
> condition and invoke the pm_runtime_resume() only if current power
> state is D3cold.
>
> 6. For vfio_pci_config_pm_runtime_get()/vfio_
> pci_config_pm_runtime_put(), the reference code is taken from
> pci_config_pm_runtime_get()/pci_config_pm_runtime_put() and then it
> is modified according to vfio-pci driver requirement.
>
> 7. vfio_pci_set_power_state() will be unused after moving to runtime
> PM, so this function can be removed along with other related
> functions and structure fields.


If we're transitioning a device to D3cold rather than D3hot as
requested by userspace, isn't that a user visible change? For
instance, a device may report NoSoftRst- indicating that the device
does not do a soft reset on D3hot->D0 transition. If we're instead
putting the device in D3cold, then a transition back to D0 has very
much undergone a reset. On one hand we should at least virtualize the
NoSoftRst bit to allow the guest to restore the device, but I wonder if
that's really safe. Is a better option to prevent entering D3cold if
the device isn't natively reporting NoSoftRst-?

We're also essentially making a policy decision on behalf of userspace
that favors power saving over latency. Is that universally the correct
trade-off? I can imagine this could be desirable for many use cases,
but if we're going to covertly replace D3hot with D3cold, it seems like
there should be an opt-in. Is co-opting the PM capability for this
even really acceptable or should there be a device ioctl to request
D3cold and plumbing through QEMU such that a VM guest can make informed
choices regarding device power management?

Also if the device is not responsive to config space due to the user
placing it in D3 now, I'd expect there are other ioctl paths that need
to be blocked, maybe even MMIO paths that might be a gap for existing
D3hot support. Thanks,

Alex


2021-11-18 15:22:07

by Abhishek Sahu

[permalink] [raw]
Subject: Re: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state

On 11/17/2021 11:23 PM, Alex Williamson wrote:

> On Mon, 15 Nov 2021 19:06:40 +0530
> <[email protected]> wrote:
>
>> From: Abhishek Sahu <[email protected]>
>>
>> Currently, if the runtime power management is enabled for vfio-pci
>> device in guest OS, then guest OS will do the register write for
>> PCI_PM_CTRL register. This write request will be handled in
>> vfio_pm_config_write() where it will do the actual register write of
>> PCI_PM_CTRL register. With this, the maximum D3hot state can be
>> achieved for low power. If we can use the runtime PM framework, then
>> we can achieve the D3cold state which will help in saving
>> maximum power.
>>
>> This patch uses runtime PM framework whenever vfio-pci device will
>> be put in the low power state.
>>
>> 1. If runtime PM is enabled, then instead of directly writing
>> PCI_PM_CTRL register, decrement the device usage counter whenever
>> the power state is non-D0. The kernel runtime PM framework will
>> itself put the PCI device in low power state when device usage
>> counter will become zero. Similarly, when the power state will be
>> again changed back to D0, then increment the device usage counter
>> and the kernel runtime PM framework will itself bring the PCI device
>> out of low power state.
>>
>> 2. The guest OS will read the PCI_PM_CTRL register back to
>> confirm the current power state so virtual register bits can be
>> used. For this, before decrementing the usage count, read the actual
>> PCI_PM_CTRL register, update the power state related bits, and then
>> update the vconfig bits corresponding to PCI_PM_CTRL offset. For
>> PCI_PM_CTRL register read, return the virtual value if runtime PM is
>> requested. This vconfig bits will be cleared when the power state
>> will be changed back to D0.
>>
>> 3. For the guest OS, the PCI power state will be still D3hot
>> even if put the actual PCI device into D3cold state. In the D3hot
>> state, the config space can still be read. So, if there is any request
>> from guest OS to read the config space, then we need to move the actual
>> PCI device again back to D0. For this, increment the device usage
>> count before reading/writing the config space and then decrement it
>> again after reading/writing the config space. There can be
>> back-to-back config register read/write request, but since the auto
>> suspend methods are being used here so only first access will
>> wake up the PCI device and for the remaining access, the device will
>> already be active.
>>
>> 4. This above-mentioned wake up is not needed if the register
>> read/write is done completely with virtual bits. For handling this
>> condition, the actual resume of device will only being done in
>> vfio_user_config_read()/vfio_user_config_write(). All the config
>> register access will come vfio_pci_config_rw(). So, in this
>> function, use the pm_runtime_get_noresume() so that only usage count
>> will be incremented without resuming the device. Inside,
>> vfio_user_config_read()/vfio_user_config_write(), use the routines
>> with pm_runtime_put_noidle() so that the PCI device won’t be
>> suspended in the lower level functions. Again in the top level
>> vfio_pci_config_rw(), use the pm_runtime_put_autosuspend(). Now the
>> auto suspend timer will be started and the device will be suspended
>> again. If the device is already runtime suspended, then this routine
>> will return early.
>>
>> 5. In the host side D3cold will only be used if the platform has
>> support for this, otherwise some other state will be used. The
>> config space can be read if the device is not in D3cold state. So in
>> this case, we can skip the resuming of PCI device. The wrapper
>> function vfio_pci_config_pm_runtime_get() takes care of this
>> condition and invoke the pm_runtime_resume() only if current power
>> state is D3cold.
>>
>> 6. For vfio_pci_config_pm_runtime_get()/vfio_
>> pci_config_pm_runtime_put(), the reference code is taken from
>> pci_config_pm_runtime_get()/pci_config_pm_runtime_put() and then it
>> is modified according to vfio-pci driver requirement.
>>
>> 7. vfio_pci_set_power_state() will be unused after moving to runtime
>> PM, so this function can be removed along with other related
>> functions and structure fields.
>
>

Thanks Alex for checking this series and providing your inputs.

> If we're transitioning a device to D3cold rather than D3hot as
> requested by userspace, isn't that a user visible change?

For most of the driver, in linux kernel, the D3hot vs D3cold
state will be decided at PCI core layer. In the PCI core layer,
pci_target_state() determines which D3 state to choose. It checks
for platform_pci_power_manageable() and then it calls
platform_pci_choose_state() to find the target state.
In VM, the platform_pci_power_manageable() check will fail if the
guest is linux OS. So, it uses, D3hot state.

But there are few drivers which does not use the PCI framework
generic power related routines during runtime suspend/system suspend
and set the PCI power state directly with D3hot.
Also, the guest can be non-Linux OS also and, in that case,
it will be difficult to know the behavior. So, it may impact
these cases.

> For instance, a device may report NoSoftRst- indicating that the device
> does not do a soft reset on D3hot->D0 transition. If we're instead
> putting the device in D3cold, then a transition back to D0 has very
> much undergone a reset. On one hand we should at least virtualize the
> NoSoftRst bit to allow the guest to restore the device, but I wonder if
> that's really safe. Is a better option to prevent entering D3cold if
> the device isn't natively reporting NoSoftRst-?
>

You mean to say NoSoftRst+ instead of NoSoftRst- as visible in
the lspci output. For NoSoftRst- case, we do a soft reset on
D3hot->D0 transition. But, will this case not be handled internally
in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend,
we check for pci_dev->state_saved flag and do pci_save_state()
irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars()
will be called in pci_raw_set_power_state() which will reinitialize device
for D3hot/D3cold-> D0 case. Once the device is initialized in the host,
then for guest, it should work without re-initializing again in the
guest side. I am not sure, if my understanding is correct.

> We're also essentially making a policy decision on behalf of userspace
> that favors power saving over latency. Is that universally the correct
> trade-off?

For most drivers, the D3hot vs D3cold should not be favored due
to latency reasons. In the linux kernel side, I am seeing, the
PCI framework try to use D3cold state if platform and device
supports that. But its correct that covertly replacing D3hot with
D3cold may be concern for some drivers.

> I can imagine this could be desirable for many use cases,
> but if we're going to covertly replace D3hot with D3cold, it seems like
> there should be an opt-in. Is co-opting the PM capability for this
> even really acceptable or should there be a device ioctl to request
> D3cold and plumbing through QEMU such that a VM guest can make informed
> choices regarding device power management?
>

Making IOCTL is also an option but that case, this support needs to
be added in all hypervisors and user must pass this information
explicitly for each device. Another option could be to use
module parameter to explicitly enable D3cold support. If module
parameter is not set, then we can call pci_d3cold_disable() and
in that case, runtime PM should not use D3cold state.

Also, I was checking we can pass this information though some
virtualized register bit which will be only defined for passing
the information between guest and host. In the guest side if the
target state is being decided with pci_target_state(), then
the D3cold vs D3hot should not matter for the driver running
in the guest side and in that case, it depends upon platform support.
We can set this virtualize bit to 1. But, if driver is either
setting D3hot state explicitly or has called pci_d3cold_disable() or
similar API available in the guest OS, then set this bit to 0 and
in that case, the D3cold state can be disabled in the host side.
But don't know if is possible to use some non PCI defined
virtualized register bit.

I am not sure what should be best option to make choice
regarding d3cold but if we can have some option by which this
can be done without involvement of user, then it will benefit
for lot of cases. Currently, the D3cold is supported only in
very few desktops/servers but in future, we will see on
most of the platforms.

> Also if the device is not responsive to config space due to the user
> placing it in D3 now, I'd expect there are other ioctl paths that need
> to be blocked, maybe even MMIO paths that might be a gap for existing
> D3hot support. Thanks,
>
> Alex
>

I was in assumption that most of IOCTL code will be called by the
hypervisor before guest OS boot and during that time, the device
will be always in D0. But, if we have paths where IOCTL can be
called when the device has been suspended by guest OS, then can we
use runtime_get/put API’s there also ?

Thanks,
Abhishek


2021-11-18 15:24:42

by Abhishek Sahu

[permalink] [raw]
Subject: Re: [RFC 2/3] vfio/pci: virtualize PME related registers bits and initialize to zero

On 11/17/2021 11:23 PM, Alex Williamson wrote:
> On Mon, 15 Nov 2021 19:06:39 +0530
> <[email protected]> wrote:
>
>> From: Abhishek Sahu <[email protected]>
>>
>> If any PME event will be generated by PCI, then it will be mostly
>> handled in the host by the root port PME code. For example, in the case
>> of PCIe, the PME event will be sent to the root port and then the PME
>> interrupt will be generated. This will be handled in
>> drivers/pci/pcie/pme.c at the host side. Inside this, the
>> pci_check_pme_status() will be called where PME_Status and PME_En bits
>> will be cleared. So, the guest OS which is using vfio-pci device will
>> not come to know about this PME event.
>>
>> To handle these PME events inside guests, we need some framework so
>> that if any PME events will happen, then it needs to be forwarded to
>> virtual machine monitor. We can virtualize PME related registers bits
>> and initialize these bits to zero so vfio-pci device user will assume
>> that it is not capable of asserting the PME# signal from any power state.
>>
>> Signed-off-by: Abhishek Sahu <[email protected]>
>> ---
>> drivers/vfio/pci/vfio_pci_config.c | 32 +++++++++++++++++++++++++++++-
>> 1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>> index 6e58b4bf7a60..fb3a503a5b99 100644
>> --- a/drivers/vfio/pci/vfio_pci_config.c
>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>> @@ -738,12 +738,27 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
>> */
>> p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
>>
>> + /*
>> + * The guests can't process PME events. If any PME event will be
>> + * generated, then it will be mostly handled in the host and the
>> + * host will clear the PME_STATUS. So virtualize PME_Support bits.
>> + * It will be initialized to zero later on.
>> + */
>> + p_setw(perm, PCI_PM_PMC, PCI_PM_CAP_PME_MASK, NO_WRITE);
>> +
>> /*
>> * Power management is defined *per function*, so we can let
>> * the user change power state, but we trap and initiate the
>> * change ourselves, so the state bits are read-only.
>> + *
>> + * The guest can't process PME from D3cold so virtualize PME_Status
>> + * and PME_En bits. It will be initialized to zero later on.
>> */
>> - p_setd(perm, PCI_PM_CTRL, NO_VIRT, ~PCI_PM_CTRL_STATE_MASK);
>> + p_setd(perm, PCI_PM_CTRL,
>> + PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS,
>> + ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS |
>> + PCI_PM_CTRL_STATE_MASK));
>> +
>> return 0;
>> }
>>
>> @@ -1412,6 +1427,18 @@ static int vfio_ext_cap_len(struct vfio_pci_core_device *vdev, u16 ecap, u16 epo
>> return 0;
>> }
>>
>> +static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev,
>> + int offset)
>> +{
>> + /* initialize virtualized PME_Support bits to zero */
>> + *(__le16 *)&vdev->vconfig[offset + PCI_PM_PMC] &=
>> + ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
>> +
>> + /* initialize virtualized PME_Status and PME_En bits to zero */
>
> ^ Extra space here and above.
>
>
>> + *(__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL] &=
>> + ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);
>
> Perhaps more readable and consistent with elsewhere as:
>
> __le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC];
> __le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL];
>
> /* Clear vconfig PME_Support, PME_Status, and PME_En bits */
> *pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
> *ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);
>
> Thanks,
> Alex
>

Thanks Alex. I will fix this.

Regards,
Abhishek

>> +}
>> +
>> static int vfio_fill_vconfig_bytes(struct vfio_pci_core_device *vdev,
>> int offset, int size)
>> {
>> @@ -1535,6 +1562,9 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev)
>> if (ret)
>> return ret;
>>
>> + if (cap == PCI_CAP_ID_PM)
>> + vfio_update_pm_vconfig_bytes(vdev, pos);
>> +
>> prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
>> pos = next;
>> caps++;
>


2021-11-18 15:37:31

by Abhishek Sahu

[permalink] [raw]
Subject: Re: [RFC 1/3] vfio/pci: register vfio-pci driver with runtime PM framework

On 11/17/2021 11:22 PM, Alex Williamson wrote:
> On Mon, 15 Nov 2021 19:06:38 +0530
> <[email protected]> wrote:
>
>> From: Abhishek Sahu <[email protected]>
>>
>> Currently, there is very limited power management support
>> available in upstream vfio-pci driver. If there is no user of vfio-pci
>> device, then the PCI device will be moved into D3Hot state by writing
>> directly into PCI PM registers. This D3Hot state help in saving some
>> amount of power but we can achieve zero power consumption if we go
>> into D3cold state. The D3cold state cannot be possible with Native PCI
>> PM. It requires interaction with platform firmware which is
>> system-specific. To go into low power states (including D3cold), the
>> runtime PM framework can be used which internally interacts with PCI
>> and platform firmware and puts the device into the lowest possible
>> D-States.
>>
>> This patch registers vfio-pci driver with the runtime PM framework.
>>
>> 1. The PCI core framework takes care of most of the runtime PM
>> related things. For enabling the runtime PM, the PCI driver needs to
>> decrement the usage count and needs to register the runtime
>> suspend/resume routines. For vfio-pci based driver, these routines can
>> be stubbed since the vfio-pci driver is not doing the PCI device
>> initialization. All the config state saving, and PCI power management
>> related things will be done by PCI core framework itself inside its
>> runtime suspend/resume callbacks.
>>
>> 2. To prevent frequent runtime, suspend/resume, it uses autosuspend
>> support with a default delay of 1 second.
>>
>> 3. Inside pci_reset_bus(), all the devices in bus/slot will be moved
>> out of D0 state. This state change to D0 can happen directly without
>> going through the runtime PM framework. So if runtime PM is enabled,
>> then pm_runtime_resume() makes the runtime state active. Since the PCI
>> device power state is already D0, so it should return early when it
>> tries to change the state with pci_set_power_state(). Then
>> pm_request_autosuspend() can be used which will internally check for
>> device usage count and will move the device again into low power
>> state.
>>
>> 4. Inside vfio_pci_core_disable(), the device usage count always needs
>> to decremented which was incremented vfio_pci_core_enable() with
>> pm_runtime_get_sync().
>>
>> 5. Since the runtime PM framework will provide the same functionality,
>> so directly writing into PCI PM config register can be replaced with
>> use of runtime PM routines. Also, the use of runtime PM can help us in
>> more power saving.
>>
>> In the systems which do not support D3Cold,
>>
>> With the existing implementation:
>>
>> // PCI device
>> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>> D3hot
>> // upstream bridge
>> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>> D0
>>
>> With runtime PM:
>>
>> // PCI device
>> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>> D3hot
>> // upstream bridge
>> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>> D3hot
>>
>> So, with runtime PM, the upstream bridge or root port will also go
>> into lower power state which is not possible with existing
>> implementation.
>>
>> In the systems which support D3Cold,
>>
>> // PCI device
>> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>> D3hot
>> // upstream bridge
>> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>> D0
>>
>> With runtime PM:
>>
>> // PCI device
>> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>> D3cold
>> // upstream bridge
>> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>> D3cold
>>
>> So, with runtime PM, both the PCI device and upstream bridge will
>> go into D3cold state.
>>
>> 6. If 'disable_idle_d3' module parameter is set, then also the runtime
>> PM will be enabled, but in this case, the usage count should not be
>> decremented.
>>
>> 7. vfio_pci_dev_set_try_reset() return value is unused now, so this
>> function return type can be changed to void.
>>
>> Signed-off-by: Abhishek Sahu <[email protected]>
>> ---
>> drivers/vfio/pci/vfio_pci.c | 3 +
>> drivers/vfio/pci/vfio_pci_core.c | 104 +++++++++++++++++++++++--------
>> include/linux/vfio_pci_core.h | 4 ++
>> 3 files changed, 84 insertions(+), 27 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index a5ce92beb655..c8695baf3b54 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -193,6 +193,9 @@ static struct pci_driver vfio_pci_driver = {
>> .remove = vfio_pci_remove,
>> .sriov_configure = vfio_pci_sriov_configure,
>> .err_handler = &vfio_pci_core_err_handlers,
>> +#if defined(CONFIG_PM)
>> + .driver.pm = &vfio_pci_core_pm_ops,
>> +#endif
>> };
>>
>> static void __init vfio_pci_fill_ids(void)
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index f948e6cd2993..511a52e92b32 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -152,7 +152,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>> }
>>
>> struct vfio_pci_group_info;
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>> static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>> struct vfio_pci_group_info *groups);
>>
>> @@ -245,7 +245,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>> u16 cmd;
>> u8 msix_pos;
>>
>> - vfio_pci_set_power_state(vdev, PCI_D0);
>> + if (!disable_idle_d3)
>> + pm_runtime_get_sync(&pdev->dev);
>
> I'm not a pm_runtime expert, but why are we not using
> pm_runtime_resume_and_get() here and aborting the function on error?
>

Thanks for pointing this.
We should use pm_runtime_resume_and_get() and will abort the function
in case of error. I will check other places also and see if
we can use similar API.

> I hope some folks more familiar with pm_runtime can also review usage
> across this series.
>

Yes. The runtime PM API usage requires through review.

>>
>> /* Don't allow our initial saved state to include busmaster */
>> pci_clear_master(pdev);
>> @@ -405,8 +406,17 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>> out:
>> pci_disable_device(pdev);
>>
>> - if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
>> - vfio_pci_set_power_state(vdev, PCI_D3hot);
>> + vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
>> +
>> + /*
>> + * The device usage count always needs to decremented which was
>> + * incremented in vfio_pci_core_enable() with
>> + * pm_runtime_get_sync().
>> + */
>
> *to be
>
> Maybe:
>
> /*
> * Put the pm-runtime usage counter acquired during enable; mark
> * last use time to delay autosuspend.
> */
>

I will fix this.

>> + if (!disable_idle_d3) {
>> + pm_runtime_mark_last_busy(&pdev->dev);
>> + pm_runtime_put_autosuspend(&pdev->dev);
>> + }
>> }
>> EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>>
>> @@ -1847,19 +1857,23 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>
>> vfio_pci_probe_power_state(vdev);
>>
>> - if (!disable_idle_d3) {
>> - /*
>> - * pci-core sets the device power state to an unknown value at
>> - * bootup and after being removed from a driver. The only
>> - * transition it allows from this unknown state is to D0, which
>> - * typically happens when a driver calls pci_enable_device().
>> - * We're not ready to enable the device yet, but we do want to
>> - * be able to get to D3. Therefore first do a D0 transition
>> - * before going to D3.
>> - */
>> - vfio_pci_set_power_state(vdev, PCI_D0);
>> - vfio_pci_set_power_state(vdev, PCI_D3hot);
>> - }
>> + /*
>> + * pci-core sets the device power state to an unknown value at
>> + * bootup and after being removed from a driver. The only
>> + * transition it allows from this unknown state is to D0, which
>> + * typically happens when a driver calls pci_enable_device().
>> + * We're not ready to enable the device yet, but we do want to
>> + * be able to get to D3. Therefore first do a D0 transition
>> + * before enabling runtime PM.
>> + */
>> + pci_set_power_state(pdev, PCI_D0);
>> + pm_runtime_set_autosuspend_delay(&pdev->dev, 1000);
>
> Let's #define this 1000 at the top of the file with some rationale how
> we arrived at this heuristic (ie. avoid magic numbers in code). Thanks,
>
> Alex
>

Sure.This autosuspend delay was mainly for back-to-back config read/write
case. I will move at the top with proper reason.

Thanks,
Abhishek

>> + pm_runtime_use_autosuspend(&pdev->dev);
>> + pm_runtime_mark_last_busy(&pdev->dev);
>> + pm_runtime_allow(&pdev->dev);
>> +
>> + if (!disable_idle_d3)
>> + pm_runtime_put_autosuspend(&pdev->dev);
>>
>> ret = vfio_register_group_dev(&vdev->vdev);
>> if (ret)
>> @@ -1868,7 +1882,10 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>
>> out_power:
>> if (!disable_idle_d3)
>> - vfio_pci_set_power_state(vdev, PCI_D0);
>> + pm_runtime_get_noresume(&pdev->dev);
>> +
>> + pm_runtime_dont_use_autosuspend(&pdev->dev);
>> + pm_runtime_forbid(&pdev->dev);
>> out_vf:
>> vfio_pci_vf_uninit(vdev);
>> return ret;
>> @@ -1887,7 +1904,10 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>> vfio_pci_vga_uninit(vdev);
>>
>> if (!disable_idle_d3)
>> - vfio_pci_set_power_state(vdev, PCI_D0);
>> + pm_runtime_get_noresume(&pdev->dev);
>> +
>> + pm_runtime_dont_use_autosuspend(&pdev->dev);
>> + pm_runtime_forbid(&pdev->dev);
>> }
>> EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>>
>> @@ -2093,33 +2113,63 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
>> * - At least one of the affected devices is marked dirty via
>> * needs_reset (such as by lack of FLR support)
>> * Then attempt to perform that bus or slot reset.
>> - * Returns true if the dev_set was reset.
>> */
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>> {
>> struct vfio_pci_core_device *cur;
>> struct pci_dev *pdev;
>> int ret;
>>
>> if (!vfio_pci_dev_set_needs_reset(dev_set))
>> - return false;
>> + return;
>>
>> pdev = vfio_pci_dev_set_resettable(dev_set);
>> if (!pdev)
>> - return false;
>> + return;
>>
>> ret = pci_reset_bus(pdev);
>> if (ret)
>> - return false;
>> + return;
>>
>> list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>> cur->needs_reset = false;
>> - if (!disable_idle_d3)
>> - vfio_pci_set_power_state(cur, PCI_D3hot);
>> + if (!disable_idle_d3) {
>> + /*
>> + * Inside pci_reset_bus(), all the devices in bus/slot
>> + * will be moved out of D0 state. This state change to
>> + * D0 can happen directly without going through the
>> + * runtime PM framework. pm_runtime_resume() will
>> + * help make the runtime state as active and then
>> + * pm_request_autosuspend() can be used which will
>> + * internally check for device usage count and will
>> + * move the device again into the low power state.
>> + */
>> + pm_runtime_resume(&pdev->dev);
>> + pm_runtime_mark_last_busy(&pdev->dev);
>> + pm_request_autosuspend(&pdev->dev);
>> + }
>> }
>> - return true;
>> }
>>
>> +#ifdef CONFIG_PM
>> +static int vfio_pci_core_runtime_suspend(struct device *dev)
>> +{
>> + return 0;
>> +}
>> +
>> +static int vfio_pci_core_runtime_resume(struct device *dev)
>> +{
>> + return 0;
>> +}
>> +
>> +const struct dev_pm_ops vfio_pci_core_pm_ops = {
>> + SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
>> + vfio_pci_core_runtime_resume,
>> + NULL)
>> +};
>> +EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);
>> +#endif
>> +
>> void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
>> bool is_disable_idle_d3)
>> {
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index ef9a44b6cf5d..aafe09c9fa64 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -231,6 +231,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
>> void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
>> void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);
>>
>> +#ifdef CONFIG_PM
>> +extern const struct dev_pm_ops vfio_pci_core_pm_ops;
>> +#endif
>> +
>> static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>> {
>> return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>


2021-11-18 21:09:19

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state

On Thu, 18 Nov 2021 20:51:41 +0530
Abhishek Sahu <[email protected]> wrote:
> On 11/17/2021 11:23 PM, Alex Williamson wrote:
> Thanks Alex for checking this series and providing your inputs.
>
> > If we're transitioning a device to D3cold rather than D3hot as
> > requested by userspace, isn't that a user visible change?
>
> For most of the driver, in linux kernel, the D3hot vs D3cold
> state will be decided at PCI core layer. In the PCI core layer,
> pci_target_state() determines which D3 state to choose. It checks
> for platform_pci_power_manageable() and then it calls
> platform_pci_choose_state() to find the target state.
> In VM, the platform_pci_power_manageable() check will fail if the
> guest is linux OS. So, it uses, D3hot state.

Right, but my statement is really more that the device PM registers
cannot be used to put the device into D3cold, so the write of the PM
register that we're trapping was the user/guest's intention to put the
device into D3hot. We therefore need to be careful about differences
in the resulting device state when it comes out of D3cold vs D3hot.

> But there are few drivers which does not use the PCI framework
> generic power related routines during runtime suspend/system suspend
> and set the PCI power state directly with D3hot.

Current vfio-pci being one of those ;)

> Also, the guest can be non-Linux OS also and, in that case,
> it will be difficult to know the behavior. So, it may impact
> these cases.

That's what I'm worried about.

> > For instance, a device may report NoSoftRst- indicating that the device
> > does not do a soft reset on D3hot->D0 transition. If we're instead
> > putting the device in D3cold, then a transition back to D0 has very
> > much undergone a reset. On one hand we should at least virtualize the
> > NoSoftRst bit to allow the guest to restore the device, but I wonder if
> > that's really safe. Is a better option to prevent entering D3cold if
> > the device isn't natively reporting NoSoftRst-?
> >
>
> You mean to say NoSoftRst+ instead of NoSoftRst- as visible in

Oops yes. The concern is if the user/guest is not expecting a soft
reset when using D3hot, but we transparently promote D3hot to D3cold
which will always implies a device reset.

> the lspci output. For NoSoftRst- case, we do a soft reset on
> D3hot->D0 transition. But, will this case not be handled internally
> in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend,
> we check for pci_dev->state_saved flag and do pci_save_state()
> irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars()
> will be called in pci_raw_set_power_state() which will reinitialize device
> for D3hot/D3cold-> D0 case. Once the device is initialized in the host,
> then for guest, it should work without re-initializing again in the
> guest side. I am not sure, if my understanding is correct.

The soft reset is not limited to the state that the PCI subsystem can
save and restore. Device specific state that the user/guest may
legitimately expect to be retained may be reset as well.

[PCIe v5 5.3.1.4]
Functional context is required to be maintained by Functions in
the D3 hot state if the No_Soft_Reset field in the PMCSR is Set.

Unfortunately I don't see a specific definition of "functional
context", but I interpret that to include device specific state. For
example, if a GPU contains specific frame buffer data and reports
NoSoftRst+, wouldn't it be reasonable to expect that framebuffer data
to be retained on D3hot->D0 transition?

> > We're also essentially making a policy decision on behalf of
> > userspace that favors power saving over latency. Is that
> > universally the correct trade-off?
>
> For most drivers, the D3hot vs D3cold should not be favored due
> to latency reasons. In the linux kernel side, I am seeing, the
> PCI framework try to use D3cold state if platform and device
> supports that. But its correct that covertly replacing D3hot with
> D3cold may be concern for some drivers.
>
> > I can imagine this could be desirable for many use cases,
> > but if we're going to covertly replace D3hot with D3cold, it seems
> > like there should be an opt-in. Is co-opting the PM capability for
> > this even really acceptable or should there be a device ioctl to
> > request D3cold and plumbing through QEMU such that a VM guest can
> > make informed choices regarding device power management?
> >
>
> Making IOCTL is also an option but that case, this support needs to
> be added in all hypervisors and user must pass this information
> explicitly for each device. Another option could be to use
> module parameter to explicitly enable D3cold support. If module
> parameter is not set, then we can call pci_d3cold_disable() and
> in that case, runtime PM should not use D3cold state.
>
> Also, I was checking we can pass this information though some
> virtualized register bit which will be only defined for passing
> the information between guest and host. In the guest side if the
> target state is being decided with pci_target_state(), then
> the D3cold vs D3hot should not matter for the driver running
> in the guest side and in that case, it depends upon platform support.
> We can set this virtualize bit to 1. But, if driver is either
> setting D3hot state explicitly or has called pci_d3cold_disable() or
> similar API available in the guest OS, then set this bit to 0 and
> in that case, the D3cold state can be disabled in the host side.
> But don't know if is possible to use some non PCI defined
> virtualized register bit.

If you're suggesting a device config space register, that's troublesome
because we can't guarantee that simply because a range of config space
isn't within a capability that it doesn't have some device specific
purpose. However, we could certainly implement virtual registers in
the hypervisor that implement the ACPI support that an OS would use on
bare metal to implement D3cold. Those could trigger this ioctl through
the vfio device.

> I am not sure what should be best option to make choice
> regarding d3cold but if we can have some option by which this
> can be done without involvement of user, then it will benefit
> for lot of cases. Currently, the D3cold is supported only in
> very few desktops/servers but in future, we will see on
> most of the platforms.

I tend to see it as an interesting hack to promote D3hot to D3cold, and
potentially very useful. However, we're also introducing potentially
unexpected device behavior, so I think it would probably need to be an
opt-in. Possibly if the device reports NoSoftRst- we could use it by
default, but even virtualizing the NoSoftRst suggests that there's an
expectation that the guest driver has that support available.

> > Also if the device is not responsive to config space due to the user
> > placing it in D3 now, I'd expect there are other ioctl paths that
> > need to be blocked, maybe even MMIO paths that might be a gap for
> > existing D3hot support. Thanks,
>
> I was in assumption that most of IOCTL code will be called by the
> hypervisor before guest OS boot and during that time, the device
> will be always in D0. But, if we have paths where IOCTL can be
> called when the device has been suspended by guest OS, then can we
> use runtime_get/put API’s there also ?

It's more a matter of preventing user actions that can cause harm
rather than expecting certain operations only in specific states. We
could chose to either resume the device for those operations or fail
the operation. We should probably also leverage the memory-disable
support to fault mmap access to MMIO when the device is in D3* as well.
Thanks,

Alex


2021-11-19 15:45:55

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state

On Thu, 18 Nov 2021 14:09:13 -0700
Alex Williamson <[email protected]> wrote:

> On Thu, 18 Nov 2021 20:51:41 +0530
> Abhishek Sahu <[email protected]> wrote:
> > On 11/17/2021 11:23 PM, Alex Williamson wrote:
> > Thanks Alex for checking this series and providing your inputs.
> >
> > > If we're transitioning a device to D3cold rather than D3hot as
> > > requested by userspace, isn't that a user visible change?
> >
> > For most of the driver, in linux kernel, the D3hot vs D3cold
> > state will be decided at PCI core layer. In the PCI core layer,
> > pci_target_state() determines which D3 state to choose. It checks
> > for platform_pci_power_manageable() and then it calls
> > platform_pci_choose_state() to find the target state.
> > In VM, the platform_pci_power_manageable() check will fail if the
> > guest is linux OS. So, it uses, D3hot state.
>
> Right, but my statement is really more that the device PM registers
> cannot be used to put the device into D3cold, so the write of the PM
> register that we're trapping was the user/guest's intention to put the
> device into D3hot. We therefore need to be careful about differences
> in the resulting device state when it comes out of D3cold vs D3hot.
>
> > But there are few drivers which does not use the PCI framework
> > generic power related routines during runtime suspend/system suspend
> > and set the PCI power state directly with D3hot.
>
> Current vfio-pci being one of those ;)
>
> > Also, the guest can be non-Linux OS also and, in that case,
> > it will be difficult to know the behavior. So, it may impact
> > these cases.
>
> That's what I'm worried about.
>
> > > For instance, a device may report NoSoftRst- indicating that the device
> > > does not do a soft reset on D3hot->D0 transition. If we're instead
> > > putting the device in D3cold, then a transition back to D0 has very
> > > much undergone a reset. On one hand we should at least virtualize the
> > > NoSoftRst bit to allow the guest to restore the device, but I wonder if
> > > that's really safe. Is a better option to prevent entering D3cold if
> > > the device isn't natively reporting NoSoftRst-?
> > >
> >
> > You mean to say NoSoftRst+ instead of NoSoftRst- as visible in
>
> Oops yes. The concern is if the user/guest is not expecting a soft
> reset when using D3hot, but we transparently promote D3hot to D3cold
> which will always implies a device reset.
>
> > the lspci output. For NoSoftRst- case, we do a soft reset on
> > D3hot->D0 transition. But, will this case not be handled internally
> > in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend,
> > we check for pci_dev->state_saved flag and do pci_save_state()
> > irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars()
> > will be called in pci_raw_set_power_state() which will reinitialize device
> > for D3hot/D3cold-> D0 case. Once the device is initialized in the host,
> > then for guest, it should work without re-initializing again in the
> > guest side. I am not sure, if my understanding is correct.
>
> The soft reset is not limited to the state that the PCI subsystem can
> save and restore. Device specific state that the user/guest may
> legitimately expect to be retained may be reset as well.
>
> [PCIe v5 5.3.1.4]
> Functional context is required to be maintained by Functions in
> the D3 hot state if the No_Soft_Reset field in the PMCSR is Set.
>
> Unfortunately I don't see a specific definition of "functional
> context", but I interpret that to include device specific state. For
> example, if a GPU contains specific frame buffer data and reports
> NoSoftRst+, wouldn't it be reasonable to expect that framebuffer data
> to be retained on D3hot->D0 transition?
>
> > > We're also essentially making a policy decision on behalf of
> > > userspace that favors power saving over latency. Is that
> > > universally the correct trade-off?
> >
> > For most drivers, the D3hot vs D3cold should not be favored due
> > to latency reasons. In the linux kernel side, I am seeing, the
> > PCI framework try to use D3cold state if platform and device
> > supports that. But its correct that covertly replacing D3hot with
> > D3cold may be concern for some drivers.
> >
> > > I can imagine this could be desirable for many use cases,
> > > but if we're going to covertly replace D3hot with D3cold, it seems
> > > like there should be an opt-in. Is co-opting the PM capability for
> > > this even really acceptable or should there be a device ioctl to
> > > request D3cold and plumbing through QEMU such that a VM guest can
> > > make informed choices regarding device power management?
> > >
> >
> > Making IOCTL is also an option but that case, this support needs to
> > be added in all hypervisors and user must pass this information
> > explicitly for each device. Another option could be to use
> > module parameter to explicitly enable D3cold support. If module
> > parameter is not set, then we can call pci_d3cold_disable() and
> > in that case, runtime PM should not use D3cold state.
> >
> > Also, I was checking we can pass this information though some
> > virtualized register bit which will be only defined for passing
> > the information between guest and host. In the guest side if the
> > target state is being decided with pci_target_state(), then
> > the D3cold vs D3hot should not matter for the driver running
> > in the guest side and in that case, it depends upon platform support.
> > We can set this virtualize bit to 1. But, if driver is either
> > setting D3hot state explicitly or has called pci_d3cold_disable() or
> > similar API available in the guest OS, then set this bit to 0 and
> > in that case, the D3cold state can be disabled in the host side.
> > But don't know if is possible to use some non PCI defined
> > virtualized register bit.
>
> If you're suggesting a device config space register, that's troublesome
> because we can't guarantee that simply because a range of config space
> isn't within a capability that it doesn't have some device specific
> purpose. However, we could certainly implement virtual registers in
> the hypervisor that implement the ACPI support that an OS would use on
> bare metal to implement D3cold. Those could trigger this ioctl through
> the vfio device.
>
> > I am not sure what should be best option to make choice
> > regarding d3cold but if we can have some option by which this
> > can be done without involvement of user, then it will benefit
> > for lot of cases. Currently, the D3cold is supported only in
> > very few desktops/servers but in future, we will see on
> > most of the platforms.
>
> I tend to see it as an interesting hack to promote D3hot to D3cold, and
> potentially very useful. However, we're also introducing potentially
> unexpected device behavior, so I think it would probably need to be an
> opt-in. Possibly if the device reports NoSoftRst- we could use it by
> default, but even virtualizing the NoSoftRst suggests that there's an
> expectation that the guest driver has that support available.
>
> > > Also if the device is not responsive to config space due to the user
> > > placing it in D3 now, I'd expect there are other ioctl paths that
> > > need to be blocked, maybe even MMIO paths that might be a gap for
> > > existing D3hot support. Thanks,
> >
> > I was in assumption that most of IOCTL code will be called by the
> > hypervisor before guest OS boot and during that time, the device
> > will be always in D0. But, if we have paths where IOCTL can be
> > called when the device has been suspended by guest OS, then can we
> > use runtime_get/put API’s there also ?
>
> It's more a matter of preventing user actions that can cause harm
> rather than expecting certain operations only in specific states. We
> could chose to either resume the device for those operations or fail
> the operation. We should probably also leverage the memory-disable
> support to fault mmap access to MMIO when the device is in D3* as well.

It also occurred to me last night that a guest triggering D3hot via the
PM registers must be a synchronous power state change, we can't use
auto-suspend. This is necessary for nested assignment where the guest
might use a D3hot->D0 power state transition with NoSoftRst- devices in
order to perform a reset of the device. With auto-suspend, the guest
would return the device to D0 before the physical device ever timed out
to enter a D3 state. Thanks,

Alex


2021-11-23 07:27:25

by Abhishek Sahu

[permalink] [raw]
Subject: Re: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state

On 11/19/2021 9:15 PM, Alex Williamson wrote:
> On Thu, 18 Nov 2021 14:09:13 -0700
> Alex Williamson <[email protected]> wrote:
>
>> On Thu, 18 Nov 2021 20:51:41 +0530
>> Abhishek Sahu <[email protected]> wrote:
>>> On 11/17/2021 11:23 PM, Alex Williamson wrote:
>>> Thanks Alex for checking this series and providing your inputs.
>>>
>>>> If we're transitioning a device to D3cold rather than D3hot as
>>>> requested by userspace, isn't that a user visible change?
>>>
>>> For most of the driver, in linux kernel, the D3hot vs D3cold
>>> state will be decided at PCI core layer. In the PCI core layer,
>>> pci_target_state() determines which D3 state to choose. It checks
>>> for platform_pci_power_manageable() and then it calls
>>> platform_pci_choose_state() to find the target state.
>>> In VM, the platform_pci_power_manageable() check will fail if the
>>> guest is linux OS. So, it uses, D3hot state.
>>
>> Right, but my statement is really more that the device PM registers
>> cannot be used to put the device into D3cold, so the write of the PM
>> register that we're trapping was the user/guest's intention to put the
>> device into D3hot. We therefore need to be careful about differences
>> in the resulting device state when it comes out of D3cold vs D3hot.
>>>>> But there are few drivers which does not use the PCI framework
>>> generic power related routines during runtime suspend/system suspend
>>> and set the PCI power state directly with D3hot.
>>
>> Current vfio-pci being one of those ;)
>>
>>> Also, the guest can be non-Linux OS also and, in that case,
>>> it will be difficult to know the behavior. So, it may impact
>>> these cases.
>>
>> That's what I'm worried about.
>>
>>>> For instance, a device may report NoSoftRst- indicating that the device
>>>> does not do a soft reset on D3hot->D0 transition. If we're instead
>>>> putting the device in D3cold, then a transition back to D0 has very
>>>> much undergone a reset. On one hand we should at least virtualize the
>>>> NoSoftRst bit to allow the guest to restore the device, but I wonder if
>>>> that's really safe. Is a better option to prevent entering D3cold if
>>>> the device isn't natively reporting NoSoftRst-?
>>>>
>>>
>>> You mean to say NoSoftRst+ instead of NoSoftRst- as visible in
>>
>> Oops yes. The concern is if the user/guest is not expecting a soft
>> reset when using D3hot, but we transparently promote D3hot to D3cold
>> which will always implies a device reset.
>>
>>> the lspci output. For NoSoftRst- case, we do a soft reset on
>>> D3hot->D0 transition. But, will this case not be handled internally
>>> in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend,
>>> we check for pci_dev->state_saved flag and do pci_save_state()
>>> irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars()
>>> will be called in pci_raw_set_power_state() which will reinitialize device
>>> for D3hot/D3cold-> D0 case. Once the device is initialized in the host,
>>> then for guest, it should work without re-initializing again in the
>>> guest side. I am not sure, if my understanding is correct.
>>
>> The soft reset is not limited to the state that the PCI subsystem can
>> save and restore. Device specific state that the user/guest may
>> legitimately expect to be retained may be reset as well.
>>
>> [PCIe v5 5.3.1.4]
>> Functional context is required to be maintained by Functions in
>> the D3 hot state if the No_Soft_Reset field in the PMCSR is Set.
>>
>> Unfortunately I don't see a specific definition of "functional
>> context", but I interpret that to include device specific state. For
>> example, if a GPU contains specific frame buffer data and reports
>> NoSoftRst+, wouldn't it be reasonable to expect that framebuffer data
>> to be retained on D3hot->D0 transition?
>>

Thanks Alex for your inputs.

I got your point. Yes. That can happen.

>>>> We're also essentially making a policy decision on behalf of
>>>> userspace that favors power saving over latency. Is that
>>>> universally the correct trade-off?
>>>
>>> For most drivers, the D3hot vs D3cold should not be favored due
>>> to latency reasons. In the linux kernel side, I am seeing, the
>>> PCI framework try to use D3cold state if platform and device
>>> supports that. But its correct that covertly replacing D3hot with
>>> D3cold may be concern for some drivers.
>>>
>>>> I can imagine this could be desirable for many use cases,
>>>> but if we're going to covertly replace D3hot with D3cold, it seems
>>>> like there should be an opt-in. Is co-opting the PM capability for
>>>> this even really acceptable or should there be a device ioctl to
>>>> request D3cold and plumbing through QEMU such that a VM guest can
>>>> make informed choices regarding device power management?
>>>>
>>>
>>> Making IOCTL is also an option but that case, this support needs to
>>> be added in all hypervisors and user must pass this information
>>> explicitly for each device. Another option could be to use
>>> module parameter to explicitly enable D3cold support. If module
>>> parameter is not set, then we can call pci_d3cold_disable() and
>>> in that case, runtime PM should not use D3cold state.
>>>
>>> Also, I was checking we can pass this information though some
>>> virtualized register bit which will be only defined for passing
>>> the information between guest and host. In the guest side if the
>>> target state is being decided with pci_target_state(), then
>>> the D3cold vs D3hot should not matter for the driver running
>>> in the guest side and in that case, it depends upon platform support.
>>> We can set this virtualize bit to 1. But, if driver is either
>>> setting D3hot state explicitly or has called pci_d3cold_disable() or
>>> similar API available in the guest OS, then set this bit to 0 and
>>> in that case, the D3cold state can be disabled in the host side.
>>> But don't know if is possible to use some non PCI defined
>>> virtualized register bit.
>>
>> If you're suggesting a device config space register, that's troublesome
>> because we can't guarantee that simply because a range of config space
>> isn't within a capability that it doesn't have some device specific
>> purpose. However, we could certainly implement virtual registers in
>> the hypervisor that implement the ACPI support that an OS would use on
>> bare metal to implement D3cold. Those could trigger this ioctl through
>> the vfio device.
>>

Yes. I was suggesting a some bits in PM_CTRL register.
We can identify some bits which are unused like bit 22 and 23.

23:22 Undefined - these bits were defined in previous specifications.
They should be ignored by software.

and virtualize these bits for passing D3cold related information
from guest to host. These bits were defined for PCI-to-PCI
bridge earlier. But this will be hack and will cause issues
if PCI specification starts using these bits in future revisions.
Also, it needs the changes in the other OS. So, it won't be good
idea.

>>> I am not sure what should be best option to make choice
>>> regarding d3cold but if we can have some option by which this
>>> can be done without involvement of user, then it will benefit
>>> for lot of cases. Currently, the D3cold is supported only in
>>> very few desktops/servers but in future, we will see on
>>> most of the platforms.
>>
>> I tend to see it as an interesting hack to promote D3hot to D3cold, and
>> potentially very useful. However, we're also introducing potentially
>> unexpected device behavior, so I think it would probably need to be an
>> opt-in. Possibly if the device reports NoSoftRst- we could use it by
>> default, but even virtualizing the NoSoftRst suggests that there's an
>> expectation that the guest driver has that support available.
>>

Sure. Based on the discussion, the best option is to provide IOCTL
to user for transition to D3cold. The hypervisor can implement the
the required ACPI power resource for D3Hot and D0 and then once
guest OS calls these power resource methods,
then it can invoke the IOCTL to the host side vfio-pci driver.

The D0/D1/D2/D3hot transition can happen with existing way
by directly writing into PM config register and the IOCTL
needs to transition from D3hot to D3cold via runtime PM
framework.

I will do more analysis regarding this by doing code changes
and will update.

>>>> Also if the device is not responsive to config space due to the user
>>>> placing it in D3 now, I'd expect there are other ioctl paths that
>>>> need to be blocked, maybe even MMIO paths that might be a gap for
>>>> existing D3hot support. Thanks,
>>>
>>> I was in assumption that most of IOCTL code will be called by the
>>> hypervisor before guest OS boot and during that time, the device
>>> will be always in D0. But, if we have paths where IOCTL can be
>>> called when the device has been suspended by guest OS, then can we
>>> use runtime_get/put API’s there also ?
>>
>> It's more a matter of preventing user actions that can cause harm
>> rather than expecting certain operations only in specific states. We
>> could chose to either resume the device for those operations or fail
>> the operation. We should probably also leverage the memory-disable
>> support to fault mmap access to MMIO when the device is in D3* as well.
>

Sure. I can explore regarding this as well.

> It also occurred to me last night that a guest triggering D3hot via the
> PM registers must be a synchronous power state change, we can't use
> auto-suspend. This is necessary for nested assignment where the guest
> might use a D3hot->D0 power state transition with NoSoftRst- devices in
> order to perform a reset of the device. With auto-suspend, the guest
> would return the device to D0 before the physical device ever timed out
> to enter a D3 state. Thanks,
>
> Alex
>

Okay. I think if we can use IOCTL based way to trigger D3cold, then
autosuspend should not be required. Also, in that case, the transition
to D3hot can happen with existing way of writing directly into PCI
PM register. I will validate this use-case after doing changes
with IOCTL based design. I will make the changes in QEMU locally
to validate my changes.

Thanks,
Abhishek