2022-05-18 11:21:02

by Abhishek Sahu

[permalink] [raw]
Subject: [PATCH v5 0/4] vfio/pci: power management changes

Currently, there is very limited power management support available
in the upstream vfio-pci driver. If there is no user of vfio-pci device,
then it will be moved into D3Hot state. Similarly, if we enable the
runtime power management for vfio-pci device in the guest OS, then the
device is being runtime suspended (for linux guest OS) and the PCI
device will be put into D3hot state (in function
vfio_pm_config_write()). If the D3cold state can be used instead of
D3hot, then it will help in saving maximum power. The D3cold state can't
be possible with native PCI PM. It requires interaction with platform
firmware which is system-specific. To go into low power states
(including D3cold), the runtime PM framework can be used which
internally interacts with PCI and platform firmware and puts the device
into the lowest possible D-States.

This patch series registers the vfio-pci driver with runtime
PM framework and uses the same for moving the physical PCI
device to go into the low power state for unused idle devices.
There will be separate patch series that will add the support
for using runtime PM framework for used idle devices.

The current PM support was added with commit 6eb7018705de ("vfio-pci:
Move idle devices to D3hot power state") where the following point was
mentioned regarding D3cold state.

"It's tempting to try to use D3cold, but we have no reason to inhibit
hotplug of idle devices and we might get into a loop of having the
device disappear before we have a chance to try to use it."

With the runtime PM, if the user want to prevent going into D3cold then
/sys/bus/pci/devices/.../d3cold_allowed can be set to 0 for the
devices where the above functionality is required instead of
disallowing the D3cold state for all the cases.

The BAR access needs to be disabled if device is in D3hot state.
Also, there should not be any config access if device is in D3cold
state. For SR-IOV, the PF power state should be higher than VF's power
state.

* Changes in v5

- Rebased over https://github.com/awilliam/linux-vfio/tree/next.
- Renamed vfio_pci_lock_and_set_power_state() to
vfio_lock_and_set_power_state() and made it static.
- Inside vfio_pci_core_sriov_configure(), protected setting of
power state and sriov enablement with 'memory_lock'.
- Removed CONFIG_PM macro use since it is not needed with current
code.

* Changes in v4
(https://lore.kernel.org/lkml/[email protected])

- Rebased over https://github.com/awilliam/linux-vfio/tree/next.
- Split the patch series into 2 parts. This part contains the patches
for using runtime PM for unused idle device.
- Used the 'pdev->current_state' for checking if the device in D3 state.
- Adds the check in __vfio_pci_memory_enabled() function itself instead
of adding power state check at each caller.
- Make vfio_pci_lock_and_set_power_state() global since it is needed
in different files.
- Used vfio_pci_lock_and_set_power_state() instead of
vfio_pci_set_power_state() before pci_enable_sriov().
- Inside vfio_pci_core_sriov_configure(), handled both the cases
(the device is in low power state with and without user).
- Used list_for_each_entry_continue_reverse() in
vfio_pci_dev_set_pm_runtime_get().

* Changes in v3
(https://lore.kernel.org/lkml/[email protected])

- Rebased patches on v5.18-rc3.
- Marked this series as PATCH instead of RFC.
- Addressed the review comments given in v2.
- Removed the limitation to keep device in D0 state if there is any
access from host side. This is specific to NVIDIA use case and
will be handled separately.
- Used the existing DEVICE_FEATURE IOCTL itself instead of adding new
IOCTL for power management.
- Removed all custom code related with power management in runtime
suspend/resume callbacks and IOCTL handling. Now, the callbacks
contain code related with INTx handling and few other stuffs and
all the PCI state and platform PM handling will be done by PCI core
functions itself.
- Add the support of wake-up in main vfio layer itself since now we have
more vfio/pci based drivers.
- Instead of assigning the 'struct dev_pm_ops' in individual parent
driver, now the vfio_pci_core tself assigns the 'struct dev_pm_ops'.
- Added handling of power management around SR-IOV handling.
- Moved the setting of drvdata in a separate patch.
- Masked INTx before during runtime suspended state.
- Changed the order of patches so that Fix related things are at beginning
of this patch series.
- Removed storing the power state locally and used one new boolean to
track the d3 (D3cold and D3hot) power state
- Removed check for IO access in D3 power state.
- Used another helper function vfio_lock_and_set_power_state() instead
of touching vfio_pci_set_power_state().
- Considered the fixes made in
https://lore.kernel.org/lkml/[email protected]
and updated the patches accordingly.

* Changes in v2
(https://lore.kernel.org/lkml/[email protected])

- Rebased patches on v5.17-rc1.
- Included the patch to handle BAR access in D3cold.
- Included the patch to fix memory leak.
- Made a separate IOCTL that can be used to change the power state from
D3hot to D3cold and D3cold to D0.
- Addressed the review comments given in v1.

* v1
https://lore.kernel.org/lkml/[email protected]/

Abhishek Sahu (4):
vfio/pci: Invalidate mmaps and block the access in D3hot power state
vfio/pci: Change the PF power state to D0 before enabling VFs
vfio/pci: Virtualize PME related registers bits and initialize to zero
vfio/pci: Move the unused device into low power state with runtime PM

drivers/vfio/pci/vfio_pci_config.c | 56 ++++++++-
drivers/vfio/pci/vfio_pci_core.c | 178 ++++++++++++++++++++---------
2 files changed, 178 insertions(+), 56 deletions(-)

--
2.17.1



2022-05-18 11:21:05

by Abhishek Sahu

[permalink] [raw]
Subject: [PATCH v5 4/4] vfio/pci: Move the unused device into low power state with runtime PM

Currently, there is very limited power management support
available in the upstream vfio_pci_core based drivers. If there
are no users of the device, then the PCI device will be moved into
D3hot state by writing directly into PCI PM registers. This D3hot
state help in saving power but we can achieve zero power consumption
if we go into the D3cold state. The D3cold state cannot be possible
with native PCI PM. It requires interaction with platform firmware
which is system-specific. To go into low power states (including D3cold),
the runtime PM framework can be used which internally interacts with PCI
and platform firmware and puts the device into the lowest possible
D-States.

This patch registers vfio_pci_core based drivers with the
runtime PM framework.

1. The PCI core framework takes care of most of the runtime PM
related things. For enabling the runtime PM, the PCI driver needs to
decrement the usage count and needs to provide 'struct dev_pm_ops'
at least. The runtime suspend/resume callbacks are optional and needed
only if we need to do any extra handling. Now there are multiple
vfio_pci_core based drivers. Instead of assigning the
'struct dev_pm_ops' in individual parent driver, the vfio_pci_core
itself assigns the 'struct dev_pm_ops'. There are other drivers where
the 'struct dev_pm_ops' is being assigned inside core layer
(For example, wlcore_probe() and some sound based driver, etc.).

2. This patch provides the stub implementation of 'struct dev_pm_ops'.
The subsequent patch will provide the runtime suspend/resume
callbacks. All the config state saving, and PCI power management
related things will be done by PCI core framework itself inside its
runtime suspend/resume callbacks (pci_pm_runtime_suspend() and
pci_pm_runtime_resume()).

3. Inside pci_reset_bus(), all the devices in dev_set needs to be
runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take
care of the runtime resume and its error handling.

4. Inside vfio_pci_core_disable(), the device usage count always needs
to be decremented which was incremented in vfio_pci_core_enable().

5. Since the runtime PM framework will provide the same functionality,
so directly writing into PCI PM config register can be replaced with
the use of runtime PM routines. Also, the use of runtime PM can help
us in more power saving.

In the systems which do not support D3cold,

With the existing implementation:

// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3hot
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D0

With runtime PM:

// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3hot
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D3hot

So, with runtime PM, the upstream bridge or root port will also go
into lower power state which is not possible with existing
implementation.

In the systems which support D3cold,

// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3hot
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D0

With runtime PM:

// PCI device
# cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
D3cold
// upstream bridge
# cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
D3cold

So, with runtime PM, both the PCI device and upstream bridge will
go into D3cold state.

6. If 'disable_idle_d3' module parameter is set, then also the runtime
PM will be enabled, but in this case, the usage count should not be
decremented.

7. vfio_pci_dev_set_try_reset() return value is unused now, so this
function return type can be changed to void.

8. Use the runtime PM API's in vfio_pci_core_sriov_configure().
The device can be in low power state either with runtime
power management (when there is no user) or PCI_PM_CTRL register
write by the user. In both the cases, the PF should be moved to
D0 state. For preventing any runtime usage mismatch, pci_num_vf()
has been called explicitly during disable.

Signed-off-by: Abhishek Sahu <[email protected]>
---
drivers/vfio/pci/vfio_pci_core.c | 170 ++++++++++++++++++++-----------
1 file changed, 113 insertions(+), 57 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 9489ceea8875..a0d69ddaf90d 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -156,7 +156,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
}

struct vfio_pci_group_info;
-static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
+static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
struct vfio_pci_group_info *groups);

@@ -259,6 +259,17 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
return ret;
}

+/*
+ * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
+ * so use structure without any callbacks.
+ *
+ * The pci-driver core runtime PM routines always save the device state
+ * before going into suspended state. If the device is going into low power
+ * state with only with runtime PM ops, then no explicit handling is needed
+ * for the devices which have NoSoftRst-.
+ */
+static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
+
int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
{
struct pci_dev *pdev = vdev->pdev;
@@ -266,21 +277,23 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
u16 cmd;
u8 msix_pos;

- vfio_pci_set_power_state(vdev, PCI_D0);
+ if (!disable_idle_d3) {
+ ret = pm_runtime_resume_and_get(&pdev->dev);
+ if (ret < 0)
+ return ret;
+ }

/* Don't allow our initial saved state to include busmaster */
pci_clear_master(pdev);

ret = pci_enable_device(pdev);
if (ret)
- return ret;
+ goto out_power;

/* If reset fails because of the device lock, fail this path entirely */
ret = pci_try_reset_function(pdev);
- if (ret == -EAGAIN) {
- pci_disable_device(pdev);
- return ret;
- }
+ if (ret == -EAGAIN)
+ goto out_disable_device;

vdev->reset_works = !ret;
pci_save_state(pdev);
@@ -304,12 +317,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
}

ret = vfio_config_init(vdev);
- if (ret) {
- kfree(vdev->pci_saved_state);
- vdev->pci_saved_state = NULL;
- pci_disable_device(pdev);
- return ret;
- }
+ if (ret)
+ goto out_free_state;

msix_pos = pdev->msix_cap;
if (msix_pos) {
@@ -330,6 +339,16 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)


return 0;
+
+out_free_state:
+ kfree(vdev->pci_saved_state);
+ vdev->pci_saved_state = NULL;
+out_disable_device:
+ pci_disable_device(pdev);
+out_power:
+ if (!disable_idle_d3)
+ pm_runtime_put(&pdev->dev);
+ return ret;
}
EXPORT_SYMBOL_GPL(vfio_pci_core_enable);

@@ -437,8 +456,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
out:
pci_disable_device(pdev);

- if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
- vfio_pci_set_power_state(vdev, PCI_D3hot);
+ vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
+
+ /* Put the pm-runtime usage counter acquired during enable */
+ if (!disable_idle_d3)
+ pm_runtime_put(&pdev->dev);
}
EXPORT_SYMBOL_GPL(vfio_pci_core_disable);

@@ -1823,10 +1845,11 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_uninit_device);
int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
{
struct pci_dev *pdev = vdev->pdev;
+ struct device *dev = &pdev->dev;
int ret;

/* Drivers must set the vfio_pci_core_device to their drvdata */
- if (WARN_ON(vdev != dev_get_drvdata(&vdev->pdev->dev)))
+ if (WARN_ON(vdev != dev_get_drvdata(dev)))
return -EINVAL;

if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
@@ -1868,19 +1891,21 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)

vfio_pci_probe_power_state(vdev);

- if (!disable_idle_d3) {
- /*
- * pci-core sets the device power state to an unknown value at
- * bootup and after being removed from a driver. The only
- * transition it allows from this unknown state is to D0, which
- * typically happens when a driver calls pci_enable_device().
- * We're not ready to enable the device yet, but we do want to
- * be able to get to D3. Therefore first do a D0 transition
- * before going to D3.
- */
- vfio_pci_set_power_state(vdev, PCI_D0);
- vfio_pci_set_power_state(vdev, PCI_D3hot);
- }
+ /*
+ * pci-core sets the device power state to an unknown value at
+ * bootup and after being removed from a driver. The only
+ * transition it allows from this unknown state is to D0, which
+ * typically happens when a driver calls pci_enable_device().
+ * We're not ready to enable the device yet, but we do want to
+ * be able to get to D3. Therefore first do a D0 transition
+ * before enabling runtime PM.
+ */
+ vfio_pci_set_power_state(vdev, PCI_D0);
+
+ dev->driver->pm = &vfio_pci_core_pm_ops;
+ pm_runtime_allow(dev);
+ if (!disable_idle_d3)
+ pm_runtime_put(dev);

ret = vfio_register_group_dev(&vdev->vdev);
if (ret)
@@ -1889,7 +1914,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)

out_power:
if (!disable_idle_d3)
- vfio_pci_set_power_state(vdev, PCI_D0);
+ pm_runtime_get_noresume(dev);
+
+ pm_runtime_forbid(dev);
out_vf:
vfio_pci_vf_uninit(vdev);
return ret;
@@ -1906,7 +1933,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
vfio_pci_vga_uninit(vdev);

if (!disable_idle_d3)
- vfio_pci_set_power_state(vdev, PCI_D0);
+ pm_runtime_get_noresume(&vdev->pdev->dev);
+
+ pm_runtime_forbid(&vdev->pdev->dev);
}
EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);

@@ -1951,22 +1980,33 @@ int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,

/*
* The PF power state should always be higher than the VF power
- * state. If PF is in the low power state, then change the
- * power state to D0 first before enabling SR-IOV.
- * Also, this function can be called at any time, and userspace
- * PCI_PM_CTRL write can race against this code path,
+ * state. The PF can be in low power state either with runtime
+ * power management (when there is no user) or PCI_PM_CTRL
+ * register write by the user. If PF is in the low power state,
+ * then change the power state to D0 first before enabling
+ * SR-IOV. Also, this function can be called at any time, and
+ * userspace PCI_PM_CTRL write can race against this code path,
* so protect the same with 'memory_lock'.
*/
+ ret = pm_runtime_resume_and_get(&pdev->dev);
+ if (ret)
+ goto out_del;
+
down_write(&vdev->memory_lock);
vfio_pci_set_power_state(vdev, PCI_D0);
ret = pci_enable_sriov(pdev, nr_virtfn);
up_write(&vdev->memory_lock);
- if (ret)
+ if (ret) {
+ pm_runtime_put(&pdev->dev);
goto out_del;
+ }
return nr_virtfn;
}

- pci_disable_sriov(pdev);
+ if (pci_num_vf(pdev)) {
+ pci_disable_sriov(pdev);
+ pm_runtime_put(&pdev->dev);
+ }

out_del:
mutex_lock(&vfio_pci_sriov_pfs_mutex);
@@ -2041,6 +2081,27 @@ vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set)
return pdev;
}

+static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
+{
+ struct vfio_pci_core_device *cur;
+ int ret;
+
+ list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
+ ret = pm_runtime_resume_and_get(&cur->pdev->dev);
+ if (ret)
+ goto unwind;
+ }
+
+ return 0;
+
+unwind:
+ list_for_each_entry_continue_reverse(cur, &dev_set->device_list,
+ vdev.dev_set_list)
+ pm_runtime_put(&cur->pdev->dev);
+
+ return ret;
+}
+
/*
* We need to get memory_lock for each device, but devices can share mmap_lock,
* therefore we need to zap and hold the vma_lock for each device, and only then
@@ -2147,43 +2208,38 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
* - At least one of the affected devices is marked dirty via
* needs_reset (such as by lack of FLR support)
* Then attempt to perform that bus or slot reset.
- * Returns true if the dev_set was reset.
*/
-static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
+static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
{
struct vfio_pci_core_device *cur;
struct pci_dev *pdev;
- int ret;
+ bool reset_done = false;

if (!vfio_pci_dev_set_needs_reset(dev_set))
- return false;
+ return;

pdev = vfio_pci_dev_set_resettable(dev_set);
if (!pdev)
- return false;
+ return;

/*
- * The pci_reset_bus() will reset all the devices in the bus.
- * The power state can be non-D0 for some of the devices in the bus.
- * For these devices, the pci_reset_bus() will internally set
- * the power state to D0 without vfio driver involvement.
- * For the devices which have NoSoftRst-, the reset function can
- * cause the PCI config space reset without restoring the original
- * state (saved locally in 'vdev->pm_save').
+ * Some of the devices in the bus can be in the runtime suspended
+ * state. Increment the usage count for all the devices in the dev_set
+ * before reset and decrement the same after reset.
*/
- list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
- vfio_pci_set_power_state(cur, PCI_D0);
+ if (!disable_idle_d3 && vfio_pci_dev_set_pm_runtime_get(dev_set))
+ return;

- ret = pci_reset_bus(pdev);
- if (ret)
- return false;
+ if (!pci_reset_bus(pdev))
+ reset_done = true;

list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
- cur->needs_reset = false;
+ if (reset_done)
+ cur->needs_reset = false;
+
if (!disable_idle_d3)
- vfio_pci_set_power_state(cur, PCI_D3hot);
+ pm_runtime_put(&cur->pdev->dev);
}
- return true;
}

void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
--
2.17.1


2022-05-18 17:55:14

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] vfio/pci: power management changes

On Wed, 18 May 2022 16:46:08 +0530
Abhishek Sahu <[email protected]> wrote:

> Currently, there is very limited power management support available
> in the upstream vfio-pci driver. If there is no user of vfio-pci device,
> then it will be moved into D3Hot state. Similarly, if we enable the
> runtime power management for vfio-pci device in the guest OS, then the
> device is being runtime suspended (for linux guest OS) and the PCI
> device will be put into D3hot state (in function
> vfio_pm_config_write()). If the D3cold state can be used instead of
> D3hot, then it will help in saving maximum power. The D3cold state can't
> be possible with native PCI PM. It requires interaction with platform
> firmware which is system-specific. To go into low power states
> (including D3cold), the runtime PM framework can be used which
> internally interacts with PCI and platform firmware and puts the device
> into the lowest possible D-States.
>
> This patch series registers the vfio-pci driver with runtime
> PM framework and uses the same for moving the physical PCI
> device to go into the low power state for unused idle devices.
> There will be separate patch series that will add the support
> for using runtime PM framework for used idle devices.
>
> The current PM support was added with commit 6eb7018705de ("vfio-pci:
> Move idle devices to D3hot power state") where the following point was
> mentioned regarding D3cold state.
>
> "It's tempting to try to use D3cold, but we have no reason to inhibit
> hotplug of idle devices and we might get into a loop of having the
> device disappear before we have a chance to try to use it."
>
> With the runtime PM, if the user want to prevent going into D3cold then
> /sys/bus/pci/devices/.../d3cold_allowed can be set to 0 for the
> devices where the above functionality is required instead of
> disallowing the D3cold state for all the cases.
>
> The BAR access needs to be disabled if device is in D3hot state.
> Also, there should not be any config access if device is in D3cold
> state. For SR-IOV, the PF power state should be higher than VF's power
> state.
>
> * Changes in v5
>
> - Rebased over https://github.com/awilliam/linux-vfio/tree/next.
> - Renamed vfio_pci_lock_and_set_power_state() to
> vfio_lock_and_set_power_state() and made it static.
> - Inside vfio_pci_core_sriov_configure(), protected setting of
> power state and sriov enablement with 'memory_lock'.
> - Removed CONFIG_PM macro use since it is not needed with current
> code.

Applied to vfio next branch for v5.19. Thanks!

Alex

> * Changes in v4
> (https://lore.kernel.org/lkml/[email protected])
>
> - Rebased over https://github.com/awilliam/linux-vfio/tree/next.
> - Split the patch series into 2 parts. This part contains the patches
> for using runtime PM for unused idle device.
> - Used the 'pdev->current_state' for checking if the device in D3 state.
> - Adds the check in __vfio_pci_memory_enabled() function itself instead
> of adding power state check at each caller.
> - Make vfio_pci_lock_and_set_power_state() global since it is needed
> in different files.
> - Used vfio_pci_lock_and_set_power_state() instead of
> vfio_pci_set_power_state() before pci_enable_sriov().
> - Inside vfio_pci_core_sriov_configure(), handled both the cases
> (the device is in low power state with and without user).
> - Used list_for_each_entry_continue_reverse() in
> vfio_pci_dev_set_pm_runtime_get().
>
> * Changes in v3
> (https://lore.kernel.org/lkml/[email protected])
>
> - Rebased patches on v5.18-rc3.
> - Marked this series as PATCH instead of RFC.
> - Addressed the review comments given in v2.
> - Removed the limitation to keep device in D0 state if there is any
> access from host side. This is specific to NVIDIA use case and
> will be handled separately.
> - Used the existing DEVICE_FEATURE IOCTL itself instead of adding new
> IOCTL for power management.
> - Removed all custom code related with power management in runtime
> suspend/resume callbacks and IOCTL handling. Now, the callbacks
> contain code related with INTx handling and few other stuffs and
> all the PCI state and platform PM handling will be done by PCI core
> functions itself.
> - Add the support of wake-up in main vfio layer itself since now we have
> more vfio/pci based drivers.
> - Instead of assigning the 'struct dev_pm_ops' in individual parent
> driver, now the vfio_pci_core tself assigns the 'struct dev_pm_ops'.
> - Added handling of power management around SR-IOV handling.
> - Moved the setting of drvdata in a separate patch.
> - Masked INTx before during runtime suspended state.
> - Changed the order of patches so that Fix related things are at beginning
> of this patch series.
> - Removed storing the power state locally and used one new boolean to
> track the d3 (D3cold and D3hot) power state
> - Removed check for IO access in D3 power state.
> - Used another helper function vfio_lock_and_set_power_state() instead
> of touching vfio_pci_set_power_state().
> - Considered the fixes made in
> https://lore.kernel.org/lkml/[email protected]
> and updated the patches accordingly.
>
> * Changes in v2
> (https://lore.kernel.org/lkml/[email protected])
>
> - Rebased patches on v5.17-rc1.
> - Included the patch to handle BAR access in D3cold.
> - Included the patch to fix memory leak.
> - Made a separate IOCTL that can be used to change the power state from
> D3hot to D3cold and D3cold to D0.
> - Addressed the review comments given in v1.
>
> * v1
> https://lore.kernel.org/lkml/[email protected]/
>
> Abhishek Sahu (4):
> vfio/pci: Invalidate mmaps and block the access in D3hot power state
> vfio/pci: Change the PF power state to D0 before enabling VFs
> vfio/pci: Virtualize PME related registers bits and initialize to zero
> vfio/pci: Move the unused device into low power state with runtime PM
>
> drivers/vfio/pci/vfio_pci_config.c | 56 ++++++++-
> drivers/vfio/pci/vfio_pci_core.c | 178 ++++++++++++++++++++---------
> 2 files changed, 178 insertions(+), 56 deletions(-)
>


2022-05-19 21:42:36

by Abhishek Sahu

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] vfio/pci: power management changes

On 5/18/2022 11:21 PM, Alex Williamson wrote:
> On Wed, 18 May 2022 16:46:08 +0530
> Abhishek Sahu <[email protected]> wrote:
>
>> Currently, there is very limited power management support available
>> in the upstream vfio-pci driver. If there is no user of vfio-pci device,
>> then it will be moved into D3Hot state. Similarly, if we enable the
>> runtime power management for vfio-pci device in the guest OS, then the
>> device is being runtime suspended (for linux guest OS) and the PCI
>> device will be put into D3hot state (in function
>> vfio_pm_config_write()). If the D3cold state can be used instead of
>> D3hot, then it will help in saving maximum power. The D3cold state can't
>> be possible with native PCI PM. It requires interaction with platform
>> firmware which is system-specific. To go into low power states
>> (including D3cold), the runtime PM framework can be used which
>> internally interacts with PCI and platform firmware and puts the device
>> into the lowest possible D-States.
>>
>> This patch series registers the vfio-pci driver with runtime
>> PM framework and uses the same for moving the physical PCI
>> device to go into the low power state for unused idle devices.
>> There will be separate patch series that will add the support
>> for using runtime PM framework for used idle devices.
>>
>> The current PM support was added with commit 6eb7018705de ("vfio-pci:
>> Move idle devices to D3hot power state") where the following point was
>> mentioned regarding D3cold state.
>>
>> "It's tempting to try to use D3cold, but we have no reason to inhibit
>> hotplug of idle devices and we might get into a loop of having the
>> device disappear before we have a chance to try to use it."
>>
>> With the runtime PM, if the user want to prevent going into D3cold then
>> /sys/bus/pci/devices/.../d3cold_allowed can be set to 0 for the
>> devices where the above functionality is required instead of
>> disallowing the D3cold state for all the cases.
>>
>> The BAR access needs to be disabled if device is in D3hot state.
>> Also, there should not be any config access if device is in D3cold
>> state. For SR-IOV, the PF power state should be higher than VF's power
>> state.
>>
>> * Changes in v5
>>
>> - Rebased over https://github.com/awilliam/linux-vfio/tree/next.
>> - Renamed vfio_pci_lock_and_set_power_state() to
>> vfio_lock_and_set_power_state() and made it static.
>> - Inside vfio_pci_core_sriov_configure(), protected setting of
>> power state and sriov enablement with 'memory_lock'.
>> - Removed CONFIG_PM macro use since it is not needed with current
>> code.
>
> Applied to vfio next branch for v5.19. Thanks!
>
> Alex
>

Thanks Alex for your thorough review and support in getting
this series merged. I will start exploring for the second part
and will find out a generic way to support all the use cases.

Regards,
Abhishek

>> * Changes in v4
>> (https://lore.kernel.org/lkml/[email protected])
>>
>> - Rebased over https://github.com/awilliam/linux-vfio/tree/next.
>> - Split the patch series into 2 parts. This part contains the patches
>> for using runtime PM for unused idle device.
>> - Used the 'pdev->current_state' for checking if the device in D3 state.
>> - Adds the check in __vfio_pci_memory_enabled() function itself instead
>> of adding power state check at each caller.
>> - Make vfio_pci_lock_and_set_power_state() global since it is needed
>> in different files.
>> - Used vfio_pci_lock_and_set_power_state() instead of
>> vfio_pci_set_power_state() before pci_enable_sriov().
>> - Inside vfio_pci_core_sriov_configure(), handled both the cases
>> (the device is in low power state with and without user).
>> - Used list_for_each_entry_continue_reverse() in
>> vfio_pci_dev_set_pm_runtime_get().
>>
>> * Changes in v3
>> (https://lore.kernel.org/lkml/[email protected])
>>
>> - Rebased patches on v5.18-rc3.
>> - Marked this series as PATCH instead of RFC.
>> - Addressed the review comments given in v2.
>> - Removed the limitation to keep device in D0 state if there is any
>> access from host side. This is specific to NVIDIA use case and
>> will be handled separately.
>> - Used the existing DEVICE_FEATURE IOCTL itself instead of adding new
>> IOCTL for power management.
>> - Removed all custom code related with power management in runtime
>> suspend/resume callbacks and IOCTL handling. Now, the callbacks
>> contain code related with INTx handling and few other stuffs and
>> all the PCI state and platform PM handling will be done by PCI core
>> functions itself.
>> - Add the support of wake-up in main vfio layer itself since now we have
>> more vfio/pci based drivers.
>> - Instead of assigning the 'struct dev_pm_ops' in individual parent
>> driver, now the vfio_pci_core tself assigns the 'struct dev_pm_ops'.
>> - Added handling of power management around SR-IOV handling.
>> - Moved the setting of drvdata in a separate patch.
>> - Masked INTx before during runtime suspended state.
>> - Changed the order of patches so that Fix related things are at beginning
>> of this patch series.
>> - Removed storing the power state locally and used one new boolean to
>> track the d3 (D3cold and D3hot) power state
>> - Removed check for IO access in D3 power state.
>> - Used another helper function vfio_lock_and_set_power_state() instead
>> of touching vfio_pci_set_power_state().
>> - Considered the fixes made in
>> https://lore.kernel.org/lkml/[email protected]
>> and updated the patches accordingly.
>>
>> * Changes in v2
>> (https://lore.kernel.org/lkml/[email protected])
>>
>> - Rebased patches on v5.17-rc1.
>> - Included the patch to handle BAR access in D3cold.
>> - Included the patch to fix memory leak.
>> - Made a separate IOCTL that can be used to change the power state from
>> D3hot to D3cold and D3cold to D0.
>> - Addressed the review comments given in v1.
>>
>> * v1
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Abhishek Sahu (4):
>> vfio/pci: Invalidate mmaps and block the access in D3hot power state
>> vfio/pci: Change the PF power state to D0 before enabling VFs
>> vfio/pci: Virtualize PME related registers bits and initialize to zero
>> vfio/pci: Move the unused device into low power state with runtime PM
>>
>> drivers/vfio/pci/vfio_pci_config.c | 56 ++++++++-
>> drivers/vfio/pci/vfio_pci_core.c | 178 ++++++++++++++++++++---------
>> 2 files changed, 178 insertions(+), 56 deletions(-)
>>
>