2022-08-29 12:35:54

by Abhishek Sahu

[permalink] [raw]
Subject: [PATCH v7 0/5] vfio/pci: power management changes

This is part 2 for the vfio-pci driver power management support.
Part 1 of this patch series was related to adding D3cold support
when there is no user of the VFIO device and has already merged in the
mainline kernel. If we enable the runtime power management for
vfio-pci device in the guest OS, then the device is being runtime
suspended (for linux guest OS) and the PCI device will be put into
D3hot state (in function vfio_pm_config_write()). If the D3cold
state can be used instead of D3hot, then it will help in saving
maximum power. The D3cold state can't be possible with native
PCI PM. It requires interaction with platform firmware which is
system-specific. To go into low power states (Including D3cold),
the runtime PM framework can be used which internally interacts
with PCI and platform firmware and puts the device into the
lowest possible D-States.

This patch series adds the support to engage runtime power management
initiated by the user. Since D3cold state can't be achieved by writing
PCI standard PM config registers, so new device features have been
added in DEVICE_FEATURE IOCTL for low power entry and exit related
handling. For the PCI device, this low power state will be D3cold
(if the platform supports the D3cold state). The hypervisors can implement
virtual ACPI methods to make the integration with guest OS.
For example, in guest Linux OS if PCI device ACPI node has
_PR3 and _PR0 power resources with _ON/_OFF method, then guest
Linux OS makes the _OFF call during D3cold transition and
then _ON during D0 transition. The hypervisor can tap these virtual
ACPI calls and then do the low power related IOCTL.

The entry device feature has two variants. These two variants are mainly
to support the different behaviour for the low power entry.
If there is any access for the VFIO device on the host side, then the
device will be moved out of the low power state without the user's
guest driver involvement. Some devices (for example NVIDIA VGA or
3D controller) require the user's guest driver involvement for
each low-power entry. In the first variant, the host can move the
device into low power without any guest driver involvement while
in the second variant, the host will send a notification to user
through eventfd and then user guest driver needs to move the device
into low power. The hypervisor can implement the virtual PME
support to notify the guest OS. Please refer
https://lore.kernel.org/lkml/[email protected]/
where initially this virtual PME was implemented in the vfio-pci driver
itself, but later-on, it has been decided that hypervisor can implement
this.

* Changes in v7

- Rebased patches over the following patch series
https://lore.kernel.org/all/[email protected]
https://lore.kernel.org/all/[email protected]

- Since is_intx() is now static function, so open coded the same
(s/is_intx()/vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX) and
updated the commit message for the same.
- Replaced 'void __user *arg' with
'struct vfio_device_low_power_entry_with_wakeup __user *arg'
- Added new device features in sorted order in
vfio_pci_core_ioctl_feature().

* Changes in v6
(https://lore.kernel.org/lkml/[email protected])

- Rebased patches on v6.0-rc1.
- Updated uAPI documentation.
- Fixed return value checking for pm_runtime_resume_and_get().
- Refactored code around low power exit to make it cleaner.
- Updated commit message and comments at few places.

* Changes in v5
(https://lore.kernel.org/all/[email protected])

- Rebased patches on https://github.com/awilliam/linux-vfio/tree/next.
- Implemented 3 separate device features for the low power entry and exit.
- Dropped virtual PME patch.
- Removed the special handling for power management related device feature
and now all the ioctls will be wrapped under pm_runtime_get/put.
- Refactored code around low power entry and exit.
- Removed all the policy related code and same can be implemented in the
userspace.
- Renamed 'intx_masked' to 'masked_changed' and updated function comment.
- Changed the order of patches.

* Changes in v4
(https://lore.kernel.org/lkml/[email protected])

- Rebased patches on v5.19-rc4.
- Added virtual PME support.
- Used flags for low power entry and exit instead of explicit variable.
- Add the support to keep NVIDIA display related controllers in active
state if there is any activity on the host side.
- Add a flag that can be set by the user to keep the device in the active
state if there is any activity on the host side.
- Split the D3cold patch into smaller patches.
- Kept the runtime PM usage count incremented for all the IOCTL
(except power management IOCTL) and all the PCI region access.
- Masked the runtime errors behind -EIO.
- Refactored logic in runtime suspend/resume routine and for power
management device feature IOCTL.
- Add helper function for pm_runtime_put() also in the
drivers/vfio/vfio.c and use the 'struct vfio_device' for the
function parameter.
- Removed the requirement to move the device into D3hot before calling
low power entry.
- Renamed power management related new members in the structure.
- Used 'pm_runtime_engaged' check in __vfio_pci_memory_enabled().

* Changes in v3
(https://lore.kernel.org/lkml/[email protected])

- Rebased patches on v5.18-rc3.
- Marked this series as PATCH instead of RFC.
- Addressed the review comments given in v2.
- Removed the limitation to keep device in D0 state if there is any
access from host side. This is specific to NVIDIA use case and
will be handled separately.
- Used the existing DEVICE_FEATURE IOCTL itself instead of adding new
IOCTL for power management.
- Removed all custom code related with power management in runtime
suspend/resume callbacks and IOCTL handling. Now, the callbacks
contain code related with INTx handling and few other stuffs and
all the PCI state and platform PM handling will be done by PCI core
functions itself.
- Add the support of wake-up in main vfio layer itself since now we have
more vfio/pci based drivers.
- Instead of assigning the 'struct dev_pm_ops' in individual parent
driver, now the vfio_pci_core tself assigns the 'struct dev_pm_ops'.
- Added handling of power management around SR-IOV handling.
- Moved the setting of drvdata in a separate patch.
- Masked INTx before during runtime suspended state.
- Changed the order of patches so that Fix related things are at beginning
of this patch series.
- Removed storing the power state locally and used one new boolean to
track the d3 (D3cold and D3hot) power state
- Removed check for IO access in D3 power state.
- Used another helper function vfio_lock_and_set_power_state() instead
of touching vfio_pci_set_power_state().
- Considered the fixes made in
https://lore.kernel.org/lkml/[email protected]
and updated the patches accordingly.

* Changes in v2
(https://lore.kernel.org/lkml/[email protected])

- Rebased patches on v5.17-rc1.
- Included the patch to handle BAR access in D3cold.
- Included the patch to fix memory leak.
- Made a separate IOCTL that can be used to change the power state from
D3hot to D3cold and D3cold to D0.
- Addressed the review comments given in v1.

* v1
(https://lore.kernel.org/lkml/[email protected])

Abhishek Sahu (5):
vfio: Add the device features for the low power entry and exit
vfio: Increment the runtime PM usage count during IOCTL call
vfio/pci: Mask INTx during runtime suspend
vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT
vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP

drivers/vfio/pci/vfio_pci_core.c | 248 ++++++++++++++++++++++++++++--
drivers/vfio/pci/vfio_pci_intrs.c | 6 +-
drivers/vfio/pci/vfio_pci_priv.h | 2 +-
drivers/vfio/vfio_main.c | 52 ++++++-
include/linux/vfio_pci_core.h | 3 +
include/uapi/linux/vfio.h | 56 +++++++
6 files changed, 350 insertions(+), 17 deletions(-)

--
2.17.1


2022-08-29 12:39:43

by Abhishek Sahu

[permalink] [raw]
Subject: [PATCH v7 1/5] vfio: Add the device features for the low power entry and exit

This patch adds the following new device features for the low
power entry and exit in the header file. The implementation for the
same will be added in the subsequent patches.

- VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
- VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
- VFIO_DEVICE_FEATURE_LOW_POWER_EXIT

For vfio-pci based devices, with the standard PCI PM registers,
all power states cannot be achieved. The platform-based power management
needs to be involved to go into the lowest power state. For doing low
power entry and exit with platform-based power management,
these device features can be used.

The entry device feature has two variants. These two variants are mainly
to support the different behaviour for the low power entry.
If there is any access for the VFIO device on the host side, then the
device will be moved out of the low power state without the user's
guest driver involvement. Some devices (for example NVIDIA VGA or
3D controller) require the user's guest driver involvement for
each low-power entry. In the first variant, the host can return the
device to low power automatically. The device will continue to
attempt to reach low power until the low power exit feature is called.
In the second variant, if the device exits low power due to an access,
the host kernel will signal the user via the provided eventfd and will
not return the device to low power without a subsequent call to one of
the low power entry features. A call to the low power exit feature is
optional if the user provided eventfd is signaled.

These device features only support VFIO_DEVICE_FEATURE_SET and
VFIO_DEVICE_FEATURE_PROBE operations.

Signed-off-by: Abhishek Sahu <[email protected]>
---
include/uapi/linux/vfio.h | 56 +++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 733a1cddde30..76a173f973de 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -986,6 +986,62 @@ enum vfio_device_mig_state {
VFIO_DEVICE_STATE_RUNNING_P2P = 5,
};

+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
+ * state with the platform-based power management. Device use of lower power
+ * states depends on factors managed by the runtime power management core,
+ * including system level support and coordinating support among dependent
+ * devices. Enabling device low power entry does not guarantee lower power
+ * usage by the device, nor is a mechanism provided through this feature to
+ * know the current power state of the device. If any device access happens
+ * (either from the host or through the vfio uAPI) when the device is in the
+ * low power state, then the host will move the device out of the low power
+ * state as necessary prior to the access. Once the access is completed, the
+ * device may re-enter the low power state. For single shot low power support
+ * with wake-up notification, see
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below. Access to mmap'd
+ * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
+ * calling LOW_POWER_EXIT.
+ */
+#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
+
+/*
+ * This device feature has the same behavior as
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
+ * provides an eventfd for wake-up notification. When the device moves out of
+ * the low power state for the wake-up, the host will not allow the device to
+ * re-enter a low power state without a subsequent user call to one of the low
+ * power entry device feature IOCTLs. Access to mmap'd device regions is
+ * disabled on LOW_POWER_ENTRY_WITH_WAKEUP and may only be resumed after the
+ * low power exit. The low power exit can happen either through LOW_POWER_EXIT
+ * or through any other access (where the wake-up notification has been
+ * generated). The access to mmap'd device regions will not trigger low power
+ * exit.
+ *
+ * The notification through the provided eventfd will be generated only when
+ * the device has entered and is resumed from a low power state after
+ * calling this device feature IOCTL. A device that has not entered low power
+ * state, as managed through the runtime power management core, will not
+ * generate a notification through the provided eventfd on access. Calling the
+ * LOW_POWER_EXIT feature is optional in the case where notification has been
+ * signaled on the provided eventfd that a resume from low power has occurred.
+ */
+struct vfio_device_low_power_entry_with_wakeup {
+ __s32 wakeup_eventfd;
+ __u32 reserved;
+};
+
+#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
+ * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
+ * This device feature IOCTL may itself generate a wakeup eventfd notification
+ * in the latter case if the device had previously entered a low power state.
+ */
+#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
+
/* -------- API for Type1 VFIO IOMMU -------- */

/**
--
2.17.1

2022-09-02 08:10:14

by Abhishek Sahu

[permalink] [raw]
Subject: Re: [PATCH v7 0/5] vfio/pci: power management changes

> * Changes in v7
>
> - Rebased patches over the following patch series
> https://lore.kernel.org/all/[email protected]
> https://lore.kernel.org/all/[email protected]
>

This patch series is getting applied cleanly on top of Jason updated patch
series as well.

https://lore.kernel.org/all/[email protected]/

Thanks,
Abhishek

> - Since is_intx() is now static function, so open coded the same
> (s/is_intx()/vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX) and
> updated the commit message for the same.
> - Replaced 'void __user *arg' with
> 'struct vfio_device_low_power_entry_with_wakeup __user *arg'
> - Added new device features in sorted order in
> vfio_pci_core_ioctl_feature().

2022-09-02 19:51:25

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v7 0/5] vfio/pci: power management changes

On Mon, 29 Aug 2022 17:18:45 +0530
Abhishek Sahu <[email protected]> wrote:

> This is part 2 for the vfio-pci driver power management support.
> Part 1 of this patch series was related to adding D3cold support
> when there is no user of the VFIO device and has already merged in the
> mainline kernel. If we enable the runtime power management for
> vfio-pci device in the guest OS, then the device is being runtime
> suspended (for linux guest OS) and the PCI device will be put into
> D3hot state (in function vfio_pm_config_write()). If the D3cold
> state can be used instead of D3hot, then it will help in saving
> maximum power. The D3cold state can't be possible with native
> PCI PM. It requires interaction with platform firmware which is
> system-specific. To go into low power states (Including D3cold),
> the runtime PM framework can be used which internally interacts
> with PCI and platform firmware and puts the device into the
> lowest possible D-States.
>
> This patch series adds the support to engage runtime power management
> initiated by the user. Since D3cold state can't be achieved by writing
> PCI standard PM config registers, so new device features have been
> added in DEVICE_FEATURE IOCTL for low power entry and exit related
> handling. For the PCI device, this low power state will be D3cold
> (if the platform supports the D3cold state). The hypervisors can implement
> virtual ACPI methods to make the integration with guest OS.
> For example, in guest Linux OS if PCI device ACPI node has
> _PR3 and _PR0 power resources with _ON/_OFF method, then guest
> Linux OS makes the _OFF call during D3cold transition and
> then _ON during D0 transition. The hypervisor can tap these virtual
> ACPI calls and then do the low power related IOCTL.
>
> The entry device feature has two variants. These two variants are mainly
> to support the different behaviour for the low power entry.
> If there is any access for the VFIO device on the host side, then the
> device will be moved out of the low power state without the user's
> guest driver involvement. Some devices (for example NVIDIA VGA or
> 3D controller) require the user's guest driver involvement for
> each low-power entry. In the first variant, the host can move the
> device into low power without any guest driver involvement while
> in the second variant, the host will send a notification to user
> through eventfd and then user guest driver needs to move the device
> into low power. The hypervisor can implement the virtual PME
> support to notify the guest OS. Please refer
> https://lore.kernel.org/lkml/[email protected]/
> where initially this virtual PME was implemented in the vfio-pci driver
> itself, but later-on, it has been decided that hypervisor can implement
> this.
>
> * Changes in v7

Applied to vfio next branch for v6.1. Thanks,

Alex

2022-09-05 05:55:05

by Abhishek Sahu

[permalink] [raw]
Subject: Re: [PATCH v7 0/5] vfio/pci: power management changes

On 9/3/2022 12:12 AM, Alex Williamson wrote:
> On Mon, 29 Aug 2022 17:18:45 +0530
> Abhishek Sahu <[email protected]> wrote:
>
>> This is part 2 for the vfio-pci driver power management support.
>> Part 1 of this patch series was related to adding D3cold support
>> when there is no user of the VFIO device and has already merged in the
>> mainline kernel. If we enable the runtime power management for
>> vfio-pci device in the guest OS, then the device is being runtime
>> suspended (for linux guest OS) and the PCI device will be put into
>> D3hot state (in function vfio_pm_config_write()). If the D3cold
>> state can be used instead of D3hot, then it will help in saving
>> maximum power. The D3cold state can't be possible with native
>> PCI PM. It requires interaction with platform firmware which is
>> system-specific. To go into low power states (Including D3cold),
>> the runtime PM framework can be used which internally interacts
>> with PCI and platform firmware and puts the device into the
>> lowest possible D-States.
>>
>> This patch series adds the support to engage runtime power management
>> initiated by the user. Since D3cold state can't be achieved by writing
>> PCI standard PM config registers, so new device features have been
>> added in DEVICE_FEATURE IOCTL for low power entry and exit related
>> handling. For the PCI device, this low power state will be D3cold
>> (if the platform supports the D3cold state). The hypervisors can implement
>> virtual ACPI methods to make the integration with guest OS.
>> For example, in guest Linux OS if PCI device ACPI node has
>> _PR3 and _PR0 power resources with _ON/_OFF method, then guest
>> Linux OS makes the _OFF call during D3cold transition and
>> then _ON during D0 transition. The hypervisor can tap these virtual
>> ACPI calls and then do the low power related IOCTL.
>>
>> The entry device feature has two variants. These two variants are mainly
>> to support the different behaviour for the low power entry.
>> If there is any access for the VFIO device on the host side, then the
>> device will be moved out of the low power state without the user's
>> guest driver involvement. Some devices (for example NVIDIA VGA or
>> 3D controller) require the user's guest driver involvement for
>> each low-power entry. In the first variant, the host can move the
>> device into low power without any guest driver involvement while
>> in the second variant, the host will send a notification to user
>> through eventfd and then user guest driver needs to move the device
>> into low power. The hypervisor can implement the virtual PME
>> support to notify the guest OS. Please refer
>> https://lore.kernel.org/lkml/[email protected]/
>> where initially this virtual PME was implemented in the vfio-pci driver
>> itself, but later-on, it has been decided that hypervisor can implement
>> this.
>>
>> * Changes in v7
>
> Applied to vfio next branch for v6.1. Thanks,
>
> Alex
>

Thanks Alex for your guidance and support.

Regards,
Abhishek