2019-12-05 03:33:33

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

For SRIOV devices, VFs are passthroughed into guest directly without host
driver mediation. However, when VMs migrating with passthroughed VFs,
dynamic host mediation is required to (1) get device states, (2) get
dirty pages. Since device states as well as other critical information
required for dirty page tracking for VFs are usually retrieved from PFs,
it is handy to provide an extension in PF driver to centralizingly control
VFs' migration.

Therefore, in order to realize (1) passthrough VFs at normal time, (2)
dynamically trap VFs' bars for dirty page tracking and (3) centralizing
VF critical states retrieving and VF controls into one driver, we propose
to introduce mediate ops on top of current vfio-pci device driver.


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

__________ register mediate ops| ___________ ___________ |
| |<-----------------------| VF | | |
| vfio-pci | | | mediate | | PF driver | |
|__________|----------------------->| driver | |___________|
| open(pdev) | ----------- | |
| |
| |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
\|/ \|/
----------- ------------
| VF | | PF |
----------- ------------


VF mediate driver could be a standalone driver that does not bind to
any devices (as in demo code in patches 5-6) or it could be a built-in
extension of PF driver (as in patches 7-9) .

Rather than directly bind to VF, VF mediate driver register a mediate
ops into vfio-pci in driver init. vfio-pci maintains a list of such
mediate ops.
(Note that: VF mediate driver can register mediate ops into vfio-pci
before vfio-pci binding to any devices. And VF mediate driver can
support mediating multiple devices.)

When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
device as a parameter.
VF mediate driver should return success or failure depending on it
supports the pdev or not.
E.g. VF mediate driver would compare its supported VF devfn with the
devfn of the passed-in pdev.
Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
stop querying other mediate ops and bind the opening device with this
mediate ops using the returned mediate handle.

Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
VF will be intercepted into VF mediate driver as
vfio_pci_mediate_ops->get_region_info(),
vfio_pci_mediate_ops->rw,
vfio_pci_mediate_ops->mmap, and get customized.
For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
further return 'pt' to indicate whether vfio-pci should further
passthrough data to hw.

when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
with a mediate handle as parameter.

The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
mediate driver be able to differentiate two opening VFs of the same device
id and vendor id.

When VF mediate driver exits, it unregisters its mediate ops from
vfio-pci.


In this patchset, we enable vfio-pci to provide 3 things:
(1) calling mediate ops to allow vendor driver customizing default
region info/rw/mmap of a region.
(2) provide a migration region to support migration
(3) provide a dynamic trap bar info region to allow vendor driver
control trap/untrap of device pci bars

This vfio-pci + mediate ops way differs from mdev way in that
(1) medv way needs to create a 1:1 mdev device on top of one VF, device
specific mdev parent driver is bound to VF directly.
(2) vfio-pci + mediate ops way does not create mdev devices and VF
mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.

The reason why we don't choose the way of writing mdev parent driver is
that
(1) VFs are almost all the time directly passthroughed. Directly binding
to vfio-pci can make most of the code shared/reused. If we write a
vendor specific mdev parent driver, most of the code (like passthrough
style of rw/mmap) still needs to be copied from vfio-pci driver, which is
actually a duplicated and tedious work.
(2) For features like dynamically trap/untrap pci bars, if they are in
vfio-pci, they can be available to most people without repeated code
copying and re-testing.
(3) with a 1:1 mdev driver which passthrough VFs most of the time, people
have to decide whether to bind VFs to vfio-pci or mdev parent driver before
it runs into a real migration need. However, if vfio-pci is bound
initially, they have no chance to do live migration when there's a need
later.

In this patchset,
- patches 1-4 enable vfio-pci to call mediate ops registered by vendor
driver to mediate/customize region info/rw/mmap.

- patches 5-6 provide a standalone sample driver to register a mediate ops
for Intel Graphics Devices. It does not bind to IGDs directly but decides
what devices it supports via its pciidlist. It also demonstrates how to
dynamic trap a device's PCI bars. (by adding more pciids in its
pciidlist, this sample driver actually is not necessarily limited to
support IGDs)

- patch 7-9 provide a sample on i40e driver that supports Intel(R)
Ethernet Controller XL710 Family of devices. It supports VF precopy live
migration on Intel's 710 SRIOV. (but we commented out the real
implementation of dirty page tracking and device state retrieving part
to focus on demonstrating framework part. Will send out them in future
versions)

patch 7 registers/unregisters VF mediate ops when PF driver
probes/removes. It specifies its supporting VFs via
vfio_pci_mediate_ops->open(pdev)

patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
provides a sample implementation of migration region.
The QEMU part of vfio migration is based on v8
https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
We do not based on recent v9 because we think there are still opens in
dirty page track part in that series.

patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
provides an example on how to trap part of bar0 when migration starts
and passthrough this part of bar0 again when migration fails.

Yan Zhao (9):
vfio/pci: introduce mediate ops to intercept vfio-pci ops
vfio/pci: test existence before calling region->ops
vfio/pci: register a default migration region
vfio-pci: register default dynamic-trap-bar-info region
samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
i40e/vf_migration: register mediate_ops to vfio-pci
i40e/vf_migration: mediate migration region
i40e/vf_migration: support dynamic trap of bar0

drivers/net/ethernet/intel/Kconfig | 2 +-
drivers/net/ethernet/intel/i40e/Makefile | 3 +-
drivers/net/ethernet/intel/i40e/i40e.h | 2 +
drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +
.../ethernet/intel/i40e/i40e_vf_migration.c | 626 ++++++++++++++++++
.../ethernet/intel/i40e/i40e_vf_migration.h | 78 +++
drivers/vfio/pci/vfio_pci.c | 189 +++++-
drivers/vfio/pci/vfio_pci_private.h | 2 +
include/linux/vfio.h | 18 +
include/uapi/linux/vfio.h | 160 +++++
samples/Kconfig | 6 +
samples/Makefile | 1 +
samples/vfio-pci/Makefile | 2 +
samples/vfio-pci/igd_dt.c | 367 ++++++++++
14 files changed, 1455 insertions(+), 4 deletions(-)
create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
create mode 100644 samples/vfio-pci/Makefile
create mode 100644 samples/vfio-pci/igd_dt.c

--
2.17.1


2019-12-05 03:34:45

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

when vfio-pci is bound to a physical device, almost all the hardware
resources are passthroughed.
Sometimes, vendor driver of this physcial device may want to mediate some
hardware resource access for a short period of time, e.g. dirty page
tracking during live migration.

Here we introduce mediate ops in vfio-pci for this purpose.

Vendor driver can register a mediate ops to vfio-pci.
But rather than directly bind to the passthroughed device, the
vendor driver is now either a module that does not bind to any device or
a module binds to other device.
E.g. when passing through a VF device that is bound to vfio-pci modules,
PF driver that binds to PF device can register to vfio-pci to mediate
VF's regions, hence supporting VF live migration.

The sequence goes like this:
1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver

2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops

3. Whenever vfio-pci opens a device, it searches the list and call
vfio_pci_mediate_ops->open() to check whether a vendor driver supports
mediating this device.
Upon a success return value of from vfio_pci_mediate_ops->open(),
vfio-pci will stop list searching and store a mediate handle to
represent this open into vendor driver.
(so if multiple vendor drivers support mediating a device through
vfio_pci_mediate_ops, only one will win, depending on their registering
sequence)

4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
vendor driver is able to override a region's default flags and caps,
e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
region.

5. vfio_pci_rw()/vfio_pci_mmap() first calls into
vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
passthrough this read/write/mmap to physical device, otherwise it just
returns without touch physical device.

6. When vfio-pci closes a device, vfio_pci_release() chains into
vfio_pci_mediate_ops->release() to close the reference in vendor driver.

7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits

Cc: Kevin Tian <[email protected]>

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 146 ++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_private.h | 2 +
include/linux/vfio.h | 16 +++
3 files changed, 164 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 02206162eaa9..55080ff29495 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
MODULE_PARM_DESC(disable_idle_d3,
"Disable using the PCI D3 low power state for idle, unused devices");

+static LIST_HEAD(mediate_ops_list);
+static DEFINE_MUTEX(mediate_ops_list_lock);
+struct vfio_pci_mediate_ops_list_entry {
+ struct vfio_pci_mediate_ops *ops;
+ int refcnt;
+ struct list_head next;
+};
+
static inline bool vfio_vga_disabled(void)
{
#ifdef CONFIG_VFIO_PCI_VGA
@@ -472,6 +480,10 @@ static void vfio_pci_release(void *device_data)
if (!(--vdev->refcnt)) {
vfio_spapr_pci_eeh_release(vdev->pdev);
vfio_pci_disable(vdev);
+ if (vdev->mediate_ops && vdev->mediate_ops->release) {
+ vdev->mediate_ops->release(vdev->mediate_handle);
+ vdev->mediate_ops = NULL;
+ }
}

mutex_unlock(&vdev->reflck->lock);
@@ -483,6 +495,7 @@ static int vfio_pci_open(void *device_data)
{
struct vfio_pci_device *vdev = device_data;
int ret = 0;
+ struct vfio_pci_mediate_ops_list_entry *mentry;

if (!try_module_get(THIS_MODULE))
return -ENODEV;
@@ -495,6 +508,30 @@ static int vfio_pci_open(void *device_data)
goto error;

vfio_spapr_pci_eeh_open(vdev->pdev);
+ mutex_lock(&mediate_ops_list_lock);
+ list_for_each_entry(mentry, &mediate_ops_list, next) {
+ u64 caps;
+ u32 handle;
+
+ memset(&caps, 0, sizeof(caps));
+ ret = mentry->ops->open(vdev->pdev, &caps, &handle);
+ if (!ret) {
+ vdev->mediate_ops = mentry->ops;
+ vdev->mediate_handle = handle;
+
+ pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
+ vdev->mediate_ops->name, caps,
+ handle, vdev->pdev->vendor,
+ vdev->pdev->device);
+ /*
+ * only find the first matching mediate_ops,
+ * and add its refcnt
+ */
+ mentry->refcnt++;
+ break;
+ }
+ }
+ mutex_unlock(&mediate_ops_list_lock);
}
vdev->refcnt++;
error:
@@ -736,6 +773,14 @@ static long vfio_pci_ioctl(void *device_data,
info.size = pdev->cfg_size;
info.flags = VFIO_REGION_INFO_FLAG_READ |
VFIO_REGION_INFO_FLAG_WRITE;
+
+ if (vdev->mediate_ops &&
+ vdev->mediate_ops->get_region_info) {
+ vdev->mediate_ops->get_region_info(
+ vdev->mediate_handle,
+ &info, &caps, NULL);
+ }
+
break;
case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
@@ -756,6 +801,13 @@ static long vfio_pci_ioctl(void *device_data,
}
}

+ if (vdev->mediate_ops &&
+ vdev->mediate_ops->get_region_info) {
+ vdev->mediate_ops->get_region_info(
+ vdev->mediate_handle,
+ &info, &caps, NULL);
+ }
+
break;
case VFIO_PCI_ROM_REGION_INDEX:
{
@@ -794,6 +846,14 @@ static long vfio_pci_ioctl(void *device_data,
}

pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
+
+ if (vdev->mediate_ops &&
+ vdev->mediate_ops->get_region_info) {
+ vdev->mediate_ops->get_region_info(
+ vdev->mediate_handle,
+ &info, &caps, NULL);
+ }
+
break;
}
case VFIO_PCI_VGA_REGION_INDEX:
@@ -805,6 +865,13 @@ static long vfio_pci_ioctl(void *device_data,
info.flags = VFIO_REGION_INFO_FLAG_READ |
VFIO_REGION_INFO_FLAG_WRITE;

+ if (vdev->mediate_ops &&
+ vdev->mediate_ops->get_region_info) {
+ vdev->mediate_ops->get_region_info(
+ vdev->mediate_handle,
+ &info, &caps, NULL);
+ }
+
break;
default:
{
@@ -839,6 +906,13 @@ static long vfio_pci_ioctl(void *device_data,
if (ret)
return ret;
}
+
+ if (vdev->mediate_ops &&
+ vdev->mediate_ops->get_region_info) {
+ vdev->mediate_ops->get_region_info(
+ vdev->mediate_handle,
+ &info, &caps, &cap_type);
+ }
}
}

@@ -1151,6 +1225,16 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
return -EINVAL;

+ if (vdev->mediate_ops && vdev->mediate_ops->rw) {
+ int ret;
+ bool pt = true;
+
+ ret = vdev->mediate_ops->rw(vdev->mediate_handle,
+ buf, count, ppos, iswrite, &pt);
+ if (!pt)
+ return ret;
+ }
+
switch (index) {
case VFIO_PCI_CONFIG_REGION_INDEX:
return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
@@ -1200,6 +1284,15 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
u64 phys_len, req_len, pgoff, req_start;
int ret;

+ if (vdev->mediate_ops && vdev->mediate_ops->mmap) {
+ int ret;
+ bool pt = true;
+
+ ret = vdev->mediate_ops->mmap(vdev->mediate_handle, vma, &pt);
+ if (!pt)
+ return ret;
+ }
+
index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);

if (vma->vm_end < vma->vm_start)
@@ -1629,8 +1722,17 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)

static void __exit vfio_pci_cleanup(void)
{
+ struct vfio_pci_mediate_ops_list_entry *mentry, *n;
+
pci_unregister_driver(&vfio_pci_driver);
vfio_pci_uninit_perm_bits();
+
+ mutex_lock(&mediate_ops_list_lock);
+ list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
+ list_del(&mentry->next);
+ kfree(mentry);
+ }
+ mutex_unlock(&mediate_ops_list_lock);
}

static void __init vfio_pci_fill_ids(void)
@@ -1697,6 +1799,50 @@ static int __init vfio_pci_init(void)
return ret;
}

+int vfio_pci_register_mediate_ops(struct vfio_pci_mediate_ops *ops)
+{
+ struct vfio_pci_mediate_ops_list_entry *mentry;
+
+ mutex_lock(&mediate_ops_list_lock);
+ mentry = kzalloc(sizeof(*mentry), GFP_KERNEL);
+ if (!mentry) {
+ mutex_unlock(&mediate_ops_list_lock);
+ return -ENOMEM;
+ }
+
+ mentry->ops = ops;
+ mentry->refcnt = 0;
+ list_add(&mentry->next, &mediate_ops_list);
+
+ pr_info("registered dm ops %s\n", ops->name);
+ mutex_unlock(&mediate_ops_list_lock);
+
+ return 0;
+}
+EXPORT_SYMBOL(vfio_pci_register_mediate_ops);
+
+void vfio_pci_unregister_mediate_ops(struct vfio_pci_mediate_ops *ops)
+{
+ struct vfio_pci_mediate_ops_list_entry *mentry, *n;
+
+ mutex_lock(&mediate_ops_list_lock);
+ list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
+ if (mentry->ops != ops)
+ continue;
+
+ mentry->refcnt--;
+ if (!mentry->refcnt) {
+ list_del(&mentry->next);
+ kfree(mentry);
+ } else
+ pr_err("vfio_pci unregister mediate ops %s error\n",
+ mentry->ops->name);
+ }
+ mutex_unlock(&mediate_ops_list_lock);
+
+}
+EXPORT_SYMBOL(vfio_pci_unregister_mediate_ops);
+
module_init(vfio_pci_init);
module_exit(vfio_pci_cleanup);

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index ee6ee91718a4..bad4a254360e 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -122,6 +122,8 @@ struct vfio_pci_device {
struct list_head dummy_resources_list;
struct mutex ioeventfds_lock;
struct list_head ioeventfds_list;
+ struct vfio_pci_mediate_ops *mediate_ops;
+ u32 mediate_handle;
};

#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e42a711a2800..0265e779acd1 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -195,4 +195,20 @@ extern int vfio_virqfd_enable(void *opaque,
void *data, struct virqfd **pvirqfd, int fd);
extern void vfio_virqfd_disable(struct virqfd **pvirqfd);

+struct vfio_pci_mediate_ops {
+ char *name;
+ int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
+ void (*release)(int handle);
+ void (*get_region_info)(int handle,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps,
+ struct vfio_region_info_cap_type *cap_type);
+ ssize_t (*rw)(int handle, char __user *buf,
+ size_t count, loff_t *ppos, bool iswrite, bool *pt);
+ int (*mmap)(int handle, struct vm_area_struct *vma, bool *pt);
+
+};
+extern int vfio_pci_register_mediate_ops(struct vfio_pci_mediate_ops *ops);
+extern void vfio_pci_unregister_mediate_ops(struct vfio_pci_mediate_ops *ops);
+
#endif /* VFIO_H */
--
2.17.1

2019-12-05 03:35:26

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 2/9] vfio/pci: test existence before calling region->ops

For regions registered through vfio_pci_register_dev_region(),
before calling region->ops, first check whether region->ops is not null.

As in the next two patches, dev regions of null region->ops are to be
registered by default on behalf of vendor driver, we need to check here
to prevent null pointer access if vendor driver forgets to handle those
dev regions

Cc: Kevin Tian <[email protected]>

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 55080ff29495..f3730252ee82 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -398,8 +398,12 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)

vdev->virq_disabled = false;

- for (i = 0; i < vdev->num_regions; i++)
+ for (i = 0; i < vdev->num_regions; i++) {
+ if (!vdev->region[i].ops || vdev->region[i].ops->release)
+ continue;
+
vdev->region[i].ops->release(vdev, &vdev->region[i]);
+ }

vdev->num_regions = 0;
kfree(vdev->region);
@@ -900,7 +904,8 @@ static long vfio_pci_ioctl(void *device_data,
if (ret)
return ret;

- if (vdev->region[i].ops->add_capability) {
+ if (vdev->region[i].ops &&
+ vdev->region[i].ops->add_capability) {
ret = vdev->region[i].ops->add_capability(vdev,
&vdev->region[i], &caps);
if (ret)
@@ -1251,6 +1256,9 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
default:
index -= VFIO_PCI_NUM_REGIONS;
+ if (!vdev->region[index].ops || !vdev->region[index].ops->rw)
+ return -EINVAL;
+
return vdev->region[index].ops->rw(vdev, buf,
count, ppos, iswrite);
}
--
2.17.1

2019-12-05 03:36:02

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

Dynamic trap bar info region is a channel for QEMU and vendor driver to
communicate dynamic trap info. It is of type
VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.

This region has two fields: dt_fd and trap.
When QEMU detects a device regions of this type, it will create an
eventfd and write its eventfd id to dt_fd field.
When vendor drivre signals this eventfd, QEMU reads trap field of this
info region.
- If trap is true, QEMU would search the device's PCI BAR
regions and disable all the sparse mmaped subregions (if the sparse
mmaped subregion is disablable).
- If trap is false, QEMU would re-enable those subregions.

A typical usage is
1. vendor driver first cuts its bar 0 into several sections, all in a
sparse mmap array. So initally, all its bar 0 are passthroughed.
2. vendor driver specifys part of bar 0 sections to be disablable.
3. on migration starts, vendor driver signals dt_fd and set trap to true
to notify QEMU disabling the bar 0 sections of disablable flags on.
4. QEMU disables those bar 0 section and hence let vendor driver be able
to trap access of bar 0 registers and make dirty page tracking possible.
5. on migration failure, vendor driver signals dt_fd to QEMU again.
QEMU reads trap field of this info region which is false and QEMU
re-passthrough the whole bar 0 region.

Vendor driver specifies whether it supports dynamic-trap-bar-info region
through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
vfio_pci_mediate_ops->open().

If vfio-pci detects this cap, it will create a default
dynamic_trap_bar_info region on behalf of vendor driver with region len=0
and region->ops=null.
Vvendor driver should override this region's len, flags, rw, mmap in its
vfio_pci_mediate_ops.

Cc: Kevin Tian <[email protected]>

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 16 ++++++++++++++++
include/linux/vfio.h | 3 ++-
include/uapi/linux/vfio.h | 11 +++++++++++
3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 059660328be2..62b811ca43e4 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -127,6 +127,19 @@ void init_migration_region(struct vfio_pci_device *vdev)
NULL);
}

+/**
+ * register a region to hold info for dynamically trap bar regions
+ */
+void init_dynamic_trap_bar_info_region(struct vfio_pci_device *vdev)
+{
+ vfio_pci_register_dev_region(vdev,
+ VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO,
+ VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO,
+ NULL, 0,
+ VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
+ NULL);
+}
+
static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
{
struct resource *res;
@@ -538,6 +551,9 @@ static int vfio_pci_open(void *device_data)
if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
init_migration_region(vdev);

+ if (caps & VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR)
+ init_dynamic_trap_bar_info_region(vdev);
+
pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
vdev->mediate_ops->name, caps,
handle, vdev->pdev->vendor,
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index cddea8e9dcb2..cf8ecf687bee 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -197,7 +197,8 @@ extern void vfio_virqfd_disable(struct virqfd **pvirqfd);

struct vfio_pci_mediate_ops {
char *name;
-#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
+#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
+#define VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR (0x02)
int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
void (*release)(int handle);
void (*get_region_info)(int handle,
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index caf8845a67a6..74a2d0b57741 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -258,6 +258,9 @@ struct vfio_region_info {
struct vfio_region_sparse_mmap_area {
__u64 offset; /* Offset of mmap'able area within region */
__u64 size; /* Size of mmap'able area */
+ __u32 disablable; /* whether this mmap'able are able to
+ * be dynamically disabled
+ */
};

struct vfio_region_info_cap_sparse_mmap {
@@ -454,6 +457,14 @@ struct vfio_device_migration_info {
#define VFIO_DEVICE_DIRTY_PFNS_ALL (~0ULL)
} __attribute__((packed));

+/* Region type and sub-type to hold info to dynamically trap bars */
+#define VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO (4)
+#define VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO (1)
+
+struct vfio_device_dt_bar_info_region {
+ __u32 dt_fd; /* fd of eventfd to notify qemu trap/untrap bars*/
+ __u32 trap; /* trap/untrap bar regions */
+};

/* sub-types for VFIO_REGION_TYPE_PCI_* */

--
2.17.1

2019-12-05 03:36:32

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 5/9] samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD

This is a sample driver to use mediate ops for passthrough IGDs.

This sample driver does not directly bind to IGD device but defines what
IGD devices to support via a pciidlist.

It registers its vfio_pci_mediate_ops to vfio-pci on driver loading.

when vfio_pci->open() calls vfio_pci_mediate_ops->open(), it will check
the vendor id and device id of the pdev passed in. If they match in
pciidlist, success is returned; otherwise, failure is return.

After a success vfio_pci_mediate_ops->open(), vfio-pci will further call
.get_region_info/.rw/.mmap interface with a mediate handle for each region
and therefore the regions access get mediated/customized.

when vfio-pci->release() is called on the IGD, it first calls
vfio_pci_mediate_ops->release() with a mediate_handle to close the
opened IGD device instance in this sample driver.

This sample driver unregister its vfio_pci_mediate_ops on driver exiting.

Cc: Kevin Tian <[email protected]>

Signed-off-by: Yan Zhao <[email protected]>
---
samples/Kconfig | 6 ++
samples/Makefile | 1 +
samples/vfio-pci/Makefile | 2 +
samples/vfio-pci/igd_dt.c | 191 ++++++++++++++++++++++++++++++++++++++
4 files changed, 200 insertions(+)
create mode 100644 samples/vfio-pci/Makefile
create mode 100644 samples/vfio-pci/igd_dt.c

diff --git a/samples/Kconfig b/samples/Kconfig
index c8dacb4dda80..2da42a725c03 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -169,4 +169,10 @@ config SAMPLE_VFS
as mount API and statx(). Note that this is restricted to the x86
arch whilst it accesses system calls that aren't yet in all arches.

+config SAMPLE_VFIO_PCI_IGD_DT
+ tristate "Build example driver to dynamicaly trap a passthroughed device bound to VFIO-PCI -- loadable modules only"
+ depends on VFIO_PCI && m
+ help
+ Build a sample driver to show how to dynamically trap a passthroughed device that bound to VFIO-PCI
+
endif # SAMPLES
diff --git a/samples/Makefile b/samples/Makefile
index 7d6e4ca28d69..f0f422e7dd11 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -18,5 +18,6 @@ subdir-$(CONFIG_SAMPLE_SECCOMP) += seccomp
obj-$(CONFIG_SAMPLE_TRACE_EVENTS) += trace_events/
obj-$(CONFIG_SAMPLE_TRACE_PRINTK) += trace_printk/
obj-$(CONFIG_VIDEO_PCI_SKELETON) += v4l/
+obj-$(CONFIG_SAMPLE_VFIO_PCI_IGD_DT) += vfio-pci/
obj-y += vfio-mdev/
subdir-$(CONFIG_SAMPLE_VFS) += vfs
diff --git a/samples/vfio-pci/Makefile b/samples/vfio-pci/Makefile
new file mode 100644
index 000000000000..4b8acc145d65
--- /dev/null
+++ b/samples/vfio-pci/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_SAMPLE_VFIO_PCI_IGD_DT) += igd_dt.o
diff --git a/samples/vfio-pci/igd_dt.c b/samples/vfio-pci/igd_dt.c
new file mode 100644
index 000000000000..857e8d01b0d1
--- /dev/null
+++ b/samples/vfio-pci/igd_dt.c
@@ -0,0 +1,191 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Dynamic trap IGD device that bound to vfio-pci device driver
+ * Copyright(c) 2019 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/vfio.h>
+#include <linux/sysfs.h>
+#include <linux/file.h>
+#include <linux/pci.h>
+#include <linux/eventfd.h>
+
+#define VERSION_STRING "0.1"
+#define DRIVER_AUTHOR "Intel Corporation"
+
+/* helper macros copied from vfio-pci */
+#define VFIO_PCI_OFFSET_SHIFT 40
+#define VFIO_PCI_OFFSET_TO_INDEX(off) ((off) >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+/* This driver supports to open max 256 device devices */
+#define MAX_OPEN_DEVICE 256
+
+/*
+ * below are pciids of two IGD devices supported in this driver
+ * It is only for demo purpose.
+ * You can add more device ids in this list to support any pci devices
+ * that you want to dynamically trap its pci bars
+ */
+static const struct pci_device_id pciidlist[] = {
+ {0x8086, 0x5927, ~0, ~0, 0x30000, 0xff0000, 0},
+ {0x8086, 0x193b, ~0, ~0, 0x30000, 0xff0000, 0},
+};
+
+static long igd_device_bits[MAX_OPEN_DEVICE/BITS_PER_LONG + 1];
+static DEFINE_MUTEX(device_bit_lock);
+
+struct igd_dt_device {
+ __u32 vendor;
+ __u32 device;
+ __u32 handle;
+};
+
+static struct igd_dt_device *igd_device_array[MAX_OPEN_DEVICE];
+
+int igd_dt_open(struct pci_dev *pdev, u64 *caps, u32 *mediate_handle)
+{
+ int supported_dev_cnt = sizeof(pciidlist)/sizeof(struct pci_device_id);
+ int i, ret = 0;
+ struct igd_dt_device *igd_device;
+ int handle;
+
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
+ for (i = 0; i < supported_dev_cnt; i++) {
+ if (pciidlist[i].vendor == pdev->vendor &&
+ pciidlist[i].device == pdev->device)
+ goto support;
+ }
+
+ module_put(THIS_MODULE);
+ return -ENODEV;
+
+support:
+ mutex_lock(&device_bit_lock);
+ handle = find_next_zero_bit(igd_device_bits, MAX_OPEN_DEVICE, 0);
+ if (handle >= MAX_OPEN_DEVICE) {
+ ret = -EBUSY;
+ goto error;
+ }
+
+ igd_device = kzalloc(sizeof(*igd_device), GFP_KERNEL);
+
+ if (!igd_device) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ igd_device->vendor = pdev->vendor;
+ igd_device->device = pdev->device;
+ igd_device->handle = handle;
+ igd_device_array[handle] = igd_device;
+ set_bit(handle, igd_device_bits);
+
+ pr_info("%s open device %x %x, handle=%x\n", __func__,
+ pdev->vendor, pdev->device, handle);
+
+ *mediate_handle = handle;
+
+error:
+ mutex_unlock(&device_bit_lock);
+ if (ret < 0)
+ module_put(THIS_MODULE);
+ return ret;
+}
+
+void igd_dt_release(int handle)
+{
+ struct igd_dt_device *igd_device;
+
+ mutex_lock(&device_bit_lock);
+
+ if (handle >= MAX_OPEN_DEVICE || !igd_device_array[handle] ||
+ !test_bit(handle, igd_device_bits)) {
+ pr_err("handle mismatch, please check interaction with vfio-pci module\n");
+ mutex_unlock(&device_bit_lock);
+ return;
+ }
+
+ igd_device = igd_device_array[handle];
+ igd_device_array[handle] = NULL;
+ clear_bit(handle, igd_device_bits);
+ mutex_unlock(&device_bit_lock);
+
+ pr_info("release: handle=%d, igd_device VID DID =%x %x\n",
+ handle, igd_device->vendor, igd_device->device);
+
+
+ kfree(igd_device);
+ module_put(THIS_MODULE);
+
+}
+
+static void igd_dt_get_region_info(int handle,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps,
+ struct vfio_region_info_cap_type *cap_type)
+{
+}
+
+static ssize_t igd_dt_rw(int handle, char __user *buf,
+ size_t count, loff_t *ppos,
+ bool iswrite, bool *pt)
+{
+ *pt = true;
+
+ return 0;
+}
+
+static int igd_dt_mmap(int handle, struct vm_area_struct *vma, bool *pt)
+{
+ *pt = true;
+
+ return 0;
+}
+
+
+static struct vfio_pci_mediate_ops igd_dt_ops = {
+ .name = "IGD dt",
+ .open = igd_dt_open,
+ .release = igd_dt_release,
+ .get_region_info = igd_dt_get_region_info,
+ .rw = igd_dt_rw,
+ .mmap = igd_dt_mmap,
+};
+
+
+static int __init igd_dt_init(void)
+{
+ int ret = 0;
+
+ pr_info("igd_dt: %s\n", __func__);
+
+ memset(igd_device_bits, 0, sizeof(igd_device_bits));
+ memset(igd_device_array, 0, sizeof(igd_device_array));
+ vfio_pci_register_mediate_ops(&igd_dt_ops);
+ return ret;
+}
+
+static void __exit igd_dt_exit(void)
+{
+ pr_info("igd_dt: Unloaded!\n");
+ vfio_pci_unregister_mediate_ops(&igd_dt_ops);
+}
+
+module_init(igd_dt_init)
+module_exit(igd_dt_exit)
+
+MODULE_LICENSE("GPL v2");
+MODULE_INFO(supported, "Sample driver that Dynamic Trap a passthoughed IGD bound to vfio-pci");
+MODULE_VERSION(VERSION_STRING);
+MODULE_AUTHOR(DRIVER_AUTHOR);
--
2.17.1

2019-12-05 03:36:42

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 3/9] vfio/pci: register a default migration region

Vendor driver specifies when to support a migration region through cap
VFIO_PCI_DEVICE_CAP_MIGRATION in vfio_pci_mediate_ops->open().

If vfio-pci detects this cap, it creates a default migration region on
behalf of vendor driver with region len=0 and region->ops=null.
Vendor driver should override this region's len, flags, rw, mmap in
its vfio_pci_mediate_ops.

This migration region definition is aligned to QEMU vfio migration code v8:
(https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html)

Cc: Kevin Tian <[email protected]>

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 15 ++++
include/linux/vfio.h | 1 +
include/uapi/linux/vfio.h | 149 ++++++++++++++++++++++++++++++++++++
3 files changed, 165 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index f3730252ee82..059660328be2 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -115,6 +115,18 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
}

+/**
+ * init a region to hold migration ctl & data
+ */
+void init_migration_region(struct vfio_pci_device *vdev)
+{
+ vfio_pci_register_dev_region(vdev, VFIO_REGION_TYPE_MIGRATION,
+ VFIO_REGION_SUBTYPE_MIGRATION,
+ NULL, 0,
+ VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
+ NULL);
+}
+
static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
{
struct resource *res;
@@ -523,6 +535,9 @@ static int vfio_pci_open(void *device_data)
vdev->mediate_ops = mentry->ops;
vdev->mediate_handle = handle;

+ if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
+ init_migration_region(vdev);
+
pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
vdev->mediate_ops->name, caps,
handle, vdev->pdev->vendor,
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0265e779acd1..cddea8e9dcb2 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -197,6 +197,7 @@ extern void vfio_virqfd_disable(struct virqfd **pvirqfd);

struct vfio_pci_mediate_ops {
char *name;
+#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
void (*release)(int handle);
void (*get_region_info)(int handle,
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..caf8845a67a6 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -306,6 +306,155 @@ struct vfio_region_info_cap_type {
#define VFIO_REGION_TYPE_GFX (1)
#define VFIO_REGION_TYPE_CCW (2)

+/* Migration region type and sub-type */
+#define VFIO_REGION_TYPE_MIGRATION (3)
+#define VFIO_REGION_SUBTYPE_MIGRATION (1)
+
+/**
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise the result is undefined and vendor
+ * drivers should return an error.
+ *
+ * device_state: (read/write)
+ * To indicate vendor driver the state VFIO device should be transitioned
+ * to. If device state transition fails, write on this field return error.
+ * It consists of 3 bits:
+ * - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
+ * _STOPPED state. When device is changed to _STOPPED, driver should stop
+ * device before write() returns.
+ * - If bit 1 set, indicates _SAVING state.
+ * - If bit 2 set, indicates _RESUMING state.
+ * Bits 3 - 31 are reserved for future use. User should perform
+ * read-modify-write operation on this field.
+ * _SAVING and _RESUMING bits set at the same time is invalid state.
+ *
+ * pending bytes: (read only)
+ * Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ * User application should read data_offset in migration region from where
+ * user application should read device data during _SAVING state or write
+ * device data during _RESUMING state or read dirty pages bitmap. See below
+ * for detail of sequence to be followed.
+ *
+ * data_size: (read/write)
+ * User application should read data_size to get size of data copied in
+ * migration region during _SAVING state and write size of data copied in
+ * migration region during _RESUMING state.
+ *
+ * start_pfn: (write only)
+ * Start address pfn to get bitmap of dirty pages from vendor driver duing
+ * _SAVING state.
+ *
+ * page_size: (write only)
+ * User application should write the page_size of pfn.
+ *
+ * total_pfns: (write only)
+ * Total pfn count from start_pfn for which dirty bitmap is requested.
+ *
+ * copied_pfns: (read only)
+ * pfn count for which dirty bitmap is copied to migration region.
+ * Vendor driver should copy the bitmap with bits set only for pages to be
+ * marked dirty in migration region.
+ * - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if none of the
+ * pages are dirty in requested range or rest of the range.
+ * - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
+ * pages dirty in the given range or rest of the range.
+ * - Vendor driver should return pfn count for which bitmap is written in
+ * the region.
+ *
+ * Migration region looks like:
+ * ------------------------------------------------------------------
+ * |vfio_device_migration_info| data section |
+ * | | /////////////////////////////// |
+ * ------------------------------------------------------------------
+ * ^ ^ ^
+ * offset 0-trapped part data_offset data_size
+ *
+ * Data section is always followed by vfio_device_migration_info structure
+ * in the region, so data_offset will always be non-0. Offset from where data
+ * is copied is decided by kernel driver, data section can be trapped or
+ * mapped or partitioned, depending on how kernel driver defines data section.
+ * Data section partition can be defined as mapped by sparse mmap capability.
+ * If mmapped, then data_offset should be page aligned, where as initial section
+ * which contain vfio_device_migration_info structure might not end at offset
+ * which is page aligned.
+ * Data_offset can be same or different for device data and dirty pages bitmap.
+ * Vendor driver should decide whether to partition data section and how to
+ * partition the data section. Vendor driver should return data_offset
+ * accordingly.
+ *
+ * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
+ * and for _SAVING device state or stop-and-copy phase:
+ * a. read pending_bytes. If pending_bytes > 0, go through below steps.
+ * b. read data_offset, indicates kernel driver to write data to staging buffer.
+ * c. read data_size, amount of data in bytes written by vendor driver in
+ * migration region.
+ * d. read data_size bytes of data from data_offset in the migration region.
+ * e. process data.
+ * f. Loop through a to e.
+ *
+ * To copy system memory content during migration, vendor driver should be able
+ * to report system memory pages which are dirtied by that driver. For such
+ * dirty page reporting, user application should query for a range of GFNs
+ * relative to device address space (IOVA), then vendor driver should provide
+ * the bitmap of pages from this range which are dirtied by him through
+ * migration region where each bit represents a page and bit set to 1 represents
+ * that the page is dirty.
+ * User space application should take care of copying content of system memory
+ * for those pages.
+ *
+ * Steps to get dirty page bitmap:
+ * a. write start_pfn, page_size and total_pfns.
+ * b. read copied_pfns. Vendor driver should take one of the below action:
+ * - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if driver
+ * doesn't have any page to report dirty in given range or rest of the
+ * range. Exit the loop.
+ * - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
+ * pages dirty for given range or rest of the range. User space
+ * application mark all pages in the range as dirty and exit the loop.
+ * - Vendor driver should return copied_pfns and provide bitmap for
+ * copied_pfn in migration region.
+ * c. read data_offset, where vendor driver has written bitmap.
+ * d. read bitmap from the migration region from data_offset.
+ * e. Iterate through steps a to d while (total copied_pfns < total_pfns)
+ *
+ * Sequence to be followed while _RESUMING device state:
+ * While data for this device is available, repeat below steps:
+ * a. read data_offset from where user application should write data.
+ * b. write data of data_size to migration region from data_offset.
+ * c. write data_size which indicates vendor driver that data is written in
+ * staging buffer.
+ *
+ * For user application, data is opaque. User should write data in the same
+ * order as received.
+ */
+
+struct vfio_device_migration_info {
+ __u32 device_state; /* VFIO device state */
+#define VFIO_DEVICE_STATE_RUNNING (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING (1 << 2)
+#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \
+ VFIO_DEVICE_STATE_SAVING | \
+ VFIO_DEVICE_STATE_RESUMING)
+#define VFIO_DEVICE_STATE_INVALID (VFIO_DEVICE_STATE_SAVING | \
+ VFIO_DEVICE_STATE_RESUMING)
+ __u32 reserved;
+ __u64 pending_bytes;
+ __u64 data_offset;
+ __u64 data_size;
+ __u64 start_pfn;
+ __u64 page_size;
+ __u64 total_pfns;
+ __u64 copied_pfns;
+#define VFIO_DEVICE_DIRTY_PFNS_NONE (0)
+#define VFIO_DEVICE_DIRTY_PFNS_ALL (~0ULL)
+} __attribute__((packed));
+
+
/* sub-types for VFIO_REGION_TYPE_PCI_* */

/* 8086 vendor PCI sub-types */
--
2.17.1

2019-12-05 03:36:47

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 7/9] i40e/vf_migration: register mediate_ops to vfio-pci

register to vfio-pci vfio_pci_mediate_ops when i40e binds to PF to
support mediating of VF's vfio-pci ops.
unregister vfio_pci_mediate_ops when i40e unbinds from PF.

vfio_pci_mediate_ops->open will return success if the device passed in
equals to devfn of its VFs

Cc: Shaopeng He <[email protected]>

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/net/ethernet/intel/Kconfig | 2 +-
drivers/net/ethernet/intel/i40e/Makefile | 3 +-
drivers/net/ethernet/intel/i40e/i40e.h | 2 +
drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +
.../ethernet/intel/i40e/i40e_vf_migration.c | 169 ++++++++++++++++++
.../ethernet/intel/i40e/i40e_vf_migration.h | 52 ++++++
6 files changed, 229 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h

diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
index 154e2e818ec6..b5c7fdf55380 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -240,7 +240,7 @@ config IXGBEVF_IPSEC
config I40E
tristate "Intel(R) Ethernet Controller XL710 Family support"
imply PTP_1588_CLOCK
- depends on PCI
+ depends on PCI && VFIO_PCI
---help---
This driver supports Intel(R) Ethernet Controller XL710 Family of
devices. For more information on how to identify your adapter, go
diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
index 2f21b3e89fd0..ae7a6a23dba9 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -24,6 +24,7 @@ i40e-objs := i40e_main.o \
i40e_ddp.o \
i40e_client.o \
i40e_virtchnl_pf.o \
- i40e_xsk.o
+ i40e_xsk.o \
+ i40e_vf_migration.o

i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 2af9f6308f84..0141c94b835f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -1162,4 +1162,6 @@ int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
struct i40e_cloud_filter *filter,
bool add);
+int i40e_vf_migration_register(void);
+void i40e_vf_migration_unregister(void);
#endif /* _I40E_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6031223eafab..92d1c3fdc808 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -15274,6 +15274,7 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
/* print a string summarizing features */
i40e_print_features(pf);

+ i40e_vf_migration_register();
return 0;

/* Unwind what we've done if something failed in the setup */
@@ -15320,6 +15321,8 @@ static void i40e_remove(struct pci_dev *pdev)
i40e_status ret_code;
int i;

+ i40e_vf_migration_unregister();
+
i40e_dbg_pf_exit(pf);

i40e_ptp_stop(pf);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
new file mode 100644
index 000000000000..b2d913459600
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2013 - 2019 Intel Corporation. */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/vfio.h>
+#include <linux/pci.h>
+#include <linux/eventfd.h>
+
+#include "i40e.h"
+#include "i40e_vf_migration.h"
+
+static long open_device_bits[MAX_OPEN_DEVICE / BITS_PER_LONG + 1];
+static DEFINE_MUTEX(device_bit_lock);
+static struct i40e_vf_migration *i40e_vf_dev_array[MAX_OPEN_DEVICE];
+
+int i40e_vf_migration_open(struct pci_dev *pdev, u64 *caps, u32 *dm_handle)
+{
+ int i, ret = 0;
+ struct i40e_vf_migration *i40e_vf_dev = NULL;
+ int handle;
+ struct pci_dev *pf_dev, *vf_dev;
+ struct i40e_pf *pf;
+ struct i40e_vf *vf;
+ unsigned int vf_devfn, devfn;
+ int vf_id = -1;
+
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
+ pf_dev = pdev->physfn;
+ pf = pci_get_drvdata(pf_dev);
+ vf_dev = pdev;
+ vf_devfn = vf_dev->devfn;
+
+ for (i = 0; i < pci_num_vf(pf_dev); i++) {
+ devfn = (pf_dev->devfn + pf_dev->sriov->offset +
+ pf_dev->sriov->stride * i) & 0xff;
+ if (devfn == vf_devfn) {
+ vf_id = i;
+ break;
+ }
+ }
+
+ if (vf_id == -1) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ mutex_lock(&device_bit_lock);
+ handle = find_next_zero_bit(open_device_bits, MAX_OPEN_DEVICE, 0);
+ if (handle >= MAX_OPEN_DEVICE) {
+ ret = -EBUSY;
+ goto error;
+ }
+
+ i40e_vf_dev = kzalloc(sizeof(*i40e_vf_dev), GFP_KERNEL);
+
+ if (!i40e_vf_dev) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ i40e_vf_dev->vf_id = vf_id;
+ i40e_vf_dev->vf_vendor = pdev->vendor;
+ i40e_vf_dev->vf_device = pdev->device;
+ i40e_vf_dev->pf_dev = pf_dev;
+ i40e_vf_dev->vf_dev = vf_dev;
+ i40e_vf_dev->handle = handle;
+
+ pr_info("%s: device %x %x, vf id %d, handle=%x\n",
+ __func__, pdev->vendor, pdev->device, vf_id, handle);
+
+ i40e_vf_dev_array[handle] = i40e_vf_dev;
+ set_bit(handle, open_device_bits);
+ vf = &pf->vf[vf_id];
+ *dm_handle = handle;
+error:
+ mutex_unlock(&device_bit_lock);
+
+ if (ret < 0) {
+ module_put(THIS_MODULE);
+ kfree(i40e_vf_dev);
+ }
+
+out:
+ return ret;
+}
+
+void i40e_vf_migration_release(int handle)
+{
+ struct i40e_vf_migration *i40e_vf_dev;
+
+ mutex_lock(&device_bit_lock);
+
+ if (handle >= MAX_OPEN_DEVICE ||
+ !i40e_vf_dev_array[handle] ||
+ !test_bit(handle, open_device_bits)) {
+ pr_err("handle mismatch, please check interaction with vfio-pci module\n");
+ mutex_unlock(&device_bit_lock);
+ return;
+ }
+
+ i40e_vf_dev = i40e_vf_dev_array[handle];
+ i40e_vf_dev_array[handle] = NULL;
+
+ clear_bit(handle, open_device_bits);
+ mutex_unlock(&device_bit_lock);
+
+ pr_info("%s: handle=%d, i40e_vf_dev VID DID =%x %x, vf id=%d\n",
+ __func__, handle,
+ i40e_vf_dev->vf_vendor, i40e_vf_dev->vf_device,
+ i40e_vf_dev->vf_id);
+
+ kfree(i40e_vf_dev);
+ module_put(THIS_MODULE);
+}
+
+static void
+i40e_vf_migration_get_region_info(int handle,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps,
+ struct vfio_region_info_cap_type *cap_type)
+{
+}
+
+static ssize_t i40e_vf_migration_rw(int handle, char __user *buf,
+ size_t count, loff_t *ppos,
+ bool iswrite, bool *pt)
+{
+ *pt = true;
+
+ return 0;
+}
+
+static int i40e_vf_migration_mmap(int handle, struct vm_area_struct *vma,
+ bool *pt)
+{
+ *pt = true;
+ return 0;
+}
+
+static struct vfio_pci_mediate_ops i40e_vf_migration_ops = {
+ .name = "i40e_vf",
+ .open = i40e_vf_migration_open,
+ .release = i40e_vf_migration_release,
+ .get_region_info = i40e_vf_migration_get_region_info,
+ .rw = i40e_vf_migration_rw,
+ .mmap = i40e_vf_migration_mmap,
+};
+
+int i40e_vf_migration_register(void)
+{
+ int ret = 0;
+
+ pr_info("%s\n", __func__);
+
+ memset(open_device_bits, 0, sizeof(open_device_bits));
+ memset(i40e_vf_dev_array, 0, sizeof(i40e_vf_dev_array));
+ vfio_pci_register_mediate_ops(&i40e_vf_migration_ops);
+
+ return ret;
+}
+
+void i40e_vf_migration_unregister(void)
+{
+ pr_info("%s\n", __func__);
+ vfio_pci_unregister_mediate_ops(&i40e_vf_migration_ops);
+}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
new file mode 100644
index 000000000000..b195399b6788
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2013 - 2019 Intel Corporation. */
+
+#ifndef I40E_MIG_H
+#define I40E_MIG_H
+
+#include <linux/pci.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+
+#include "i40e.h"
+#include "i40e_txrx.h"
+
+#define MAX_OPEN_DEVICE 1024
+
+/* Single Root I/O Virtualization */
+struct pci_sriov {
+ int pos; /* Capability position */
+ int nres; /* Number of resources */
+ u32 cap; /* SR-IOV Capabilities */
+ u16 ctrl; /* SR-IOV Control */
+ u16 total_VFs; /* Total VFs associated with the PF */
+ u16 initial_VFs; /* Initial VFs associated with the PF */
+ u16 num_VFs; /* Number of VFs available */
+ u16 offset; /* First VF Routing ID offset */
+ u16 stride; /* Following VF stride */
+ u16 vf_device; /* VF device ID */
+ u32 pgsz; /* Page size for BAR alignment */
+ u8 link; /* Function Dependency Link */
+ u8 max_VF_buses; /* Max buses consumed by VFs */
+ u16 driver_max_VFs; /* Max num VFs driver supports */
+ struct pci_dev *dev; /* Lowest numbered PF */
+ struct pci_dev *self; /* This PF */
+ u32 cfg_size; /* VF config space size */
+ u32 class; /* VF device */
+ u8 hdr_type; /* VF header type */
+ u16 subsystem_vendor; /* VF subsystem vendor */
+ u16 subsystem_device; /* VF subsystem device */
+ resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */
+ bool drivers_autoprobe; /* Auto probing of VFs by driver */
+};
+
+struct i40e_vf_migration {
+ __u32 vf_vendor;
+ __u32 vf_device;
+ __u32 handle;
+ struct pci_dev *pf_dev;
+ struct pci_dev *vf_dev;
+ int vf_id;
+};
+#endif /* I40E_MIG_H */
+
--
2.17.1

2019-12-05 03:36:59

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 8/9] i40e/vf_migration: mediate migration region

in vfio_pci_mediate_ops->get_region_info(), migration region's len and
flags are overridden and its region index is saved.

vfio_pci_mediate_ops->rw() and vfio_pci_mediate_ops->mmap() overrides
default rw/mmap for migration region.

This is only a sample implementation in i440 vf migration to demonstrate
how vf migration code will look like. The actual dirty page tracking and
device state retrieving code would be sent in future. Currently only
comments are used as placeholders.

It's based on QEMU vfio migration code v8:
(https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html).

Cc: Shaopeng He <[email protected]>

Signed-off-by: Yan Zhao <[email protected]>
---
.../ethernet/intel/i40e/i40e_vf_migration.c | 335 +++++++++++++++++-
.../ethernet/intel/i40e/i40e_vf_migration.h | 14 +
2 files changed, 345 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
index b2d913459600..5bb509fed66e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -14,6 +14,55 @@ static long open_device_bits[MAX_OPEN_DEVICE / BITS_PER_LONG + 1];
static DEFINE_MUTEX(device_bit_lock);
static struct i40e_vf_migration *i40e_vf_dev_array[MAX_OPEN_DEVICE];

+static bool is_handle_valid(int handle)
+{
+ mutex_lock(&device_bit_lock);
+
+ if (handle >= MAX_OPEN_DEVICE || !i40e_vf_dev_array[handle] ||
+ !test_bit(handle, open_device_bits)) {
+ pr_err("%s: handle mismatch, please check interaction with vfio-pci module\n",
+ __func__);
+ mutex_unlock(&device_bit_lock);
+ return false;
+ }
+ mutex_unlock(&device_bit_lock);
+ return true;
+}
+
+static size_t set_device_state(struct i40e_vf_migration *i40e_vf_dev, u32 state)
+{
+ int ret = 0;
+ struct vfio_device_migration_info *mig_ctl = i40e_vf_dev->mig_ctl;
+
+ if (state == mig_ctl->device_state)
+ return ret;
+
+ switch (state) {
+ case VFIO_DEVICE_STATE_RUNNING:
+ break;
+ case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
+ // alloc dirty page tracking resources and
+ // do the first round dirty page scanning
+ break;
+ case VFIO_DEVICE_STATE_SAVING:
+ // do the last round of dirty page scanning
+ break;
+ case ~VFIO_DEVICE_STATE_MASK & VFIO_DEVICE_STATE_MASK:
+ // release dirty page tracking resources
+ //if (mig_ctl->device_state == VFIO_DEVICE_STATE_SAVING)
+ // i40e_release_scan_resources(i40e_vf_dev);
+ break;
+ case VFIO_DEVICE_STATE_RESUMING:
+ break;
+ default:
+ ret = -EFAULT;
+ }
+
+ mig_ctl->device_state = state;
+
+ return ret;
+}
+
int i40e_vf_migration_open(struct pci_dev *pdev, u64 *caps, u32 *dm_handle)
{
int i, ret = 0;
@@ -24,6 +73,8 @@ int i40e_vf_migration_open(struct pci_dev *pdev, u64 *caps, u32 *dm_handle)
struct i40e_vf *vf;
unsigned int vf_devfn, devfn;
int vf_id = -1;
+ struct vfio_device_migration_info *mig_ctl = NULL;
+ void *dirty_bitmap_base = NULL;

if (!try_module_get(THIS_MODULE))
return -ENODEV;
@@ -68,18 +119,41 @@ int i40e_vf_migration_open(struct pci_dev *pdev, u64 *caps, u32 *dm_handle)
i40e_vf_dev->vf_dev = vf_dev;
i40e_vf_dev->handle = handle;

- pr_info("%s: device %x %x, vf id %d, handle=%x\n",
- __func__, pdev->vendor, pdev->device, vf_id, handle);
+ mig_ctl = kzalloc(sizeof(*mig_ctl), GFP_KERNEL);
+ if (!mig_ctl) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ dirty_bitmap_base = vmalloc_user(MIGRATION_DIRTY_BITMAP_SIZE);
+ if (!dirty_bitmap_base) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ i40e_vf_dev->dirty_bitmap = dirty_bitmap_base;
+ i40e_vf_dev->mig_ctl = mig_ctl;
+ i40e_vf_dev->migration_region_size = DIRTY_BITMAP_OFFSET +
+ MIGRATION_DIRTY_BITMAP_SIZE;
+ i40e_vf_dev->migration_region_index = -1;
+
+ vf = &pf->vf[vf_id];

i40e_vf_dev_array[handle] = i40e_vf_dev;
set_bit(handle, open_device_bits);
- vf = &pf->vf[vf_id];
*dm_handle = handle;
+
+ *caps |= VFIO_PCI_DEVICE_CAP_MIGRATION;
+
+ pr_info("%s: device %x %x, vf id %d, handle=%x\n",
+ __func__, pdev->vendor, pdev->device, vf_id, handle);
error:
mutex_unlock(&device_bit_lock);

if (ret < 0) {
module_put(THIS_MODULE);
+ kfree(mig_ctl);
+ vfree(dirty_bitmap_base);
kfree(i40e_vf_dev);
}

@@ -112,32 +186,285 @@ void i40e_vf_migration_release(int handle)
i40e_vf_dev->vf_vendor, i40e_vf_dev->vf_device,
i40e_vf_dev->vf_id);

+ kfree(i40e_vf_dev->mig_ctl);
+ vfree(i40e_vf_dev->dirty_bitmap);
kfree(i40e_vf_dev);
+
module_put(THIS_MODULE);
}

+static void migration_region_sparse_mmap_cap(struct vfio_info_cap *caps)
+{
+ struct vfio_region_info_cap_sparse_mmap *sparse;
+ size_t size;
+ int nr_areas = 1;
+
+ size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+
+ sparse = kzalloc(size, GFP_KERNEL);
+ if (!sparse)
+ return;
+
+ sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+ sparse->header.version = 1;
+ sparse->nr_areas = nr_areas;
+
+ sparse->areas[0].offset = DIRTY_BITMAP_OFFSET;
+ sparse->areas[0].size = MIGRATION_DIRTY_BITMAP_SIZE;
+
+ vfio_info_add_capability(caps, &sparse->header, size);
+ kfree(sparse);
+}
+
static void
i40e_vf_migration_get_region_info(int handle,
struct vfio_region_info *info,
struct vfio_info_cap *caps,
struct vfio_region_info_cap_type *cap_type)
{
+ if (!is_handle_valid(handle))
+ return;
+
+ switch (info->index) {
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ info->flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE;
+
+ break;
+ case VFIO_PCI_BAR1_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ case VFIO_PCI_ROM_REGION_INDEX:
+ case VFIO_PCI_VGA_REGION_INDEX:
+ break;
+ default:
+ if (cap_type->type == VFIO_REGION_TYPE_MIGRATION &&
+ cap_type->subtype == VFIO_REGION_SUBTYPE_MIGRATION) {
+ struct i40e_vf_migration *i40e_vf_dev;
+
+ i40e_vf_dev = i40e_vf_dev_array[handle];
+ i40e_vf_dev->migration_region_index = info->index;
+ info->size = i40e_vf_dev->migration_region_size;
+
+ info->flags = VFIO_REGION_INFO_FLAG_CAPS |
+ VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE |
+ VFIO_REGION_INFO_FLAG_MMAP;
+ migration_region_sparse_mmap_cap(caps);
+ }
+ }
+}
+
+static
+ssize_t i40e_vf_migration_region_rw(struct i40e_vf_migration *i40e_vf_dev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite, bool *pt)
+{
+#define VDM_OFFSET(x) offsetof(struct vfio_device_migration_info, x)
+ struct vfio_device_migration_info *mig_ctl = i40e_vf_dev->mig_ctl;
+ u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ ssize_t ret = 0;
+
+ *pt = false;
+ switch (pos) {
+ case VDM_OFFSET(device_state):
+ if (count != sizeof(mig_ctl->device_state))
+ return -EINVAL;
+
+ if (iswrite) {
+ u32 device_state;
+
+ if (copy_from_user(&device_state, buf, count))
+ return -EFAULT;
+
+ set_device_state(i40e_vf_dev, device_state);
+ ret = count;
+ } else {
+ ret = -EFAULT;
+ }
+ break;
+
+ case VDM_OFFSET(reserved):
+ ret = -EFAULT;
+ break;
+
+ case VDM_OFFSET(pending_bytes):
+ if (count != sizeof(mig_ctl->pending_bytes))
+ return -EINVAL;
+
+ if (iswrite) {
+ ret = -EFAULT;
+ } else {
+ u64 p_bytes = 0;
+
+ ret = copy_to_user(buf, &p_bytes, count) ?
+ -EFAULT : count;
+ }
+ break;
+
+ case VDM_OFFSET(data_offset):
+ if (count != sizeof(mig_ctl->data_offset))
+ return -EINVAL;
+
+ if (iswrite) {
+ ret = -EFAULT;
+ } else {
+ u64 d_off = DIRTY_BITMAP_OFFSET;
+ /* always return dirty bitmap offset
+ * here as we don't support device
+ * internal dirty data
+ * and our pending_bytes always return 0
+ */
+ ret = copy_to_user(buf, &d_off, count) ?
+ -EFAULT : count;
+ }
+ break;
+
+ case VDM_OFFSET(data_size):
+ if (count != sizeof(mig_ctl->data_size))
+ return -EINVAL;
+
+ if (iswrite)
+ ret = copy_from_user(&mig_ctl->data_size, buf,
+ count) ? -EFAULT : count;
+ else
+ ret = copy_to_user(buf, &mig_ctl->data_size,
+ count) ? -EFAULT : count;
+ break;
+
+ case VDM_OFFSET(start_pfn):
+ if (count != sizeof(mig_ctl->start_pfn))
+ return -EINVAL;
+
+ if (iswrite)
+ ret = copy_from_user(&mig_ctl->start_pfn, buf,
+ count) ? -EFAULT : count;
+ else
+ ret = -EFAULT;
+ break;
+
+ case VDM_OFFSET(page_size):
+ if (count != sizeof(mig_ctl->page_size))
+ return -EINVAL;
+
+ if (iswrite)
+ ret = copy_from_user(&mig_ctl->page_size, buf,
+ count) ? -EFAULT : count;
+ else
+ ret = -EFAULT;
+ break;
+
+ case VDM_OFFSET(total_pfns):
+ if (count != sizeof(mig_ctl->total_pfns))
+ return -EINVAL;
+
+ if (iswrite) {
+ if (copy_from_user(&mig_ctl->total_pfns, buf, count))
+ return -EFAULT;
+
+ //calc dirty page bitmap
+ ret = count;
+ } else {
+ ret = -EFAULT;
+ }
+ break;
+
+ case VDM_OFFSET(copied_pfns):
+ if (count != sizeof(mig_ctl->copied_pfns))
+ return -EINVAL;
+
+ if (iswrite)
+ ret = -EFAULT;
+ else
+ ret = copy_to_user(buf, &mig_ctl->copied_pfns,
+ count) ? -EFAULT : count;
+ break;
+
+ case DIRTY_BITMAP_OFFSET:
+ if (count > MIGRATION_DIRTY_BITMAP_SIZE || count < 0)
+ return -EINVAL;
+
+ if (iswrite)
+ ret = -EFAULT;
+ else
+ ret = copy_to_user(buf, i40e_vf_dev->dirty_bitmap,
+ count) ? -EFAULT : count;
+ break;
+ default:
+ ret = -EFAULT;
+ break;
+ }
+ return ret;
}

static ssize_t i40e_vf_migration_rw(int handle, char __user *buf,
size_t count, loff_t *ppos,
bool iswrite, bool *pt)
{
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ struct i40e_vf_migration *i40e_vf_dev;
+
*pt = true;

+ if (!is_handle_valid(handle))
+ return 0;
+
+ i40e_vf_dev = i40e_vf_dev_array[handle];
+
+ switch (index) {
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ // scan dirty pages
+ break;
+ case VFIO_PCI_BAR1_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ case VFIO_PCI_ROM_REGION_INDEX:
+ case VFIO_PCI_VGA_REGION_INDEX:
+ break;
+ default:
+ if (index == i40e_vf_dev->migration_region_index) {
+ return i40e_vf_migration_region_rw(i40e_vf_dev, buf,
+ count, ppos, iswrite, pt);
+ }
+ }
return 0;
}

static int i40e_vf_migration_mmap(int handle, struct vm_area_struct *vma,
bool *pt)
{
+ unsigned int index;
+ struct i40e_vf_migration *i40e_vf_dev;
+ unsigned long pgoff = 0;
+ void *base;
+
+ index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
*pt = true;
- return 0;
+ if (!is_handle_valid(handle))
+ return -EINVAL;
+
+ i40e_vf_dev = i40e_vf_dev_array[handle];
+
+ if (index != i40e_vf_dev->migration_region_index)
+ return 0;
+
+ *pt = false;
+ base = i40e_vf_dev->dirty_bitmap;
+
+ if (vma->vm_end < vma->vm_start)
+ return -EINVAL;
+
+ if (!(vma->vm_flags & VM_SHARED))
+ return -EINVAL;
+
+ pgoff = vma->vm_pgoff &
+ ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+ if (pgoff != DIRTY_BITMAP_OFFSET / PAGE_SIZE)
+ return -EINVAL;
+
+ pr_info("%s, handle=%d, vf_id=%d, pgoff %lx\n", __func__,
+ handle, i40e_vf_dev->vf_id, pgoff);
+ return remap_vmalloc_range(vma, base, 0);
}

static struct vfio_pci_mediate_ops i40e_vf_migration_ops = {
diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
index b195399b6788..b31b500b3cd6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
@@ -11,8 +11,16 @@
#include "i40e.h"
#include "i40e_txrx.h"

+/* helper macros copied from vfio-pci */
+#define VFIO_PCI_OFFSET_SHIFT 40
+#define VFIO_PCI_OFFSET_TO_INDEX(off) ((off) >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
#define MAX_OPEN_DEVICE 1024

+#define DIRTY_BITMAP_OFFSET \
+ PAGE_ALIGN(sizeof(struct vfio_device_migration_info))
+#define MIGRATION_DIRTY_BITMAP_SIZE (64 * 1024UL)
+
/* Single Root I/O Virtualization */
struct pci_sriov {
int pos; /* Capability position */
@@ -47,6 +55,12 @@ struct i40e_vf_migration {
struct pci_dev *pf_dev;
struct pci_dev *vf_dev;
int vf_id;
+
+ __u64 migration_region_index;
+ __u64 migration_region_size;
+
+ struct vfio_device_migration_info *mig_ctl;
+ void *dirty_bitmap;
};
#endif /* I40E_MIG_H */

--
2.17.1

2019-12-05 03:37:26

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 9/9] i40e/vf_migration: support dynamic trap of bar0

mediate dynamic_trap_info region to dynamically trap bar0.

bar0 is sparsely mmaped into 5 sub-regions, of which only two need to be
dynamically trapped.
By mediating dynamic_trap_info region and telling QEMU this information,
the two sub-regions of bar0 can be trapped when migration starts and put
to passthrough again when migration fails

Cc: Shaopeng He <[email protected]>

Signed-off-by: Yan Zhao <[email protected]>
---
.../ethernet/intel/i40e/i40e_vf_migration.c | 140 +++++++++++++++++-
.../ethernet/intel/i40e/i40e_vf_migration.h | 12 ++
2 files changed, 147 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
index 5bb509fed66e..0b9d5be85049 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -29,6 +29,21 @@ static bool is_handle_valid(int handle)
return true;
}

+static
+void i40e_vf_migration_dynamic_trap_bar(struct i40e_vf_migration *i40e_vf_dev)
+{
+ if (i40e_vf_dev->dt_trigger)
+ eventfd_signal(i40e_vf_dev->dt_trigger, 1);
+}
+
+static void i40e_vf_trap_bar0(struct i40e_vf_migration *i40e_vf_dev, bool trap)
+{
+ if (i40e_vf_dev->trap_bar0 != trap) {
+ i40e_vf_dev->trap_bar0 = trap;
+ i40e_vf_migration_dynamic_trap_bar(i40e_vf_dev);
+ }
+}
+
static size_t set_device_state(struct i40e_vf_migration *i40e_vf_dev, u32 state)
{
int ret = 0;
@@ -39,8 +54,10 @@ static size_t set_device_state(struct i40e_vf_migration *i40e_vf_dev, u32 state)

switch (state) {
case VFIO_DEVICE_STATE_RUNNING:
+ i40e_vf_trap_bar0(i40e_vf_dev, false);
break;
case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
+ i40e_vf_trap_bar0(i40e_vf_dev, true);
// alloc dirty page tracking resources and
// do the first round dirty page scanning
break;
@@ -137,16 +154,22 @@ int i40e_vf_migration_open(struct pci_dev *pdev, u64 *caps, u32 *dm_handle)
MIGRATION_DIRTY_BITMAP_SIZE;
i40e_vf_dev->migration_region_index = -1;

+ i40e_vf_dev->dt_region_index = -1;
+ i40e_vf_dev->trap_bar0 = false;
+
vf = &pf->vf[vf_id];

i40e_vf_dev_array[handle] = i40e_vf_dev;
set_bit(handle, open_device_bits);
+
*dm_handle = handle;

*caps |= VFIO_PCI_DEVICE_CAP_MIGRATION;
+ *caps |= VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR;

pr_info("%s: device %x %x, vf id %d, handle=%x\n",
__func__, pdev->vendor, pdev->device, vf_id, handle);
+
error:
mutex_unlock(&device_bit_lock);

@@ -188,6 +211,10 @@ void i40e_vf_migration_release(int handle)

kfree(i40e_vf_dev->mig_ctl);
vfree(i40e_vf_dev->dirty_bitmap);
+
+ if (i40e_vf_dev->dt_trigger)
+ eventfd_ctx_put(i40e_vf_dev->dt_trigger);
+
kfree(i40e_vf_dev);

module_put(THIS_MODULE);
@@ -216,6 +243,47 @@ static void migration_region_sparse_mmap_cap(struct vfio_info_cap *caps)
kfree(sparse);
}

+static void bar0_sparse_mmap_cap(struct vfio_region_info *info,
+ struct vfio_info_cap *caps)
+{
+ struct vfio_region_info_cap_sparse_mmap *sparse;
+ size_t size;
+ int nr_areas = 5;
+
+ size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+
+ sparse = kzalloc(size, GFP_KERNEL);
+ if (!sparse)
+ return;
+
+ sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+ sparse->header.version = 1;
+ sparse->nr_areas = nr_areas;
+
+ sparse->areas[0].offset = 0;
+ sparse->areas[0].size = IAVF_VF_TAIL_START;
+ sparse->areas[0].disablable = 0;//able to get toggled
+
+ sparse->areas[1].offset = IAVF_VF_TAIL_START;
+ sparse->areas[1].size = PAGE_SIZE;
+ sparse->areas[1].disablable = 1;//able to get toggled
+
+ sparse->areas[2].offset = IAVF_VF_TAIL_START + PAGE_SIZE;
+ sparse->areas[2].size = IAVF_VF_ARQH1 - sparse->areas[2].offset;
+ sparse->areas[2].disablable = 0;//able to get toggled
+
+ sparse->areas[3].offset = IAVF_VF_ARQT1;
+ sparse->areas[3].size = PAGE_SIZE;
+ sparse->areas[3].disablable = 1;//able to get toggled
+
+ sparse->areas[4].offset = IAVF_VF_ARQT1 + PAGE_SIZE;
+ sparse->areas[4].size = info->size - sparse->areas[4].offset;
+ sparse->areas[4].disablable = 0;//able to get toggled
+
+ vfio_info_add_capability(caps, &sparse->header, size);
+ kfree(sparse);
+}
+
static void
i40e_vf_migration_get_region_info(int handle,
struct vfio_region_info *info,
@@ -227,9 +295,8 @@ i40e_vf_migration_get_region_info(int handle,

switch (info->index) {
case VFIO_PCI_BAR0_REGION_INDEX:
- info->flags = VFIO_REGION_INFO_FLAG_READ |
- VFIO_REGION_INFO_FLAG_WRITE;
-
+ info->flags |= VFIO_REGION_INFO_FLAG_MMAP;
+ bar0_sparse_mmap_cap(info, caps);
break;
case VFIO_PCI_BAR1_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
case VFIO_PCI_CONFIG_REGION_INDEX:
@@ -237,7 +304,20 @@ i40e_vf_migration_get_region_info(int handle,
case VFIO_PCI_VGA_REGION_INDEX:
break;
default:
- if (cap_type->type == VFIO_REGION_TYPE_MIGRATION &&
+ if (cap_type->type ==
+ VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO &&
+ cap_type->subtype ==
+ VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO) {
+ struct i40e_vf_migration *i40e_vf_dev;
+
+ i40e_vf_dev = i40e_vf_dev_array[handle];
+ i40e_vf_dev->dt_region_index = info->index;
+ info->size =
+ sizeof(struct vfio_device_dt_bar_info_region);
+ } else if ((cap_type->type == VFIO_REGION_TYPE_MIGRATION) &&
+ (cap_type->subtype ==
+ VFIO_REGION_SUBTYPE_MIGRATION)) {
+ } else if (cap_type->type == VFIO_REGION_TYPE_MIGRATION &&
cap_type->subtype == VFIO_REGION_SUBTYPE_MIGRATION) {
struct i40e_vf_migration *i40e_vf_dev;

@@ -254,6 +334,53 @@ i40e_vf_migration_get_region_info(int handle,
}
}

+static ssize_t i40e_vf_dt_region_rw(struct i40e_vf_migration *i40e_vf_dev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite, bool *pt)
+{
+#define DT_REGION_OFFSET(x) offsetof(struct vfio_device_dt_bar_info_region, x)
+ u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ ssize_t ret = 0;
+
+ *pt = false;
+ switch (pos) {
+ case DT_REGION_OFFSET(dt_fd):
+ if (iswrite) {
+ u32 dt_fd;
+ struct eventfd_ctx *trigger;
+
+ if (copy_from_user(&dt_fd, buf, sizeof(dt_fd)))
+ return -EFAULT;
+
+ trigger = eventfd_ctx_fdget(dt_fd);
+ if (IS_ERR(trigger)) {
+ pr_err("i40e_vf_migration_rw, dt trigger fd set error\n");
+ return -EINVAL;
+ }
+ i40e_vf_dev->dt_trigger = trigger;
+ ret = sizeof(dt_fd);
+ } else {
+ ret = -EFAULT;
+ }
+ break;
+
+ case DT_REGION_OFFSET(trap):
+ if (iswrite)
+ ret = copy_from_user(&i40e_vf_dev->trap_bar0,
+ buf, count) ? -EFAULT : count;
+ else
+ ret = copy_to_user(buf,
+ &i40e_vf_dev->trap_bar0,
+ sizeof(u32)) ?
+ -EFAULT : sizeof(u32);
+ break;
+ default:
+ ret = -EFAULT;
+ break;
+ }
+ return ret;
+}
+
static
ssize_t i40e_vf_migration_region_rw(struct i40e_vf_migration *i40e_vf_dev,
char __user *buf, size_t count,
@@ -420,7 +547,10 @@ static ssize_t i40e_vf_migration_rw(int handle, char __user *buf,
case VFIO_PCI_VGA_REGION_INDEX:
break;
default:
- if (index == i40e_vf_dev->migration_region_index) {
+ if (index == i40e_vf_dev->dt_region_index) {
+ return i40e_vf_dt_region_rw(i40e_vf_dev, buf, count,
+ ppos, iswrite, pt);
+ } else if (index == i40e_vf_dev->migration_region_index) {
return i40e_vf_migration_region_rw(i40e_vf_dev, buf,
count, ppos, iswrite, pt);
}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
index b31b500b3cd6..dfad4cc7e46f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
@@ -21,6 +21,14 @@
PAGE_ALIGN(sizeof(struct vfio_device_migration_info))
#define MIGRATION_DIRTY_BITMAP_SIZE (64 * 1024UL)

+#define IAVF_VF_ARQBAH1 0x00006000 /* Reset: EMPR */
+#define IAVF_VF_ARQBAL1 0x00006C00 /* Reset: EMPR */
+#define IAVF_VF_ARQH1 0x00007400 /* Reset: EMPR */
+#define IAVF_VF_ARQT1 0x00007000 /* Reset: EMPR */
+#define IAVF_VF_ARQLEN1 0x00008000 /* Reset: EMPR */
+#define IAVF_VF_TAIL_START 0x00002000 /* Start of tail register region */
+#define IAVF_VF_TAIL_END 0x00002400 /* End of tail register region */
+
/* Single Root I/O Virtualization */
struct pci_sriov {
int pos; /* Capability position */
@@ -56,6 +64,10 @@ struct i40e_vf_migration {
struct pci_dev *vf_dev;
int vf_id;

+ __u64 dt_region_index;
+ struct eventfd_ctx *dt_trigger;
+ bool trap_bar0;
+
__u64 migration_region_index;
__u64 migration_region_size;

--
2.17.1

2019-12-05 03:37:39

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 6/9] sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0

This sample code first returns device
cap |= VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR, so that vfio-pci driver
would create for it a dynamic-trap-bar-info region
(of type VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and
subtype VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO)

Then in igd_dt_get_region_info(), this sample driver will customize the
size of dynamic-trap-bar-info region.
Also, this sample driver customizes BAR 0 region to be sparse mmaped
(only passthrough subregion from BAR0_DYNAMIC_TRAP_OFFSET of size
BAR0_DYNAMIC_TRAP_SIZE) and set this sparse mmaped subregion as disablable.

Then when QEMU detects the dynamic trap bar info region, it will create
an eventfd and write its fd into 'dt_fd' field of this region.

When BAR0's registers below BAR0_DYNAMIC_TRAP_OFFSET is trapped, it will
signal the eventfd to notify QEMU to read 'trap' field of dynamic trap bar
info region and put previously passthroughed subregion to be trapped.
After registers within BAR0_DYNAMIC_TRAP_OFFSET and
BAR0_DYNAMIC_TRAP_SIZE are trapped, this sample driver notifies QEMU via
eventfd to passthrough this subregion again.

Cc: Kevin Tian <[email protected]>

Signed-off-by: Yan Zhao <[email protected]>
---
samples/vfio-pci/igd_dt.c | 176 ++++++++++++++++++++++++++++++++++++++
1 file changed, 176 insertions(+)

diff --git a/samples/vfio-pci/igd_dt.c b/samples/vfio-pci/igd_dt.c
index 857e8d01b0d1..58ef110917f1 100644
--- a/samples/vfio-pci/igd_dt.c
+++ b/samples/vfio-pci/igd_dt.c
@@ -29,6 +29,9 @@
/* This driver supports to open max 256 device devices */
#define MAX_OPEN_DEVICE 256

+#define BAR0_DYNAMIC_TRAP_OFFSET (32*1024)
+#define BAR0_DYNAMIC_TRAP_SIZE (32*1024)
+
/*
* below are pciids of two IGD devices supported in this driver
* It is only for demo purpose.
@@ -47,10 +50,30 @@ struct igd_dt_device {
__u32 vendor;
__u32 device;
__u32 handle;
+
+ __u64 dt_region_index;
+ struct eventfd_ctx *dt_trigger;
+ bool is_highend_trapped;
+ bool is_trap_triggered;
};

static struct igd_dt_device *igd_device_array[MAX_OPEN_DEVICE];

+static bool is_handle_valid(int handle)
+{
+ mutex_lock(&device_bit_lock);
+
+ if (handle >= MAX_OPEN_DEVICE || !igd_device_array[handle] ||
+ !test_bit(handle, igd_device_bits)) {
+ pr_err("%s: handle mismatch, please check interaction with vfio-pci module\n",
+ __func__);
+ mutex_unlock(&device_bit_lock);
+ return false;
+ }
+ mutex_unlock(&device_bit_lock);
+ return true;
+}
+
int igd_dt_open(struct pci_dev *pdev, u64 *caps, u32 *mediate_handle)
{
int supported_dev_cnt = sizeof(pciidlist)/sizeof(struct pci_device_id);
@@ -88,6 +111,7 @@ int igd_dt_open(struct pci_dev *pdev, u64 *caps, u32 *mediate_handle)
igd_device->vendor = pdev->vendor;
igd_device->device = pdev->device;
igd_device->handle = handle;
+ igd_device->dt_region_index = -1;
igd_device_array[handle] = igd_device;
set_bit(handle, igd_device_bits);

@@ -95,6 +119,7 @@ int igd_dt_open(struct pci_dev *pdev, u64 *caps, u32 *mediate_handle)
pdev->vendor, pdev->device, handle);

*mediate_handle = handle;
+ *caps |= VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR;

error:
mutex_unlock(&device_bit_lock);
@@ -135,14 +160,165 @@ static void igd_dt_get_region_info(int handle,
struct vfio_info_cap *caps,
struct vfio_region_info_cap_type *cap_type)
{
+ struct vfio_region_info_cap_sparse_mmap *sparse;
+ size_t size;
+ int nr_areas, ret;
+
+ if (!is_handle_valid(handle))
+ return;
+
+ switch (info->index) {
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ info->flags |= VFIO_REGION_INFO_FLAG_MMAP;
+ nr_areas = 1;
+
+ size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+
+ sparse = kzalloc(size, GFP_KERNEL);
+ if (!sparse)
+ return;
+
+ sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+ sparse->header.version = 1;
+ sparse->nr_areas = nr_areas;
+
+ sparse->areas[0].offset = BAR0_DYNAMIC_TRAP_OFFSET;
+ sparse->areas[0].size = BAR0_DYNAMIC_TRAP_SIZE;
+ sparse->areas[0].disablable = 1;//able to get disabled
+
+ ret = vfio_info_add_capability(caps, &sparse->header,
+ size);
+ kfree(sparse);
+ break;
+ case VFIO_PCI_BAR1_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ case VFIO_PCI_ROM_REGION_INDEX:
+ case VFIO_PCI_VGA_REGION_INDEX:
+ break;
+ default:
+ if ((cap_type->type ==
+ VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO) &&
+ (cap_type->subtype ==
+ VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO)){
+ struct igd_dt_device *igd_device;
+
+ igd_device = igd_device_array[handle];
+ igd_device->dt_region_index = info->index;
+ info->size =
+ sizeof(struct vfio_device_dt_bar_info_region);
+ }
+ }
+}
+
+static
+void igd_dt_set_bar_mmap_enabled(struct igd_dt_device *igd_device,
+ bool enabled)
+{
+ bool disable_bar = !enabled;
+
+ if (igd_device->is_highend_trapped == disable_bar)
+ return;
+
+ igd_device->is_highend_trapped = disable_bar;
+
+ if (igd_device->dt_trigger)
+ eventfd_signal(igd_device->dt_trigger, 1);
+}
+
+static ssize_t igd_dt_dt_region_rw(struct igd_dt_device *igd_device,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite, bool *pt)
+{
+#define DT_REGION_OFFSET(x) offsetof(struct vfio_device_dt_bar_info_region, x)
+ u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+ *pt = false;
+ switch (pos) {
+ case DT_REGION_OFFSET(dt_fd):
+ if (iswrite) {
+ u32 dt_fd;
+ struct eventfd_ctx *trigger;
+
+ if (copy_from_user(&dt_fd, buf,
+ sizeof(dt_fd)))
+ return -EFAULT;
+
+ trigger = eventfd_ctx_fdget(dt_fd);
+ pr_info("igd_dt_rw, dt trigger fd %d\n",
+ dt_fd);
+ if (IS_ERR(trigger)) {
+ pr_err("igd_dt_rw, dt trigger fd set error\n");
+ return -EINVAL;
+ }
+ igd_device->dt_trigger = trigger;
+ return sizeof(dt_fd);
+ } else
+ return -EFAULT;
+ case DT_REGION_OFFSET(trap):
+ if (iswrite)
+ return -EFAULT;
+ else
+ return copy_to_user(buf,
+ &igd_device->is_highend_trapped,
+ sizeof(u32)) ?
+ -EFAULT : count;
+ break;
+ default:
+ return -EFAULT;
+ }
}

static ssize_t igd_dt_rw(int handle, char __user *buf,
size_t count, loff_t *ppos,
bool iswrite, bool *pt)
{
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ struct igd_dt_device *igd_device;
+ u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
*pt = true;

+ if (!is_handle_valid(handle))
+ return -EFAULT;
+
+ igd_device = igd_device_array[handle];
+
+ switch (index) {
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ /*
+ * disable passthroughed subregion
+ * on lower end write trapped
+ */
+ if (pos < BAR0_DYNAMIC_TRAP_OFFSET &&
+ !igd_device->is_trap_triggered) {
+ pr_info("igd_dt bar 0 lowend rw trapped, trap highend\n");
+ igd_device->is_trap_triggered = true;
+ igd_dt_set_bar_mmap_enabled(igd_device, false);
+ }
+
+ /*
+ * re-enable passthroughed subregion
+ * on high end write trapped
+ */
+ if (pos >= BAR0_DYNAMIC_TRAP_OFFSET &&
+ pos <= (BAR0_DYNAMIC_TRAP_OFFSET +
+ BAR0_DYNAMIC_TRAP_SIZE)) {
+ pr_info("igd_dt bar 0 higher end rw trapped, pt higher end\n");
+ igd_dt_set_bar_mmap_enabled(igd_device, true);
+ }
+
+ break;
+ case VFIO_PCI_BAR1_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ case VFIO_PCI_ROM_REGION_INDEX:
+ case VFIO_PCI_VGA_REGION_INDEX:
+ break;
+ default:
+ if (index == igd_device->dt_region_index)
+ return igd_dt_dt_region_rw(igd_device, buf,
+ count, ppos, iswrite, pt);
+ }
+
return 0;
}

--
2.17.1

2019-12-05 06:35:46

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

Hi:

On 2019/12/5 上午11:24, Yan Zhao wrote:
> For SRIOV devices, VFs are passthroughed into guest directly without host
> driver mediation. However, when VMs migrating with passthroughed VFs,
> dynamic host mediation is required to (1) get device states, (2) get
> dirty pages. Since device states as well as other critical information
> required for dirty page tracking for VFs are usually retrieved from PFs,
> it is handy to provide an extension in PF driver to centralizingly control
> VFs' migration.
>
> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> dynamically trap VFs' bars for dirty page tracking and


A silly question, what's the reason for doing this, is this a must for
dirty page tracking?


> (3) centralizing
> VF critical states retrieving and VF controls into one driver, we propose
> to introduce mediate ops on top of current vfio-pci device driver.
>
>
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>
> __________ register mediate ops| ___________ ___________ |
> | |<-----------------------| VF | | |
> | vfio-pci | | | mediate | | PF driver | |
> |__________|----------------------->| driver | |___________|
> | open(pdev) | ----------- | |
> | |
> | |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> \|/ \|/
> ----------- ------------
> | VF | | PF |
> ----------- ------------
>
>
> VF mediate driver could be a standalone driver that does not bind to
> any devices (as in demo code in patches 5-6) or it could be a built-in
> extension of PF driver (as in patches 7-9) .
>
> Rather than directly bind to VF, VF mediate driver register a mediate
> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> mediate ops.
> (Note that: VF mediate driver can register mediate ops into vfio-pci
> before vfio-pci binding to any devices. And VF mediate driver can
> support mediating multiple devices.)
>
> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> device as a parameter.
> VF mediate driver should return success or failure depending on it
> supports the pdev or not.
> E.g. VF mediate driver would compare its supported VF devfn with the
> devfn of the passed-in pdev.
> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> stop querying other mediate ops and bind the opening device with this
> mediate ops using the returned mediate handle.
>
> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> VF will be intercepted into VF mediate driver as
> vfio_pci_mediate_ops->get_region_info(),
> vfio_pci_mediate_ops->rw,
> vfio_pci_mediate_ops->mmap, and get customized.
> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> further return 'pt' to indicate whether vfio-pci should further
> passthrough data to hw.
>
> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> with a mediate handle as parameter.
>
> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> mediate driver be able to differentiate two opening VFs of the same device
> id and vendor id.
>
> When VF mediate driver exits, it unregisters its mediate ops from
> vfio-pci.
>
>
> In this patchset, we enable vfio-pci to provide 3 things:
> (1) calling mediate ops to allow vendor driver customizing default
> region info/rw/mmap of a region.
> (2) provide a migration region to support migration


What's the benefit of introducing a region? It looks to me we don't
expect the region to be accessed directly from guest. Could we simply
extend device fd ioctl for doing such things?


> (3) provide a dynamic trap bar info region to allow vendor driver
> control trap/untrap of device pci bars
>
> This vfio-pci + mediate ops way differs from mdev way in that
> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> specific mdev parent driver is bound to VF directly.
> (2) vfio-pci + mediate ops way does not create mdev devices and VF
> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
>
> The reason why we don't choose the way of writing mdev parent driver is
> that
> (1) VFs are almost all the time directly passthroughed. Directly binding
> to vfio-pci can make most of the code shared/reused.


Can we split out the common parts from vfio-pci?


> If we write a
> vendor specific mdev parent driver, most of the code (like passthrough
> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> actually a duplicated and tedious work.


The mediate ops looks quite similar to what vfio-mdev did. And it looks
to me we need to consider live migration for mdev as well. In that case,
do we still expect mediate ops through VFIO directly?


> (2) For features like dynamically trap/untrap pci bars, if they are in
> vfio-pci, they can be available to most people without repeated code
> copying and re-testing.
> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> it runs into a real migration need. However, if vfio-pci is bound
> initially, they have no chance to do live migration when there's a need
> later.


We can teach management layer to do this.

Thanks


>
> In this patchset,
> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
> driver to mediate/customize region info/rw/mmap.
>
> - patches 5-6 provide a standalone sample driver to register a mediate ops
> for Intel Graphics Devices. It does not bind to IGDs directly but decides
> what devices it supports via its pciidlist. It also demonstrates how to
> dynamic trap a device's PCI bars. (by adding more pciids in its
> pciidlist, this sample driver actually is not necessarily limited to
> support IGDs)
>
> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
> Ethernet Controller XL710 Family of devices. It supports VF precopy live
> migration on Intel's 710 SRIOV. (but we commented out the real
> implementation of dirty page tracking and device state retrieving part
> to focus on demonstrating framework part. Will send out them in future
> versions)
>
> patch 7 registers/unregisters VF mediate ops when PF driver
> probes/removes. It specifies its supporting VFs via
> vfio_pci_mediate_ops->open(pdev)
>
> patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
> provides a sample implementation of migration region.
> The QEMU part of vfio migration is based on v8
> https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
> We do not based on recent v9 because we think there are still opens in
> dirty page track part in that series.
>
> patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
> provides an example on how to trap part of bar0 when migration starts
> and passthrough this part of bar0 again when migration fails.
>
> Yan Zhao (9):
> vfio/pci: introduce mediate ops to intercept vfio-pci ops
> vfio/pci: test existence before calling region->ops
> vfio/pci: register a default migration region
> vfio-pci: register default dynamic-trap-bar-info region
> samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
> sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
> i40e/vf_migration: register mediate_ops to vfio-pci
> i40e/vf_migration: mediate migration region
> i40e/vf_migration: support dynamic trap of bar0
>
> drivers/net/ethernet/intel/Kconfig | 2 +-
> drivers/net/ethernet/intel/i40e/Makefile | 3 +-
> drivers/net/ethernet/intel/i40e/i40e.h | 2 +
> drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +
> .../ethernet/intel/i40e/i40e_vf_migration.c | 626 ++++++++++++++++++
> .../ethernet/intel/i40e/i40e_vf_migration.h | 78 +++
> drivers/vfio/pci/vfio_pci.c | 189 +++++-
> drivers/vfio/pci/vfio_pci_private.h | 2 +
> include/linux/vfio.h | 18 +
> include/uapi/linux/vfio.h | 160 +++++
> samples/Kconfig | 6 +
> samples/Makefile | 1 +
> samples/vfio-pci/Makefile | 2 +
> samples/vfio-pci/igd_dt.c | 367 ++++++++++
> 14 files changed, 1455 insertions(+), 4 deletions(-)
> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> create mode 100644 samples/vfio-pci/Makefile
> create mode 100644 samples/vfio-pci/igd_dt.c
>

2019-12-05 09:00:08

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> Hi:
>
> On 2019/12/5 上午11:24, Yan Zhao wrote:
> > For SRIOV devices, VFs are passthroughed into guest directly without host
> > driver mediation. However, when VMs migrating with passthroughed VFs,
> > dynamic host mediation is required to (1) get device states, (2) get
> > dirty pages. Since device states as well as other critical information
> > required for dirty page tracking for VFs are usually retrieved from PFs,
> > it is handy to provide an extension in PF driver to centralizingly control
> > VFs' migration.
> >
> > Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> > dynamically trap VFs' bars for dirty page tracking and
>
>
> A silly question, what's the reason for doing this, is this a must for dirty
> page tracking?
>
For performance consideration. VFs' bars should be passthoughed at
normal time and only enter into trap state on need.

>
> > (3) centralizing
> > VF critical states retrieving and VF controls into one driver, we propose
> > to introduce mediate ops on top of current vfio-pci device driver.
> >
> >
> > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > __________ register mediate ops| ___________ ___________ |
> > | |<-----------------------| VF | | |
> > | vfio-pci | | | mediate | | PF driver | |
> > |__________|----------------------->| driver | |___________|
> > | open(pdev) | ----------- | |
> > | |
> > | |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> > \|/ \|/
> > ----------- ------------
> > | VF | | PF |
> > ----------- ------------
> >
> >
> > VF mediate driver could be a standalone driver that does not bind to
> > any devices (as in demo code in patches 5-6) or it could be a built-in
> > extension of PF driver (as in patches 7-9) .
> >
> > Rather than directly bind to VF, VF mediate driver register a mediate
> > ops into vfio-pci in driver init. vfio-pci maintains a list of such
> > mediate ops.
> > (Note that: VF mediate driver can register mediate ops into vfio-pci
> > before vfio-pci binding to any devices. And VF mediate driver can
> > support mediating multiple devices.)
> >
> > When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> > list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> > device as a parameter.
> > VF mediate driver should return success or failure depending on it
> > supports the pdev or not.
> > E.g. VF mediate driver would compare its supported VF devfn with the
> > devfn of the passed-in pdev.
> > Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> > stop querying other mediate ops and bind the opening device with this
> > mediate ops using the returned mediate handle.
> >
> > Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> > VF will be intercepted into VF mediate driver as
> > vfio_pci_mediate_ops->get_region_info(),
> > vfio_pci_mediate_ops->rw,
> > vfio_pci_mediate_ops->mmap, and get customized.
> > For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> > further return 'pt' to indicate whether vfio-pci should further
> > passthrough data to hw.
> >
> > when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> > with a mediate handle as parameter.
> >
> > The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> > mediate driver be able to differentiate two opening VFs of the same device
> > id and vendor id.
> >
> > When VF mediate driver exits, it unregisters its mediate ops from
> > vfio-pci.
> >
> >
> > In this patchset, we enable vfio-pci to provide 3 things:
> > (1) calling mediate ops to allow vendor driver customizing default
> > region info/rw/mmap of a region.
> > (2) provide a migration region to support migration
>
>
> What's the benefit of introducing a region? It looks to me we don't expect
> the region to be accessed directly from guest. Could we simply extend device
> fd ioctl for doing such things?
>
You may take a look on mdev live migration discussions in
https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html

or previous discussion at
https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/

generaly speaking, qemu part of live migration is consistent for
vfio-pci + mediate ops way or mdev way. The region is only a channel for
QEMU and kernel to communicate information without introducing IOCTLs.


>
> > (3) provide a dynamic trap bar info region to allow vendor driver
> > control trap/untrap of device pci bars
> >
> > This vfio-pci + mediate ops way differs from mdev way in that
> > (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> > specific mdev parent driver is bound to VF directly.
> > (2) vfio-pci + mediate ops way does not create mdev devices and VF
> > mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
> >
> > The reason why we don't choose the way of writing mdev parent driver is
> > that
> > (1) VFs are almost all the time directly passthroughed. Directly binding
> > to vfio-pci can make most of the code shared/reused.
>
>
> Can we split out the common parts from vfio-pci?
>
That's very attractive. but one cannot implement a vfio-pci except
export everything in it as common part :)
>
> > If we write a
> > vendor specific mdev parent driver, most of the code (like passthrough
> > style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> > actually a duplicated and tedious work.
>
>
> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
> me we need to consider live migration for mdev as well. In that case, do we
> still expect mediate ops through VFIO directly?
>
>
> > (2) For features like dynamically trap/untrap pci bars, if they are in
> > vfio-pci, they can be available to most people without repeated code
> > copying and re-testing.
> > (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> > have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> > it runs into a real migration need. However, if vfio-pci is bound
> > initially, they have no chance to do live migration when there's a need
> > later.
>
>
> We can teach management layer to do this.
>
No. not possible as vfio-pci by default has no migration region and
dirty page tracking needs vendor's mediation at least for most
passthrough devices now.

Thanks
Yn

> Thanks
>
>
> >
> > In this patchset,
> > - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
> > driver to mediate/customize region info/rw/mmap.
> >
> > - patches 5-6 provide a standalone sample driver to register a mediate ops
> > for Intel Graphics Devices. It does not bind to IGDs directly but decides
> > what devices it supports via its pciidlist. It also demonstrates how to
> > dynamic trap a device's PCI bars. (by adding more pciids in its
> > pciidlist, this sample driver actually is not necessarily limited to
> > support IGDs)
> >
> > - patch 7-9 provide a sample on i40e driver that supports Intel(R)
> > Ethernet Controller XL710 Family of devices. It supports VF precopy live
> > migration on Intel's 710 SRIOV. (but we commented out the real
> > implementation of dirty page tracking and device state retrieving part
> > to focus on demonstrating framework part. Will send out them in future
> > versions)
> > patch 7 registers/unregisters VF mediate ops when PF driver
> > probes/removes. It specifies its supporting VFs via
> > vfio_pci_mediate_ops->open(pdev)
> >
> > patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
> > provides a sample implementation of migration region.
> > The QEMU part of vfio migration is based on v8
> > https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
> > We do not based on recent v9 because we think there are still opens in
> > dirty page track part in that series.
> >
> > patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
> > provides an example on how to trap part of bar0 when migration starts
> > and passthrough this part of bar0 again when migration fails.
> >
> > Yan Zhao (9):
> > vfio/pci: introduce mediate ops to intercept vfio-pci ops
> > vfio/pci: test existence before calling region->ops
> > vfio/pci: register a default migration region
> > vfio-pci: register default dynamic-trap-bar-info region
> > samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
> > sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
> > i40e/vf_migration: register mediate_ops to vfio-pci
> > i40e/vf_migration: mediate migration region
> > i40e/vf_migration: support dynamic trap of bar0
> >
> > drivers/net/ethernet/intel/Kconfig | 2 +-
> > drivers/net/ethernet/intel/i40e/Makefile | 3 +-
> > drivers/net/ethernet/intel/i40e/i40e.h | 2 +
> > drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +
> > .../ethernet/intel/i40e/i40e_vf_migration.c | 626 ++++++++++++++++++
> > .../ethernet/intel/i40e/i40e_vf_migration.h | 78 +++
> > drivers/vfio/pci/vfio_pci.c | 189 +++++-
> > drivers/vfio/pci/vfio_pci_private.h | 2 +
> > include/linux/vfio.h | 18 +
> > include/uapi/linux/vfio.h | 160 +++++
> > samples/Kconfig | 6 +
> > samples/Makefile | 1 +
> > samples/vfio-pci/Makefile | 2 +
> > samples/vfio-pci/igd_dt.c | 367 ++++++++++
> > 14 files changed, 1455 insertions(+), 4 deletions(-)
> > create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> > create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> > create mode 100644 samples/vfio-pci/Makefile
> > create mode 100644 samples/vfio-pci/igd_dt.c
> >
>

2019-12-05 13:09:18

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci


On 2019/12/5 下午4:51, Yan Zhao wrote:
> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
>> Hi:
>>
>> On 2019/12/5 上午11:24, Yan Zhao wrote:
>>> For SRIOV devices, VFs are passthroughed into guest directly without host
>>> driver mediation. However, when VMs migrating with passthroughed VFs,
>>> dynamic host mediation is required to (1) get device states, (2) get
>>> dirty pages. Since device states as well as other critical information
>>> required for dirty page tracking for VFs are usually retrieved from PFs,
>>> it is handy to provide an extension in PF driver to centralizingly control
>>> VFs' migration.
>>>
>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
>>> dynamically trap VFs' bars for dirty page tracking and
>>
>> A silly question, what's the reason for doing this, is this a must for dirty
>> page tracking?
>>
> For performance consideration. VFs' bars should be passthoughed at
> normal time and only enter into trap state on need.


Right, but how does this matter for the case of dirty page tracking?


>
>>> (3) centralizing
>>> VF critical states retrieving and VF controls into one driver, we propose
>>> to introduce mediate ops on top of current vfio-pci device driver.
>>>
>>>
>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>> __________ register mediate ops| ___________ ___________ |
>>> | |<-----------------------| VF | | |
>>> | vfio-pci | | | mediate | | PF driver | |
>>> |__________|----------------------->| driver | |___________|
>>> | open(pdev) | ----------- | |
>>> | |
>>> | |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
>>> \|/ \|/
>>> ----------- ------------
>>> | VF | | PF |
>>> ----------- ------------
>>>
>>>
>>> VF mediate driver could be a standalone driver that does not bind to
>>> any devices (as in demo code in patches 5-6) or it could be a built-in
>>> extension of PF driver (as in patches 7-9) .
>>>
>>> Rather than directly bind to VF, VF mediate driver register a mediate
>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
>>> mediate ops.
>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
>>> before vfio-pci binding to any devices. And VF mediate driver can
>>> support mediating multiple devices.)
>>>
>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
>>> device as a parameter.
>>> VF mediate driver should return success or failure depending on it
>>> supports the pdev or not.
>>> E.g. VF mediate driver would compare its supported VF devfn with the
>>> devfn of the passed-in pdev.
>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
>>> stop querying other mediate ops and bind the opening device with this
>>> mediate ops using the returned mediate handle.
>>>
>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
>>> VF will be intercepted into VF mediate driver as
>>> vfio_pci_mediate_ops->get_region_info(),
>>> vfio_pci_mediate_ops->rw,
>>> vfio_pci_mediate_ops->mmap, and get customized.
>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
>>> further return 'pt' to indicate whether vfio-pci should further
>>> passthrough data to hw.
>>>
>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
>>> with a mediate handle as parameter.
>>>
>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
>>> mediate driver be able to differentiate two opening VFs of the same device
>>> id and vendor id.
>>>
>>> When VF mediate driver exits, it unregisters its mediate ops from
>>> vfio-pci.
>>>
>>>
>>> In this patchset, we enable vfio-pci to provide 3 things:
>>> (1) calling mediate ops to allow vendor driver customizing default
>>> region info/rw/mmap of a region.
>>> (2) provide a migration region to support migration
>>
>> What's the benefit of introducing a region? It looks to me we don't expect
>> the region to be accessed directly from guest. Could we simply extend device
>> fd ioctl for doing such things?
>>
> You may take a look on mdev live migration discussions in
> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>
> or previous discussion at
> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
>
> generaly speaking, qemu part of live migration is consistent for
> vfio-pci + mediate ops way or mdev way.


So in mdev, do you still have a mediate driver? Or you expect the parent
to implement the region?


> The region is only a channel for
> QEMU and kernel to communicate information without introducing IOCTLs.


Well, at least you introduce new type of region in uapi. So this does
not answer why region is better than ioctl. If the region will only be
used by qemu, using ioctl is much more easier and straightforward.


>
>
>>> (3) provide a dynamic trap bar info region to allow vendor driver
>>> control trap/untrap of device pci bars
>>>
>>> This vfio-pci + mediate ops way differs from mdev way in that
>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
>>> specific mdev parent driver is bound to VF directly.
>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
>>>
>>> The reason why we don't choose the way of writing mdev parent driver is
>>> that
>>> (1) VFs are almost all the time directly passthroughed. Directly binding
>>> to vfio-pci can make most of the code shared/reused.
>>
>> Can we split out the common parts from vfio-pci?
>>
> That's very attractive. but one cannot implement a vfio-pci except
> export everything in it as common part :)


Well, I think there should be not hard to do that. E..g you can route it
back to like:

vfio -> vfio_mdev -> parent -> vfio_pci


>>> If we write a
>>> vendor specific mdev parent driver, most of the code (like passthrough
>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
>>> actually a duplicated and tedious work.
>>
>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
>> me we need to consider live migration for mdev as well. In that case, do we
>> still expect mediate ops through VFIO directly?
>>
>>
>>> (2) For features like dynamically trap/untrap pci bars, if they are in
>>> vfio-pci, they can be available to most people without repeated code
>>> copying and re-testing.
>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
>>> it runs into a real migration need. However, if vfio-pci is bound
>>> initially, they have no chance to do live migration when there's a need
>>> later.
>>
>> We can teach management layer to do this.
>>
> No. not possible as vfio-pci by default has no migration region and
> dirty page tracking needs vendor's mediation at least for most
> passthrough devices now.


I'm not quite sure I get here but in this case, just tech them to use
the driver that has migration support?

Thanks


>
> Thanks
> Yn
>
>> Thanks
>>
>>
>>> In this patchset,
>>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
>>> driver to mediate/customize region info/rw/mmap.
>>>
>>> - patches 5-6 provide a standalone sample driver to register a mediate ops
>>> for Intel Graphics Devices. It does not bind to IGDs directly but decides
>>> what devices it supports via its pciidlist. It also demonstrates how to
>>> dynamic trap a device's PCI bars. (by adding more pciids in its
>>> pciidlist, this sample driver actually is not necessarily limited to
>>> support IGDs)
>>>
>>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
>>> Ethernet Controller XL710 Family of devices. It supports VF precopy live
>>> migration on Intel's 710 SRIOV. (but we commented out the real
>>> implementation of dirty page tracking and device state retrieving part
>>> to focus on demonstrating framework part. Will send out them in future
>>> versions)
>>> patch 7 registers/unregisters VF mediate ops when PF driver
>>> probes/removes. It specifies its supporting VFs via
>>> vfio_pci_mediate_ops->open(pdev)
>>>
>>> patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
>>> provides a sample implementation of migration region.
>>> The QEMU part of vfio migration is based on v8
>>> https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
>>> We do not based on recent v9 because we think there are still opens in
>>> dirty page track part in that series.
>>>
>>> patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
>>> provides an example on how to trap part of bar0 when migration starts
>>> and passthrough this part of bar0 again when migration fails.
>>>
>>> Yan Zhao (9):
>>> vfio/pci: introduce mediate ops to intercept vfio-pci ops
>>> vfio/pci: test existence before calling region->ops
>>> vfio/pci: register a default migration region
>>> vfio-pci: register default dynamic-trap-bar-info region
>>> samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
>>> sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
>>> i40e/vf_migration: register mediate_ops to vfio-pci
>>> i40e/vf_migration: mediate migration region
>>> i40e/vf_migration: support dynamic trap of bar0
>>>
>>> drivers/net/ethernet/intel/Kconfig | 2 +-
>>> drivers/net/ethernet/intel/i40e/Makefile | 3 +-
>>> drivers/net/ethernet/intel/i40e/i40e.h | 2 +
>>> drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +
>>> .../ethernet/intel/i40e/i40e_vf_migration.c | 626 ++++++++++++++++++
>>> .../ethernet/intel/i40e/i40e_vf_migration.h | 78 +++
>>> drivers/vfio/pci/vfio_pci.c | 189 +++++-
>>> drivers/vfio/pci/vfio_pci_private.h | 2 +
>>> include/linux/vfio.h | 18 +
>>> include/uapi/linux/vfio.h | 160 +++++
>>> samples/Kconfig | 6 +
>>> samples/Makefile | 1 +
>>> samples/vfio-pci/Makefile | 2 +
>>> samples/vfio-pci/igd_dt.c | 367 ++++++++++
>>> 14 files changed, 1455 insertions(+), 4 deletions(-)
>>> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
>>> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>>> create mode 100644 samples/vfio-pci/Makefile
>>> create mode 100644 samples/vfio-pci/igd_dt.c
>>>

2019-12-05 23:56:23

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 3/9] vfio/pci: register a default migration region

On Wed, 4 Dec 2019 22:26:38 -0500
Yan Zhao <[email protected]> wrote:

> Vendor driver specifies when to support a migration region through cap
> VFIO_PCI_DEVICE_CAP_MIGRATION in vfio_pci_mediate_ops->open().
>
> If vfio-pci detects this cap, it creates a default migration region on
> behalf of vendor driver with region len=0 and region->ops=null.
> Vendor driver should override this region's len, flags, rw, mmap in
> its vfio_pci_mediate_ops.
>
> This migration region definition is aligned to QEMU vfio migration code v8:
> (https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html)
>
> Cc: Kevin Tian <[email protected]>
>
> Signed-off-by: Yan Zhao <[email protected]>
> ---
> drivers/vfio/pci/vfio_pci.c | 15 ++++
> include/linux/vfio.h | 1 +
> include/uapi/linux/vfio.h | 149 ++++++++++++++++++++++++++++++++++++
> 3 files changed, 165 insertions(+)
>
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index f3730252ee82..059660328be2 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -115,6 +115,18 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
> }
>
> +/**
> + * init a region to hold migration ctl & data
> + */
> +void init_migration_region(struct vfio_pci_device *vdev)
> +{
> + vfio_pci_register_dev_region(vdev, VFIO_REGION_TYPE_MIGRATION,
> + VFIO_REGION_SUBTYPE_MIGRATION,
> + NULL, 0,
> + VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
> + NULL);
> +}
> +
> static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> {
> struct resource *res;
> @@ -523,6 +535,9 @@ static int vfio_pci_open(void *device_data)
> vdev->mediate_ops = mentry->ops;
> vdev->mediate_handle = handle;
>
> + if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
> + init_migration_region(vdev);

No. We're not going to add a cap flag for every region the mediation
driver wants to add. The mediation driver should have the ability to
add regions and irqs to the device itself. Thanks,

Alex

> +
> pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
> vdev->mediate_ops->name, caps,
> handle, vdev->pdev->vendor,

2019-12-05 23:57:44

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

On Wed, 4 Dec 2019 22:26:50 -0500
Yan Zhao <[email protected]> wrote:

> Dynamic trap bar info region is a channel for QEMU and vendor driver to
> communicate dynamic trap info. It is of type
> VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
>
> This region has two fields: dt_fd and trap.
> When QEMU detects a device regions of this type, it will create an
> eventfd and write its eventfd id to dt_fd field.
> When vendor drivre signals this eventfd, QEMU reads trap field of this
> info region.
> - If trap is true, QEMU would search the device's PCI BAR
> regions and disable all the sparse mmaped subregions (if the sparse
> mmaped subregion is disablable).
> - If trap is false, QEMU would re-enable those subregions.
>
> A typical usage is
> 1. vendor driver first cuts its bar 0 into several sections, all in a
> sparse mmap array. So initally, all its bar 0 are passthroughed.
> 2. vendor driver specifys part of bar 0 sections to be disablable.
> 3. on migration starts, vendor driver signals dt_fd and set trap to true
> to notify QEMU disabling the bar 0 sections of disablable flags on.
> 4. QEMU disables those bar 0 section and hence let vendor driver be able
> to trap access of bar 0 registers and make dirty page tracking possible.
> 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> QEMU reads trap field of this info region which is false and QEMU
> re-passthrough the whole bar 0 region.
>
> Vendor driver specifies whether it supports dynamic-trap-bar-info region
> through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> vfio_pci_mediate_ops->open().
>
> If vfio-pci detects this cap, it will create a default
> dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> and region->ops=null.
> Vvendor driver should override this region's len, flags, rw, mmap in its
> vfio_pci_mediate_ops.

TBH, I don't like this interface at all. Userspace doesn't pass data
to the kernel via INFO ioctls. We have a SET_IRQS ioctl for
configuring user signaling with eventfds. I think we only need to
define an IRQ type that tells the user to re-evaluate the sparse mmap
information for a region. The user would enumerate the device IRQs via
GET_IRQ_INFO, find one of this type where the IRQ info would also
indicate which region(s) should be re-evaluated on signaling. The user
would enable that signaling via SET_IRQS and simply re-evaluate the
sparse mmap capability for the associated regions when signaled.
Thanks,

Alex

>
> Cc: Kevin Tian <[email protected]>
>
> Signed-off-by: Yan Zhao <[email protected]>
> ---
> drivers/vfio/pci/vfio_pci.c | 16 ++++++++++++++++
> include/linux/vfio.h | 3 ++-
> include/uapi/linux/vfio.h | 11 +++++++++++
> 3 files changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 059660328be2..62b811ca43e4 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -127,6 +127,19 @@ void init_migration_region(struct vfio_pci_device *vdev)
> NULL);
> }
>
> +/**
> + * register a region to hold info for dynamically trap bar regions
> + */
> +void init_dynamic_trap_bar_info_region(struct vfio_pci_device *vdev)
> +{
> + vfio_pci_register_dev_region(vdev,
> + VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO,
> + VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO,
> + NULL, 0,
> + VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
> + NULL);
> +}
> +
> static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> {
> struct resource *res;
> @@ -538,6 +551,9 @@ static int vfio_pci_open(void *device_data)
> if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
> init_migration_region(vdev);
>
> + if (caps & VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR)
> + init_dynamic_trap_bar_info_region(vdev);
> +
> pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
> vdev->mediate_ops->name, caps,
> handle, vdev->pdev->vendor,
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index cddea8e9dcb2..cf8ecf687bee 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -197,7 +197,8 @@ extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
>
> struct vfio_pci_mediate_ops {
> char *name;
> -#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
> +#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
> +#define VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR (0x02)
> int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
> void (*release)(int handle);
> void (*get_region_info)(int handle,
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index caf8845a67a6..74a2d0b57741 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -258,6 +258,9 @@ struct vfio_region_info {
> struct vfio_region_sparse_mmap_area {
> __u64 offset; /* Offset of mmap'able area within region */
> __u64 size; /* Size of mmap'able area */
> + __u32 disablable; /* whether this mmap'able are able to
> + * be dynamically disabled
> + */
> };
>
> struct vfio_region_info_cap_sparse_mmap {
> @@ -454,6 +457,14 @@ struct vfio_device_migration_info {
> #define VFIO_DEVICE_DIRTY_PFNS_ALL (~0ULL)
> } __attribute__((packed));
>
> +/* Region type and sub-type to hold info to dynamically trap bars */
> +#define VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO (4)
> +#define VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO (1)
> +
> +struct vfio_device_dt_bar_info_region {
> + __u32 dt_fd; /* fd of eventfd to notify qemu trap/untrap bars*/
> + __u32 trap; /* trap/untrap bar regions */
> +};
>
> /* sub-types for VFIO_REGION_TYPE_PCI_* */
>

2019-12-05 23:58:11

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

On Wed, 4 Dec 2019 22:25:36 -0500
Yan Zhao <[email protected]> wrote:

> when vfio-pci is bound to a physical device, almost all the hardware
> resources are passthroughed.
> Sometimes, vendor driver of this physcial device may want to mediate some
> hardware resource access for a short period of time, e.g. dirty page
> tracking during live migration.
>
> Here we introduce mediate ops in vfio-pci for this purpose.
>
> Vendor driver can register a mediate ops to vfio-pci.
> But rather than directly bind to the passthroughed device, the
> vendor driver is now either a module that does not bind to any device or
> a module binds to other device.
> E.g. when passing through a VF device that is bound to vfio-pci modules,
> PF driver that binds to PF device can register to vfio-pci to mediate
> VF's regions, hence supporting VF live migration.
>
> The sequence goes like this:
> 1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver
>
> 2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops
>
> 3. Whenever vfio-pci opens a device, it searches the list and call
> vfio_pci_mediate_ops->open() to check whether a vendor driver supports
> mediating this device.
> Upon a success return value of from vfio_pci_mediate_ops->open(),
> vfio-pci will stop list searching and store a mediate handle to
> represent this open into vendor driver.
> (so if multiple vendor drivers support mediating a device through
> vfio_pci_mediate_ops, only one will win, depending on their registering
> sequence)
>
> 4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
> ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
> vendor driver is able to override a region's default flags and caps,
> e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
> region.
>
> 5. vfio_pci_rw()/vfio_pci_mmap() first calls into
> vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
> if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
> passthrough this read/write/mmap to physical device, otherwise it just
> returns without touch physical device.
>
> 6. When vfio-pci closes a device, vfio_pci_release() chains into
> vfio_pci_mediate_ops->release() to close the reference in vendor driver.
>
> 7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits
>
> Cc: Kevin Tian <[email protected]>
>
> Signed-off-by: Yan Zhao <[email protected]>
> ---
> drivers/vfio/pci/vfio_pci.c | 146 ++++++++++++++++++++++++++++
> drivers/vfio/pci/vfio_pci_private.h | 2 +
> include/linux/vfio.h | 16 +++
> 3 files changed, 164 insertions(+)
>
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 02206162eaa9..55080ff29495 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> MODULE_PARM_DESC(disable_idle_d3,
> "Disable using the PCI D3 low power state for idle, unused devices");
>
> +static LIST_HEAD(mediate_ops_list);
> +static DEFINE_MUTEX(mediate_ops_list_lock);
> +struct vfio_pci_mediate_ops_list_entry {
> + struct vfio_pci_mediate_ops *ops;
> + int refcnt;
> + struct list_head next;
> +};
> +
> static inline bool vfio_vga_disabled(void)
> {
> #ifdef CONFIG_VFIO_PCI_VGA
> @@ -472,6 +480,10 @@ static void vfio_pci_release(void *device_data)
> if (!(--vdev->refcnt)) {
> vfio_spapr_pci_eeh_release(vdev->pdev);
> vfio_pci_disable(vdev);
> + if (vdev->mediate_ops && vdev->mediate_ops->release) {
> + vdev->mediate_ops->release(vdev->mediate_handle);
> + vdev->mediate_ops = NULL;
> + }
> }
>
> mutex_unlock(&vdev->reflck->lock);
> @@ -483,6 +495,7 @@ static int vfio_pci_open(void *device_data)
> {
> struct vfio_pci_device *vdev = device_data;
> int ret = 0;
> + struct vfio_pci_mediate_ops_list_entry *mentry;
>
> if (!try_module_get(THIS_MODULE))
> return -ENODEV;
> @@ -495,6 +508,30 @@ static int vfio_pci_open(void *device_data)
> goto error;
>
> vfio_spapr_pci_eeh_open(vdev->pdev);
> + mutex_lock(&mediate_ops_list_lock);
> + list_for_each_entry(mentry, &mediate_ops_list, next) {
> + u64 caps;
> + u32 handle;

Wouldn't it seem likely that the ops provider might use this handle as
a pointer, so we'd want it to be an opaque void*?

> +
> + memset(&caps, 0, sizeof(caps));

@caps has no purpose here, add it if/when we do something with it.
It's also a standard type, why are we memset'ing it rather than just
=0??

> + ret = mentry->ops->open(vdev->pdev, &caps, &handle);
> + if (!ret) {
> + vdev->mediate_ops = mentry->ops;
> + vdev->mediate_handle = handle;
> +
> + pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
> + vdev->mediate_ops->name, caps,
> + handle, vdev->pdev->vendor,
> + vdev->pdev->device);

Generally not advisable to make user accessible printks.

> + /*
> + * only find the first matching mediate_ops,
> + * and add its refcnt
> + */
> + mentry->refcnt++;
> + break;
> + }
> + }
> + mutex_unlock(&mediate_ops_list_lock);
> }
> vdev->refcnt++;
> error:
> @@ -736,6 +773,14 @@ static long vfio_pci_ioctl(void *device_data,
> info.size = pdev->cfg_size;
> info.flags = VFIO_REGION_INFO_FLAG_READ |
> VFIO_REGION_INFO_FLAG_WRITE;
> +
> + if (vdev->mediate_ops &&
> + vdev->mediate_ops->get_region_info) {
> + vdev->mediate_ops->get_region_info(
> + vdev->mediate_handle,
> + &info, &caps, NULL);
> + }

These would be a lot cleaner if we could just call a helper function:

void vfio_pci_region_info_mediation_hook(vdev, info, caps, etc...)
{
if (vdev->mediate_ops
vdev->mediate_ops->get_region_info)
vdev->mediate_ops->get_region_info(vdev->mediate_handle,
&info, &caps, NULL);
}

I'm not thrilled with all these hooks, but not open coding every one of
them might help.

> +
> break;
> case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> @@ -756,6 +801,13 @@ static long vfio_pci_ioctl(void *device_data,
> }
> }
>
> + if (vdev->mediate_ops &&
> + vdev->mediate_ops->get_region_info) {
> + vdev->mediate_ops->get_region_info(
> + vdev->mediate_handle,
> + &info, &caps, NULL);
> + }
> +
> break;
> case VFIO_PCI_ROM_REGION_INDEX:
> {
> @@ -794,6 +846,14 @@ static long vfio_pci_ioctl(void *device_data,
> }
>
> pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
> +
> + if (vdev->mediate_ops &&
> + vdev->mediate_ops->get_region_info) {
> + vdev->mediate_ops->get_region_info(
> + vdev->mediate_handle,
> + &info, &caps, NULL);
> + }
> +
> break;
> }
> case VFIO_PCI_VGA_REGION_INDEX:
> @@ -805,6 +865,13 @@ static long vfio_pci_ioctl(void *device_data,
> info.flags = VFIO_REGION_INFO_FLAG_READ |
> VFIO_REGION_INFO_FLAG_WRITE;
>
> + if (vdev->mediate_ops &&
> + vdev->mediate_ops->get_region_info) {
> + vdev->mediate_ops->get_region_info(
> + vdev->mediate_handle,
> + &info, &caps, NULL);
> + }
> +
> break;
> default:
> {
> @@ -839,6 +906,13 @@ static long vfio_pci_ioctl(void *device_data,
> if (ret)
> return ret;
> }
> +
> + if (vdev->mediate_ops &&
> + vdev->mediate_ops->get_region_info) {
> + vdev->mediate_ops->get_region_info(
> + vdev->mediate_handle,
> + &info, &caps, &cap_type);
> + }
> }
> }
>
> @@ -1151,6 +1225,16 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
> if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
> return -EINVAL;
>
> + if (vdev->mediate_ops && vdev->mediate_ops->rw) {
> + int ret;
> + bool pt = true;
> +
> + ret = vdev->mediate_ops->rw(vdev->mediate_handle,
> + buf, count, ppos, iswrite, &pt);
> + if (!pt)
> + return ret;
> + }
> +
> switch (index) {
> case VFIO_PCI_CONFIG_REGION_INDEX:
> return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> @@ -1200,6 +1284,15 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> u64 phys_len, req_len, pgoff, req_start;
> int ret;
>
> + if (vdev->mediate_ops && vdev->mediate_ops->mmap) {
> + int ret;
> + bool pt = true;
> +
> + ret = vdev->mediate_ops->mmap(vdev->mediate_handle, vma, &pt);
> + if (!pt)
> + return ret;
> + }

There must be a better way to do all these. Do we really want to call
into ops for every rw or mmap, have the vendor code decode a region,
and maybe or maybe not have it handle it? It's pretty ugly. Do we
need the mediation provider to be able to dynamically setup the ops per
region and export the default handlers out for them to call?

> +
> index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>
> if (vma->vm_end < vma->vm_start)
> @@ -1629,8 +1722,17 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
>
> static void __exit vfio_pci_cleanup(void)
> {
> + struct vfio_pci_mediate_ops_list_entry *mentry, *n;
> +
> pci_unregister_driver(&vfio_pci_driver);
> vfio_pci_uninit_perm_bits();
> +
> + mutex_lock(&mediate_ops_list_lock);
> + list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
> + list_del(&mentry->next);
> + kfree(mentry);
> + }
> + mutex_unlock(&mediate_ops_list_lock);

Is it even possible to unload vfio-pci while there are mediation
drivers registered? I don't think the module interactions are well
thought out here, ex. do you really want i40e to have build and runtime
dependencies on vfio-pci? I don't think so.

> }
>
> static void __init vfio_pci_fill_ids(void)
> @@ -1697,6 +1799,50 @@ static int __init vfio_pci_init(void)
> return ret;
> }
>
> +int vfio_pci_register_mediate_ops(struct vfio_pci_mediate_ops *ops)
> +{
> + struct vfio_pci_mediate_ops_list_entry *mentry;
> +
> + mutex_lock(&mediate_ops_list_lock);
> + mentry = kzalloc(sizeof(*mentry), GFP_KERNEL);
> + if (!mentry) {
> + mutex_unlock(&mediate_ops_list_lock);
> + return -ENOMEM;
> + }
> +
> + mentry->ops = ops;
> + mentry->refcnt = 0;

It's kZalloc'd, this is unnecessary.

> + list_add(&mentry->next, &mediate_ops_list);

Check for duplicates?

> +
> + pr_info("registered dm ops %s\n", ops->name);
> + mutex_unlock(&mediate_ops_list_lock);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(vfio_pci_register_mediate_ops);
> +
> +void vfio_pci_unregister_mediate_ops(struct vfio_pci_mediate_ops *ops)
> +{
> + struct vfio_pci_mediate_ops_list_entry *mentry, *n;
> +
> + mutex_lock(&mediate_ops_list_lock);
> + list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
> + if (mentry->ops != ops)
> + continue;
> +
> + mentry->refcnt--;

Whose reference is this removing?

> + if (!mentry->refcnt) {
> + list_del(&mentry->next);
> + kfree(mentry);
> + } else
> + pr_err("vfio_pci unregister mediate ops %s error\n",
> + mentry->ops->name);

This is bad, we should hold a reference to the module providing these
ops for each use of it such that the module cannot be removed while
it's in use. Otherwise we enter a very bad state here and it's
trivially accessible by an admin remove the module while in use.
Thanks,

Alex

> + }
> + mutex_unlock(&mediate_ops_list_lock);
> +
> +}
> +EXPORT_SYMBOL(vfio_pci_unregister_mediate_ops);
> +
> module_init(vfio_pci_init);
> module_exit(vfio_pci_cleanup);
>
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index ee6ee91718a4..bad4a254360e 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -122,6 +122,8 @@ struct vfio_pci_device {
> struct list_head dummy_resources_list;
> struct mutex ioeventfds_lock;
> struct list_head ioeventfds_list;
> + struct vfio_pci_mediate_ops *mediate_ops;
> + u32 mediate_handle;
> };
>
> #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index e42a711a2800..0265e779acd1 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -195,4 +195,20 @@ extern int vfio_virqfd_enable(void *opaque,
> void *data, struct virqfd **pvirqfd, int fd);
> extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
>
> +struct vfio_pci_mediate_ops {
> + char *name;
> + int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
> + void (*release)(int handle);
> + void (*get_region_info)(int handle,
> + struct vfio_region_info *info,
> + struct vfio_info_cap *caps,
> + struct vfio_region_info_cap_type *cap_type);
> + ssize_t (*rw)(int handle, char __user *buf,
> + size_t count, loff_t *ppos, bool iswrite, bool *pt);
> + int (*mmap)(int handle, struct vm_area_struct *vma, bool *pt);
> +
> +};
> +extern int vfio_pci_register_mediate_ops(struct vfio_pci_mediate_ops *ops);
> +extern void vfio_pci_unregister_mediate_ops(struct vfio_pci_mediate_ops *ops);
> +
> #endif /* VFIO_H */

2019-12-06 06:00:28

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 3/9] vfio/pci: register a default migration region

On Fri, Dec 06, 2019 at 07:55:15AM +0800, Alex Williamson wrote:
> On Wed, 4 Dec 2019 22:26:38 -0500
> Yan Zhao <[email protected]> wrote:
>
> > Vendor driver specifies when to support a migration region through cap
> > VFIO_PCI_DEVICE_CAP_MIGRATION in vfio_pci_mediate_ops->open().
> >
> > If vfio-pci detects this cap, it creates a default migration region on
> > behalf of vendor driver with region len=0 and region->ops=null.
> > Vendor driver should override this region's len, flags, rw, mmap in
> > its vfio_pci_mediate_ops.
> >
> > This migration region definition is aligned to QEMU vfio migration code v8:
> > (https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html)
> >
> > Cc: Kevin Tian <[email protected]>
> >
> > Signed-off-by: Yan Zhao <[email protected]>
> > ---
> > drivers/vfio/pci/vfio_pci.c | 15 ++++
> > include/linux/vfio.h | 1 +
> > include/uapi/linux/vfio.h | 149 ++++++++++++++++++++++++++++++++++++
> > 3 files changed, 165 insertions(+)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index f3730252ee82..059660328be2 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -115,6 +115,18 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> > return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
> > }
> >
> > +/**
> > + * init a region to hold migration ctl & data
> > + */
> > +void init_migration_region(struct vfio_pci_device *vdev)
> > +{
> > + vfio_pci_register_dev_region(vdev, VFIO_REGION_TYPE_MIGRATION,
> > + VFIO_REGION_SUBTYPE_MIGRATION,
> > + NULL, 0,
> > + VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
> > + NULL);
> > +}
> > +
> > static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> > {
> > struct resource *res;
> > @@ -523,6 +535,9 @@ static int vfio_pci_open(void *device_data)
> > vdev->mediate_ops = mentry->ops;
> > vdev->mediate_handle = handle;
> >
> > + if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
> > + init_migration_region(vdev);
>
> No. We're not going to add a cap flag for every region the mediation
> driver wants to add. The mediation driver should have the ability to
> add regions and irqs to the device itself. Thanks,
>
> Alex
>
ok. got it. will do it.

Thanks
Yan

> > +
> > pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
> > vdev->mediate_ops->name, caps,
> > handle, vdev->pdev->vendor,
>

2019-12-06 06:13:34

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> On Wed, 4 Dec 2019 22:26:50 -0500
> Yan Zhao <[email protected]> wrote:
>
> > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > communicate dynamic trap info. It is of type
> > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> >
> > This region has two fields: dt_fd and trap.
> > When QEMU detects a device regions of this type, it will create an
> > eventfd and write its eventfd id to dt_fd field.
> > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > info region.
> > - If trap is true, QEMU would search the device's PCI BAR
> > regions and disable all the sparse mmaped subregions (if the sparse
> > mmaped subregion is disablable).
> > - If trap is false, QEMU would re-enable those subregions.
> >
> > A typical usage is
> > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > to trap access of bar 0 registers and make dirty page tracking possible.
> > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > QEMU reads trap field of this info region which is false and QEMU
> > re-passthrough the whole bar 0 region.
> >
> > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > vfio_pci_mediate_ops->open().
> >
> > If vfio-pci detects this cap, it will create a default
> > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > and region->ops=null.
> > Vvendor driver should override this region's len, flags, rw, mmap in its
> > vfio_pci_mediate_ops.
>
> TBH, I don't like this interface at all. Userspace doesn't pass data
> to the kernel via INFO ioctls. We have a SET_IRQS ioctl for
> configuring user signaling with eventfds. I think we only need to
> define an IRQ type that tells the user to re-evaluate the sparse mmap
> information for a region. The user would enumerate the device IRQs via
> GET_IRQ_INFO, find one of this type where the IRQ info would also
> indicate which region(s) should be re-evaluated on signaling. The user
> would enable that signaling via SET_IRQS and simply re-evaluate the
ok. I'll try to switch to this way. Thanks for this suggestion.

> sparse mmap capability for the associated regions when signaled.

Do you like the "disablable" flag of sparse mmap ?
I think it's a lightweight way for user to switch mmap state of a whole region,
otherwise going through a complete flow of GET_REGION_INFO and re-setup
region might be too heavy.

Thanks
Yan

> Thanks,
>
> Alex
>




> >
> > Cc: Kevin Tian <[email protected]>
> >
> > Signed-off-by: Yan Zhao <[email protected]>
> > ---
> > drivers/vfio/pci/vfio_pci.c | 16 ++++++++++++++++
> > include/linux/vfio.h | 3 ++-
> > include/uapi/linux/vfio.h | 11 +++++++++++
> > 3 files changed, 29 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 059660328be2..62b811ca43e4 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -127,6 +127,19 @@ void init_migration_region(struct vfio_pci_device *vdev)
> > NULL);
> > }
> >
> > +/**
> > + * register a region to hold info for dynamically trap bar regions
> > + */
> > +void init_dynamic_trap_bar_info_region(struct vfio_pci_device *vdev)
> > +{
> > + vfio_pci_register_dev_region(vdev,
> > + VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO,
> > + VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO,
> > + NULL, 0,
> > + VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
> > + NULL);
> > +}
> > +
> > static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> > {
> > struct resource *res;
> > @@ -538,6 +551,9 @@ static int vfio_pci_open(void *device_data)
> > if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
> > init_migration_region(vdev);
> >
> > + if (caps & VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR)
> > + init_dynamic_trap_bar_info_region(vdev);
> > +
> > pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
> > vdev->mediate_ops->name, caps,
> > handle, vdev->pdev->vendor,
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index cddea8e9dcb2..cf8ecf687bee 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -197,7 +197,8 @@ extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
> >
> > struct vfio_pci_mediate_ops {
> > char *name;
> > -#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
> > +#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
> > +#define VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR (0x02)
> > int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
> > void (*release)(int handle);
> > void (*get_region_info)(int handle,
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index caf8845a67a6..74a2d0b57741 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -258,6 +258,9 @@ struct vfio_region_info {
> > struct vfio_region_sparse_mmap_area {
> > __u64 offset; /* Offset of mmap'able area within region */
> > __u64 size; /* Size of mmap'able area */
> > + __u32 disablable; /* whether this mmap'able are able to
> > + * be dynamically disabled
> > + */
> > };
> >
> > struct vfio_region_info_cap_sparse_mmap {
> > @@ -454,6 +457,14 @@ struct vfio_device_migration_info {
> > #define VFIO_DEVICE_DIRTY_PFNS_ALL (~0ULL)
> > } __attribute__((packed));
> >
> > +/* Region type and sub-type to hold info to dynamically trap bars */
> > +#define VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO (4)
> > +#define VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO (1)
> > +
> > +struct vfio_device_dt_bar_info_region {
> > + __u32 dt_fd; /* fd of eventfd to notify qemu trap/untrap bars*/
> > + __u32 trap; /* trap/untrap bar regions */
> > +};
> >
> > /* sub-types for VFIO_REGION_TYPE_PCI_* */
> >
>

2019-12-06 08:07:52

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

On Fri, Dec 06, 2019 at 07:55:19AM +0800, Alex Williamson wrote:
> On Wed, 4 Dec 2019 22:25:36 -0500
> Yan Zhao <[email protected]> wrote:
>
> > when vfio-pci is bound to a physical device, almost all the hardware
> > resources are passthroughed.
> > Sometimes, vendor driver of this physcial device may want to mediate some
> > hardware resource access for a short period of time, e.g. dirty page
> > tracking during live migration.
> >
> > Here we introduce mediate ops in vfio-pci for this purpose.
> >
> > Vendor driver can register a mediate ops to vfio-pci.
> > But rather than directly bind to the passthroughed device, the
> > vendor driver is now either a module that does not bind to any device or
> > a module binds to other device.
> > E.g. when passing through a VF device that is bound to vfio-pci modules,
> > PF driver that binds to PF device can register to vfio-pci to mediate
> > VF's regions, hence supporting VF live migration.
> >
> > The sequence goes like this:
> > 1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver
> >
> > 2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops
> >
> > 3. Whenever vfio-pci opens a device, it searches the list and call
> > vfio_pci_mediate_ops->open() to check whether a vendor driver supports
> > mediating this device.
> > Upon a success return value of from vfio_pci_mediate_ops->open(),
> > vfio-pci will stop list searching and store a mediate handle to
> > represent this open into vendor driver.
> > (so if multiple vendor drivers support mediating a device through
> > vfio_pci_mediate_ops, only one will win, depending on their registering
> > sequence)
> >
> > 4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
> > ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
> > vendor driver is able to override a region's default flags and caps,
> > e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
> > region.
> >
> > 5. vfio_pci_rw()/vfio_pci_mmap() first calls into
> > vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
> > if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
> > passthrough this read/write/mmap to physical device, otherwise it just
> > returns without touch physical device.
> >
> > 6. When vfio-pci closes a device, vfio_pci_release() chains into
> > vfio_pci_mediate_ops->release() to close the reference in vendor driver.
> >
> > 7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits
> >
> > Cc: Kevin Tian <[email protected]>
> >
> > Signed-off-by: Yan Zhao <[email protected]>
> > ---
> > drivers/vfio/pci/vfio_pci.c | 146 ++++++++++++++++++++++++++++
> > drivers/vfio/pci/vfio_pci_private.h | 2 +
> > include/linux/vfio.h | 16 +++
> > 3 files changed, 164 insertions(+)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 02206162eaa9..55080ff29495 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > MODULE_PARM_DESC(disable_idle_d3,
> > "Disable using the PCI D3 low power state for idle, unused devices");
> >
> > +static LIST_HEAD(mediate_ops_list);
> > +static DEFINE_MUTEX(mediate_ops_list_lock);
> > +struct vfio_pci_mediate_ops_list_entry {
> > + struct vfio_pci_mediate_ops *ops;
> > + int refcnt;
> > + struct list_head next;
> > +};
> > +
> > static inline bool vfio_vga_disabled(void)
> > {
> > #ifdef CONFIG_VFIO_PCI_VGA
> > @@ -472,6 +480,10 @@ static void vfio_pci_release(void *device_data)
> > if (!(--vdev->refcnt)) {
> > vfio_spapr_pci_eeh_release(vdev->pdev);
> > vfio_pci_disable(vdev);
> > + if (vdev->mediate_ops && vdev->mediate_ops->release) {
> > + vdev->mediate_ops->release(vdev->mediate_handle);
> > + vdev->mediate_ops = NULL;
> > + }
> > }
> >
> > mutex_unlock(&vdev->reflck->lock);
> > @@ -483,6 +495,7 @@ static int vfio_pci_open(void *device_data)
> > {
> > struct vfio_pci_device *vdev = device_data;
> > int ret = 0;
> > + struct vfio_pci_mediate_ops_list_entry *mentry;
> >
> > if (!try_module_get(THIS_MODULE))
> > return -ENODEV;
> > @@ -495,6 +508,30 @@ static int vfio_pci_open(void *device_data)
> > goto error;
> >
> > vfio_spapr_pci_eeh_open(vdev->pdev);
> > + mutex_lock(&mediate_ops_list_lock);
> > + list_for_each_entry(mentry, &mediate_ops_list, next) {
> > + u64 caps;
> > + u32 handle;
>
> Wouldn't it seem likely that the ops provider might use this handle as
> a pointer, so we'd want it to be an opaque void*?
>
yes, you are right, handle as a pointer is much better. will change it.
Thanks :)

> > +
> > + memset(&caps, 0, sizeof(caps));
>
> @caps has no purpose here, add it if/when we do something with it.
> It's also a standard type, why are we memset'ing it rather than just
> =0??
>
> > + ret = mentry->ops->open(vdev->pdev, &caps, &handle);
> > + if (!ret) {
> > + vdev->mediate_ops = mentry->ops;
> > + vdev->mediate_handle = handle;
> > +
> > + pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
> > + vdev->mediate_ops->name, caps,
> > + handle, vdev->pdev->vendor,
> > + vdev->pdev->device);
>
> Generally not advisable to make user accessible printks.
>
ok.

> > + /*
> > + * only find the first matching mediate_ops,
> > + * and add its refcnt
> > + */
> > + mentry->refcnt++;
> > + break;
> > + }
> > + }
> > + mutex_unlock(&mediate_ops_list_lock);
> > }
> > vdev->refcnt++;
> > error:
> > @@ -736,6 +773,14 @@ static long vfio_pci_ioctl(void *device_data,
> > info.size = pdev->cfg_size;
> > info.flags = VFIO_REGION_INFO_FLAG_READ |
> > VFIO_REGION_INFO_FLAG_WRITE;
> > +
> > + if (vdev->mediate_ops &&
> > + vdev->mediate_ops->get_region_info) {
> > + vdev->mediate_ops->get_region_info(
> > + vdev->mediate_handle,
> > + &info, &caps, NULL);
> > + }
>
> These would be a lot cleaner if we could just call a helper function:
>
> void vfio_pci_region_info_mediation_hook(vdev, info, caps, etc...)
> {
> if (vdev->mediate_ops
> vdev->mediate_ops->get_region_info)
> vdev->mediate_ops->get_region_info(vdev->mediate_handle,
> &info, &caps, NULL);
> }
>
> I'm not thrilled with all these hooks, but not open coding every one of
> them might help.

ok. got it.
>
> > +
> > break;
> > case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> > info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> > @@ -756,6 +801,13 @@ static long vfio_pci_ioctl(void *device_data,
> > }
> > }
> >
> > + if (vdev->mediate_ops &&
> > + vdev->mediate_ops->get_region_info) {
> > + vdev->mediate_ops->get_region_info(
> > + vdev->mediate_handle,
> > + &info, &caps, NULL);
> > + }
> > +
> > break;
> > case VFIO_PCI_ROM_REGION_INDEX:
> > {
> > @@ -794,6 +846,14 @@ static long vfio_pci_ioctl(void *device_data,
> > }
> >
> > pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
> > +
> > + if (vdev->mediate_ops &&
> > + vdev->mediate_ops->get_region_info) {
> > + vdev->mediate_ops->get_region_info(
> > + vdev->mediate_handle,
> > + &info, &caps, NULL);
> > + }
> > +
> > break;
> > }
> > case VFIO_PCI_VGA_REGION_INDEX:
> > @@ -805,6 +865,13 @@ static long vfio_pci_ioctl(void *device_data,
> > info.flags = VFIO_REGION_INFO_FLAG_READ |
> > VFIO_REGION_INFO_FLAG_WRITE;
> >
> > + if (vdev->mediate_ops &&
> > + vdev->mediate_ops->get_region_info) {
> > + vdev->mediate_ops->get_region_info(
> > + vdev->mediate_handle,
> > + &info, &caps, NULL);
> > + }
> > +
> > break;
> > default:
> > {
> > @@ -839,6 +906,13 @@ static long vfio_pci_ioctl(void *device_data,
> > if (ret)
> > return ret;
> > }
> > +
> > + if (vdev->mediate_ops &&
> > + vdev->mediate_ops->get_region_info) {
> > + vdev->mediate_ops->get_region_info(
> > + vdev->mediate_handle,
> > + &info, &caps, &cap_type);
> > + }
> > }
> > }
> >
> > @@ -1151,6 +1225,16 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
> > if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
> > return -EINVAL;
> >
> > + if (vdev->mediate_ops && vdev->mediate_ops->rw) {
> > + int ret;
> > + bool pt = true;
> > +
> > + ret = vdev->mediate_ops->rw(vdev->mediate_handle,
> > + buf, count, ppos, iswrite, &pt);
> > + if (!pt)
> > + return ret;
> > + }
> > +
> > switch (index) {
> > case VFIO_PCI_CONFIG_REGION_INDEX:
> > return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> > @@ -1200,6 +1284,15 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > u64 phys_len, req_len, pgoff, req_start;
> > int ret;
> >
> > + if (vdev->mediate_ops && vdev->mediate_ops->mmap) {
> > + int ret;
> > + bool pt = true;
> > +
> > + ret = vdev->mediate_ops->mmap(vdev->mediate_handle, vma, &pt);
> > + if (!pt)
> > + return ret;
> > + }
>
> There must be a better way to do all these. Do we really want to call
> into ops for every rw or mmap, have the vendor code decode a region,
> and maybe or maybe not have it handle it? It's pretty ugly. Do we

do you think below flow is good ?
1. in mediate_ops->open(), return
(1) region[] indexed by region index, if a mediate driver supports mediating
region[i], region[i].ops->get_region_info, regions[i].ops->rw, or
regions[i].ops->mmap is not null.
(2) irq_info[] indexed by irq index, if a mediate driver supports mediating
irq_info[i], irq_info[i].ops->get_irq_info or irq_info[i].ops->set_irq_info
is not null.

Then, vfio_pci_rw/vfio_pci_mmap/vfio_pci_ioctl only call into those
non-null hooks.

> need the mediation provider to be able to dynamically setup the ops per
May I confirm that you are not saying dynamic registering mediate ops
after vfio-pci already opened a device, right?

> region and export the default handlers out for them to call?
>
could we still keep checking return value of the hooks rather than
export default handlers? Otherwise at least vfio_pci_default_ioctl(),
vfio_pci_default_rw(), and vfio_pci_default_mmap() need to be exported.

> > +
> > index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> >
> > if (vma->vm_end < vma->vm_start)
> > @@ -1629,8 +1722,17 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
> >
> > static void __exit vfio_pci_cleanup(void)
> > {
> > + struct vfio_pci_mediate_ops_list_entry *mentry, *n;
> > +
> > pci_unregister_driver(&vfio_pci_driver);
> > vfio_pci_uninit_perm_bits();
> > +
> > + mutex_lock(&mediate_ops_list_lock);
> > + list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
> > + list_del(&mentry->next);
> > + kfree(mentry);
> > + }
> > + mutex_unlock(&mediate_ops_list_lock);
>
> Is it even possible to unload vfio-pci while there are mediation
> drivers registered? I don't think the module interactions are well
> thought out here, ex. do you really want i40e to have build and runtime
> dependencies on vfio-pci? I don't think so.
>
Currently, yes, i40e has build dependency on vfio-pci.
It's like this, if i40e decides to support SRIOV and compiles in vf
related code who depends on vfio-pci, it will also have build dependency
on vfio-pci. isn't it natural?

> > }
> >
> > static void __init vfio_pci_fill_ids(void)
> > @@ -1697,6 +1799,50 @@ static int __init vfio_pci_init(void)
> > return ret;
> > }
> >
> > +int vfio_pci_register_mediate_ops(struct vfio_pci_mediate_ops *ops)
> > +{
> > + struct vfio_pci_mediate_ops_list_entry *mentry;
> > +
> > + mutex_lock(&mediate_ops_list_lock);
> > + mentry = kzalloc(sizeof(*mentry), GFP_KERNEL);
> > + if (!mentry) {
> > + mutex_unlock(&mediate_ops_list_lock);
> > + return -ENOMEM;
> > + }
> > +
> > + mentry->ops = ops;
> > + mentry->refcnt = 0;
>
> It's kZalloc'd, this is unnecessary.
>
right :)
> > + list_add(&mentry->next, &mediate_ops_list);
>
> Check for duplicates?
>
ok. will do it.
> > +
> > + pr_info("registered dm ops %s\n", ops->name);
> > + mutex_unlock(&mediate_ops_list_lock);
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL(vfio_pci_register_mediate_ops);
> > +
> > +void vfio_pci_unregister_mediate_ops(struct vfio_pci_mediate_ops *ops)
> > +{
> > + struct vfio_pci_mediate_ops_list_entry *mentry, *n;
> > +
> > + mutex_lock(&mediate_ops_list_lock);
> > + list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
> > + if (mentry->ops != ops)
> > + continue;
> > +
> > + mentry->refcnt--;
>
> Whose reference is this removing?
>
I intended to prevent mediate driver from calling unregister mediate ops
while there're still opened devices in it.
after a successful mediate_ops->open(), mentry->refcnt++.
after calling mediate_ops->release(). mentry->refcnt--.

(seems in this RFC, I missed a mentry->refcnt-- after calling
mediate_ops->release())


> > + if (!mentry->refcnt) {
> > + list_del(&mentry->next);
> > + kfree(mentry);
> > + } else
> > + pr_err("vfio_pci unregister mediate ops %s error\n",
> > + mentry->ops->name);
>
> This is bad, we should hold a reference to the module providing these
> ops for each use of it such that the module cannot be removed while
> it's in use. Otherwise we enter a very bad state here and it's
> trivially accessible by an admin remove the module while in use.
mediate driver is supposed to ref its own module on a success
mediate_ops->open(), and deref its own module on mediate_ops->release().
so, it can't be accidentally removed.

Thanks

Yan
> Thanks,
>
> Alex
>
> > + }
> > + mutex_unlock(&mediate_ops_list_lock);
> > +
> > +}
> > +EXPORT_SYMBOL(vfio_pci_unregister_mediate_ops);
> > +
> > module_init(vfio_pci_init);
> > module_exit(vfio_pci_cleanup);
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> > index ee6ee91718a4..bad4a254360e 100644
> > --- a/drivers/vfio/pci/vfio_pci_private.h
> > +++ b/drivers/vfio/pci/vfio_pci_private.h
> > @@ -122,6 +122,8 @@ struct vfio_pci_device {
> > struct list_head dummy_resources_list;
> > struct mutex ioeventfds_lock;
> > struct list_head ioeventfds_list;
> > + struct vfio_pci_mediate_ops *mediate_ops;
> > + u32 mediate_handle;
> > };
> >
> > #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index e42a711a2800..0265e779acd1 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -195,4 +195,20 @@ extern int vfio_virqfd_enable(void *opaque,
> > void *data, struct virqfd **pvirqfd, int fd);
> > extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
> >
> > +struct vfio_pci_mediate_ops {
> > + char *name;
> > + int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
> > + void (*release)(int handle);
> > + void (*get_region_info)(int handle,
> > + struct vfio_region_info *info,
> > + struct vfio_info_cap *caps,
> > + struct vfio_region_info_cap_type *cap_type);
> > + ssize_t (*rw)(int handle, char __user *buf,
> > + size_t count, loff_t *ppos, bool iswrite, bool *pt);
> > + int (*mmap)(int handle, struct vm_area_struct *vma, bool *pt);
> > +
> > +};
> > +extern int vfio_pci_register_mediate_ops(struct vfio_pci_mediate_ops *ops);
> > +extern void vfio_pci_unregister_mediate_ops(struct vfio_pci_mediate_ops *ops);
> > +
> > #endif /* VFIO_H */
>

2019-12-06 08:31:47

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
>
> On 2019/12/5 下午4:51, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> >> Hi:
> >>
> >> On 2019/12/5 上午11:24, Yan Zhao wrote:
> >>> For SRIOV devices, VFs are passthroughed into guest directly without host
> >>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>> dynamic host mediation is required to (1) get device states, (2) get
> >>> dirty pages. Since device states as well as other critical information
> >>> required for dirty page tracking for VFs are usually retrieved from PFs,
> >>> it is handy to provide an extension in PF driver to centralizingly control
> >>> VFs' migration.
> >>>
> >>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>> dynamically trap VFs' bars for dirty page tracking and
> >>
> >> A silly question, what's the reason for doing this, is this a must for dirty
> >> page tracking?
> >>
> > For performance consideration. VFs' bars should be passthoughed at
> > normal time and only enter into trap state on need.
>
>
> Right, but how does this matter for the case of dirty page tracking?
>
Take NIC as an example, to trap its VF dirty pages, software way is
required to trap every write of ring tail that resides in BAR0. There's
still no IOMMU Dirty bit available.
>
> >
> >>> (3) centralizing
> >>> VF critical states retrieving and VF controls into one driver, we propose
> >>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>
> >>>
> >>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>> __________ register mediate ops| ___________ ___________ |
> >>> | |<-----------------------| VF | | |
> >>> | vfio-pci | | | mediate | | PF driver | |
> >>> |__________|----------------------->| driver | |___________|
> >>> | open(pdev) | ----------- | |
> >>> | |
> >>> | |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >>> \|/ \|/
> >>> ----------- ------------
> >>> | VF | | PF |
> >>> ----------- ------------
> >>>
> >>>
> >>> VF mediate driver could be a standalone driver that does not bind to
> >>> any devices (as in demo code in patches 5-6) or it could be a built-in
> >>> extension of PF driver (as in patches 7-9) .
> >>>
> >>> Rather than directly bind to VF, VF mediate driver register a mediate
> >>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> >>> mediate ops.
> >>> (Note that: VF mediate driver can register mediate ops into vfio-pci
> >>> before vfio-pci binding to any devices. And VF mediate driver can
> >>> support mediating multiple devices.)
> >>>
> >>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> >>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> >>> device as a parameter.
> >>> VF mediate driver should return success or failure depending on it
> >>> supports the pdev or not.
> >>> E.g. VF mediate driver would compare its supported VF devfn with the
> >>> devfn of the passed-in pdev.
> >>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> >>> stop querying other mediate ops and bind the opening device with this
> >>> mediate ops using the returned mediate handle.
> >>>
> >>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> >>> VF will be intercepted into VF mediate driver as
> >>> vfio_pci_mediate_ops->get_region_info(),
> >>> vfio_pci_mediate_ops->rw,
> >>> vfio_pci_mediate_ops->mmap, and get customized.
> >>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> >>> further return 'pt' to indicate whether vfio-pci should further
> >>> passthrough data to hw.
> >>>
> >>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> >>> with a mediate handle as parameter.
> >>>
> >>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> >>> mediate driver be able to differentiate two opening VFs of the same device
> >>> id and vendor id.
> >>>
> >>> When VF mediate driver exits, it unregisters its mediate ops from
> >>> vfio-pci.
> >>>
> >>>
> >>> In this patchset, we enable vfio-pci to provide 3 things:
> >>> (1) calling mediate ops to allow vendor driver customizing default
> >>> region info/rw/mmap of a region.
> >>> (2) provide a migration region to support migration
> >>
> >> What's the benefit of introducing a region? It looks to me we don't expect
> >> the region to be accessed directly from guest. Could we simply extend device
> >> fd ioctl for doing such things?
> >>
> > You may take a look on mdev live migration discussions in
> > https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >
> > or previous discussion at
> > https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
> > which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
> >
> > generaly speaking, qemu part of live migration is consistent for
> > vfio-pci + mediate ops way or mdev way.
>
>
> So in mdev, do you still have a mediate driver? Or you expect the parent
> to implement the region?
>
No, currently it's only for vfio-pci.
mdev parent driver is free to customize its regions and hence does not
requires this mediate ops hooks.

>
> > The region is only a channel for
> > QEMU and kernel to communicate information without introducing IOCTLs.
>
>
> Well, at least you introduce new type of region in uapi. So this does
> not answer why region is better than ioctl. If the region will only be
> used by qemu, using ioctl is much more easier and straightforward.
>
It's not introduced by me :)
mdev live migration is actually using this way, I'm just keeping
compatible to the uapi.

From my own perspective, my answer is that a region is more flexible
compared to ioctl. vendor driver can freely define the size, mmap cap of
its data subregion. Also, there're already too many ioctls in vfio.
>
> >
> >
> >>> (3) provide a dynamic trap bar info region to allow vendor driver
> >>> control trap/untrap of device pci bars
> >>>
> >>> This vfio-pci + mediate ops way differs from mdev way in that
> >>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> >>> specific mdev parent driver is bound to VF directly.
> >>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
> >>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
> >>>
> >>> The reason why we don't choose the way of writing mdev parent driver is
> >>> that
> >>> (1) VFs are almost all the time directly passthroughed. Directly binding
> >>> to vfio-pci can make most of the code shared/reused.
> >>
> >> Can we split out the common parts from vfio-pci?
> >>
> > That's very attractive. but one cannot implement a vfio-pci except
> > export everything in it as common part :)
>
>
> Well, I think there should be not hard to do that. E..g you can route it
> back to like:
>
> vfio -> vfio_mdev -> parent -> vfio_pci
>
it's desired for us to have mediate driver binding to PF device.
so once a VF device is created, only PF driver and vfio-pci are
required. Just the same as what needs to be done for a normal VF passthrough.
otherwise, a separate parent driver binding to VF is required.
Also, this parent driver has many drawbacks as I mentions in this
cover-letter.
>
> >>> If we write a
> >>> vendor specific mdev parent driver, most of the code (like passthrough
> >>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> >>> actually a duplicated and tedious work.
> >>
> >> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
> >> me we need to consider live migration for mdev as well. In that case, do we
> >> still expect mediate ops through VFIO directly?
> >>
> >>
> >>> (2) For features like dynamically trap/untrap pci bars, if they are in
> >>> vfio-pci, they can be available to most people without repeated code
> >>> copying and re-testing.
> >>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> >>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> >>> it runs into a real migration need. However, if vfio-pci is bound
> >>> initially, they have no chance to do live migration when there's a need
> >>> later.
> >>
> >> We can teach management layer to do this.
> >>
> > No. not possible as vfio-pci by default has no migration region and
> > dirty page tracking needs vendor's mediation at least for most
> > passthrough devices now.
>
>
> I'm not quite sure I get here but in this case, just tech them to use
> the driver that has migration support?
>
That's a way, but as more and more passthrough devices have demands and
caps to do migration, will vfio-pci be used in future any more ?

Thanks
Yan

> Thanks
>
>
> >
> > Thanks
> > Yn
> >
> >> Thanks
> >>
> >>
> >>> In this patchset,
> >>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
> >>> driver to mediate/customize region info/rw/mmap.
> >>>
> >>> - patches 5-6 provide a standalone sample driver to register a mediate ops
> >>> for Intel Graphics Devices. It does not bind to IGDs directly but decides
> >>> what devices it supports via its pciidlist. It also demonstrates how to
> >>> dynamic trap a device's PCI bars. (by adding more pciids in its
> >>> pciidlist, this sample driver actually is not necessarily limited to
> >>> support IGDs)
> >>>
> >>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
> >>> Ethernet Controller XL710 Family of devices. It supports VF precopy live
> >>> migration on Intel's 710 SRIOV. (but we commented out the real
> >>> implementation of dirty page tracking and device state retrieving part
> >>> to focus on demonstrating framework part. Will send out them in future
> >>> versions)
> >>> patch 7 registers/unregisters VF mediate ops when PF driver
> >>> probes/removes. It specifies its supporting VFs via
> >>> vfio_pci_mediate_ops->open(pdev)
> >>>
> >>> patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
> >>> provides a sample implementation of migration region.
> >>> The QEMU part of vfio migration is based on v8
> >>> https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
> >>> We do not based on recent v9 because we think there are still opens in
> >>> dirty page track part in that series.
> >>>
> >>> patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
> >>> provides an example on how to trap part of bar0 when migration starts
> >>> and passthrough this part of bar0 again when migration fails.
> >>>
> >>> Yan Zhao (9):
> >>> vfio/pci: introduce mediate ops to intercept vfio-pci ops
> >>> vfio/pci: test existence before calling region->ops
> >>> vfio/pci: register a default migration region
> >>> vfio-pci: register default dynamic-trap-bar-info region
> >>> samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
> >>> sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
> >>> i40e/vf_migration: register mediate_ops to vfio-pci
> >>> i40e/vf_migration: mediate migration region
> >>> i40e/vf_migration: support dynamic trap of bar0
> >>>
> >>> drivers/net/ethernet/intel/Kconfig | 2 +-
> >>> drivers/net/ethernet/intel/i40e/Makefile | 3 +-
> >>> drivers/net/ethernet/intel/i40e/i40e.h | 2 +
> >>> drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +
> >>> .../ethernet/intel/i40e/i40e_vf_migration.c | 626 ++++++++++++++++++
> >>> .../ethernet/intel/i40e/i40e_vf_migration.h | 78 +++
> >>> drivers/vfio/pci/vfio_pci.c | 189 +++++-
> >>> drivers/vfio/pci/vfio_pci_private.h | 2 +
> >>> include/linux/vfio.h | 18 +
> >>> include/uapi/linux/vfio.h | 160 +++++
> >>> samples/Kconfig | 6 +
> >>> samples/Makefile | 1 +
> >>> samples/vfio-pci/Makefile | 2 +
> >>> samples/vfio-pci/igd_dt.c | 367 ++++++++++
> >>> 14 files changed, 1455 insertions(+), 4 deletions(-)
> >>> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> >>> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> >>> create mode 100644 samples/vfio-pci/Makefile
> >>> create mode 100644 samples/vfio-pci/igd_dt.c
> >>>
>

2019-12-06 09:41:10

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci


On 2019/12/6 下午4:22, Yan Zhao wrote:
> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
>> On 2019/12/5 下午4:51, Yan Zhao wrote:
>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
>>>> Hi:
>>>>
>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
>>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
>>>>> dynamic host mediation is required to (1) get device states, (2) get
>>>>> dirty pages. Since device states as well as other critical information
>>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
>>>>> it is handy to provide an extension in PF driver to centralizingly control
>>>>> VFs' migration.
>>>>>
>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
>>>>> dynamically trap VFs' bars for dirty page tracking and
>>>> A silly question, what's the reason for doing this, is this a must for dirty
>>>> page tracking?
>>>>
>>> For performance consideration. VFs' bars should be passthoughed at
>>> normal time and only enter into trap state on need.
>>
>> Right, but how does this matter for the case of dirty page tracking?
>>
> Take NIC as an example, to trap its VF dirty pages, software way is
> required to trap every write of ring tail that resides in BAR0.


Interesting, but it looks like we need:
- decode the instruction
- mediate all access to BAR0
All of which seems a great burden for the VF driver. I wonder whether or
not doing interrupt relay and tracking head is better in this case.


> There's
> still no IOMMU Dirty bit available.
>>>>> (3) centralizing
>>>>> VF critical states retrieving and VF controls into one driver, we propose
>>>>> to introduce mediate ops on top of current vfio-pci device driver.
>>>>>
>>>>>
>>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>> __________ register mediate ops| ___________ ___________ |
>>>>> | |<-----------------------| VF | | |
>>>>> | vfio-pci | | | mediate | | PF driver | |
>>>>> |__________|----------------------->| driver | |___________|
>>>>> | open(pdev) | ----------- | |
>>>>> | |
>>>>> | |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
>>>>> \|/ \|/
>>>>> ----------- ------------
>>>>> | VF | | PF |
>>>>> ----------- ------------
>>>>>
>>>>>
>>>>> VF mediate driver could be a standalone driver that does not bind to
>>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
>>>>> extension of PF driver (as in patches 7-9) .
>>>>>
>>>>> Rather than directly bind to VF, VF mediate driver register a mediate
>>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
>>>>> mediate ops.
>>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
>>>>> before vfio-pci binding to any devices. And VF mediate driver can
>>>>> support mediating multiple devices.)
>>>>>
>>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
>>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
>>>>> device as a parameter.
>>>>> VF mediate driver should return success or failure depending on it
>>>>> supports the pdev or not.
>>>>> E.g. VF mediate driver would compare its supported VF devfn with the
>>>>> devfn of the passed-in pdev.
>>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
>>>>> stop querying other mediate ops and bind the opening device with this
>>>>> mediate ops using the returned mediate handle.
>>>>>
>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
>>>>> VF will be intercepted into VF mediate driver as
>>>>> vfio_pci_mediate_ops->get_region_info(),
>>>>> vfio_pci_mediate_ops->rw,
>>>>> vfio_pci_mediate_ops->mmap, and get customized.
>>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
>>>>> further return 'pt' to indicate whether vfio-pci should further
>>>>> passthrough data to hw.
>>>>>
>>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
>>>>> with a mediate handle as parameter.
>>>>>
>>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
>>>>> mediate driver be able to differentiate two opening VFs of the same device
>>>>> id and vendor id.
>>>>>
>>>>> When VF mediate driver exits, it unregisters its mediate ops from
>>>>> vfio-pci.
>>>>>
>>>>>
>>>>> In this patchset, we enable vfio-pci to provide 3 things:
>>>>> (1) calling mediate ops to allow vendor driver customizing default
>>>>> region info/rw/mmap of a region.
>>>>> (2) provide a migration region to support migration
>>>> What's the benefit of introducing a region? It looks to me we don't expect
>>>> the region to be accessed directly from guest. Could we simply extend device
>>>> fd ioctl for doing such things?
>>>>
>>> You may take a look on mdev live migration discussions in
>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>>
>>> or previous discussion at
>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
>>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
>>>
>>> generaly speaking, qemu part of live migration is consistent for
>>> vfio-pci + mediate ops way or mdev way.
>>
>> So in mdev, do you still have a mediate driver? Or you expect the parent
>> to implement the region?
>>
> No, currently it's only for vfio-pci.

And specific to PCI.

> mdev parent driver is free to customize its regions and hence does not
> requires this mediate ops hooks.
>
>>> The region is only a channel for
>>> QEMU and kernel to communicate information without introducing IOCTLs.
>>
>> Well, at least you introduce new type of region in uapi. So this does
>> not answer why region is better than ioctl. If the region will only be
>> used by qemu, using ioctl is much more easier and straightforward.
>>
> It's not introduced by me :)
> mdev live migration is actually using this way, I'm just keeping
> compatible to the uapi.


I meant e.g VFIO_REGION_TYPE_MIGRATION.


>
> From my own perspective, my answer is that a region is more flexible
> compared to ioctl. vendor driver can freely define the size,
>

Probably not since it's an ABI I think.

> mmap cap of
> its data subregion.
>

It doesn't help much unless it can be mapped into guest (which I don't
think it was the case here).

> Also, there're already too many ioctls in vfio.

Probably not :) We had a brunch of  subsystems that have much more
ioctls than VFIO. (e.g DRM)

>>>
>>>>> (3) provide a dynamic trap bar info region to allow vendor driver
>>>>> control trap/untrap of device pci bars
>>>>>
>>>>> This vfio-pci + mediate ops way differs from mdev way in that
>>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
>>>>> specific mdev parent driver is bound to VF directly.
>>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
>>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
>>>>>
>>>>> The reason why we don't choose the way of writing mdev parent driver is
>>>>> that
>>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
>>>>> to vfio-pci can make most of the code shared/reused.
>>>> Can we split out the common parts from vfio-pci?
>>>>
>>> That's very attractive. but one cannot implement a vfio-pci except
>>> export everything in it as common part :)
>>
>> Well, I think there should be not hard to do that. E..g you can route it
>> back to like:
>>
>> vfio -> vfio_mdev -> parent -> vfio_pci
>>
> it's desired for us to have mediate driver binding to PF device.
> so once a VF device is created, only PF driver and vfio-pci are
> required. Just the same as what needs to be done for a normal VF passthrough.
> otherwise, a separate parent driver binding to VF is required.
> Also, this parent driver has many drawbacks as I mentions in this
> cover-letter.

Well, as discussed, no need to duplicate the code, bar trick should
still work. The main issues I saw with this proposal is:

1) PCI specific, other bus may need something similar
2) Function duplicated with mdev and mdev can do even more


>>>>> If we write a
>>>>> vendor specific mdev parent driver, most of the code (like passthrough
>>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
>>>>> actually a duplicated and tedious work.
>>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
>>>> me we need to consider live migration for mdev as well. In that case, do we
>>>> still expect mediate ops through VFIO directly?
>>>>
>>>>
>>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
>>>>> vfio-pci, they can be available to most people without repeated code
>>>>> copying and re-testing.
>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
>>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
>>>>> it runs into a real migration need. However, if vfio-pci is bound
>>>>> initially, they have no chance to do live migration when there's a need
>>>>> later.
>>>> We can teach management layer to do this.
>>>>
>>> No. not possible as vfio-pci by default has no migration region and
>>> dirty page tracking needs vendor's mediation at least for most
>>> passthrough devices now.
>>
>> I'm not quite sure I get here but in this case, just tech them to use
>> the driver that has migration support?
>>
> That's a way, but as more and more passthrough devices have demands and
> caps to do migration, will vfio-pci be used in future any more ?


This should not be a problem:
- If we introduce a common mdev for vfio-pci, we can just bind that
driver always
- The most straightforward way to support dirty page tracking is done by
IOMMU instead of device specific operations.

Thanks

>
> Thanks
> Yan
>
>> Thanks
>>
>>
>>> Thanks
>>> Yn
>>>
>>>> Thanks
>>>>
>>>>
>>>>> In this patchset,
>>>>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
>>>>> driver to mediate/customize region info/rw/mmap.
>>>>>
>>>>> - patches 5-6 provide a standalone sample driver to register a mediate ops
>>>>> for Intel Graphics Devices. It does not bind to IGDs directly but decides
>>>>> what devices it supports via its pciidlist. It also demonstrates how to
>>>>> dynamic trap a device's PCI bars. (by adding more pciids in its
>>>>> pciidlist, this sample driver actually is not necessarily limited to
>>>>> support IGDs)
>>>>>
>>>>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
>>>>> Ethernet Controller XL710 Family of devices. It supports VF precopy live
>>>>> migration on Intel's 710 SRIOV. (but we commented out the real
>>>>> implementation of dirty page tracking and device state retrieving part
>>>>> to focus on demonstrating framework part. Will send out them in future
>>>>> versions)
>>>>> patch 7 registers/unregisters VF mediate ops when PF driver
>>>>> probes/removes. It specifies its supporting VFs via
>>>>> vfio_pci_mediate_ops->open(pdev)
>>>>>
>>>>> patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
>>>>> provides a sample implementation of migration region.
>>>>> The QEMU part of vfio migration is based on v8
>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
>>>>> We do not based on recent v9 because we think there are still opens in
>>>>> dirty page track part in that series.
>>>>>
>>>>> patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
>>>>> provides an example on how to trap part of bar0 when migration starts
>>>>> and passthrough this part of bar0 again when migration fails.
>>>>>
>>>>> Yan Zhao (9):
>>>>> vfio/pci: introduce mediate ops to intercept vfio-pci ops
>>>>> vfio/pci: test existence before calling region->ops
>>>>> vfio/pci: register a default migration region
>>>>> vfio-pci: register default dynamic-trap-bar-info region
>>>>> samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
>>>>> sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
>>>>> i40e/vf_migration: register mediate_ops to vfio-pci
>>>>> i40e/vf_migration: mediate migration region
>>>>> i40e/vf_migration: support dynamic trap of bar0
>>>>>
>>>>> drivers/net/ethernet/intel/Kconfig | 2 +-
>>>>> drivers/net/ethernet/intel/i40e/Makefile | 3 +-
>>>>> drivers/net/ethernet/intel/i40e/i40e.h | 2 +
>>>>> drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +
>>>>> .../ethernet/intel/i40e/i40e_vf_migration.c | 626 ++++++++++++++++++
>>>>> .../ethernet/intel/i40e/i40e_vf_migration.h | 78 +++
>>>>> drivers/vfio/pci/vfio_pci.c | 189 +++++-
>>>>> drivers/vfio/pci/vfio_pci_private.h | 2 +
>>>>> include/linux/vfio.h | 18 +
>>>>> include/uapi/linux/vfio.h | 160 +++++
>>>>> samples/Kconfig | 6 +
>>>>> samples/Makefile | 1 +
>>>>> samples/vfio-pci/Makefile | 2 +
>>>>> samples/vfio-pci/igd_dt.c | 367 ++++++++++
>>>>> 14 files changed, 1455 insertions(+), 4 deletions(-)
>>>>> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
>>>>> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>>>>> create mode 100644 samples/vfio-pci/Makefile
>>>>> create mode 100644 samples/vfio-pci/igd_dt.c
>>>>>

2019-12-06 12:58:53

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote:
>
> On 2019/12/6 下午4:22, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
> >> On 2019/12/5 下午4:51, Yan Zhao wrote:
> >>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> >>>> Hi:
> >>>>
> >>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
> >>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
> >>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>>>> dynamic host mediation is required to (1) get device states, (2) get
> >>>>> dirty pages. Since device states as well as other critical information
> >>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
> >>>>> it is handy to provide an extension in PF driver to centralizingly control
> >>>>> VFs' migration.
> >>>>>
> >>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>>>> dynamically trap VFs' bars for dirty page tracking and
> >>>> A silly question, what's the reason for doing this, is this a must for dirty
> >>>> page tracking?
> >>>>
> >>> For performance consideration. VFs' bars should be passthoughed at
> >>> normal time and only enter into trap state on need.
> >>
> >> Right, but how does this matter for the case of dirty page tracking?
> >>
> > Take NIC as an example, to trap its VF dirty pages, software way is
> > required to trap every write of ring tail that resides in BAR0.
>
>
> Interesting, but it looks like we need:
> - decode the instruction
> - mediate all access to BAR0
> All of which seems a great burden for the VF driver. I wonder whether or
> not doing interrupt relay and tracking head is better in this case.
>
hi Jason

not familiar with the way you mentioned. could you elaborate more?
>
> > There's
> > still no IOMMU Dirty bit available.
> >>>>> (3) centralizing
> >>>>> VF critical states retrieving and VF controls into one driver, we propose
> >>>>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>>>
> >>>>>
> >>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>>> __________ register mediate ops| ___________ ___________ |
> >>>>> | |<-----------------------| VF | | |
> >>>>> | vfio-pci | | | mediate | | PF driver | |
> >>>>> |__________|----------------------->| driver | |___________|
> >>>>> | open(pdev) | ----------- | |
> >>>>> | |
> >>>>> | |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >>>>> \|/ \|/
> >>>>> ----------- ------------
> >>>>> | VF | | PF |
> >>>>> ----------- ------------
> >>>>>
> >>>>>
> >>>>> VF mediate driver could be a standalone driver that does not bind to
> >>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
> >>>>> extension of PF driver (as in patches 7-9) .
> >>>>>
> >>>>> Rather than directly bind to VF, VF mediate driver register a mediate
> >>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> >>>>> mediate ops.
> >>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
> >>>>> before vfio-pci binding to any devices. And VF mediate driver can
> >>>>> support mediating multiple devices.)
> >>>>>
> >>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> >>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> >>>>> device as a parameter.
> >>>>> VF mediate driver should return success or failure depending on it
> >>>>> supports the pdev or not.
> >>>>> E.g. VF mediate driver would compare its supported VF devfn with the
> >>>>> devfn of the passed-in pdev.
> >>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> >>>>> stop querying other mediate ops and bind the opening device with this
> >>>>> mediate ops using the returned mediate handle.
> >>>>>
> >>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> >>>>> VF will be intercepted into VF mediate driver as
> >>>>> vfio_pci_mediate_ops->get_region_info(),
> >>>>> vfio_pci_mediate_ops->rw,
> >>>>> vfio_pci_mediate_ops->mmap, and get customized.
> >>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> >>>>> further return 'pt' to indicate whether vfio-pci should further
> >>>>> passthrough data to hw.
> >>>>>
> >>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> >>>>> with a mediate handle as parameter.
> >>>>>
> >>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> >>>>> mediate driver be able to differentiate two opening VFs of the same device
> >>>>> id and vendor id.
> >>>>>
> >>>>> When VF mediate driver exits, it unregisters its mediate ops from
> >>>>> vfio-pci.
> >>>>>
> >>>>>
> >>>>> In this patchset, we enable vfio-pci to provide 3 things:
> >>>>> (1) calling mediate ops to allow vendor driver customizing default
> >>>>> region info/rw/mmap of a region.
> >>>>> (2) provide a migration region to support migration
> >>>> What's the benefit of introducing a region? It looks to me we don't expect
> >>>> the region to be accessed directly from guest. Could we simply extend device
> >>>> fd ioctl for doing such things?
> >>>>
> >>> You may take a look on mdev live migration discussions in
> >>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >>>
> >>> or previous discussion at
> >>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
> >>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
> >>>
> >>> generaly speaking, qemu part of live migration is consistent for
> >>> vfio-pci + mediate ops way or mdev way.
> >>
> >> So in mdev, do you still have a mediate driver? Or you expect the parent
> >> to implement the region?
> >>
> > No, currently it's only for vfio-pci.
>
> And specific to PCI.
>
> > mdev parent driver is free to customize its regions and hence does not
> > requires this mediate ops hooks.
> >
> >>> The region is only a channel for
> >>> QEMU and kernel to communicate information without introducing IOCTLs.
> >>
> >> Well, at least you introduce new type of region in uapi. So this does
> >> not answer why region is better than ioctl. If the region will only be
> >> used by qemu, using ioctl is much more easier and straightforward.
> >>
> > It's not introduced by me :)
> > mdev live migration is actually using this way, I'm just keeping
> > compatible to the uapi.
>
>
> I meant e.g VFIO_REGION_TYPE_MIGRATION.
>
here's the history of vfio live migration:
https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg05564.html
https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html
https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html

If you have any concern of this region way, feel free to comment to the
latest v9 patchset:
https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html

The patchset here will always keep compatible to there.
>
> >
> > From my own perspective, my answer is that a region is more flexible
> > compared to ioctl. vendor driver can freely define the size,
> >
>
> Probably not since it's an ABI I think.
>
that's why I need to define VFIO_REGION_TYPE_MIGRATION here in this
patchset, as it's not upstreamed yet.
maybe I should make it into a prerequisite patch, indicating it is not
introduced by this patchset

> > mmap cap of
> > its data subregion.
> >
>
> It doesn't help much unless it can be mapped into guest (which I don't
> think it was the case here).
>
it's access by host qemu, the same as how linux app access an mmaped
memory. the mmap here is to reduce memory copy from kernel to user.
No need to get mapped into guest.

> > Also, there're already too many ioctls in vfio.
>
> Probably not :) We had a brunch of  subsystems that have much more
> ioctls than VFIO. (e.g DRM)
>

> >>>
> >>>>> (3) provide a dynamic trap bar info region to allow vendor driver
> >>>>> control trap/untrap of device pci bars
> >>>>>
> >>>>> This vfio-pci + mediate ops way differs from mdev way in that
> >>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> >>>>> specific mdev parent driver is bound to VF directly.
> >>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
> >>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
> >>>>>
> >>>>> The reason why we don't choose the way of writing mdev parent driver is
> >>>>> that
> >>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
> >>>>> to vfio-pci can make most of the code shared/reused.
> >>>> Can we split out the common parts from vfio-pci?
> >>>>
> >>> That's very attractive. but one cannot implement a vfio-pci except
> >>> export everything in it as common part :)
> >>
> >> Well, I think there should be not hard to do that. E..g you can route it
> >> back to like:
> >>
> >> vfio -> vfio_mdev -> parent -> vfio_pci
> >>
> > it's desired for us to have mediate driver binding to PF device.
> > so once a VF device is created, only PF driver and vfio-pci are
> > required. Just the same as what needs to be done for a normal VF passthrough.
> > otherwise, a separate parent driver binding to VF is required.
> > Also, this parent driver has many drawbacks as I mentions in this
> > cover-letter.
>
> Well, as discussed, no need to duplicate the code, bar trick should
> still work. The main issues I saw with this proposal is:
>
> 1) PCI specific, other bus may need something similar
vfio-pci is only for PCI of course.

> 2) Function duplicated with mdev and mdev can do even more
>
could you elaborate how mdev can do solve the above saying problem ?
>
> >>>>> If we write a
> >>>>> vendor specific mdev parent driver, most of the code (like passthrough
> >>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> >>>>> actually a duplicated and tedious work.
> >>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
> >>>> me we need to consider live migration for mdev as well. In that case, do we
> >>>> still expect mediate ops through VFIO directly?
> >>>>
> >>>>
> >>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
> >>>>> vfio-pci, they can be available to most people without repeated code
> >>>>> copying and re-testing.
> >>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> >>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> >>>>> it runs into a real migration need. However, if vfio-pci is bound
> >>>>> initially, they have no chance to do live migration when there's a need
> >>>>> later.
> >>>> We can teach management layer to do this.
> >>>>
> >>> No. not possible as vfio-pci by default has no migration region and
> >>> dirty page tracking needs vendor's mediation at least for most
> >>> passthrough devices now.
> >>
> >> I'm not quite sure I get here but in this case, just tech them to use
> >> the driver that has migration support?
> >>
> > That's a way, but as more and more passthrough devices have demands and
> > caps to do migration, will vfio-pci be used in future any more ?
>
>
> This should not be a problem:
> - If we introduce a common mdev for vfio-pci, we can just bind that
> driver always
what is common mdev for vfio-pci? a common mdev parent driver that have
the same implementation as vfio-pci?

There's actually already a solution of creating only one mdev on top
of each passthrough device, and make mdev share the same iommu group
with it. We've also made an implementation on it already. here's a
sample one made by Yi at https://patchwork.kernel.org/cover/11134695/.

But, as I said, it's desired to re-use vfio-pci directly for SRIOV,
which is straghtforward :)

> - The most straightforward way to support dirty page tracking is done by
> IOMMU instead of device specific operations.
>
No such IOMMU yet. And all kinds of platforms should be cared, right?

Thanks
Yan

> Thanks
>
> >
> > Thanks
> > Yan
> >
> >> Thanks
> >>
> >>
> >>> Thanks
> >>> Yn
> >>>
> >>>> Thanks
> >>>>
> >>>>
> >>>>> In this patchset,
> >>>>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
> >>>>> driver to mediate/customize region info/rw/mmap.
> >>>>>
> >>>>> - patches 5-6 provide a standalone sample driver to register a mediate ops
> >>>>> for Intel Graphics Devices. It does not bind to IGDs directly but decides
> >>>>> what devices it supports via its pciidlist. It also demonstrates how to
> >>>>> dynamic trap a device's PCI bars. (by adding more pciids in its
> >>>>> pciidlist, this sample driver actually is not necessarily limited to
> >>>>> support IGDs)
> >>>>>
> >>>>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
> >>>>> Ethernet Controller XL710 Family of devices. It supports VF precopy live
> >>>>> migration on Intel's 710 SRIOV. (but we commented out the real
> >>>>> implementation of dirty page tracking and device state retrieving part
> >>>>> to focus on demonstrating framework part. Will send out them in future
> >>>>> versions)
> >>>>> patch 7 registers/unregisters VF mediate ops when PF driver
> >>>>> probes/removes. It specifies its supporting VFs via
> >>>>> vfio_pci_mediate_ops->open(pdev)
> >>>>>
> >>>>> patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
> >>>>> provides a sample implementation of migration region.
> >>>>> The QEMU part of vfio migration is based on v8
> >>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
> >>>>> We do not based on recent v9 because we think there are still opens in
> >>>>> dirty page track part in that series.
> >>>>>
> >>>>> patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
> >>>>> provides an example on how to trap part of bar0 when migration starts
> >>>>> and passthrough this part of bar0 again when migration fails.
> >>>>>
> >>>>> Yan Zhao (9):
> >>>>> vfio/pci: introduce mediate ops to intercept vfio-pci ops
> >>>>> vfio/pci: test existence before calling region->ops
> >>>>> vfio/pci: register a default migration region
> >>>>> vfio-pci: register default dynamic-trap-bar-info region
> >>>>> samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
> >>>>> sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
> >>>>> i40e/vf_migration: register mediate_ops to vfio-pci
> >>>>> i40e/vf_migration: mediate migration region
> >>>>> i40e/vf_migration: support dynamic trap of bar0
> >>>>>
> >>>>> drivers/net/ethernet/intel/Kconfig | 2 +-
> >>>>> drivers/net/ethernet/intel/i40e/Makefile | 3 +-
> >>>>> drivers/net/ethernet/intel/i40e/i40e.h | 2 +
> >>>>> drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +
> >>>>> .../ethernet/intel/i40e/i40e_vf_migration.c | 626 ++++++++++++++++++
> >>>>> .../ethernet/intel/i40e/i40e_vf_migration.h | 78 +++
> >>>>> drivers/vfio/pci/vfio_pci.c | 189 +++++-
> >>>>> drivers/vfio/pci/vfio_pci_private.h | 2 +
> >>>>> include/linux/vfio.h | 18 +
> >>>>> include/uapi/linux/vfio.h | 160 +++++
> >>>>> samples/Kconfig | 6 +
> >>>>> samples/Makefile | 1 +
> >>>>> samples/vfio-pci/Makefile | 2 +
> >>>>> samples/vfio-pci/igd_dt.c | 367 ++++++++++
> >>>>> 14 files changed, 1455 insertions(+), 4 deletions(-)
> >>>>> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> >>>>> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> >>>>> create mode 100644 samples/vfio-pci/Makefile
> >>>>> create mode 100644 samples/vfio-pci/igd_dt.c
> >>>>>
>

2019-12-06 15:21:36

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

On Fri, 6 Dec 2019 01:04:07 -0500
Yan Zhao <[email protected]> wrote:

> On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > On Wed, 4 Dec 2019 22:26:50 -0500
> > Yan Zhao <[email protected]> wrote:
> >
> > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > communicate dynamic trap info. It is of type
> > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > >
> > > This region has two fields: dt_fd and trap.
> > > When QEMU detects a device regions of this type, it will create an
> > > eventfd and write its eventfd id to dt_fd field.
> > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > info region.
> > > - If trap is true, QEMU would search the device's PCI BAR
> > > regions and disable all the sparse mmaped subregions (if the sparse
> > > mmaped subregion is disablable).
> > > - If trap is false, QEMU would re-enable those subregions.
> > >
> > > A typical usage is
> > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > QEMU reads trap field of this info region which is false and QEMU
> > > re-passthrough the whole bar 0 region.
> > >
> > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > vfio_pci_mediate_ops->open().
> > >
> > > If vfio-pci detects this cap, it will create a default
> > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > and region->ops=null.
> > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > vfio_pci_mediate_ops.
> >
> > TBH, I don't like this interface at all. Userspace doesn't pass data
> > to the kernel via INFO ioctls. We have a SET_IRQS ioctl for
> > configuring user signaling with eventfds. I think we only need to
> > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > information for a region. The user would enumerate the device IRQs via
> > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > indicate which region(s) should be re-evaluated on signaling. The user
> > would enable that signaling via SET_IRQS and simply re-evaluate the
> ok. I'll try to switch to this way. Thanks for this suggestion.
>
> > sparse mmap capability for the associated regions when signaled.
>
> Do you like the "disablable" flag of sparse mmap ?
> I think it's a lightweight way for user to switch mmap state of a whole region,
> otherwise going through a complete flow of GET_REGION_INFO and re-setup
> region might be too heavy.

No, I don't like the disable-able flag. At what frequency do we expect
regions to change? It seems like we'd only change when switching into
and out of the _SAVING state, which is rare. It seems easy for
userspace, at least QEMU, to drop the entire mmap configuration and
re-read it. Another concern here is how do we synchronize the event?
Are we assuming that this event would occur when a user switch to
_SAVING mode on the device? That operation is synchronous, the device
must be in saving mode after the write to device state completes, but
it seems like this might be trying to add an asynchronous dependency.
Will the write to device_state only complete once the user handles the
eventfd? How would the kernel know when the mmap re-evaluation is
complete. It seems like there are gaps here that the vendor driver
could miss traps required for migration because the user hasn't
completed the mmap transition yet. Thanks,

Alex

2019-12-06 17:45:08

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

On Fri, 6 Dec 2019 17:40:02 +0800
Jason Wang <[email protected]> wrote:

> On 2019/12/6 下午4:22, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
> >> On 2019/12/5 下午4:51, Yan Zhao wrote:
> >>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> >>>> Hi:
> >>>>
> >>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
> >>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
> >>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>>>> dynamic host mediation is required to (1) get device states, (2) get
> >>>>> dirty pages. Since device states as well as other critical information
> >>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
> >>>>> it is handy to provide an extension in PF driver to centralizingly control
> >>>>> VFs' migration.
> >>>>>
> >>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>>>> dynamically trap VFs' bars for dirty page tracking and
> >>>> A silly question, what's the reason for doing this, is this a must for dirty
> >>>> page tracking?
> >>>>
> >>> For performance consideration. VFs' bars should be passthoughed at
> >>> normal time and only enter into trap state on need.
> >>
> >> Right, but how does this matter for the case of dirty page tracking?
> >>
> > Take NIC as an example, to trap its VF dirty pages, software way is
> > required to trap every write of ring tail that resides in BAR0.
>
>
> Interesting, but it looks like we need:
> - decode the instruction
> - mediate all access to BAR0
> All of which seems a great burden for the VF driver. I wonder whether or
> not doing interrupt relay and tracking head is better in this case.

This sounds like a NIC specific solution, I believe the goal here is to
allow any device type to implement a partial mediation solution, in
this case to sufficiently track the device while in the migration
saving state.

> > There's
> > still no IOMMU Dirty bit available.
> >>>>> (3) centralizing
> >>>>> VF critical states retrieving and VF controls into one driver, we propose
> >>>>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>>>
> >>>>>
> >>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>>> __________ register mediate ops| ___________ ___________ |
> >>>>> | |<-----------------------| VF | | |
> >>>>> | vfio-pci | | | mediate | | PF driver | |
> >>>>> |__________|----------------------->| driver | |___________|
> >>>>> | open(pdev) | ----------- | |
> >>>>> | |
> >>>>> | |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >>>>> \|/ \|/
> >>>>> ----------- ------------
> >>>>> | VF | | PF |
> >>>>> ----------- ------------
> >>>>>
> >>>>>
> >>>>> VF mediate driver could be a standalone driver that does not bind to
> >>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
> >>>>> extension of PF driver (as in patches 7-9) .
> >>>>>
> >>>>> Rather than directly bind to VF, VF mediate driver register a mediate
> >>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> >>>>> mediate ops.
> >>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
> >>>>> before vfio-pci binding to any devices. And VF mediate driver can
> >>>>> support mediating multiple devices.)
> >>>>>
> >>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> >>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> >>>>> device as a parameter.
> >>>>> VF mediate driver should return success or failure depending on it
> >>>>> supports the pdev or not.
> >>>>> E.g. VF mediate driver would compare its supported VF devfn with the
> >>>>> devfn of the passed-in pdev.
> >>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> >>>>> stop querying other mediate ops and bind the opening device with this
> >>>>> mediate ops using the returned mediate handle.
> >>>>>
> >>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> >>>>> VF will be intercepted into VF mediate driver as
> >>>>> vfio_pci_mediate_ops->get_region_info(),
> >>>>> vfio_pci_mediate_ops->rw,
> >>>>> vfio_pci_mediate_ops->mmap, and get customized.
> >>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> >>>>> further return 'pt' to indicate whether vfio-pci should further
> >>>>> passthrough data to hw.
> >>>>>
> >>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> >>>>> with a mediate handle as parameter.
> >>>>>
> >>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> >>>>> mediate driver be able to differentiate two opening VFs of the same device
> >>>>> id and vendor id.
> >>>>>
> >>>>> When VF mediate driver exits, it unregisters its mediate ops from
> >>>>> vfio-pci.
> >>>>>
> >>>>>
> >>>>> In this patchset, we enable vfio-pci to provide 3 things:
> >>>>> (1) calling mediate ops to allow vendor driver customizing default
> >>>>> region info/rw/mmap of a region.
> >>>>> (2) provide a migration region to support migration
> >>>> What's the benefit of introducing a region? It looks to me we don't expect
> >>>> the region to be accessed directly from guest. Could we simply extend device
> >>>> fd ioctl for doing such things?
> >>>>
> >>> You may take a look on mdev live migration discussions in
> >>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >>>
> >>> or previous discussion at
> >>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
> >>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
> >>>
> >>> generaly speaking, qemu part of live migration is consistent for
> >>> vfio-pci + mediate ops way or mdev way.
> >>
> >> So in mdev, do you still have a mediate driver? Or you expect the parent
> >> to implement the region?
> >>
> > No, currently it's only for vfio-pci.
>
> And specific to PCI.

What's PCI specific? The implementation, yes, it's done in the bus
vfio bus driver here but all device access is performed by the bus
driver. I'm not sure how we could introduce the intercept at the
vfio-core level, but I'm open to suggestions.

> > mdev parent driver is free to customize its regions and hence does not
> > requires this mediate ops hooks.
> >
> >>> The region is only a channel for
> >>> QEMU and kernel to communicate information without introducing IOCTLs.
> >>
> >> Well, at least you introduce new type of region in uapi. So this does
> >> not answer why region is better than ioctl. If the region will only be
> >> used by qemu, using ioctl is much more easier and straightforward.
> >>
> > It's not introduced by me :)
> > mdev live migration is actually using this way, I'm just keeping
> > compatible to the uapi.
>
>
> I meant e.g VFIO_REGION_TYPE_MIGRATION.
>
>
> >
> > From my own perspective, my answer is that a region is more flexible
> > compared to ioctl. vendor driver can freely define the size,
> >
>
> Probably not since it's an ABI I think.

I think Kirti's thread proposing the migration interface is a better
place for this discussion, I believe Yan has already linked to it. In
general we prefer to be frugal in our introduction of new ioctls,
especially when we have existing mechanisms via regions to support the
interactions. The interface is designed to be flexible to the vendor
driver needs, partially thanks to it being a region.

> > mmap cap of
> > its data subregion.
> >
>
> It doesn't help much unless it can be mapped into guest (which I don't
> think it was the case here).
>
> > Also, there're already too many ioctls in vfio.
>
> Probably not :) We had a brunch of  subsystems that have much more
> ioctls than VFIO. (e.g DRM)

And this is a good thing? We can more easily deprecate and revise
region support than we can take back ioctls that have been previously
used. I generally don't like the "let's create a new ioctl for that"
approach versus trying to fit something within the existing
architecture and convention.

> >>>>> (3) provide a dynamic trap bar info region to allow vendor driver
> >>>>> control trap/untrap of device pci bars
> >>>>>
> >>>>> This vfio-pci + mediate ops way differs from mdev way in that
> >>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> >>>>> specific mdev parent driver is bound to VF directly.
> >>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
> >>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
> >>>>>
> >>>>> The reason why we don't choose the way of writing mdev parent driver is
> >>>>> that
> >>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
> >>>>> to vfio-pci can make most of the code shared/reused.
> >>>> Can we split out the common parts from vfio-pci?
> >>>>
> >>> That's very attractive. but one cannot implement a vfio-pci except
> >>> export everything in it as common part :)
> >>
> >> Well, I think there should be not hard to do that. E..g you can route it
> >> back to like:
> >>
> >> vfio -> vfio_mdev -> parent -> vfio_pci
> >>
> > it's desired for us to have mediate driver binding to PF device.
> > so once a VF device is created, only PF driver and vfio-pci are
> > required. Just the same as what needs to be done for a normal VF passthrough.
> > otherwise, a separate parent driver binding to VF is required.
> > Also, this parent driver has many drawbacks as I mentions in this
> > cover-letter.
>
> Well, as discussed, no need to duplicate the code, bar trick should
> still work. The main issues I saw with this proposal is:
>
> 1) PCI specific, other bus may need something similar

Propose how it could be implemented higher in the vfio stack to make it
device agnostic.

> 2) Function duplicated with mdev and mdev can do even more

mdev also comes with a device lifecycle interface that doesn't really
make sense when a driver is only trying to partially mediate a single
physical device rather than multiplex a physical device into virtual
devices. mdev would also require vendor drivers to re-implement
much of vfio-pci for the direct access mechanisms. Also, do we really
want users or management tools to decide between binding a device to
vfio-pci or a separate mdev driver to get this functionality. We've
already been burnt trying to use mdev beyond its scope.

> >>>>> If we write a
> >>>>> vendor specific mdev parent driver, most of the code (like passthrough
> >>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> >>>>> actually a duplicated and tedious work.
> >>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
> >>>> me we need to consider live migration for mdev as well. In that case, do we
> >>>> still expect mediate ops through VFIO directly?
> >>>>
> >>>>
> >>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
> >>>>> vfio-pci, they can be available to most people without repeated code
> >>>>> copying and re-testing.
> >>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> >>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> >>>>> it runs into a real migration need. However, if vfio-pci is bound
> >>>>> initially, they have no chance to do live migration when there's a need
> >>>>> later.
> >>>> We can teach management layer to do this.
> >>>>
> >>> No. not possible as vfio-pci by default has no migration region and
> >>> dirty page tracking needs vendor's mediation at least for most
> >>> passthrough devices now.
> >>
> >> I'm not quite sure I get here but in this case, just tech them to use
> >> the driver that has migration support?
> >>
> > That's a way, but as more and more passthrough devices have demands and
> > caps to do migration, will vfio-pci be used in future any more ?
>
>
> This should not be a problem:
> - If we introduce a common mdev for vfio-pci, we can just bind that
> driver always

There's too much of mdev that doesn't make sense for this usage model,
this is why Yi's proposed generic mdev PCI wrapper is only a sample
driver. I think we do not want to introduce user confusion regarding
which driver to use and there are outstanding non-singleton group
issues with mdev that don't seem worthwhile to resolve.

> - The most straightforward way to support dirty page tracking is done by
> IOMMU instead of device specific operations.

Of course, but it doesn't exist yet. We're attempting to design the
dirty page tracking in a way that's mostly transparent for current mdev
drivers, would provide generic support for IOMMU-based dirty tracking,
and extensible to the inevitability of vendor driver tracking. Thanks,

Alex

2019-12-06 21:23:17

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

On Fri, 6 Dec 2019 02:56:55 -0500
Yan Zhao <[email protected]> wrote:

> On Fri, Dec 06, 2019 at 07:55:19AM +0800, Alex Williamson wrote:
> > On Wed, 4 Dec 2019 22:25:36 -0500
> > Yan Zhao <[email protected]> wrote:
> >
> > > when vfio-pci is bound to a physical device, almost all the hardware
> > > resources are passthroughed.
> > > Sometimes, vendor driver of this physcial device may want to mediate some
> > > hardware resource access for a short period of time, e.g. dirty page
> > > tracking during live migration.
> > >
> > > Here we introduce mediate ops in vfio-pci for this purpose.
> > >
> > > Vendor driver can register a mediate ops to vfio-pci.
> > > But rather than directly bind to the passthroughed device, the
> > > vendor driver is now either a module that does not bind to any device or
> > > a module binds to other device.
> > > E.g. when passing through a VF device that is bound to vfio-pci modules,
> > > PF driver that binds to PF device can register to vfio-pci to mediate
> > > VF's regions, hence supporting VF live migration.
> > >
> > > The sequence goes like this:
> > > 1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver
> > >
> > > 2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops
> > >
> > > 3. Whenever vfio-pci opens a device, it searches the list and call
> > > vfio_pci_mediate_ops->open() to check whether a vendor driver supports
> > > mediating this device.
> > > Upon a success return value of from vfio_pci_mediate_ops->open(),
> > > vfio-pci will stop list searching and store a mediate handle to
> > > represent this open into vendor driver.
> > > (so if multiple vendor drivers support mediating a device through
> > > vfio_pci_mediate_ops, only one will win, depending on their registering
> > > sequence)
> > >
> > > 4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
> > > ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
> > > vendor driver is able to override a region's default flags and caps,
> > > e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
> > > region.
> > >
> > > 5. vfio_pci_rw()/vfio_pci_mmap() first calls into
> > > vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
> > > if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
> > > passthrough this read/write/mmap to physical device, otherwise it just
> > > returns without touch physical device.
> > >
> > > 6. When vfio-pci closes a device, vfio_pci_release() chains into
> > > vfio_pci_mediate_ops->release() to close the reference in vendor driver.
> > >
> > > 7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits
> > >
> > > Cc: Kevin Tian <[email protected]>
> > >
> > > Signed-off-by: Yan Zhao <[email protected]>
> > > ---
> > > drivers/vfio/pci/vfio_pci.c | 146 ++++++++++++++++++++++++++++
> > > drivers/vfio/pci/vfio_pci_private.h | 2 +
> > > include/linux/vfio.h | 16 +++
> > > 3 files changed, 164 insertions(+)
> > >
> > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > > index 02206162eaa9..55080ff29495 100644
> > > --- a/drivers/vfio/pci/vfio_pci.c
> > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > @@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > > MODULE_PARM_DESC(disable_idle_d3,
> > > "Disable using the PCI D3 low power state for idle, unused devices");
> > >
> > > +static LIST_HEAD(mediate_ops_list);
> > > +static DEFINE_MUTEX(mediate_ops_list_lock);
> > > +struct vfio_pci_mediate_ops_list_entry {
> > > + struct vfio_pci_mediate_ops *ops;
> > > + int refcnt;
> > > + struct list_head next;
> > > +};
> > > +
> > > static inline bool vfio_vga_disabled(void)
> > > {
> > > #ifdef CONFIG_VFIO_PCI_VGA
> > > @@ -472,6 +480,10 @@ static void vfio_pci_release(void *device_data)
> > > if (!(--vdev->refcnt)) {
> > > vfio_spapr_pci_eeh_release(vdev->pdev);
> > > vfio_pci_disable(vdev);
> > > + if (vdev->mediate_ops && vdev->mediate_ops->release) {
> > > + vdev->mediate_ops->release(vdev->mediate_handle);
> > > + vdev->mediate_ops = NULL;
> > > + }
> > > }
> > >
> > > mutex_unlock(&vdev->reflck->lock);
> > > @@ -483,6 +495,7 @@ static int vfio_pci_open(void *device_data)
> > > {
> > > struct vfio_pci_device *vdev = device_data;
> > > int ret = 0;
> > > + struct vfio_pci_mediate_ops_list_entry *mentry;
> > >
> > > if (!try_module_get(THIS_MODULE))
> > > return -ENODEV;
> > > @@ -495,6 +508,30 @@ static int vfio_pci_open(void *device_data)
> > > goto error;
> > >
> > > vfio_spapr_pci_eeh_open(vdev->pdev);
> > > + mutex_lock(&mediate_ops_list_lock);
> > > + list_for_each_entry(mentry, &mediate_ops_list, next) {
> > > + u64 caps;
> > > + u32 handle;
> >
> > Wouldn't it seem likely that the ops provider might use this handle as
> > a pointer, so we'd want it to be an opaque void*?
> >
> yes, you are right, handle as a pointer is much better. will change it.
> Thanks :)
>
> > > +
> > > + memset(&caps, 0, sizeof(caps));
> >
> > @caps has no purpose here, add it if/when we do something with it.
> > It's also a standard type, why are we memset'ing it rather than just
> > =0??
> >
> > > + ret = mentry->ops->open(vdev->pdev, &caps, &handle);
> > > + if (!ret) {
> > > + vdev->mediate_ops = mentry->ops;
> > > + vdev->mediate_handle = handle;
> > > +
> > > + pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
> > > + vdev->mediate_ops->name, caps,
> > > + handle, vdev->pdev->vendor,
> > > + vdev->pdev->device);
> >
> > Generally not advisable to make user accessible printks.
> >
> ok.
>
> > > + /*
> > > + * only find the first matching mediate_ops,
> > > + * and add its refcnt
> > > + */
> > > + mentry->refcnt++;
> > > + break;
> > > + }
> > > + }
> > > + mutex_unlock(&mediate_ops_list_lock);
> > > }
> > > vdev->refcnt++;
> > > error:
> > > @@ -736,6 +773,14 @@ static long vfio_pci_ioctl(void *device_data,
> > > info.size = pdev->cfg_size;
> > > info.flags = VFIO_REGION_INFO_FLAG_READ |
> > > VFIO_REGION_INFO_FLAG_WRITE;
> > > +
> > > + if (vdev->mediate_ops &&
> > > + vdev->mediate_ops->get_region_info) {
> > > + vdev->mediate_ops->get_region_info(
> > > + vdev->mediate_handle,
> > > + &info, &caps, NULL);
> > > + }
> >
> > These would be a lot cleaner if we could just call a helper function:
> >
> > void vfio_pci_region_info_mediation_hook(vdev, info, caps, etc...)
> > {
> > if (vdev->mediate_ops
> > vdev->mediate_ops->get_region_info)
> > vdev->mediate_ops->get_region_info(vdev->mediate_handle,
> > &info, &caps, NULL);
> > }
> >
> > I'm not thrilled with all these hooks, but not open coding every one of
> > them might help.
>
> ok. got it.
> >
> > > +
> > > break;
> > > case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> > > info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> > > @@ -756,6 +801,13 @@ static long vfio_pci_ioctl(void *device_data,
> > > }
> > > }
> > >
> > > + if (vdev->mediate_ops &&
> > > + vdev->mediate_ops->get_region_info) {
> > > + vdev->mediate_ops->get_region_info(
> > > + vdev->mediate_handle,
> > > + &info, &caps, NULL);
> > > + }
> > > +
> > > break;
> > > case VFIO_PCI_ROM_REGION_INDEX:
> > > {
> > > @@ -794,6 +846,14 @@ static long vfio_pci_ioctl(void *device_data,
> > > }
> > >
> > > pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
> > > +
> > > + if (vdev->mediate_ops &&
> > > + vdev->mediate_ops->get_region_info) {
> > > + vdev->mediate_ops->get_region_info(
> > > + vdev->mediate_handle,
> > > + &info, &caps, NULL);
> > > + }
> > > +
> > > break;
> > > }
> > > case VFIO_PCI_VGA_REGION_INDEX:
> > > @@ -805,6 +865,13 @@ static long vfio_pci_ioctl(void *device_data,
> > > info.flags = VFIO_REGION_INFO_FLAG_READ |
> > > VFIO_REGION_INFO_FLAG_WRITE;
> > >
> > > + if (vdev->mediate_ops &&
> > > + vdev->mediate_ops->get_region_info) {
> > > + vdev->mediate_ops->get_region_info(
> > > + vdev->mediate_handle,
> > > + &info, &caps, NULL);
> > > + }
> > > +
> > > break;
> > > default:
> > > {
> > > @@ -839,6 +906,13 @@ static long vfio_pci_ioctl(void *device_data,
> > > if (ret)
> > > return ret;
> > > }
> > > +
> > > + if (vdev->mediate_ops &&
> > > + vdev->mediate_ops->get_region_info) {
> > > + vdev->mediate_ops->get_region_info(
> > > + vdev->mediate_handle,
> > > + &info, &caps, &cap_type);
> > > + }
> > > }
> > > }
> > >
> > > @@ -1151,6 +1225,16 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
> > > if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
> > > return -EINVAL;
> > >
> > > + if (vdev->mediate_ops && vdev->mediate_ops->rw) {
> > > + int ret;
> > > + bool pt = true;
> > > +
> > > + ret = vdev->mediate_ops->rw(vdev->mediate_handle,
> > > + buf, count, ppos, iswrite, &pt);
> > > + if (!pt)
> > > + return ret;
> > > + }
> > > +
> > > switch (index) {
> > > case VFIO_PCI_CONFIG_REGION_INDEX:
> > > return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> > > @@ -1200,6 +1284,15 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > > u64 phys_len, req_len, pgoff, req_start;
> > > int ret;
> > >
> > > + if (vdev->mediate_ops && vdev->mediate_ops->mmap) {
> > > + int ret;
> > > + bool pt = true;
> > > +
> > > + ret = vdev->mediate_ops->mmap(vdev->mediate_handle, vma, &pt);
> > > + if (!pt)
> > > + return ret;
> > > + }
> >
> > There must be a better way to do all these. Do we really want to call
> > into ops for every rw or mmap, have the vendor code decode a region,
> > and maybe or maybe not have it handle it? It's pretty ugly. Do we
>
> do you think below flow is good ?
> 1. in mediate_ops->open(), return
> (1) region[] indexed by region index, if a mediate driver supports mediating
> region[i], region[i].ops->get_region_info, regions[i].ops->rw, or
> regions[i].ops->mmap is not null.
> (2) irq_info[] indexed by irq index, if a mediate driver supports mediating
> irq_info[i], irq_info[i].ops->get_irq_info or irq_info[i].ops->set_irq_info
> is not null.
>
> Then, vfio_pci_rw/vfio_pci_mmap/vfio_pci_ioctl only call into those
> non-null hooks.

Or would it be better to always call into the hooks and the vendor
driver is allowed to selectively replace the hooks for regions they
want to mediate. For example, region[i].ops->rw could by default point
to vfio_pci_default_rw() and the mediation driver would have a
mechanism to replace that with its own vendorABC_vfio_pci_rw(). We
could export vfio_pci_default_rw() such that the vendor driver would be
responsible for calling it as necessary.

> > need the mediation provider to be able to dynamically setup the ops per
> May I confirm that you are not saying dynamic registering mediate ops
> after vfio-pci already opened a device, right?

I'm not necessarily excluding or advocating for that.

> > region and export the default handlers out for them to call?
> >
> could we still keep checking return value of the hooks rather than
> export default handlers? Otherwise at least vfio_pci_default_ioctl(),
> vfio_pci_default_rw(), and vfio_pci_default_mmap() need to be exported.

The ugliness of vfio-pci having all these vendor branches is what I'm
trying to avoid, so I really am not a fan of the idea or mechanism that
the vfio-pci core code is directly involving a mediation driver and
handling the return for every entry point.

> > > +
> > > index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> > >
> > > if (vma->vm_end < vma->vm_start)
> > > @@ -1629,8 +1722,17 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
> > >
> > > static void __exit vfio_pci_cleanup(void)
> > > {
> > > + struct vfio_pci_mediate_ops_list_entry *mentry, *n;
> > > +
> > > pci_unregister_driver(&vfio_pci_driver);
> > > vfio_pci_uninit_perm_bits();
> > > +
> > > + mutex_lock(&mediate_ops_list_lock);
> > > + list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
> > > + list_del(&mentry->next);
> > > + kfree(mentry);
> > > + }
> > > + mutex_unlock(&mediate_ops_list_lock);
> >
> > Is it even possible to unload vfio-pci while there are mediation
> > drivers registered? I don't think the module interactions are well
> > thought out here, ex. do you really want i40e to have build and runtime
> > dependencies on vfio-pci? I don't think so.
> >
> Currently, yes, i40e has build dependency on vfio-pci.
> It's like this, if i40e decides to support SRIOV and compiles in vf
> related code who depends on vfio-pci, it will also have build dependency
> on vfio-pci. isn't it natural?

No, this is not natural. There are certainly i40e VF use cases that
have no interest in vfio and having dependencies between the two
modules is unacceptable. I think you probably want to modularize the
i40e vfio support code and then perhaps register a table in vfio-pci
that the vfio-pci code can perform a module request when using a
compatible device. Just and idea, there might be better options. I
will not accept a solution that requires unloading the i40e driver in
order to unload the vfio-pci driver. It's inconvenient with just one
NIC driver, imagine how poorly that scales.

> > > }
> > >
> > > static void __init vfio_pci_fill_ids(void)
> > > @@ -1697,6 +1799,50 @@ static int __init vfio_pci_init(void)
> > > return ret;
> > > }
> > >
> > > +int vfio_pci_register_mediate_ops(struct vfio_pci_mediate_ops *ops)
> > > +{
> > > + struct vfio_pci_mediate_ops_list_entry *mentry;
> > > +
> > > + mutex_lock(&mediate_ops_list_lock);
> > > + mentry = kzalloc(sizeof(*mentry), GFP_KERNEL);
> > > + if (!mentry) {
> > > + mutex_unlock(&mediate_ops_list_lock);
> > > + return -ENOMEM;
> > > + }
> > > +
> > > + mentry->ops = ops;
> > > + mentry->refcnt = 0;
> >
> > It's kZalloc'd, this is unnecessary.
> >
> right :)
> > > + list_add(&mentry->next, &mediate_ops_list);
> >
> > Check for duplicates?
> >
> ok. will do it.
> > > +
> > > + pr_info("registered dm ops %s\n", ops->name);
> > > + mutex_unlock(&mediate_ops_list_lock);
> > > +
> > > + return 0;
> > > +}
> > > +EXPORT_SYMBOL(vfio_pci_register_mediate_ops);
> > > +
> > > +void vfio_pci_unregister_mediate_ops(struct vfio_pci_mediate_ops *ops)
> > > +{
> > > + struct vfio_pci_mediate_ops_list_entry *mentry, *n;
> > > +
> > > + mutex_lock(&mediate_ops_list_lock);
> > > + list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
> > > + if (mentry->ops != ops)
> > > + continue;
> > > +
> > > + mentry->refcnt--;
> >
> > Whose reference is this removing?
> >
> I intended to prevent mediate driver from calling unregister mediate ops
> while there're still opened devices in it.
> after a successful mediate_ops->open(), mentry->refcnt++.
> after calling mediate_ops->release(). mentry->refcnt--.
>
> (seems in this RFC, I missed a mentry->refcnt-- after calling
> mediate_ops->release())
>
>
> > > + if (!mentry->refcnt) {
> > > + list_del(&mentry->next);
> > > + kfree(mentry);
> > > + } else
> > > + pr_err("vfio_pci unregister mediate ops %s error\n",
> > > + mentry->ops->name);
> >
> > This is bad, we should hold a reference to the module providing these
> > ops for each use of it such that the module cannot be removed while
> > it's in use. Otherwise we enter a very bad state here and it's
> > trivially accessible by an admin remove the module while in use.
> mediate driver is supposed to ref its own module on a success
> mediate_ops->open(), and deref its own module on mediate_ops->release().
> so, it can't be accidentally removed.

Where was that semantic expressed in this series? We should create
interfaces that are hard to use incorrectly. It is far too easy for a
vendor driver to overlook such a requirement, which means fixing the
same bugs repeatedly for each vendor. It needs to be improved. Thanks,

Alex

2019-12-06 23:14:17

by Eric Blake

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

On 12/4/19 9:25 PM, Yan Zhao wrote:
> when vfio-pci is bound to a physical device, almost all the hardware
> resources are passthroughed.

The intent is obvious, but it sounds awkward to a native speaker.
s/passthroughed/passed through/

> Sometimes, vendor driver of this physcial device may want to mediate some

physical

> hardware resource access for a short period of time, e.g. dirty page
> tracking during live migration.
>
> Here we introduce mediate ops in vfio-pci for this purpose.
>
> Vendor driver can register a mediate ops to vfio-pci.
> But rather than directly bind to the passthroughed device, the

passed-through

--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org

2019-12-09 03:28:02

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

Sorry about that. I'll pay attention to them next time and thank you for
pointing them out :)

On Sat, Dec 07, 2019 at 07:13:30AM +0800, Eric Blake wrote:
> On 12/4/19 9:25 PM, Yan Zhao wrote:
> > when vfio-pci is bound to a physical device, almost all the hardware
> > resources are passthroughed.
>
> The intent is obvious, but it sounds awkward to a native speaker.
> s/passthroughed/passed through/
>
> > Sometimes, vendor driver of this physcial device may want to mediate some
>
> physical
>
> > hardware resource access for a short period of time, e.g. dirty page
> > tracking during live migration.
> >
> > Here we introduce mediate ops in vfio-pci for this purpose.
> >
> > Vendor driver can register a mediate ops to vfio-pci.
> > But rather than directly bind to the passthroughed device, the
>
> passed-through
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc. +1-919-301-3226
> Virtualization: qemu.org | libvirt.org
>

2019-12-09 03:52:34

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

On Sat, Dec 07, 2019 at 05:22:26AM +0800, Alex Williamson wrote:
> On Fri, 6 Dec 2019 02:56:55 -0500
> Yan Zhao <[email protected]> wrote:
>
> > On Fri, Dec 06, 2019 at 07:55:19AM +0800, Alex Williamson wrote:
> > > On Wed, 4 Dec 2019 22:25:36 -0500
> > > Yan Zhao <[email protected]> wrote:
> > >
> > > > when vfio-pci is bound to a physical device, almost all the hardware
> > > > resources are passthroughed.
> > > > Sometimes, vendor driver of this physcial device may want to mediate some
> > > > hardware resource access for a short period of time, e.g. dirty page
> > > > tracking during live migration.
> > > >
> > > > Here we introduce mediate ops in vfio-pci for this purpose.
> > > >
> > > > Vendor driver can register a mediate ops to vfio-pci.
> > > > But rather than directly bind to the passthroughed device, the
> > > > vendor driver is now either a module that does not bind to any device or
> > > > a module binds to other device.
> > > > E.g. when passing through a VF device that is bound to vfio-pci modules,
> > > > PF driver that binds to PF device can register to vfio-pci to mediate
> > > > VF's regions, hence supporting VF live migration.
> > > >
> > > > The sequence goes like this:
> > > > 1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver
> > > >
> > > > 2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops
> > > >
> > > > 3. Whenever vfio-pci opens a device, it searches the list and call
> > > > vfio_pci_mediate_ops->open() to check whether a vendor driver supports
> > > > mediating this device.
> > > > Upon a success return value of from vfio_pci_mediate_ops->open(),
> > > > vfio-pci will stop list searching and store a mediate handle to
> > > > represent this open into vendor driver.
> > > > (so if multiple vendor drivers support mediating a device through
> > > > vfio_pci_mediate_ops, only one will win, depending on their registering
> > > > sequence)
> > > >
> > > > 4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
> > > > ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
> > > > vendor driver is able to override a region's default flags and caps,
> > > > e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
> > > > region.
> > > >
> > > > 5. vfio_pci_rw()/vfio_pci_mmap() first calls into
> > > > vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
> > > > if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
> > > > passthrough this read/write/mmap to physical device, otherwise it just
> > > > returns without touch physical device.
> > > >
> > > > 6. When vfio-pci closes a device, vfio_pci_release() chains into
> > > > vfio_pci_mediate_ops->release() to close the reference in vendor driver.
> > > >
> > > > 7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits
> > > >
> > > > Cc: Kevin Tian <[email protected]>
> > > >
> > > > Signed-off-by: Yan Zhao <[email protected]>
> > > > ---
> > > > drivers/vfio/pci/vfio_pci.c | 146 ++++++++++++++++++++++++++++
> > > > drivers/vfio/pci/vfio_pci_private.h | 2 +
> > > > include/linux/vfio.h | 16 +++
> > > > 3 files changed, 164 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > > > index 02206162eaa9..55080ff29495 100644
> > > > --- a/drivers/vfio/pci/vfio_pci.c
> > > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > > @@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > > > MODULE_PARM_DESC(disable_idle_d3,
> > > > "Disable using the PCI D3 low power state for idle, unused devices");
> > > >
> > > > +static LIST_HEAD(mediate_ops_list);
> > > > +static DEFINE_MUTEX(mediate_ops_list_lock);
> > > > +struct vfio_pci_mediate_ops_list_entry {
> > > > + struct vfio_pci_mediate_ops *ops;
> > > > + int refcnt;
> > > > + struct list_head next;
> > > > +};
> > > > +
> > > > static inline bool vfio_vga_disabled(void)
> > > > {
> > > > #ifdef CONFIG_VFIO_PCI_VGA
> > > > @@ -472,6 +480,10 @@ static void vfio_pci_release(void *device_data)
> > > > if (!(--vdev->refcnt)) {
> > > > vfio_spapr_pci_eeh_release(vdev->pdev);
> > > > vfio_pci_disable(vdev);
> > > > + if (vdev->mediate_ops && vdev->mediate_ops->release) {
> > > > + vdev->mediate_ops->release(vdev->mediate_handle);
> > > > + vdev->mediate_ops = NULL;
> > > > + }
> > > > }
> > > >
> > > > mutex_unlock(&vdev->reflck->lock);
> > > > @@ -483,6 +495,7 @@ static int vfio_pci_open(void *device_data)
> > > > {
> > > > struct vfio_pci_device *vdev = device_data;
> > > > int ret = 0;
> > > > + struct vfio_pci_mediate_ops_list_entry *mentry;
> > > >
> > > > if (!try_module_get(THIS_MODULE))
> > > > return -ENODEV;
> > > > @@ -495,6 +508,30 @@ static int vfio_pci_open(void *device_data)
> > > > goto error;
> > > >
> > > > vfio_spapr_pci_eeh_open(vdev->pdev);
> > > > + mutex_lock(&mediate_ops_list_lock);
> > > > + list_for_each_entry(mentry, &mediate_ops_list, next) {
> > > > + u64 caps;
> > > > + u32 handle;
> > >
> > > Wouldn't it seem likely that the ops provider might use this handle as
> > > a pointer, so we'd want it to be an opaque void*?
> > >
> > yes, you are right, handle as a pointer is much better. will change it.
> > Thanks :)
> >
> > > > +
> > > > + memset(&caps, 0, sizeof(caps));
> > >
> > > @caps has no purpose here, add it if/when we do something with it.
> > > It's also a standard type, why are we memset'ing it rather than just
> > > =0??
> > >
> > > > + ret = mentry->ops->open(vdev->pdev, &caps, &handle);
> > > > + if (!ret) {
> > > > + vdev->mediate_ops = mentry->ops;
> > > > + vdev->mediate_handle = handle;
> > > > +
> > > > + pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
> > > > + vdev->mediate_ops->name, caps,
> > > > + handle, vdev->pdev->vendor,
> > > > + vdev->pdev->device);
> > >
> > > Generally not advisable to make user accessible printks.
> > >
> > ok.
> >
> > > > + /*
> > > > + * only find the first matching mediate_ops,
> > > > + * and add its refcnt
> > > > + */
> > > > + mentry->refcnt++;
> > > > + break;
> > > > + }
> > > > + }
> > > > + mutex_unlock(&mediate_ops_list_lock);
> > > > }
> > > > vdev->refcnt++;
> > > > error:
> > > > @@ -736,6 +773,14 @@ static long vfio_pci_ioctl(void *device_data,
> > > > info.size = pdev->cfg_size;
> > > > info.flags = VFIO_REGION_INFO_FLAG_READ |
> > > > VFIO_REGION_INFO_FLAG_WRITE;
> > > > +
> > > > + if (vdev->mediate_ops &&
> > > > + vdev->mediate_ops->get_region_info) {
> > > > + vdev->mediate_ops->get_region_info(
> > > > + vdev->mediate_handle,
> > > > + &info, &caps, NULL);
> > > > + }
> > >
> > > These would be a lot cleaner if we could just call a helper function:
> > >
> > > void vfio_pci_region_info_mediation_hook(vdev, info, caps, etc...)
> > > {
> > > if (vdev->mediate_ops
> > > vdev->mediate_ops->get_region_info)
> > > vdev->mediate_ops->get_region_info(vdev->mediate_handle,
> > > &info, &caps, NULL);
> > > }
> > >
> > > I'm not thrilled with all these hooks, but not open coding every one of
> > > them might help.
> >
> > ok. got it.
> > >
> > > > +
> > > > break;
> > > > case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> > > > info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> > > > @@ -756,6 +801,13 @@ static long vfio_pci_ioctl(void *device_data,
> > > > }
> > > > }
> > > >
> > > > + if (vdev->mediate_ops &&
> > > > + vdev->mediate_ops->get_region_info) {
> > > > + vdev->mediate_ops->get_region_info(
> > > > + vdev->mediate_handle,
> > > > + &info, &caps, NULL);
> > > > + }
> > > > +
> > > > break;
> > > > case VFIO_PCI_ROM_REGION_INDEX:
> > > > {
> > > > @@ -794,6 +846,14 @@ static long vfio_pci_ioctl(void *device_data,
> > > > }
> > > >
> > > > pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
> > > > +
> > > > + if (vdev->mediate_ops &&
> > > > + vdev->mediate_ops->get_region_info) {
> > > > + vdev->mediate_ops->get_region_info(
> > > > + vdev->mediate_handle,
> > > > + &info, &caps, NULL);
> > > > + }
> > > > +
> > > > break;
> > > > }
> > > > case VFIO_PCI_VGA_REGION_INDEX:
> > > > @@ -805,6 +865,13 @@ static long vfio_pci_ioctl(void *device_data,
> > > > info.flags = VFIO_REGION_INFO_FLAG_READ |
> > > > VFIO_REGION_INFO_FLAG_WRITE;
> > > >
> > > > + if (vdev->mediate_ops &&
> > > > + vdev->mediate_ops->get_region_info) {
> > > > + vdev->mediate_ops->get_region_info(
> > > > + vdev->mediate_handle,
> > > > + &info, &caps, NULL);
> > > > + }
> > > > +
> > > > break;
> > > > default:
> > > > {
> > > > @@ -839,6 +906,13 @@ static long vfio_pci_ioctl(void *device_data,
> > > > if (ret)
> > > > return ret;
> > > > }
> > > > +
> > > > + if (vdev->mediate_ops &&
> > > > + vdev->mediate_ops->get_region_info) {
> > > > + vdev->mediate_ops->get_region_info(
> > > > + vdev->mediate_handle,
> > > > + &info, &caps, &cap_type);
> > > > + }
> > > > }
> > > > }
> > > >
> > > > @@ -1151,6 +1225,16 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
> > > > if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
> > > > return -EINVAL;
> > > >
> > > > + if (vdev->mediate_ops && vdev->mediate_ops->rw) {
> > > > + int ret;
> > > > + bool pt = true;
> > > > +
> > > > + ret = vdev->mediate_ops->rw(vdev->mediate_handle,
> > > > + buf, count, ppos, iswrite, &pt);
> > > > + if (!pt)
> > > > + return ret;
> > > > + }
> > > > +
> > > > switch (index) {
> > > > case VFIO_PCI_CONFIG_REGION_INDEX:
> > > > return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> > > > @@ -1200,6 +1284,15 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > > > u64 phys_len, req_len, pgoff, req_start;
> > > > int ret;
> > > >
> > > > + if (vdev->mediate_ops && vdev->mediate_ops->mmap) {
> > > > + int ret;
> > > > + bool pt = true;
> > > > +
> > > > + ret = vdev->mediate_ops->mmap(vdev->mediate_handle, vma, &pt);
> > > > + if (!pt)
> > > > + return ret;
> > > > + }
> > >
> > > There must be a better way to do all these. Do we really want to call
> > > into ops for every rw or mmap, have the vendor code decode a region,
> > > and maybe or maybe not have it handle it? It's pretty ugly. Do we
> >
> > do you think below flow is good ?
> > 1. in mediate_ops->open(), return
> > (1) region[] indexed by region index, if a mediate driver supports mediating
> > region[i], region[i].ops->get_region_info, regions[i].ops->rw, or
> > regions[i].ops->mmap is not null.
> > (2) irq_info[] indexed by irq index, if a mediate driver supports mediating
> > irq_info[i], irq_info[i].ops->get_irq_info or irq_info[i].ops->set_irq_info
> > is not null.
> >
> > Then, vfio_pci_rw/vfio_pci_mmap/vfio_pci_ioctl only call into those
> > non-null hooks.
>
> Or would it be better to always call into the hooks and the vendor
> driver is allowed to selectively replace the hooks for regions they
> want to mediate. For example, region[i].ops->rw could by default point
> to vfio_pci_default_rw() and the mediation driver would have a
> mechanism to replace that with its own vendorABC_vfio_pci_rw(). We
> could export vfio_pci_default_rw() such that the vendor driver would be
> responsible for calling it as necessary.
>
good idea :)

> > > need the mediation provider to be able to dynamically setup the ops per
> > May I confirm that you are not saying dynamic registering mediate ops
> > after vfio-pci already opened a device, right?
>
> I'm not necessarily excluding or advocating for that.
>
ok. got it.

> > > region and export the default handlers out for them to call?
> > >
> > could we still keep checking return value of the hooks rather than
> > export default handlers? Otherwise at least vfio_pci_default_ioctl(),
> > vfio_pci_default_rw(), and vfio_pci_default_mmap() need to be exported.
>
> The ugliness of vfio-pci having all these vendor branches is what I'm
> trying to avoid, so I really am not a fan of the idea or mechanism that
> the vfio-pci core code is directly involving a mediation driver and
> handling the return for every entry point.
>
I see :)
> > > > +
> > > > index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> > > >
> > > > if (vma->vm_end < vma->vm_start)
> > > > @@ -1629,8 +1722,17 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
> > > >
> > > > static void __exit vfio_pci_cleanup(void)
> > > > {
> > > > + struct vfio_pci_mediate_ops_list_entry *mentry, *n;
> > > > +
> > > > pci_unregister_driver(&vfio_pci_driver);
> > > > vfio_pci_uninit_perm_bits();
> > > > +
> > > > + mutex_lock(&mediate_ops_list_lock);
> > > > + list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
> > > > + list_del(&mentry->next);
> > > > + kfree(mentry);
> > > > + }
> > > > + mutex_unlock(&mediate_ops_list_lock);
> > >
> > > Is it even possible to unload vfio-pci while there are mediation
> > > drivers registered? I don't think the module interactions are well
> > > thought out here, ex. do you really want i40e to have build and runtime
> > > dependencies on vfio-pci? I don't think so.
> > >
> > Currently, yes, i40e has build dependency on vfio-pci.
> > It's like this, if i40e decides to support SRIOV and compiles in vf
> > related code who depends on vfio-pci, it will also have build dependency
> > on vfio-pci. isn't it natural?
>
> No, this is not natural. There are certainly i40e VF use cases that
> have no interest in vfio and having dependencies between the two
> modules is unacceptable. I think you probably want to modularize the
> i40e vfio support code and then perhaps register a table in vfio-pci
> that the vfio-pci code can perform a module request when using a
> compatible device. Just and idea, there might be better options. I
> will not accept a solution that requires unloading the i40e driver in
> order to unload the vfio-pci driver. It's inconvenient with just one
> NIC driver, imagine how poorly that scales.
>
what about this way:
mediate driver registers a module notifier and every time when
vfio_pci is loaded, register to vfio_pci its mediate ops?
(Just like in below sample code)
This way vfio-pci is free to unload and this registering only gives
vfio-pci a name of what module to request.
After that,
in vfio_pci_open(), vfio-pci requests the mediate driver. (or puts
the mediate driver when mediate driver does not support mediating the
device)
in vfio_pci_release(), vfio-pci puts the mediate driver.

static void register_mediate_ops(void)
{
int (*func)(struct vfio_pci_mediate_ops *ops) = NULL;

func = symbol_get(vfio_pci_register_mediate_ops);

if (func) {
func(&igd_dt_ops);
symbol_put(vfio_pci_register_mediate_ops);
}
}

static int igd_module_notify(struct notifier_block *self,
unsigned long val, void *data)
{
struct module *mod = data;
int ret = 0;

switch (val) {
case MODULE_STATE_LIVE:
if (!strcmp(mod->name, "vfio_pci"))
register_mediate_ops();
break;
case MODULE_STATE_GOING:
break;
default:
break;
}
return ret;
}

static struct notifier_block igd_module_nb = {
.notifier_call = igd_module_notify,
.priority = 0,
};



static int __init igd_dt_init(void)
{
...
register_mediate_ops();
register_module_notifier(&igd_module_nb);
...
return 0;
}


> > > > }
> > > >
> > > > static void __init vfio_pci_fill_ids(void)
> > > > @@ -1697,6 +1799,50 @@ static int __init vfio_pci_init(void)
> > > > return ret;
> > > > }
> > > >
> > > > +int vfio_pci_register_mediate_ops(struct vfio_pci_mediate_ops *ops)
> > > > +{
> > > > + struct vfio_pci_mediate_ops_list_entry *mentry;
> > > > +
> > > > + mutex_lock(&mediate_ops_list_lock);
> > > > + mentry = kzalloc(sizeof(*mentry), GFP_KERNEL);
> > > > + if (!mentry) {
> > > > + mutex_unlock(&mediate_ops_list_lock);
> > > > + return -ENOMEM;
> > > > + }
> > > > +
> > > > + mentry->ops = ops;
> > > > + mentry->refcnt = 0;
> > >
> > > It's kZalloc'd, this is unnecessary.
> > >
> > right :)
> > > > + list_add(&mentry->next, &mediate_ops_list);
> > >
> > > Check for duplicates?
> > >
> > ok. will do it.
> > > > +
> > > > + pr_info("registered dm ops %s\n", ops->name);
> > > > + mutex_unlock(&mediate_ops_list_lock);
> > > > +
> > > > + return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(vfio_pci_register_mediate_ops);
> > > > +
> > > > +void vfio_pci_unregister_mediate_ops(struct vfio_pci_mediate_ops *ops)
> > > > +{
> > > > + struct vfio_pci_mediate_ops_list_entry *mentry, *n;
> > > > +
> > > > + mutex_lock(&mediate_ops_list_lock);
> > > > + list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
> > > > + if (mentry->ops != ops)
> > > > + continue;
> > > > +
> > > > + mentry->refcnt--;
> > >
> > > Whose reference is this removing?
> > >
> > I intended to prevent mediate driver from calling unregister mediate ops
> > while there're still opened devices in it.
> > after a successful mediate_ops->open(), mentry->refcnt++.
> > after calling mediate_ops->release(). mentry->refcnt--.
> >
> > (seems in this RFC, I missed a mentry->refcnt-- after calling
> > mediate_ops->release())
> >
> >
> > > > + if (!mentry->refcnt) {
> > > > + list_del(&mentry->next);
> > > > + kfree(mentry);
> > > > + } else
> > > > + pr_err("vfio_pci unregister mediate ops %s error\n",
> > > > + mentry->ops->name);
> > >
> > > This is bad, we should hold a reference to the module providing these
> > > ops for each use of it such that the module cannot be removed while
> > > it's in use. Otherwise we enter a very bad state here and it's
> > > trivially accessible by an admin remove the module while in use.
> > mediate driver is supposed to ref its own module on a success
> > mediate_ops->open(), and deref its own module on mediate_ops->release().
> > so, it can't be accidentally removed.
>
> Where was that semantic expressed in this series? We should create
> interfaces that are hard to use incorrectly. It is far too easy for a
> vendor driver to overlook such a requirement, which means fixing the
> same bugs repeatedly for each vendor. It needs to be improved. Thanks,

right. will improve it.

Thanks
Yan

2019-12-09 06:37:13

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> On Fri, 6 Dec 2019 01:04:07 -0500
> Yan Zhao <[email protected]> wrote:
>
> > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > On Wed, 4 Dec 2019 22:26:50 -0500
> > > Yan Zhao <[email protected]> wrote:
> > >
> > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > communicate dynamic trap info. It is of type
> > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > >
> > > > This region has two fields: dt_fd and trap.
> > > > When QEMU detects a device regions of this type, it will create an
> > > > eventfd and write its eventfd id to dt_fd field.
> > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > info region.
> > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > mmaped subregion is disablable).
> > > > - If trap is false, QEMU would re-enable those subregions.
> > > >
> > > > A typical usage is
> > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > QEMU reads trap field of this info region which is false and QEMU
> > > > re-passthrough the whole bar 0 region.
> > > >
> > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > vfio_pci_mediate_ops->open().
> > > >
> > > > If vfio-pci detects this cap, it will create a default
> > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > and region->ops=null.
> > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > vfio_pci_mediate_ops.
> > >
> > > TBH, I don't like this interface at all. Userspace doesn't pass data
> > > to the kernel via INFO ioctls. We have a SET_IRQS ioctl for
> > > configuring user signaling with eventfds. I think we only need to
> > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > information for a region. The user would enumerate the device IRQs via
> > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > indicate which region(s) should be re-evaluated on signaling. The user
> > > would enable that signaling via SET_IRQS and simply re-evaluate the
> > ok. I'll try to switch to this way. Thanks for this suggestion.
> >
> > > sparse mmap capability for the associated regions when signaled.
> >
> > Do you like the "disablable" flag of sparse mmap ?
> > I think it's a lightweight way for user to switch mmap state of a whole region,
> > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > region might be too heavy.
>
> No, I don't like the disable-able flag. At what frequency do we expect
> regions to change? It seems like we'd only change when switching into
> and out of the _SAVING state, which is rare. It seems easy for
> userspace, at least QEMU, to drop the entire mmap configuration and
ok. I'll try this way.

> re-read it. Another concern here is how do we synchronize the event?
> Are we assuming that this event would occur when a user switch to
> _SAVING mode on the device? That operation is synchronous, the device
> must be in saving mode after the write to device state completes, but
> it seems like this might be trying to add an asynchronous dependency.
> Will the write to device_state only complete once the user handles the
> eventfd? How would the kernel know when the mmap re-evaluation is
> complete. It seems like there are gaps here that the vendor driver
> could miss traps required for migration because the user hasn't
> completed the mmap transition yet. Thanks,
>
> Alex

yes, this asynchronous event notification will cause vendor driver miss
traps. But it's supposed to be of very short period time. That's also a
reason for us to wish the re-evaluation to be lightweight. E.g. if it's
able to be finished before the first iterate, it's still safe.

But I agree, the timing is not guaranteed, and so it's best for kernel
to wait for mmap re-evaluation to complete.

migration_thread
|->qemu_savevm_state_setup
| |->ram_save_setup
| | |->migration_bitmap_sync
| | |->kvm_log_sync
| | |->vfio_log_sync
| |
| |->vfio_save_setup
| |->set_device_state(_SAVING)
|
|->qemu_savevm_state_pending
| |->ram_save_pending
| | |->migration_bitmap_sync
| | |->kvm_log_sync
| | |->vfio_log_sync
| |->vfio_save_pending
|
|->qemu_savevm_state_iterate
| |->ram_save_iterate //send pages
| |->vfio_save_iterate
...


Actually, we previously let qemu trigger the re-evaluation when migration starts.
And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
there're other two possible use cases:
(1) keep passing through devices when migration starts and track dirty pages
using hardware IOMMU. Then when migration is about to complete, stop the
device and start trap PCI BARs for software emulation. (we made some
changes to let device stop ahead of vcpu )
(2) performance optimization. There's an example in GVT (mdev case):
PCI BARs are passed through on vGPU initialization and are mmaped to a host
dummy buffer. Then after initialization done, start trap of PCI BARs of
vGPUs and start normal host mediation. The initial pass-through can save
1000000 times of mmio trap.

Thanks
Yan



2019-12-09 21:17:19

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

On Mon, 9 Dec 2019 01:22:12 -0500
Yan Zhao <[email protected]> wrote:

> On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > On Fri, 6 Dec 2019 01:04:07 -0500
> > Yan Zhao <[email protected]> wrote:
> >
> > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > > On Wed, 4 Dec 2019 22:26:50 -0500
> > > > Yan Zhao <[email protected]> wrote:
> > > >
> > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > communicate dynamic trap info. It is of type
> > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > >
> > > > > This region has two fields: dt_fd and trap.
> > > > > When QEMU detects a device regions of this type, it will create an
> > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > info region.
> > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > mmaped subregion is disablable).
> > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > >
> > > > > A typical usage is
> > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > re-passthrough the whole bar 0 region.
> > > > >
> > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > vfio_pci_mediate_ops->open().
> > > > >
> > > > > If vfio-pci detects this cap, it will create a default
> > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > and region->ops=null.
> > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > vfio_pci_mediate_ops.
> > > >
> > > > TBH, I don't like this interface at all. Userspace doesn't pass data
> > > > to the kernel via INFO ioctls. We have a SET_IRQS ioctl for
> > > > configuring user signaling with eventfds. I think we only need to
> > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > information for a region. The user would enumerate the device IRQs via
> > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > indicate which region(s) should be re-evaluated on signaling. The user
> > > > would enable that signaling via SET_IRQS and simply re-evaluate the
> > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > >
> > > > sparse mmap capability for the associated regions when signaled.
> > >
> > > Do you like the "disablable" flag of sparse mmap ?
> > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > region might be too heavy.
> >
> > No, I don't like the disable-able flag. At what frequency do we expect
> > regions to change? It seems like we'd only change when switching into
> > and out of the _SAVING state, which is rare. It seems easy for
> > userspace, at least QEMU, to drop the entire mmap configuration and
> ok. I'll try this way.
>
> > re-read it. Another concern here is how do we synchronize the event?
> > Are we assuming that this event would occur when a user switch to
> > _SAVING mode on the device? That operation is synchronous, the device
> > must be in saving mode after the write to device state completes, but
> > it seems like this might be trying to add an asynchronous dependency.
> > Will the write to device_state only complete once the user handles the
> > eventfd? How would the kernel know when the mmap re-evaluation is
> > complete. It seems like there are gaps here that the vendor driver
> > could miss traps required for migration because the user hasn't
> > completed the mmap transition yet. Thanks,
> >
> > Alex
>
> yes, this asynchronous event notification will cause vendor driver miss
> traps. But it's supposed to be of very short period time. That's also a
> reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> able to be finished before the first iterate, it's still safe.

Making the re-evaluation lightweight cannot solve the race, it only
masks it.

> But I agree, the timing is not guaranteed, and so it's best for kernel
> to wait for mmap re-evaluation to complete.
>
> migration_thread
> |->qemu_savevm_state_setup
> | |->ram_save_setup
> | | |->migration_bitmap_sync
> | | |->kvm_log_sync
> | | |->vfio_log_sync
> | |
> | |->vfio_save_setup
> | |->set_device_state(_SAVING)
> |
> |->qemu_savevm_state_pending
> | |->ram_save_pending
> | | |->migration_bitmap_sync
> | | |->kvm_log_sync
> | | |->vfio_log_sync
> | |->vfio_save_pending
> |
> |->qemu_savevm_state_iterate
> | |->ram_save_iterate //send pages
> | |->vfio_save_iterate
> ...
>
>
> Actually, we previously let qemu trigger the re-evaluation when migration starts.
> And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> there're other two possible use cases:
> (1) keep passing through devices when migration starts and track dirty pages
> using hardware IOMMU. Then when migration is about to complete, stop the
> device and start trap PCI BARs for software emulation. (we made some
> changes to let device stop ahead of vcpu )

How is that possible? I/O devices need to continue to work until the
vCPU stops otherwise the vCPU can get blocked on the device. Maybe QEMU
should assume all mmaps should be dropped on vfio device after we pass
some point of the migration process.

If there are a fixed set of mmap settings for a region and discrete
conditions under which they become active (ex. switch device to SAVING
mode) then QEMU could choose the right mapping itself and we wouldn't
need to worry about this asynchronous signaling problem, it would just
be defined as part of the protocol userspace needs to use.

> (2) performance optimization. There's an example in GVT (mdev case):
> PCI BARs are passed through on vGPU initialization and are mmaped to a host
> dummy buffer. Then after initialization done, start trap of PCI BARs of
> vGPUs and start normal host mediation. The initial pass-through can save
> 1000000 times of mmio trap.

Much of this discussion has me worried that many assumptions are being
made about the user and device interaction. Backwards compatible
behavior is required. If a mdev device presents an initial sparse mmap
capability for this acceleration, how do you support an existing
userspace that doesn't understand the new dynamic mmap semantics and
continues to try to operate with the initial sparse mmap? Doesn't this
introduce another example of the raciness of the device trying to
switch mmaps? Seems that if QEMU doesn't handle the eventfd with
sufficient timeliness the switch back to trap behavior could miss an
important transaction. This also seems like an optimization targeted
at VMs running for only a short time, where it's not obvious to me that
GVT-g overlaps those sorts of use cases. How much initialization time
is actually being saved with such a hack? Thanks,

Alex

2019-12-10 00:04:59

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

On Sun, 8 Dec 2019 22:42:25 -0500
Yan Zhao <[email protected]> wrote:

> On Sat, Dec 07, 2019 at 05:22:26AM +0800, Alex Williamson wrote:
> > On Fri, 6 Dec 2019 02:56:55 -0500
> > Yan Zhao <[email protected]> wrote:
> >
> > > On Fri, Dec 06, 2019 at 07:55:19AM +0800, Alex Williamson wrote:
> > > > On Wed, 4 Dec 2019 22:25:36 -0500
> > > > Yan Zhao <[email protected]> wrote:
> > > >
> > > > > when vfio-pci is bound to a physical device, almost all the hardware
> > > > > resources are passthroughed.
> > > > > Sometimes, vendor driver of this physcial device may want to mediate some
> > > > > hardware resource access for a short period of time, e.g. dirty page
> > > > > tracking during live migration.
> > > > >
> > > > > Here we introduce mediate ops in vfio-pci for this purpose.
> > > > >
> > > > > Vendor driver can register a mediate ops to vfio-pci.
> > > > > But rather than directly bind to the passthroughed device, the
> > > > > vendor driver is now either a module that does not bind to any device or
> > > > > a module binds to other device.
> > > > > E.g. when passing through a VF device that is bound to vfio-pci modules,
> > > > > PF driver that binds to PF device can register to vfio-pci to mediate
> > > > > VF's regions, hence supporting VF live migration.
> > > > >
> > > > > The sequence goes like this:
> > > > > 1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver
> > > > >
> > > > > 2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops
> > > > >
> > > > > 3. Whenever vfio-pci opens a device, it searches the list and call
> > > > > vfio_pci_mediate_ops->open() to check whether a vendor driver supports
> > > > > mediating this device.
> > > > > Upon a success return value of from vfio_pci_mediate_ops->open(),
> > > > > vfio-pci will stop list searching and store a mediate handle to
> > > > > represent this open into vendor driver.
> > > > > (so if multiple vendor drivers support mediating a device through
> > > > > vfio_pci_mediate_ops, only one will win, depending on their registering
> > > > > sequence)
> > > > >
> > > > > 4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
> > > > > ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
> > > > > vendor driver is able to override a region's default flags and caps,
> > > > > e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
> > > > > region.
> > > > >
> > > > > 5. vfio_pci_rw()/vfio_pci_mmap() first calls into
> > > > > vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
> > > > > if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
> > > > > passthrough this read/write/mmap to physical device, otherwise it just
> > > > > returns without touch physical device.
> > > > >
> > > > > 6. When vfio-pci closes a device, vfio_pci_release() chains into
> > > > > vfio_pci_mediate_ops->release() to close the reference in vendor driver.
> > > > >
> > > > > 7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits
> > > > >
> > > > > Cc: Kevin Tian <[email protected]>
> > > > >
> > > > > Signed-off-by: Yan Zhao <[email protected]>
> > > > > ---
> > > > > drivers/vfio/pci/vfio_pci.c | 146 ++++++++++++++++++++++++++++
> > > > > drivers/vfio/pci/vfio_pci_private.h | 2 +
> > > > > include/linux/vfio.h | 16 +++
> > > > > 3 files changed, 164 insertions(+)
> > > > >
> > > > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > > > > index 02206162eaa9..55080ff29495 100644
> > > > > --- a/drivers/vfio/pci/vfio_pci.c
> > > > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > > > @@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> > > > > MODULE_PARM_DESC(disable_idle_d3,
> > > > > "Disable using the PCI D3 low power state for idle, unused devices");
> > > > >
> > > > > +static LIST_HEAD(mediate_ops_list);
> > > > > +static DEFINE_MUTEX(mediate_ops_list_lock);
> > > > > +struct vfio_pci_mediate_ops_list_entry {
> > > > > + struct vfio_pci_mediate_ops *ops;
> > > > > + int refcnt;
> > > > > + struct list_head next;
> > > > > +};
> > > > > +
> > > > > static inline bool vfio_vga_disabled(void)
> > > > > {
> > > > > #ifdef CONFIG_VFIO_PCI_VGA
> > > > > @@ -472,6 +480,10 @@ static void vfio_pci_release(void *device_data)
> > > > > if (!(--vdev->refcnt)) {
> > > > > vfio_spapr_pci_eeh_release(vdev->pdev);
> > > > > vfio_pci_disable(vdev);
> > > > > + if (vdev->mediate_ops && vdev->mediate_ops->release) {
> > > > > + vdev->mediate_ops->release(vdev->mediate_handle);
> > > > > + vdev->mediate_ops = NULL;
> > > > > + }
> > > > > }
> > > > >
> > > > > mutex_unlock(&vdev->reflck->lock);
> > > > > @@ -483,6 +495,7 @@ static int vfio_pci_open(void *device_data)
> > > > > {
> > > > > struct vfio_pci_device *vdev = device_data;
> > > > > int ret = 0;
> > > > > + struct vfio_pci_mediate_ops_list_entry *mentry;
> > > > >
> > > > > if (!try_module_get(THIS_MODULE))
> > > > > return -ENODEV;
> > > > > @@ -495,6 +508,30 @@ static int vfio_pci_open(void *device_data)
> > > > > goto error;
> > > > >
> > > > > vfio_spapr_pci_eeh_open(vdev->pdev);
> > > > > + mutex_lock(&mediate_ops_list_lock);
> > > > > + list_for_each_entry(mentry, &mediate_ops_list, next) {
> > > > > + u64 caps;
> > > > > + u32 handle;
> > > >
> > > > Wouldn't it seem likely that the ops provider might use this handle as
> > > > a pointer, so we'd want it to be an opaque void*?
> > > >
> > > yes, you are right, handle as a pointer is much better. will change it.
> > > Thanks :)
> > >
> > > > > +
> > > > > + memset(&caps, 0, sizeof(caps));
> > > >
> > > > @caps has no purpose here, add it if/when we do something with it.
> > > > It's also a standard type, why are we memset'ing it rather than just
> > > > =0??
> > > >
> > > > > + ret = mentry->ops->open(vdev->pdev, &caps, &handle);
> > > > > + if (!ret) {
> > > > > + vdev->mediate_ops = mentry->ops;
> > > > > + vdev->mediate_handle = handle;
> > > > > +
> > > > > + pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
> > > > > + vdev->mediate_ops->name, caps,
> > > > > + handle, vdev->pdev->vendor,
> > > > > + vdev->pdev->device);
> > > >
> > > > Generally not advisable to make user accessible printks.
> > > >
> > > ok.
> > >
> > > > > + /*
> > > > > + * only find the first matching mediate_ops,
> > > > > + * and add its refcnt
> > > > > + */
> > > > > + mentry->refcnt++;
> > > > > + break;
> > > > > + }
> > > > > + }
> > > > > + mutex_unlock(&mediate_ops_list_lock);
> > > > > }
> > > > > vdev->refcnt++;
> > > > > error:
> > > > > @@ -736,6 +773,14 @@ static long vfio_pci_ioctl(void *device_data,
> > > > > info.size = pdev->cfg_size;
> > > > > info.flags = VFIO_REGION_INFO_FLAG_READ |
> > > > > VFIO_REGION_INFO_FLAG_WRITE;
> > > > > +
> > > > > + if (vdev->mediate_ops &&
> > > > > + vdev->mediate_ops->get_region_info) {
> > > > > + vdev->mediate_ops->get_region_info(
> > > > > + vdev->mediate_handle,
> > > > > + &info, &caps, NULL);
> > > > > + }
> > > >
> > > > These would be a lot cleaner if we could just call a helper function:
> > > >
> > > > void vfio_pci_region_info_mediation_hook(vdev, info, caps, etc...)
> > > > {
> > > > if (vdev->mediate_ops
> > > > vdev->mediate_ops->get_region_info)
> > > > vdev->mediate_ops->get_region_info(vdev->mediate_handle,
> > > > &info, &caps, NULL);
> > > > }
> > > >
> > > > I'm not thrilled with all these hooks, but not open coding every one of
> > > > them might help.
> > >
> > > ok. got it.
> > > >
> > > > > +
> > > > > break;
> > > > > case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> > > > > info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> > > > > @@ -756,6 +801,13 @@ static long vfio_pci_ioctl(void *device_data,
> > > > > }
> > > > > }
> > > > >
> > > > > + if (vdev->mediate_ops &&
> > > > > + vdev->mediate_ops->get_region_info) {
> > > > > + vdev->mediate_ops->get_region_info(
> > > > > + vdev->mediate_handle,
> > > > > + &info, &caps, NULL);
> > > > > + }
> > > > > +
> > > > > break;
> > > > > case VFIO_PCI_ROM_REGION_INDEX:
> > > > > {
> > > > > @@ -794,6 +846,14 @@ static long vfio_pci_ioctl(void *device_data,
> > > > > }
> > > > >
> > > > > pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
> > > > > +
> > > > > + if (vdev->mediate_ops &&
> > > > > + vdev->mediate_ops->get_region_info) {
> > > > > + vdev->mediate_ops->get_region_info(
> > > > > + vdev->mediate_handle,
> > > > > + &info, &caps, NULL);
> > > > > + }
> > > > > +
> > > > > break;
> > > > > }
> > > > > case VFIO_PCI_VGA_REGION_INDEX:
> > > > > @@ -805,6 +865,13 @@ static long vfio_pci_ioctl(void *device_data,
> > > > > info.flags = VFIO_REGION_INFO_FLAG_READ |
> > > > > VFIO_REGION_INFO_FLAG_WRITE;
> > > > >
> > > > > + if (vdev->mediate_ops &&
> > > > > + vdev->mediate_ops->get_region_info) {
> > > > > + vdev->mediate_ops->get_region_info(
> > > > > + vdev->mediate_handle,
> > > > > + &info, &caps, NULL);
> > > > > + }
> > > > > +
> > > > > break;
> > > > > default:
> > > > > {
> > > > > @@ -839,6 +906,13 @@ static long vfio_pci_ioctl(void *device_data,
> > > > > if (ret)
> > > > > return ret;
> > > > > }
> > > > > +
> > > > > + if (vdev->mediate_ops &&
> > > > > + vdev->mediate_ops->get_region_info) {
> > > > > + vdev->mediate_ops->get_region_info(
> > > > > + vdev->mediate_handle,
> > > > > + &info, &caps, &cap_type);
> > > > > + }
> > > > > }
> > > > > }
> > > > >
> > > > > @@ -1151,6 +1225,16 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
> > > > > if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
> > > > > return -EINVAL;
> > > > >
> > > > > + if (vdev->mediate_ops && vdev->mediate_ops->rw) {
> > > > > + int ret;
> > > > > + bool pt = true;
> > > > > +
> > > > > + ret = vdev->mediate_ops->rw(vdev->mediate_handle,
> > > > > + buf, count, ppos, iswrite, &pt);
> > > > > + if (!pt)
> > > > > + return ret;
> > > > > + }
> > > > > +
> > > > > switch (index) {
> > > > > case VFIO_PCI_CONFIG_REGION_INDEX:
> > > > > return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> > > > > @@ -1200,6 +1284,15 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > > > > u64 phys_len, req_len, pgoff, req_start;
> > > > > int ret;
> > > > >
> > > > > + if (vdev->mediate_ops && vdev->mediate_ops->mmap) {
> > > > > + int ret;
> > > > > + bool pt = true;
> > > > > +
> > > > > + ret = vdev->mediate_ops->mmap(vdev->mediate_handle, vma, &pt);
> > > > > + if (!pt)
> > > > > + return ret;
> > > > > + }
> > > >
> > > > There must be a better way to do all these. Do we really want to call
> > > > into ops for every rw or mmap, have the vendor code decode a region,
> > > > and maybe or maybe not have it handle it? It's pretty ugly. Do we
> > >
> > > do you think below flow is good ?
> > > 1. in mediate_ops->open(), return
> > > (1) region[] indexed by region index, if a mediate driver supports mediating
> > > region[i], region[i].ops->get_region_info, regions[i].ops->rw, or
> > > regions[i].ops->mmap is not null.
> > > (2) irq_info[] indexed by irq index, if a mediate driver supports mediating
> > > irq_info[i], irq_info[i].ops->get_irq_info or irq_info[i].ops->set_irq_info
> > > is not null.
> > >
> > > Then, vfio_pci_rw/vfio_pci_mmap/vfio_pci_ioctl only call into those
> > > non-null hooks.
> >
> > Or would it be better to always call into the hooks and the vendor
> > driver is allowed to selectively replace the hooks for regions they
> > want to mediate. For example, region[i].ops->rw could by default point
> > to vfio_pci_default_rw() and the mediation driver would have a
> > mechanism to replace that with its own vendorABC_vfio_pci_rw(). We
> > could export vfio_pci_default_rw() such that the vendor driver would be
> > responsible for calling it as necessary.
> >
> good idea :)
>
> > > > need the mediation provider to be able to dynamically setup the ops per
> > > May I confirm that you are not saying dynamic registering mediate ops
> > > after vfio-pci already opened a device, right?
> >
> > I'm not necessarily excluding or advocating for that.
> >
> ok. got it.
>
> > > > region and export the default handlers out for them to call?
> > > >
> > > could we still keep checking return value of the hooks rather than
> > > export default handlers? Otherwise at least vfio_pci_default_ioctl(),
> > > vfio_pci_default_rw(), and vfio_pci_default_mmap() need to be exported.
> >
> > The ugliness of vfio-pci having all these vendor branches is what I'm
> > trying to avoid, so I really am not a fan of the idea or mechanism that
> > the vfio-pci core code is directly involving a mediation driver and
> > handling the return for every entry point.
> >
> I see :)
> > > > > +
> > > > > index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> > > > >
> > > > > if (vma->vm_end < vma->vm_start)
> > > > > @@ -1629,8 +1722,17 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
> > > > >
> > > > > static void __exit vfio_pci_cleanup(void)
> > > > > {
> > > > > + struct vfio_pci_mediate_ops_list_entry *mentry, *n;
> > > > > +
> > > > > pci_unregister_driver(&vfio_pci_driver);
> > > > > vfio_pci_uninit_perm_bits();
> > > > > +
> > > > > + mutex_lock(&mediate_ops_list_lock);
> > > > > + list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
> > > > > + list_del(&mentry->next);
> > > > > + kfree(mentry);
> > > > > + }
> > > > > + mutex_unlock(&mediate_ops_list_lock);
> > > >
> > > > Is it even possible to unload vfio-pci while there are mediation
> > > > drivers registered? I don't think the module interactions are well
> > > > thought out here, ex. do you really want i40e to have build and runtime
> > > > dependencies on vfio-pci? I don't think so.
> > > >
> > > Currently, yes, i40e has build dependency on vfio-pci.
> > > It's like this, if i40e decides to support SRIOV and compiles in vf
> > > related code who depends on vfio-pci, it will also have build dependency
> > > on vfio-pci. isn't it natural?
> >
> > No, this is not natural. There are certainly i40e VF use cases that
> > have no interest in vfio and having dependencies between the two
> > modules is unacceptable. I think you probably want to modularize the
> > i40e vfio support code and then perhaps register a table in vfio-pci
> > that the vfio-pci code can perform a module request when using a
> > compatible device. Just and idea, there might be better options. I
> > will not accept a solution that requires unloading the i40e driver in
> > order to unload the vfio-pci driver. It's inconvenient with just one
> > NIC driver, imagine how poorly that scales.
> >
> what about this way:
> mediate driver registers a module notifier and every time when
> vfio_pci is loaded, register to vfio_pci its mediate ops?
> (Just like in below sample code)
> This way vfio-pci is free to unload and this registering only gives
> vfio-pci a name of what module to request.
> After that,
> in vfio_pci_open(), vfio-pci requests the mediate driver. (or puts
> the mediate driver when mediate driver does not support mediating the
> device)
> in vfio_pci_release(), vfio-pci puts the mediate driver.
>
> static void register_mediate_ops(void)
> {
> int (*func)(struct vfio_pci_mediate_ops *ops) = NULL;
>
> func = symbol_get(vfio_pci_register_mediate_ops);
>
> if (func) {
> func(&igd_dt_ops);
> symbol_put(vfio_pci_register_mediate_ops);
> }
> }
>
> static int igd_module_notify(struct notifier_block *self,
> unsigned long val, void *data)
> {
> struct module *mod = data;
> int ret = 0;
>
> switch (val) {
> case MODULE_STATE_LIVE:
> if (!strcmp(mod->name, "vfio_pci"))
> register_mediate_ops();
> break;
> case MODULE_STATE_GOING:
> break;
> default:
> break;
> }
> return ret;
> }
>
> static struct notifier_block igd_module_nb = {
> .notifier_call = igd_module_notify,
> .priority = 0,
> };
>
>
>
> static int __init igd_dt_init(void)
> {
> ...
> register_mediate_ops();
> register_module_notifier(&igd_module_nb);
> ...
> return 0;
> }


No, this is bad. Please look at MODULE_ALIAS() and request_module() as
used in the vfio-platform for loading reset driver modules. I think
the correct approach is that vfio-pci should perform a request_module()
based on the device being probed. Having the mediation provider
listening for vfio-pci and registering itself regardless of whether we
intend to use it assumes that we will want to use it and assumes that
the mediation provider module is already loaded. We should be able to
support demand loading of modules that may serve no other purpose than
providing this mediation. Thanks,

Alex

> > > > > }
> > > > >
> > > > > static void __init vfio_pci_fill_ids(void)
> > > > > @@ -1697,6 +1799,50 @@ static int __init vfio_pci_init(void)
> > > > > return ret;
> > > > > }
> > > > >
> > > > > +int vfio_pci_register_mediate_ops(struct vfio_pci_mediate_ops *ops)
> > > > > +{
> > > > > + struct vfio_pci_mediate_ops_list_entry *mentry;
> > > > > +
> > > > > + mutex_lock(&mediate_ops_list_lock);
> > > > > + mentry = kzalloc(sizeof(*mentry), GFP_KERNEL);
> > > > > + if (!mentry) {
> > > > > + mutex_unlock(&mediate_ops_list_lock);
> > > > > + return -ENOMEM;
> > > > > + }
> > > > > +
> > > > > + mentry->ops = ops;
> > > > > + mentry->refcnt = 0;
> > > >
> > > > It's kZalloc'd, this is unnecessary.
> > > >
> > > right :)
> > > > > + list_add(&mentry->next, &mediate_ops_list);
> > > >
> > > > Check for duplicates?
> > > >
> > > ok. will do it.
> > > > > +
> > > > > + pr_info("registered dm ops %s\n", ops->name);
> > > > > + mutex_unlock(&mediate_ops_list_lock);
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +EXPORT_SYMBOL(vfio_pci_register_mediate_ops);
> > > > > +
> > > > > +void vfio_pci_unregister_mediate_ops(struct vfio_pci_mediate_ops *ops)
> > > > > +{
> > > > > + struct vfio_pci_mediate_ops_list_entry *mentry, *n;
> > > > > +
> > > > > + mutex_lock(&mediate_ops_list_lock);
> > > > > + list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
> > > > > + if (mentry->ops != ops)
> > > > > + continue;
> > > > > +
> > > > > + mentry->refcnt--;
> > > >
> > > > Whose reference is this removing?
> > > >
> > > I intended to prevent mediate driver from calling unregister mediate ops
> > > while there're still opened devices in it.
> > > after a successful mediate_ops->open(), mentry->refcnt++.
> > > after calling mediate_ops->release(). mentry->refcnt--.
> > >
> > > (seems in this RFC, I missed a mentry->refcnt-- after calling
> > > mediate_ops->release())
> > >
> > >
> > > > > + if (!mentry->refcnt) {
> > > > > + list_del(&mentry->next);
> > > > > + kfree(mentry);
> > > > > + } else
> > > > > + pr_err("vfio_pci unregister mediate ops %s error\n",
> > > > > + mentry->ops->name);
> > > >
> > > > This is bad, we should hold a reference to the module providing these
> > > > ops for each use of it such that the module cannot be removed while
> > > > it's in use. Otherwise we enter a very bad state here and it's
> > > > trivially accessible by an admin remove the module while in use.
> > > mediate driver is supposed to ref its own module on a success
> > > mediate_ops->open(), and deref its own module on mediate_ops->release().
> > > so, it can't be accidentally removed.
> >
> > Where was that semantic expressed in this series? We should create
> > interfaces that are hard to use incorrectly. It is far too easy for a
> > vendor driver to overlook such a requirement, which means fixing the
> > same bugs repeatedly for each vendor. It needs to be improved. Thanks,
>
> right. will improve it.
>
> Thanks
> Yan
>

2019-12-10 02:53:38

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

> > > > Currently, yes, i40e has build dependency on vfio-pci.
> > > > It's like this, if i40e decides to support SRIOV and compiles in vf
> > > > related code who depends on vfio-pci, it will also have build dependency
> > > > on vfio-pci. isn't it natural?
> > >
> > > No, this is not natural. There are certainly i40e VF use cases that
> > > have no interest in vfio and having dependencies between the two
> > > modules is unacceptable. I think you probably want to modularize the
> > > i40e vfio support code and then perhaps register a table in vfio-pci
> > > that the vfio-pci code can perform a module request when using a
> > > compatible device. Just and idea, there might be better options. I
> > > will not accept a solution that requires unloading the i40e driver in
> > > order to unload the vfio-pci driver. It's inconvenient with just one
> > > NIC driver, imagine how poorly that scales.
> > >
> > what about this way:
> > mediate driver registers a module notifier and every time when
> > vfio_pci is loaded, register to vfio_pci its mediate ops?
> > (Just like in below sample code)
> > This way vfio-pci is free to unload and this registering only gives
> > vfio-pci a name of what module to request.
> > After that,
> > in vfio_pci_open(), vfio-pci requests the mediate driver. (or puts
> > the mediate driver when mediate driver does not support mediating the
> > device)
> > in vfio_pci_release(), vfio-pci puts the mediate driver.
> >
> > static void register_mediate_ops(void)
> > {
> > int (*func)(struct vfio_pci_mediate_ops *ops) = NULL;
> >
> > func = symbol_get(vfio_pci_register_mediate_ops);
> >
> > if (func) {
> > func(&igd_dt_ops);
> > symbol_put(vfio_pci_register_mediate_ops);
> > }
> > }
> >
> > static int igd_module_notify(struct notifier_block *self,
> > unsigned long val, void *data)
> > {
> > struct module *mod = data;
> > int ret = 0;
> >
> > switch (val) {
> > case MODULE_STATE_LIVE:
> > if (!strcmp(mod->name, "vfio_pci"))
> > register_mediate_ops();
> > break;
> > case MODULE_STATE_GOING:
> > break;
> > default:
> > break;
> > }
> > return ret;
> > }
> >
> > static struct notifier_block igd_module_nb = {
> > .notifier_call = igd_module_notify,
> > .priority = 0,
> > };
> >
> >
> >
> > static int __init igd_dt_init(void)
> > {
> > ...
> > register_mediate_ops();
> > register_module_notifier(&igd_module_nb);
> > ...
> > return 0;
> > }
>
>
> No, this is bad. Please look at MODULE_ALIAS() and request_module() as
> used in the vfio-platform for loading reset driver modules. I think
> the correct approach is that vfio-pci should perform a request_module()
> based on the device being probed. Having the mediation provider
> listening for vfio-pci and registering itself regardless of whether we
> intend to use it assumes that we will want to use it and assumes that
> the mediation provider module is already loaded. We should be able to
> support demand loading of modules that may serve no other purpose than
> providing this mediation. Thanks,
hi Alex
Thanks for this message.
So is it good to create a separate module as mediation provider driver,
and alias its module name to "vfio-pci-mediate-vid-did".
Then when vfio-pci probes the device, it requests module of that name ?

Thanks
Yan

2019-12-10 07:53:40

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> On Mon, 9 Dec 2019 01:22:12 -0500
> Yan Zhao <[email protected]> wrote:
>
> > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > Yan Zhao <[email protected]> wrote:
> > >
> > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > > > On Wed, 4 Dec 2019 22:26:50 -0500
> > > > > Yan Zhao <[email protected]> wrote:
> > > > >
> > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > communicate dynamic trap info. It is of type
> > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > >
> > > > > > This region has two fields: dt_fd and trap.
> > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > info region.
> > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > mmaped subregion is disablable).
> > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > >
> > > > > > A typical usage is
> > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > re-passthrough the whole bar 0 region.
> > > > > >
> > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > vfio_pci_mediate_ops->open().
> > > > > >
> > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > and region->ops=null.
> > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > vfio_pci_mediate_ops.
> > > > >
> > > > > TBH, I don't like this interface at all. Userspace doesn't pass data
> > > > > to the kernel via INFO ioctls. We have a SET_IRQS ioctl for
> > > > > configuring user signaling with eventfds. I think we only need to
> > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > information for a region. The user would enumerate the device IRQs via
> > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > indicate which region(s) should be re-evaluated on signaling. The user
> > > > > would enable that signaling via SET_IRQS and simply re-evaluate the
> > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > >
> > > > > sparse mmap capability for the associated regions when signaled.
> > > >
> > > > Do you like the "disablable" flag of sparse mmap ?
> > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > region might be too heavy.
> > >
> > > No, I don't like the disable-able flag. At what frequency do we expect
> > > regions to change? It seems like we'd only change when switching into
> > > and out of the _SAVING state, which is rare. It seems easy for
> > > userspace, at least QEMU, to drop the entire mmap configuration and
> > ok. I'll try this way.
> >
> > > re-read it. Another concern here is how do we synchronize the event?
> > > Are we assuming that this event would occur when a user switch to
> > > _SAVING mode on the device? That operation is synchronous, the device
> > > must be in saving mode after the write to device state completes, but
> > > it seems like this might be trying to add an asynchronous dependency.
> > > Will the write to device_state only complete once the user handles the
> > > eventfd? How would the kernel know when the mmap re-evaluation is
> > > complete. It seems like there are gaps here that the vendor driver
> > > could miss traps required for migration because the user hasn't
> > > completed the mmap transition yet. Thanks,
> > >
> > > Alex
> >
> > yes, this asynchronous event notification will cause vendor driver miss
> > traps. But it's supposed to be of very short period time. That's also a
> > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > able to be finished before the first iterate, it's still safe.
>
> Making the re-evaluation lightweight cannot solve the race, it only
> masks it.
>
> > But I agree, the timing is not guaranteed, and so it's best for kernel
> > to wait for mmap re-evaluation to complete.
> >
> > migration_thread
> > |->qemu_savevm_state_setup
> > | |->ram_save_setup
> > | | |->migration_bitmap_sync
> > | | |->kvm_log_sync
> > | | |->vfio_log_sync
> > | |
> > | |->vfio_save_setup
> > | |->set_device_state(_SAVING)
> > |
> > |->qemu_savevm_state_pending
> > | |->ram_save_pending
> > | | |->migration_bitmap_sync
> > | | |->kvm_log_sync
> > | | |->vfio_log_sync
> > | |->vfio_save_pending
> > |
> > |->qemu_savevm_state_iterate
> > | |->ram_save_iterate //send pages
> > | |->vfio_save_iterate
> > ...
> >
> >
> > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > there're other two possible use cases:
> > (1) keep passing through devices when migration starts and track dirty pages
> > using hardware IOMMU. Then when migration is about to complete, stop the
> > device and start trap PCI BARs for software emulation. (we made some
> > changes to let device stop ahead of vcpu )
>
> How is that possible? I/O devices need to continue to work until the
> vCPU stops otherwise the vCPU can get blocked on the device. Maybe QEMU
hi Alex
For devices like DSA [1], it can support SVM mode. In this mode, when a
page fault happens, the Intel DSA device blocks until the page fault is
resolved, if PRS is enabled; otherwise it is reported as an error.

Therefore, to pass through DSA into guest and do live migration with it,
it is desired to stop DSA before stopping vCPU, as there may be an
outstanding page fault to be resolved.

During the period when DSA is stopped and vCPUs are still running, all the
pass-through resources are trapped and emulated by host mediation driver until
vCPUs stop.


[1] https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf


> should assume all mmaps should be dropped on vfio device after we pass
> some point of the migration process.
>
yes, it should be workable for the use case of DSA.

> If there are a fixed set of mmap settings for a region and discrete
> conditions under which they become active (ex. switch device to SAVING
> mode) then QEMU could choose the right mapping itself and we wouldn't
> need to worry about this asynchronous signaling problem, it would just
> be defined as part of the protocol userspace needs to use.
>
It's ok to let QEMU trigger dynamic trap on certain condition (like switching
device to SAVING mode), but it seems that there's no fixed set of mmap settings
for a region.
For example, some devices may want to trap the whole BARs, but some devices
only requires to trap a range of pages in a BAR for performance consideration.

If the "disable-able" flag is not preferable, maybe re-evaluation way is
the only choice? But it is a burden to ask for re-evaluation if they are
not required.

What about introducing a "region_bitmask" in ctl header of the migration region?
when QEMU writes a region index to the "region_bitmask", it can read back
from this field a bitmask to know which mmap to disable.

> > (2) performance optimization. There's an example in GVT (mdev case):
> > PCI BARs are passed through on vGPU initialization and are mmaped to a host
> > dummy buffer. Then after initialization done, start trap of PCI BARs of
> > vGPUs and start normal host mediation. The initial pass-through can save
> > 1000000 times of mmio trap.
>
> Much of this discussion has me worried that many assumptions are being
> made about the user and device interaction. Backwards compatible
> behavior is required. If a mdev device presents an initial sparse mmap
> capability for this acceleration, how do you support an existing
> userspace that doesn't understand the new dynamic mmap semantics and
> continues to try to operate with the initial sparse mmap? Doesn't this
> introduce another example of the raciness of the device trying to
> switch mmaps? Seems that if QEMU doesn't handle the eventfd with
> sufficient timeliness the switch back to trap behavior could miss an
> important transaction. This also seems like an optimization targeted
> at VMs running for only a short time, where it's not obvious to me that
> GVT-g overlaps those sorts of use cases. How much initialization time
> is actually being saved with such a hack? Thanks,
>
It can save about 4s initialization time with such a hack. But you are
right, the backward compatibility is a problem and we are not going to
upstream that. Just an example to show the usage.
It's fine if we drop the way of asynchronous kernel notification.

Thanks
Yan

2019-12-10 16:40:39

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

On Tue, 10 Dec 2019 02:44:44 -0500
Yan Zhao <[email protected]> wrote:

> On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > On Mon, 9 Dec 2019 01:22:12 -0500
> > Yan Zhao <[email protected]> wrote:
> >
> > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > Yan Zhao <[email protected]> wrote:
> > > >
> > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > > > > On Wed, 4 Dec 2019 22:26:50 -0500
> > > > > > Yan Zhao <[email protected]> wrote:
> > > > > >
> > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > > communicate dynamic trap info. It is of type
> > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > >
> > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > > info region.
> > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > > mmaped subregion is disablable).
> > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > >
> > > > > > > A typical usage is
> > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > re-passthrough the whole bar 0 region.
> > > > > > >
> > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > vfio_pci_mediate_ops->open().
> > > > > > >
> > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > > and region->ops=null.
> > > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > > vfio_pci_mediate_ops.
> > > > > >
> > > > > > TBH, I don't like this interface at all. Userspace doesn't pass data
> > > > > > to the kernel via INFO ioctls. We have a SET_IRQS ioctl for
> > > > > > configuring user signaling with eventfds. I think we only need to
> > > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > > information for a region. The user would enumerate the device IRQs via
> > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > indicate which region(s) should be re-evaluated on signaling. The user
> > > > > > would enable that signaling via SET_IRQS and simply re-evaluate the
> > > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > > >
> > > > > > sparse mmap capability for the associated regions when signaled.
> > > > >
> > > > > Do you like the "disablable" flag of sparse mmap ?
> > > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > > region might be too heavy.
> > > >
> > > > No, I don't like the disable-able flag. At what frequency do we expect
> > > > regions to change? It seems like we'd only change when switching into
> > > > and out of the _SAVING state, which is rare. It seems easy for
> > > > userspace, at least QEMU, to drop the entire mmap configuration and
> > > ok. I'll try this way.
> > >
> > > > re-read it. Another concern here is how do we synchronize the event?
> > > > Are we assuming that this event would occur when a user switch to
> > > > _SAVING mode on the device? That operation is synchronous, the device
> > > > must be in saving mode after the write to device state completes, but
> > > > it seems like this might be trying to add an asynchronous dependency.
> > > > Will the write to device_state only complete once the user handles the
> > > > eventfd? How would the kernel know when the mmap re-evaluation is
> > > > complete. It seems like there are gaps here that the vendor driver
> > > > could miss traps required for migration because the user hasn't
> > > > completed the mmap transition yet. Thanks,
> > > >
> > > > Alex
> > >
> > > yes, this asynchronous event notification will cause vendor driver miss
> > > traps. But it's supposed to be of very short period time. That's also a
> > > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > > able to be finished before the first iterate, it's still safe.
> >
> > Making the re-evaluation lightweight cannot solve the race, it only
> > masks it.
> >
> > > But I agree, the timing is not guaranteed, and so it's best for kernel
> > > to wait for mmap re-evaluation to complete.
> > >
> > > migration_thread
> > > |->qemu_savevm_state_setup
> > > | |->ram_save_setup
> > > | | |->migration_bitmap_sync
> > > | | |->kvm_log_sync
> > > | | |->vfio_log_sync
> > > | |
> > > | |->vfio_save_setup
> > > | |->set_device_state(_SAVING)
> > > |
> > > |->qemu_savevm_state_pending
> > > | |->ram_save_pending
> > > | | |->migration_bitmap_sync
> > > | | |->kvm_log_sync
> > > | | |->vfio_log_sync
> > > | |->vfio_save_pending
> > > |
> > > |->qemu_savevm_state_iterate
> > > | |->ram_save_iterate //send pages
> > > | |->vfio_save_iterate
> > > ...
> > >
> > >
> > > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > > there're other two possible use cases:
> > > (1) keep passing through devices when migration starts and track dirty pages
> > > using hardware IOMMU. Then when migration is about to complete, stop the
> > > device and start trap PCI BARs for software emulation. (we made some
> > > changes to let device stop ahead of vcpu )
> >
> > How is that possible? I/O devices need to continue to work until the
> > vCPU stops otherwise the vCPU can get blocked on the device. Maybe QEMU
> hi Alex
> For devices like DSA [1], it can support SVM mode. In this mode, when a
> page fault happens, the Intel DSA device blocks until the page fault is
> resolved, if PRS is enabled; otherwise it is reported as an error.
>
> Therefore, to pass through DSA into guest and do live migration with it,
> it is desired to stop DSA before stopping vCPU, as there may be an
> outstanding page fault to be resolved.
>
> During the period when DSA is stopped and vCPUs are still running, all the
> pass-through resources are trapped and emulated by host mediation driver until
> vCPUs stop.

If the DSA is stopped and resources are trapped and emulated, then is
the device really stopped from a QEMU perspective or has it simply
switched modes underneath QEMU? If the device is truly stopped, then
I'd like to understand how a vCPU doing a PIO read from the device
wouldn't wedge the VM.

> [1] https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
>
>
> > should assume all mmaps should be dropped on vfio device after we pass
> > some point of the migration process.
> >
> yes, it should be workable for the use case of DSA.
>
> > If there are a fixed set of mmap settings for a region and discrete
> > conditions under which they become active (ex. switch device to SAVING
> > mode) then QEMU could choose the right mapping itself and we wouldn't
> > need to worry about this asynchronous signaling problem, it would just
> > be defined as part of the protocol userspace needs to use.
> >
> It's ok to let QEMU trigger dynamic trap on certain condition (like switching
> device to SAVING mode), but it seems that there's no fixed set of mmap settings
> for a region.
> For example, some devices may want to trap the whole BARs, but some devices
> only requires to trap a range of pages in a BAR for performance consideration.
>
> If the "disable-able" flag is not preferable, maybe re-evaluation way is
> the only choice? But it is a burden to ask for re-evaluation if they are
> not required.
>
> What about introducing a "region_bitmask" in ctl header of the migration region?
> when QEMU writes a region index to the "region_bitmask", it can read back
> from this field a bitmask to know which mmap to disable.

If a vendor driver wanted to have a migration sparse mmap that's
different from its runtime sparse mmap, we could simply add a new
capability in the region_info. Userspace would only need to switch to
a different mapping for regions which advertise a new migration sparse
mmap capability. Doesn't that serve the same purpose as the proposed
bitmap?

> > > (2) performance optimization. There's an example in GVT (mdev case):
> > > PCI BARs are passed through on vGPU initialization and are mmaped to a host
> > > dummy buffer. Then after initialization done, start trap of PCI BARs of
> > > vGPUs and start normal host mediation. The initial pass-through can save
> > > 1000000 times of mmio trap.
> >
> > Much of this discussion has me worried that many assumptions are being
> > made about the user and device interaction. Backwards compatible
> > behavior is required. If a mdev device presents an initial sparse mmap
> > capability for this acceleration, how do you support an existing
> > userspace that doesn't understand the new dynamic mmap semantics and
> > continues to try to operate with the initial sparse mmap? Doesn't this
> > introduce another example of the raciness of the device trying to
> > switch mmaps? Seems that if QEMU doesn't handle the eventfd with
> > sufficient timeliness the switch back to trap behavior could miss an
> > important transaction. This also seems like an optimization targeted
> > at VMs running for only a short time, where it's not obvious to me that
> > GVT-g overlaps those sorts of use cases. How much initialization time
> > is actually being saved with such a hack? Thanks,
> >
> It can save about 4s initialization time with such a hack. But you are
> right, the backward compatibility is a problem and we are not going to
> upstream that. Just an example to show the usage.
> It's fine if we drop the way of asynchronous kernel notification.

I think to handle such a situation we'd need a mechanism to revoke the
user's mmap. We can make use of an asynchronous mechanism to improve
performance of a device, but we need a synchronous mechanism to
maintain correctness. For this example, the sparse mmap capability
could advertise the section of the BAR as mmap'able and revoke that
user mapping after the device finishes the initialization phase.
Potentially the user re-evaluating region_info after the initialization
phase would see a different sparse mmap capability excluding these
sections, but then we might need to think whether we want to suggest
that the user always re-read the region_info after device reset. AFAIK,
we currently have no mechanism to revoke user mmaps. Thanks,

Alex

2019-12-10 17:00:46

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

On Mon, 9 Dec 2019 21:44:23 -0500
Yan Zhao <[email protected]> wrote:

> > > > > Currently, yes, i40e has build dependency on vfio-pci.
> > > > > It's like this, if i40e decides to support SRIOV and compiles in vf
> > > > > related code who depends on vfio-pci, it will also have build dependency
> > > > > on vfio-pci. isn't it natural?
> > > >
> > > > No, this is not natural. There are certainly i40e VF use cases that
> > > > have no interest in vfio and having dependencies between the two
> > > > modules is unacceptable. I think you probably want to modularize the
> > > > i40e vfio support code and then perhaps register a table in vfio-pci
> > > > that the vfio-pci code can perform a module request when using a
> > > > compatible device. Just and idea, there might be better options. I
> > > > will not accept a solution that requires unloading the i40e driver in
> > > > order to unload the vfio-pci driver. It's inconvenient with just one
> > > > NIC driver, imagine how poorly that scales.
> > > >
> > > what about this way:
> > > mediate driver registers a module notifier and every time when
> > > vfio_pci is loaded, register to vfio_pci its mediate ops?
> > > (Just like in below sample code)
> > > This way vfio-pci is free to unload and this registering only gives
> > > vfio-pci a name of what module to request.
> > > After that,
> > > in vfio_pci_open(), vfio-pci requests the mediate driver. (or puts
> > > the mediate driver when mediate driver does not support mediating the
> > > device)
> > > in vfio_pci_release(), vfio-pci puts the mediate driver.
> > >
> > > static void register_mediate_ops(void)
> > > {
> > > int (*func)(struct vfio_pci_mediate_ops *ops) = NULL;
> > >
> > > func = symbol_get(vfio_pci_register_mediate_ops);
> > >
> > > if (func) {
> > > func(&igd_dt_ops);
> > > symbol_put(vfio_pci_register_mediate_ops);
> > > }
> > > }
> > >
> > > static int igd_module_notify(struct notifier_block *self,
> > > unsigned long val, void *data)
> > > {
> > > struct module *mod = data;
> > > int ret = 0;
> > >
> > > switch (val) {
> > > case MODULE_STATE_LIVE:
> > > if (!strcmp(mod->name, "vfio_pci"))
> > > register_mediate_ops();
> > > break;
> > > case MODULE_STATE_GOING:
> > > break;
> > > default:
> > > break;
> > > }
> > > return ret;
> > > }
> > >
> > > static struct notifier_block igd_module_nb = {
> > > .notifier_call = igd_module_notify,
> > > .priority = 0,
> > > };
> > >
> > >
> > >
> > > static int __init igd_dt_init(void)
> > > {
> > > ...
> > > register_mediate_ops();
> > > register_module_notifier(&igd_module_nb);
> > > ...
> > > return 0;
> > > }
> >
> >
> > No, this is bad. Please look at MODULE_ALIAS() and request_module() as
> > used in the vfio-platform for loading reset driver modules. I think
> > the correct approach is that vfio-pci should perform a request_module()
> > based on the device being probed. Having the mediation provider
> > listening for vfio-pci and registering itself regardless of whether we
> > intend to use it assumes that we will want to use it and assumes that
> > the mediation provider module is already loaded. We should be able to
> > support demand loading of modules that may serve no other purpose than
> > providing this mediation. Thanks,
> hi Alex
> Thanks for this message.
> So is it good to create a separate module as mediation provider driver,
> and alias its module name to "vfio-pci-mediate-vid-did".
> Then when vfio-pci probes the device, it requests module of that name ?

I think this would give us an option to have the mediator as a separate
module, but not require it. Maybe rather than a request_module(),
where if we follow the platform reset example we'd then expect the init
code for the module to register into a list, we could do a
symbol_request(). AIUI, this would give us a reference to the symbol
if the module providing it is already loaded, and request a module
(perhaps via an alias) if it's not already load. Thanks,

Alex

2019-12-11 01:29:01

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

On Wed, Dec 11, 2019 at 12:58:24AM +0800, Alex Williamson wrote:
> On Mon, 9 Dec 2019 21:44:23 -0500
> Yan Zhao <[email protected]> wrote:
>
> > > > > > Currently, yes, i40e has build dependency on vfio-pci.
> > > > > > It's like this, if i40e decides to support SRIOV and compiles in vf
> > > > > > related code who depends on vfio-pci, it will also have build dependency
> > > > > > on vfio-pci. isn't it natural?
> > > > >
> > > > > No, this is not natural. There are certainly i40e VF use cases that
> > > > > have no interest in vfio and having dependencies between the two
> > > > > modules is unacceptable. I think you probably want to modularize the
> > > > > i40e vfio support code and then perhaps register a table in vfio-pci
> > > > > that the vfio-pci code can perform a module request when using a
> > > > > compatible device. Just and idea, there might be better options. I
> > > > > will not accept a solution that requires unloading the i40e driver in
> > > > > order to unload the vfio-pci driver. It's inconvenient with just one
> > > > > NIC driver, imagine how poorly that scales.
> > > > >
> > > > what about this way:
> > > > mediate driver registers a module notifier and every time when
> > > > vfio_pci is loaded, register to vfio_pci its mediate ops?
> > > > (Just like in below sample code)
> > > > This way vfio-pci is free to unload and this registering only gives
> > > > vfio-pci a name of what module to request.
> > > > After that,
> > > > in vfio_pci_open(), vfio-pci requests the mediate driver. (or puts
> > > > the mediate driver when mediate driver does not support mediating the
> > > > device)
> > > > in vfio_pci_release(), vfio-pci puts the mediate driver.
> > > >
> > > > static void register_mediate_ops(void)
> > > > {
> > > > int (*func)(struct vfio_pci_mediate_ops *ops) = NULL;
> > > >
> > > > func = symbol_get(vfio_pci_register_mediate_ops);
> > > >
> > > > if (func) {
> > > > func(&igd_dt_ops);
> > > > symbol_put(vfio_pci_register_mediate_ops);
> > > > }
> > > > }
> > > >
> > > > static int igd_module_notify(struct notifier_block *self,
> > > > unsigned long val, void *data)
> > > > {
> > > > struct module *mod = data;
> > > > int ret = 0;
> > > >
> > > > switch (val) {
> > > > case MODULE_STATE_LIVE:
> > > > if (!strcmp(mod->name, "vfio_pci"))
> > > > register_mediate_ops();
> > > > break;
> > > > case MODULE_STATE_GOING:
> > > > break;
> > > > default:
> > > > break;
> > > > }
> > > > return ret;
> > > > }
> > > >
> > > > static struct notifier_block igd_module_nb = {
> > > > .notifier_call = igd_module_notify,
> > > > .priority = 0,
> > > > };
> > > >
> > > >
> > > >
> > > > static int __init igd_dt_init(void)
> > > > {
> > > > ...
> > > > register_mediate_ops();
> > > > register_module_notifier(&igd_module_nb);
> > > > ...
> > > > return 0;
> > > > }
> > >
> > >
> > > No, this is bad. Please look at MODULE_ALIAS() and request_module() as
> > > used in the vfio-platform for loading reset driver modules. I think
> > > the correct approach is that vfio-pci should perform a request_module()
> > > based on the device being probed. Having the mediation provider
> > > listening for vfio-pci and registering itself regardless of whether we
> > > intend to use it assumes that we will want to use it and assumes that
> > > the mediation provider module is already loaded. We should be able to
> > > support demand loading of modules that may serve no other purpose than
> > > providing this mediation. Thanks,
> > hi Alex
> > Thanks for this message.
> > So is it good to create a separate module as mediation provider driver,
> > and alias its module name to "vfio-pci-mediate-vid-did".
> > Then when vfio-pci probes the device, it requests module of that name ?
>
> I think this would give us an option to have the mediator as a separate
> module, but not require it. Maybe rather than a request_module(),
> where if we follow the platform reset example we'd then expect the init
> code for the module to register into a list, we could do a
> symbol_request(). AIUI, this would give us a reference to the symbol
> if the module providing it is already loaded, and request a module
> (perhaps via an alias) if it's not already load. Thanks,
>
ok. got it!
Thank you :)

Yan

2019-12-11 06:34:55

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:
> On Tue, 10 Dec 2019 02:44:44 -0500
> Yan Zhao <[email protected]> wrote:
>
> > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > Yan Zhao <[email protected]> wrote:
> > >
> > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > Yan Zhao <[email protected]> wrote:
> > > > >
> > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > > > > > On Wed, 4 Dec 2019 22:26:50 -0500
> > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > >
> > > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > >
> > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > > > info region.
> > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > > > mmaped subregion is disablable).
> > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > >
> > > > > > > > A typical usage is
> > > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > >
> > > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > >
> > > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > > > and region->ops=null.
> > > > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > > > vfio_pci_mediate_ops.
> > > > > > >
> > > > > > > TBH, I don't like this interface at all. Userspace doesn't pass data
> > > > > > > to the kernel via INFO ioctls. We have a SET_IRQS ioctl for
> > > > > > > configuring user signaling with eventfds. I think we only need to
> > > > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > > > information for a region. The user would enumerate the device IRQs via
> > > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > > indicate which region(s) should be re-evaluated on signaling. The user
> > > > > > > would enable that signaling via SET_IRQS and simply re-evaluate the
> > > > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > > > >
> > > > > > > sparse mmap capability for the associated regions when signaled.
> > > > > >
> > > > > > Do you like the "disablable" flag of sparse mmap ?
> > > > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > > > region might be too heavy.
> > > > >
> > > > > No, I don't like the disable-able flag. At what frequency do we expect
> > > > > regions to change? It seems like we'd only change when switching into
> > > > > and out of the _SAVING state, which is rare. It seems easy for
> > > > > userspace, at least QEMU, to drop the entire mmap configuration and
> > > > ok. I'll try this way.
> > > >
> > > > > re-read it. Another concern here is how do we synchronize the event?
> > > > > Are we assuming that this event would occur when a user switch to
> > > > > _SAVING mode on the device? That operation is synchronous, the device
> > > > > must be in saving mode after the write to device state completes, but
> > > > > it seems like this might be trying to add an asynchronous dependency.
> > > > > Will the write to device_state only complete once the user handles the
> > > > > eventfd? How would the kernel know when the mmap re-evaluation is
> > > > > complete. It seems like there are gaps here that the vendor driver
> > > > > could miss traps required for migration because the user hasn't
> > > > > completed the mmap transition yet. Thanks,
> > > > >
> > > > > Alex
> > > >
> > > > yes, this asynchronous event notification will cause vendor driver miss
> > > > traps. But it's supposed to be of very short period time. That's also a
> > > > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > > > able to be finished before the first iterate, it's still safe.
> > >
> > > Making the re-evaluation lightweight cannot solve the race, it only
> > > masks it.
> > >
> > > > But I agree, the timing is not guaranteed, and so it's best for kernel
> > > > to wait for mmap re-evaluation to complete.
> > > >
> > > > migration_thread
> > > > |->qemu_savevm_state_setup
> > > > | |->ram_save_setup
> > > > | | |->migration_bitmap_sync
> > > > | | |->kvm_log_sync
> > > > | | |->vfio_log_sync
> > > > | |
> > > > | |->vfio_save_setup
> > > > | |->set_device_state(_SAVING)
> > > > |
> > > > |->qemu_savevm_state_pending
> > > > | |->ram_save_pending
> > > > | | |->migration_bitmap_sync
> > > > | | |->kvm_log_sync
> > > > | | |->vfio_log_sync
> > > > | |->vfio_save_pending
> > > > |
> > > > |->qemu_savevm_state_iterate
> > > > | |->ram_save_iterate //send pages
> > > > | |->vfio_save_iterate
> > > > ...
> > > >
> > > >
> > > > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > > > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > > > there're other two possible use cases:
> > > > (1) keep passing through devices when migration starts and track dirty pages
> > > > using hardware IOMMU. Then when migration is about to complete, stop the
> > > > device and start trap PCI BARs for software emulation. (we made some
> > > > changes to let device stop ahead of vcpu )
> > >
> > > How is that possible? I/O devices need to continue to work until the
> > > vCPU stops otherwise the vCPU can get blocked on the device. Maybe QEMU
> > hi Alex
> > For devices like DSA [1], it can support SVM mode. In this mode, when a
> > page fault happens, the Intel DSA device blocks until the page fault is
> > resolved, if PRS is enabled; otherwise it is reported as an error.
> >
> > Therefore, to pass through DSA into guest and do live migration with it,
> > it is desired to stop DSA before stopping vCPU, as there may be an
> > outstanding page fault to be resolved.
> >
> > During the period when DSA is stopped and vCPUs are still running, all the
> > pass-through resources are trapped and emulated by host mediation driver until
> > vCPUs stop.
>
> If the DSA is stopped and resources are trapped and emulated, then is
> the device really stopped from a QEMU perspective or has it simply
> switched modes underneath QEMU? If the device is truly stopped, then
> I'd like to understand how a vCPU doing a PIO read from the device
> wouldn't wedge the VM.
>
It doesn't matter if the device is truly stopped or not (although from
my point of view, just draining commands and keeping device running is
better as it handles live migration failure better).
PIOs also need to be trapped and emulated if a vCPU accesses them.

> > [1] https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
> >
> >
> > > should assume all mmaps should be dropped on vfio device after we pass
> > > some point of the migration process.
> > >
> > yes, it should be workable for the use case of DSA.
> >
> > > If there are a fixed set of mmap settings for a region and discrete
> > > conditions under which they become active (ex. switch device to SAVING
> > > mode) then QEMU could choose the right mapping itself and we wouldn't
> > > need to worry about this asynchronous signaling problem, it would just
> > > be defined as part of the protocol userspace needs to use.
> > >
> > It's ok to let QEMU trigger dynamic trap on certain condition (like switching
> > device to SAVING mode), but it seems that there's no fixed set of mmap settings
> > for a region.
> > For example, some devices may want to trap the whole BARs, but some devices
> > only requires to trap a range of pages in a BAR for performance consideration.
> >
> > If the "disable-able" flag is not preferable, maybe re-evaluation way is
> > the only choice? But it is a burden to ask for re-evaluation if they are
> > not required.
> >
> > What about introducing a "region_bitmask" in ctl header of the migration region?
> > when QEMU writes a region index to the "region_bitmask", it can read back
> > from this field a bitmask to know which mmap to disable.
>
> If a vendor driver wanted to have a migration sparse mmap that's
> different from its runtime sparse mmap, we could simply add a new
> capability in the region_info. Userspace would only need to switch to
> a different mapping for regions which advertise a new migration sparse
> mmap capability. Doesn't that serve the same purpose as the proposed
> bitmap?

yes, it does.
I will try this way in next version.

> > > > (2) performance optimization. There's an example in GVT (mdev case):
> > > > PCI BARs are passed through on vGPU initialization and are mmaped to a host
> > > > dummy buffer. Then after initialization done, start trap of PCI BARs of
> > > > vGPUs and start normal host mediation. The initial pass-through can save
> > > > 1000000 times of mmio trap.
> > >
> > > Much of this discussion has me worried that many assumptions are being
> > > made about the user and device interaction. Backwards compatible
> > > behavior is required. If a mdev device presents an initial sparse mmap
> > > capability for this acceleration, how do you support an existing
> > > userspace that doesn't understand the new dynamic mmap semantics and
> > > continues to try to operate with the initial sparse mmap? Doesn't this
> > > introduce another example of the raciness of the device trying to
> > > switch mmaps? Seems that if QEMU doesn't handle the eventfd with
> > > sufficient timeliness the switch back to trap behavior could miss an
> > > important transaction. This also seems like an optimization targeted
> > > at VMs running for only a short time, where it's not obvious to me that
> > > GVT-g overlaps those sorts of use cases. How much initialization time
> > > is actually being saved with such a hack? Thanks,
> > >
> > It can save about 4s initialization time with such a hack. But you are
> > right, the backward compatibility is a problem and we are not going to
> > upstream that. Just an example to show the usage.
> > It's fine if we drop the way of asynchronous kernel notification.
>
> I think to handle such a situation we'd need a mechanism to revoke the
> user's mmap. We can make use of an asynchronous mechanism to improve
> performance of a device, but we need a synchronous mechanism to
> maintain correctness. For this example, the sparse mmap capability
> could advertise the section of the BAR as mmap'able and revoke that
> user mapping after the device finishes the initialization phase.
> Potentially the user re-evaluating region_info after the initialization
> phase would see a different sparse mmap capability excluding these
> sections, but then we might need to think whether we want to suggest
> that the user always re-read the region_info after device reset. AFAIK,
> we currently have no mechanism to revoke user mmaps. Thanks,
>
Actually I think the "disable-able" flag is good except for its backward
compatibility :)

Thanks
Yan

2019-12-11 18:58:56

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

On Wed, 11 Dec 2019 01:25:55 -0500
Yan Zhao <[email protected]> wrote:

> On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:
> > On Tue, 10 Dec 2019 02:44:44 -0500
> > Yan Zhao <[email protected]> wrote:
> >
> > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > Yan Zhao <[email protected]> wrote:
> > > >
> > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > Yan Zhao <[email protected]> wrote:
> > > > > >
> > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > > > > > > On Wed, 4 Dec 2019 22:26:50 -0500
> > > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > >
> > > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > > > > info region.
> > > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > >
> > > > > > > > > A typical usage is
> > > > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > >
> > > > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > >
> > > > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > > > > and region->ops=null.
> > > > > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > > > > vfio_pci_mediate_ops.
> > > > > > > >
> > > > > > > > TBH, I don't like this interface at all. Userspace doesn't pass data
> > > > > > > > to the kernel via INFO ioctls. We have a SET_IRQS ioctl for
> > > > > > > > configuring user signaling with eventfds. I think we only need to
> > > > > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > > > > information for a region. The user would enumerate the device IRQs via
> > > > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > > > indicate which region(s) should be re-evaluated on signaling. The user
> > > > > > > > would enable that signaling via SET_IRQS and simply re-evaluate the
> > > > > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > > > > >
> > > > > > > > sparse mmap capability for the associated regions when signaled.
> > > > > > >
> > > > > > > Do you like the "disablable" flag of sparse mmap ?
> > > > > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > > > > region might be too heavy.
> > > > > >
> > > > > > No, I don't like the disable-able flag. At what frequency do we expect
> > > > > > regions to change? It seems like we'd only change when switching into
> > > > > > and out of the _SAVING state, which is rare. It seems easy for
> > > > > > userspace, at least QEMU, to drop the entire mmap configuration and
> > > > > ok. I'll try this way.
> > > > >
> > > > > > re-read it. Another concern here is how do we synchronize the event?
> > > > > > Are we assuming that this event would occur when a user switch to
> > > > > > _SAVING mode on the device? That operation is synchronous, the device
> > > > > > must be in saving mode after the write to device state completes, but
> > > > > > it seems like this might be trying to add an asynchronous dependency.
> > > > > > Will the write to device_state only complete once the user handles the
> > > > > > eventfd? How would the kernel know when the mmap re-evaluation is
> > > > > > complete. It seems like there are gaps here that the vendor driver
> > > > > > could miss traps required for migration because the user hasn't
> > > > > > completed the mmap transition yet. Thanks,
> > > > > >
> > > > > > Alex
> > > > >
> > > > > yes, this asynchronous event notification will cause vendor driver miss
> > > > > traps. But it's supposed to be of very short period time. That's also a
> > > > > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > > > > able to be finished before the first iterate, it's still safe.
> > > >
> > > > Making the re-evaluation lightweight cannot solve the race, it only
> > > > masks it.
> > > >
> > > > > But I agree, the timing is not guaranteed, and so it's best for kernel
> > > > > to wait for mmap re-evaluation to complete.
> > > > >
> > > > > migration_thread
> > > > > |->qemu_savevm_state_setup
> > > > > | |->ram_save_setup
> > > > > | | |->migration_bitmap_sync
> > > > > | | |->kvm_log_sync
> > > > > | | |->vfio_log_sync
> > > > > | |
> > > > > | |->vfio_save_setup
> > > > > | |->set_device_state(_SAVING)
> > > > > |
> > > > > |->qemu_savevm_state_pending
> > > > > | |->ram_save_pending
> > > > > | | |->migration_bitmap_sync
> > > > > | | |->kvm_log_sync
> > > > > | | |->vfio_log_sync
> > > > > | |->vfio_save_pending
> > > > > |
> > > > > |->qemu_savevm_state_iterate
> > > > > | |->ram_save_iterate //send pages
> > > > > | |->vfio_save_iterate
> > > > > ...
> > > > >
> > > > >
> > > > > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > > > > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > > > > there're other two possible use cases:
> > > > > (1) keep passing through devices when migration starts and track dirty pages
> > > > > using hardware IOMMU. Then when migration is about to complete, stop the
> > > > > device and start trap PCI BARs for software emulation. (we made some
> > > > > changes to let device stop ahead of vcpu )
> > > >
> > > > How is that possible? I/O devices need to continue to work until the
> > > > vCPU stops otherwise the vCPU can get blocked on the device. Maybe QEMU
> > > hi Alex
> > > For devices like DSA [1], it can support SVM mode. In this mode, when a
> > > page fault happens, the Intel DSA device blocks until the page fault is
> > > resolved, if PRS is enabled; otherwise it is reported as an error.
> > >
> > > Therefore, to pass through DSA into guest and do live migration with it,
> > > it is desired to stop DSA before stopping vCPU, as there may be an
> > > outstanding page fault to be resolved.
> > >
> > > During the period when DSA is stopped and vCPUs are still running, all the
> > > pass-through resources are trapped and emulated by host mediation driver until
> > > vCPUs stop.
> >
> > If the DSA is stopped and resources are trapped and emulated, then is
> > the device really stopped from a QEMU perspective or has it simply
> > switched modes underneath QEMU? If the device is truly stopped, then
> > I'd like to understand how a vCPU doing a PIO read from the device
> > wouldn't wedge the VM.
> >
> It doesn't matter if the device is truly stopped or not (although from
> my point of view, just draining commands and keeping device running is
> better as it handles live migration failure better).
> PIOs also need to be trapped and emulated if a vCPU accesses them.

We seem to be talking around each other here. If PIOs are trapped and
emulated then the device is not "stopped" as far as QEMU is concerned,
right? "Stopping" a device suggests to me that a running vCPU doing a
PIO read from the device would block and cause problems in the still
running VM. So I think you're suggesting some sort of mode switch in
the device where direct access is disabled an emulation takes over
until the vCPUs are stopped.

> > > [1] https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
> > >
> > >
> > > > should assume all mmaps should be dropped on vfio device after we pass
> > > > some point of the migration process.
> > > >
> > > yes, it should be workable for the use case of DSA.
> > >
> > > > If there are a fixed set of mmap settings for a region and discrete
> > > > conditions under which they become active (ex. switch device to SAVING
> > > > mode) then QEMU could choose the right mapping itself and we wouldn't
> > > > need to worry about this asynchronous signaling problem, it would just
> > > > be defined as part of the protocol userspace needs to use.
> > > >
> > > It's ok to let QEMU trigger dynamic trap on certain condition (like switching
> > > device to SAVING mode), but it seems that there's no fixed set of mmap settings
> > > for a region.
> > > For example, some devices may want to trap the whole BARs, but some devices
> > > only requires to trap a range of pages in a BAR for performance consideration.
> > >
> > > If the "disable-able" flag is not preferable, maybe re-evaluation way is
> > > the only choice? But it is a burden to ask for re-evaluation if they are
> > > not required.
> > >
> > > What about introducing a "region_bitmask" in ctl header of the migration region?
> > > when QEMU writes a region index to the "region_bitmask", it can read back
> > > from this field a bitmask to know which mmap to disable.
> >
> > If a vendor driver wanted to have a migration sparse mmap that's
> > different from its runtime sparse mmap, we could simply add a new
> > capability in the region_info. Userspace would only need to switch to
> > a different mapping for regions which advertise a new migration sparse
> > mmap capability. Doesn't that serve the same purpose as the proposed
> > bitmap?
>
> yes, it does.
> I will try this way in next version.
>
> > > > > (2) performance optimization. There's an example in GVT (mdev case):
> > > > > PCI BARs are passed through on vGPU initialization and are mmaped to a host
> > > > > dummy buffer. Then after initialization done, start trap of PCI BARs of
> > > > > vGPUs and start normal host mediation. The initial pass-through can save
> > > > > 1000000 times of mmio trap.
> > > >
> > > > Much of this discussion has me worried that many assumptions are being
> > > > made about the user and device interaction. Backwards compatible
> > > > behavior is required. If a mdev device presents an initial sparse mmap
> > > > capability for this acceleration, how do you support an existing
> > > > userspace that doesn't understand the new dynamic mmap semantics and
> > > > continues to try to operate with the initial sparse mmap? Doesn't this
> > > > introduce another example of the raciness of the device trying to
> > > > switch mmaps? Seems that if QEMU doesn't handle the eventfd with
> > > > sufficient timeliness the switch back to trap behavior could miss an
> > > > important transaction. This also seems like an optimization targeted
> > > > at VMs running for only a short time, where it's not obvious to me that
> > > > GVT-g overlaps those sorts of use cases. How much initialization time
> > > > is actually being saved with such a hack? Thanks,
> > > >
> > > It can save about 4s initialization time with such a hack. But you are
> > > right, the backward compatibility is a problem and we are not going to
> > > upstream that. Just an example to show the usage.
> > > It's fine if we drop the way of asynchronous kernel notification.
> >
> > I think to handle such a situation we'd need a mechanism to revoke the
> > user's mmap. We can make use of an asynchronous mechanism to improve
> > performance of a device, but we need a synchronous mechanism to
> > maintain correctness. For this example, the sparse mmap capability
> > could advertise the section of the BAR as mmap'able and revoke that
> > user mapping after the device finishes the initialization phase.
> > Potentially the user re-evaluating region_info after the initialization
> > phase would see a different sparse mmap capability excluding these
> > sections, but then we might need to think whether we want to suggest
> > that the user always re-read the region_info after device reset. AFAIK,
> > we currently have no mechanism to revoke user mmaps. Thanks,
> >
> Actually I think the "disable-able" flag is good except for its backward
> compatibility :)

Setting a flag on a section of a region doesn't solve the asynchronous
problem. Thanks,

Alex

2019-12-12 02:11:58

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

On Thu, Dec 12, 2019 at 02:56:55AM +0800, Alex Williamson wrote:
> On Wed, 11 Dec 2019 01:25:55 -0500
> Yan Zhao <[email protected]> wrote:
>
> > On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:
> > > On Tue, 10 Dec 2019 02:44:44 -0500
> > > Yan Zhao <[email protected]> wrote:
> > >
> > > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > > Yan Zhao <[email protected]> wrote:
> > > > >
> > > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > >
> > > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > > > > > > > On Wed, 4 Dec 2019 22:26:50 -0500
> > > > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > > >
> > > > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > > > > > info region.
> > > > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > > >
> > > > > > > > > > A typical usage is
> > > > > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > > >
> > > > > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > > >
> > > > > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > > > > > and region->ops=null.
> > > > > > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > > > > > vfio_pci_mediate_ops.
> > > > > > > > >
> > > > > > > > > TBH, I don't like this interface at all. Userspace doesn't pass data
> > > > > > > > > to the kernel via INFO ioctls. We have a SET_IRQS ioctl for
> > > > > > > > > configuring user signaling with eventfds. I think we only need to
> > > > > > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > > > > > information for a region. The user would enumerate the device IRQs via
> > > > > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > > > > indicate which region(s) should be re-evaluated on signaling. The user
> > > > > > > > > would enable that signaling via SET_IRQS and simply re-evaluate the
> > > > > > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > > > > > >
> > > > > > > > > sparse mmap capability for the associated regions when signaled.
> > > > > > > >
> > > > > > > > Do you like the "disablable" flag of sparse mmap ?
> > > > > > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > > > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > > > > > region might be too heavy.
> > > > > > >
> > > > > > > No, I don't like the disable-able flag. At what frequency do we expect
> > > > > > > regions to change? It seems like we'd only change when switching into
> > > > > > > and out of the _SAVING state, which is rare. It seems easy for
> > > > > > > userspace, at least QEMU, to drop the entire mmap configuration and
> > > > > > ok. I'll try this way.
> > > > > >
> > > > > > > re-read it. Another concern here is how do we synchronize the event?
> > > > > > > Are we assuming that this event would occur when a user switch to
> > > > > > > _SAVING mode on the device? That operation is synchronous, the device
> > > > > > > must be in saving mode after the write to device state completes, but
> > > > > > > it seems like this might be trying to add an asynchronous dependency.
> > > > > > > Will the write to device_state only complete once the user handles the
> > > > > > > eventfd? How would the kernel know when the mmap re-evaluation is
> > > > > > > complete. It seems like there are gaps here that the vendor driver
> > > > > > > could miss traps required for migration because the user hasn't
> > > > > > > completed the mmap transition yet. Thanks,
> > > > > > >
> > > > > > > Alex
> > > > > >
> > > > > > yes, this asynchronous event notification will cause vendor driver miss
> > > > > > traps. But it's supposed to be of very short period time. That's also a
> > > > > > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > > > > > able to be finished before the first iterate, it's still safe.
> > > > >
> > > > > Making the re-evaluation lightweight cannot solve the race, it only
> > > > > masks it.
> > > > >
> > > > > > But I agree, the timing is not guaranteed, and so it's best for kernel
> > > > > > to wait for mmap re-evaluation to complete.
> > > > > >
> > > > > > migration_thread
> > > > > > |->qemu_savevm_state_setup
> > > > > > | |->ram_save_setup
> > > > > > | | |->migration_bitmap_sync
> > > > > > | | |->kvm_log_sync
> > > > > > | | |->vfio_log_sync
> > > > > > | |
> > > > > > | |->vfio_save_setup
> > > > > > | |->set_device_state(_SAVING)
> > > > > > |
> > > > > > |->qemu_savevm_state_pending
> > > > > > | |->ram_save_pending
> > > > > > | | |->migration_bitmap_sync
> > > > > > | | |->kvm_log_sync
> > > > > > | | |->vfio_log_sync
> > > > > > | |->vfio_save_pending
> > > > > > |
> > > > > > |->qemu_savevm_state_iterate
> > > > > > | |->ram_save_iterate //send pages
> > > > > > | |->vfio_save_iterate
> > > > > > ...
> > > > > >
> > > > > >
> > > > > > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > > > > > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > > > > > there're other two possible use cases:
> > > > > > (1) keep passing through devices when migration starts and track dirty pages
> > > > > > using hardware IOMMU. Then when migration is about to complete, stop the
> > > > > > device and start trap PCI BARs for software emulation. (we made some
> > > > > > changes to let device stop ahead of vcpu )
> > > > >
> > > > > How is that possible? I/O devices need to continue to work until the
> > > > > vCPU stops otherwise the vCPU can get blocked on the device. Maybe QEMU
> > > > hi Alex
> > > > For devices like DSA [1], it can support SVM mode. In this mode, when a
> > > > page fault happens, the Intel DSA device blocks until the page fault is
> > > > resolved, if PRS is enabled; otherwise it is reported as an error.
> > > >
> > > > Therefore, to pass through DSA into guest and do live migration with it,
> > > > it is desired to stop DSA before stopping vCPU, as there may be an
> > > > outstanding page fault to be resolved.
> > > >
> > > > During the period when DSA is stopped and vCPUs are still running, all the
> > > > pass-through resources are trapped and emulated by host mediation driver until
> > > > vCPUs stop.
> > >
> > > If the DSA is stopped and resources are trapped and emulated, then is
> > > the device really stopped from a QEMU perspective or has it simply
> > > switched modes underneath QEMU? If the device is truly stopped, then
> > > I'd like to understand how a vCPU doing a PIO read from the device
> > > wouldn't wedge the VM.
> > >
> > It doesn't matter if the device is truly stopped or not (although from
> > my point of view, just draining commands and keeping device running is
> > better as it handles live migration failure better).
> > PIOs also need to be trapped and emulated if a vCPU accesses them.
>
> We seem to be talking around each other here. If PIOs are trapped and
> emulated then the device is not "stopped" as far as QEMU is concerned,
> right? "Stopping" a device suggests to me that a running vCPU doing a
> PIO read from the device would block and cause problems in the still
> running VM. So I think you're suggesting some sort of mode switch in
> the device where direct access is disabled an emulation takes over
> until the vCPUs are stopped.

sorry for this confusion.
yes, it's a kind of mode switch from a QEMU perspective.
Currently, its implementation in our local branch is like that:
1. before migration thread stopping vCPUs, a migration state
(COMPLETING) notification is sent to vfio migration state notifier, and
this notifier would put device state to !RUNNING, and put all BARs to trap
state.
2. in the kernel, when device state is set to !RUNNING, draining all
pending device requests, and starts emulation.

This implementation has two issues:
1. it requires hardcode in QEMU to put all BARs trapped and the time
spending on revoking mmaps is not necessary for devices that do not need it.
2. !RUNNING state here is not accurate and it will confuse vendor
drivers who stop devices after vCPUs stop.

For the 2nd issue, I think we can propose a new device state like
PRE-STOPPING.

But for the 1st issue, not sure how to fix it right now.
Maybe we can still add an asynchronous kernel notification and wait until
QEMU have switched the region mmap state?


> > > > [1] https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
> > > >
> > > >
> > > > > should assume all mmaps should be dropped on vfio device after we pass
> > > > > some point of the migration process.
> > > > >
> > > > yes, it should be workable for the use case of DSA.
> > > >
> > > > > If there are a fixed set of mmap settings for a region and discrete
> > > > > conditions under which they become active (ex. switch device to SAVING
> > > > > mode) then QEMU could choose the right mapping itself and we wouldn't
> > > > > need to worry about this asynchronous signaling problem, it would just
> > > > > be defined as part of the protocol userspace needs to use.
> > > > >
> > > > It's ok to let QEMU trigger dynamic trap on certain condition (like switching
> > > > device to SAVING mode), but it seems that there's no fixed set of mmap settings
> > > > for a region.
> > > > For example, some devices may want to trap the whole BARs, but some devices
> > > > only requires to trap a range of pages in a BAR for performance consideration.
> > > >
> > > > If the "disable-able" flag is not preferable, maybe re-evaluation way is
> > > > the only choice? But it is a burden to ask for re-evaluation if they are
> > > > not required.
> > > >
> > > > What about introducing a "region_bitmask" in ctl header of the migration region?
> > > > when QEMU writes a region index to the "region_bitmask", it can read back
> > > > from this field a bitmask to know which mmap to disable.
> > >
> > > If a vendor driver wanted to have a migration sparse mmap that's
> > > different from its runtime sparse mmap, we could simply add a new
> > > capability in the region_info. Userspace would only need to switch to
> > > a different mapping for regions which advertise a new migration sparse
> > > mmap capability. Doesn't that serve the same purpose as the proposed
> > > bitmap?
> >
> > yes, it does.
> > I will try this way in next version.
> >
> > > > > > (2) performance optimization. There's an example in GVT (mdev case):
> > > > > > PCI BARs are passed through on vGPU initialization and are mmaped to a host
> > > > > > dummy buffer. Then after initialization done, start trap of PCI BARs of
> > > > > > vGPUs and start normal host mediation. The initial pass-through can save
> > > > > > 1000000 times of mmio trap.
> > > > >
> > > > > Much of this discussion has me worried that many assumptions are being
> > > > > made about the user and device interaction. Backwards compatible
> > > > > behavior is required. If a mdev device presents an initial sparse mmap
> > > > > capability for this acceleration, how do you support an existing
> > > > > userspace that doesn't understand the new dynamic mmap semantics and
> > > > > continues to try to operate with the initial sparse mmap? Doesn't this
> > > > > introduce another example of the raciness of the device trying to
> > > > > switch mmaps? Seems that if QEMU doesn't handle the eventfd with
> > > > > sufficient timeliness the switch back to trap behavior could miss an
> > > > > important transaction. This also seems like an optimization targeted
> > > > > at VMs running for only a short time, where it's not obvious to me that
> > > > > GVT-g overlaps those sorts of use cases. How much initialization time
> > > > > is actually being saved with such a hack? Thanks,
> > > > >
> > > > It can save about 4s initialization time with such a hack. But you are
> > > > right, the backward compatibility is a problem and we are not going to
> > > > upstream that. Just an example to show the usage.
> > > > It's fine if we drop the way of asynchronous kernel notification.
> > >
> > > I think to handle such a situation we'd need a mechanism to revoke the
> > > user's mmap. We can make use of an asynchronous mechanism to improve
> > > performance of a device, but we need a synchronous mechanism to
> > > maintain correctness. For this example, the sparse mmap capability
> > > could advertise the section of the BAR as mmap'able and revoke that
> > > user mapping after the device finishes the initialization phase.
> > > Potentially the user re-evaluating region_info after the initialization
> > > phase would see a different sparse mmap capability excluding these
> > > sections, but then we might need to think whether we want to suggest
> > > that the user always re-read the region_info after device reset. AFAIK,
> > > we currently have no mechanism to revoke user mmaps. Thanks,
> > >
> > Actually I think the "disable-able" flag is good except for its backward
> > compatibility :)
>
> Setting a flag on a section of a region doesn't solve the asynchronous
> problem. Thanks,
>
yes. I mean we are ok to give up the way of kernel to trigger re-evaluation
for now, as currently the use cases of DSA and GVT are not in upstream
phase and we can add that if necessary in future.
And I think "disable-able" flag is a way to re-use existing sparse mmap,
because otherwise we have to either re-evaluate the region_info or introduce
new caps like migration_sparse_mmap, reset_sparse_mmap. But I agree,
this flag may cause problem for old QEMUs.

Thanks
Yan

2019-12-12 03:09:01

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

On Wed, 11 Dec 2019 21:02:40 -0500
Yan Zhao <[email protected]> wrote:

> On Thu, Dec 12, 2019 at 02:56:55AM +0800, Alex Williamson wrote:
> > On Wed, 11 Dec 2019 01:25:55 -0500
> > Yan Zhao <[email protected]> wrote:
> >
> > > On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:
> > > > On Tue, 10 Dec 2019 02:44:44 -0500
> > > > Yan Zhao <[email protected]> wrote:
> > > >
> > > > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > > > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > > > Yan Zhao <[email protected]> wrote:
> > > > > >
> > > > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > > > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > > > > > > > > On Wed, 4 Dec 2019 22:26:50 -0500
> > > > > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > > > >
> > > > > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > > > > > > info region.
> > > > > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > > > >
> > > > > > > > > > > A typical usage is
> > > > > > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > > > >
> > > > > > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > > > >
> > > > > > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > > > > > > and region->ops=null.
> > > > > > > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > > > > > > vfio_pci_mediate_ops.
> > > > > > > > > >
> > > > > > > > > > TBH, I don't like this interface at all. Userspace doesn't pass data
> > > > > > > > > > to the kernel via INFO ioctls. We have a SET_IRQS ioctl for
> > > > > > > > > > configuring user signaling with eventfds. I think we only need to
> > > > > > > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > > > > > > information for a region. The user would enumerate the device IRQs via
> > > > > > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > > > > > indicate which region(s) should be re-evaluated on signaling. The user
> > > > > > > > > > would enable that signaling via SET_IRQS and simply re-evaluate the
> > > > > > > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > > > > > > >
> > > > > > > > > > sparse mmap capability for the associated regions when signaled.
> > > > > > > > >
> > > > > > > > > Do you like the "disablable" flag of sparse mmap ?
> > > > > > > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > > > > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > > > > > > region might be too heavy.
> > > > > > > >
> > > > > > > > No, I don't like the disable-able flag. At what frequency do we expect
> > > > > > > > regions to change? It seems like we'd only change when switching into
> > > > > > > > and out of the _SAVING state, which is rare. It seems easy for
> > > > > > > > userspace, at least QEMU, to drop the entire mmap configuration and
> > > > > > > ok. I'll try this way.
> > > > > > >
> > > > > > > > re-read it. Another concern here is how do we synchronize the event?
> > > > > > > > Are we assuming that this event would occur when a user switch to
> > > > > > > > _SAVING mode on the device? That operation is synchronous, the device
> > > > > > > > must be in saving mode after the write to device state completes, but
> > > > > > > > it seems like this might be trying to add an asynchronous dependency.
> > > > > > > > Will the write to device_state only complete once the user handles the
> > > > > > > > eventfd? How would the kernel know when the mmap re-evaluation is
> > > > > > > > complete. It seems like there are gaps here that the vendor driver
> > > > > > > > could miss traps required for migration because the user hasn't
> > > > > > > > completed the mmap transition yet. Thanks,
> > > > > > > >
> > > > > > > > Alex
> > > > > > >
> > > > > > > yes, this asynchronous event notification will cause vendor driver miss
> > > > > > > traps. But it's supposed to be of very short period time. That's also a
> > > > > > > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > > > > > > able to be finished before the first iterate, it's still safe.
> > > > > >
> > > > > > Making the re-evaluation lightweight cannot solve the race, it only
> > > > > > masks it.
> > > > > >
> > > > > > > But I agree, the timing is not guaranteed, and so it's best for kernel
> > > > > > > to wait for mmap re-evaluation to complete.
> > > > > > >
> > > > > > > migration_thread
> > > > > > > |->qemu_savevm_state_setup
> > > > > > > | |->ram_save_setup
> > > > > > > | | |->migration_bitmap_sync
> > > > > > > | | |->kvm_log_sync
> > > > > > > | | |->vfio_log_sync
> > > > > > > | |
> > > > > > > | |->vfio_save_setup
> > > > > > > | |->set_device_state(_SAVING)
> > > > > > > |
> > > > > > > |->qemu_savevm_state_pending
> > > > > > > | |->ram_save_pending
> > > > > > > | | |->migration_bitmap_sync
> > > > > > > | | |->kvm_log_sync
> > > > > > > | | |->vfio_log_sync
> > > > > > > | |->vfio_save_pending
> > > > > > > |
> > > > > > > |->qemu_savevm_state_iterate
> > > > > > > | |->ram_save_iterate //send pages
> > > > > > > | |->vfio_save_iterate
> > > > > > > ...
> > > > > > >
> > > > > > >
> > > > > > > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > > > > > > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > > > > > > there're other two possible use cases:
> > > > > > > (1) keep passing through devices when migration starts and track dirty pages
> > > > > > > using hardware IOMMU. Then when migration is about to complete, stop the
> > > > > > > device and start trap PCI BARs for software emulation. (we made some
> > > > > > > changes to let device stop ahead of vcpu )
> > > > > >
> > > > > > How is that possible? I/O devices need to continue to work until the
> > > > > > vCPU stops otherwise the vCPU can get blocked on the device. Maybe QEMU
> > > > > hi Alex
> > > > > For devices like DSA [1], it can support SVM mode. In this mode, when a
> > > > > page fault happens, the Intel DSA device blocks until the page fault is
> > > > > resolved, if PRS is enabled; otherwise it is reported as an error.
> > > > >
> > > > > Therefore, to pass through DSA into guest and do live migration with it,
> > > > > it is desired to stop DSA before stopping vCPU, as there may be an
> > > > > outstanding page fault to be resolved.
> > > > >
> > > > > During the period when DSA is stopped and vCPUs are still running, all the
> > > > > pass-through resources are trapped and emulated by host mediation driver until
> > > > > vCPUs stop.
> > > >
> > > > If the DSA is stopped and resources are trapped and emulated, then is
> > > > the device really stopped from a QEMU perspective or has it simply
> > > > switched modes underneath QEMU? If the device is truly stopped, then
> > > > I'd like to understand how a vCPU doing a PIO read from the device
> > > > wouldn't wedge the VM.
> > > >
> > > It doesn't matter if the device is truly stopped or not (although from
> > > my point of view, just draining commands and keeping device running is
> > > better as it handles live migration failure better).
> > > PIOs also need to be trapped and emulated if a vCPU accesses them.
> >
> > We seem to be talking around each other here. If PIOs are trapped and
> > emulated then the device is not "stopped" as far as QEMU is concerned,
> > right? "Stopping" a device suggests to me that a running vCPU doing a
> > PIO read from the device would block and cause problems in the still
> > running VM. So I think you're suggesting some sort of mode switch in
> > the device where direct access is disabled an emulation takes over
> > until the vCPUs are stopped.
>
> sorry for this confusion.
> yes, it's a kind of mode switch from a QEMU perspective.
> Currently, its implementation in our local branch is like that:
> 1. before migration thread stopping vCPUs, a migration state
> (COMPLETING) notification is sent to vfio migration state notifier, and
> this notifier would put device state to !RUNNING, and put all BARs to trap
> state.
> 2. in the kernel, when device state is set to !RUNNING, draining all
> pending device requests, and starts emulation.
>
> This implementation has two issues:
> 1. it requires hardcode in QEMU to put all BARs trapped and the time
> spending on revoking mmaps is not necessary for devices that do not need it.
> 2. !RUNNING state here is not accurate and it will confuse vendor
> drivers who stop devices after vCPUs stop.
>
> For the 2nd issue, I think we can propose a new device state like
> PRE-STOPPING.

Yes, this is absolutely abusing the !RUNNING state, if the device is
still processing accesses by the vCPU, it's still running.

> But for the 1st issue, not sure how to fix it right now.
> Maybe we can still add an asynchronous kernel notification and wait until
> QEMU have switched the region mmap state?

It seems like you're preemptively trying to optimize the SAVING state
before we even have migration working. Shouldn't SAVING be the point
at which you switch to trapping the device in order to track it?
Thanks,

Alex

2019-12-12 03:21:57

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

On Thu, Dec 12, 2019 at 11:07:42AM +0800, Alex Williamson wrote:
> On Wed, 11 Dec 2019 21:02:40 -0500
> Yan Zhao <[email protected]> wrote:
>
> > On Thu, Dec 12, 2019 at 02:56:55AM +0800, Alex Williamson wrote:
> > > On Wed, 11 Dec 2019 01:25:55 -0500
> > > Yan Zhao <[email protected]> wrote:
> > >
> > > > On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:
> > > > > On Tue, 10 Dec 2019 02:44:44 -0500
> > > > > Yan Zhao <[email protected]> wrote:
> > > > >
> > > > > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > > > > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > >
> > > > > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > > > > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > > > > > > > > > On Wed, 4 Dec 2019 22:26:50 -0500
> > > > > > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > > > > >
> > > > > > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > > > > > > > info region.
> > > > > > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > > > > >
> > > > > > > > > > > > A typical usage is
> > > > > > > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > > > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > > > > >
> > > > > > > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > > > > >
> > > > > > > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > > > > > > > and region->ops=null.
> > > > > > > > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > > > > > > > vfio_pci_mediate_ops.
> > > > > > > > > > >
> > > > > > > > > > > TBH, I don't like this interface at all. Userspace doesn't pass data
> > > > > > > > > > > to the kernel via INFO ioctls. We have a SET_IRQS ioctl for
> > > > > > > > > > > configuring user signaling with eventfds. I think we only need to
> > > > > > > > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > > > > > > > information for a region. The user would enumerate the device IRQs via
> > > > > > > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > > > > > > indicate which region(s) should be re-evaluated on signaling. The user
> > > > > > > > > > > would enable that signaling via SET_IRQS and simply re-evaluate the
> > > > > > > > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > > > > > > > >
> > > > > > > > > > > sparse mmap capability for the associated regions when signaled.
> > > > > > > > > >
> > > > > > > > > > Do you like the "disablable" flag of sparse mmap ?
> > > > > > > > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > > > > > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > > > > > > > region might be too heavy.
> > > > > > > > >
> > > > > > > > > No, I don't like the disable-able flag. At what frequency do we expect
> > > > > > > > > regions to change? It seems like we'd only change when switching into
> > > > > > > > > and out of the _SAVING state, which is rare. It seems easy for
> > > > > > > > > userspace, at least QEMU, to drop the entire mmap configuration and
> > > > > > > > ok. I'll try this way.
> > > > > > > >
> > > > > > > > > re-read it. Another concern here is how do we synchronize the event?
> > > > > > > > > Are we assuming that this event would occur when a user switch to
> > > > > > > > > _SAVING mode on the device? That operation is synchronous, the device
> > > > > > > > > must be in saving mode after the write to device state completes, but
> > > > > > > > > it seems like this might be trying to add an asynchronous dependency.
> > > > > > > > > Will the write to device_state only complete once the user handles the
> > > > > > > > > eventfd? How would the kernel know when the mmap re-evaluation is
> > > > > > > > > complete. It seems like there are gaps here that the vendor driver
> > > > > > > > > could miss traps required for migration because the user hasn't
> > > > > > > > > completed the mmap transition yet. Thanks,
> > > > > > > > >
> > > > > > > > > Alex
> > > > > > > >
> > > > > > > > yes, this asynchronous event notification will cause vendor driver miss
> > > > > > > > traps. But it's supposed to be of very short period time. That's also a
> > > > > > > > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > > > > > > > able to be finished before the first iterate, it's still safe.
> > > > > > >
> > > > > > > Making the re-evaluation lightweight cannot solve the race, it only
> > > > > > > masks it.
> > > > > > >
> > > > > > > > But I agree, the timing is not guaranteed, and so it's best for kernel
> > > > > > > > to wait for mmap re-evaluation to complete.
> > > > > > > >
> > > > > > > > migration_thread
> > > > > > > > |->qemu_savevm_state_setup
> > > > > > > > | |->ram_save_setup
> > > > > > > > | | |->migration_bitmap_sync
> > > > > > > > | | |->kvm_log_sync
> > > > > > > > | | |->vfio_log_sync
> > > > > > > > | |
> > > > > > > > | |->vfio_save_setup
> > > > > > > > | |->set_device_state(_SAVING)
> > > > > > > > |
> > > > > > > > |->qemu_savevm_state_pending
> > > > > > > > | |->ram_save_pending
> > > > > > > > | | |->migration_bitmap_sync
> > > > > > > > | | |->kvm_log_sync
> > > > > > > > | | |->vfio_log_sync
> > > > > > > > | |->vfio_save_pending
> > > > > > > > |
> > > > > > > > |->qemu_savevm_state_iterate
> > > > > > > > | |->ram_save_iterate //send pages
> > > > > > > > | |->vfio_save_iterate
> > > > > > > > ...
> > > > > > > >
> > > > > > > >
> > > > > > > > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > > > > > > > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > > > > > > > there're other two possible use cases:
> > > > > > > > (1) keep passing through devices when migration starts and track dirty pages
> > > > > > > > using hardware IOMMU. Then when migration is about to complete, stop the
> > > > > > > > device and start trap PCI BARs for software emulation. (we made some
> > > > > > > > changes to let device stop ahead of vcpu )
> > > > > > >
> > > > > > > How is that possible? I/O devices need to continue to work until the
> > > > > > > vCPU stops otherwise the vCPU can get blocked on the device. Maybe QEMU
> > > > > > hi Alex
> > > > > > For devices like DSA [1], it can support SVM mode. In this mode, when a
> > > > > > page fault happens, the Intel DSA device blocks until the page fault is
> > > > > > resolved, if PRS is enabled; otherwise it is reported as an error.
> > > > > >
> > > > > > Therefore, to pass through DSA into guest and do live migration with it,
> > > > > > it is desired to stop DSA before stopping vCPU, as there may be an
> > > > > > outstanding page fault to be resolved.
> > > > > >
> > > > > > During the period when DSA is stopped and vCPUs are still running, all the
> > > > > > pass-through resources are trapped and emulated by host mediation driver until
> > > > > > vCPUs stop.
> > > > >
> > > > > If the DSA is stopped and resources are trapped and emulated, then is
> > > > > the device really stopped from a QEMU perspective or has it simply
> > > > > switched modes underneath QEMU? If the device is truly stopped, then
> > > > > I'd like to understand how a vCPU doing a PIO read from the device
> > > > > wouldn't wedge the VM.
> > > > >
> > > > It doesn't matter if the device is truly stopped or not (although from
> > > > my point of view, just draining commands and keeping device running is
> > > > better as it handles live migration failure better).
> > > > PIOs also need to be trapped and emulated if a vCPU accesses them.
> > >
> > > We seem to be talking around each other here. If PIOs are trapped and
> > > emulated then the device is not "stopped" as far as QEMU is concerned,
> > > right? "Stopping" a device suggests to me that a running vCPU doing a
> > > PIO read from the device would block and cause problems in the still
> > > running VM. So I think you're suggesting some sort of mode switch in
> > > the device where direct access is disabled an emulation takes over
> > > until the vCPUs are stopped.
> >
> > sorry for this confusion.
> > yes, it's a kind of mode switch from a QEMU perspective.
> > Currently, its implementation in our local branch is like that:
> > 1. before migration thread stopping vCPUs, a migration state
> > (COMPLETING) notification is sent to vfio migration state notifier, and
> > this notifier would put device state to !RUNNING, and put all BARs to trap
> > state.
> > 2. in the kernel, when device state is set to !RUNNING, draining all
> > pending device requests, and starts emulation.
> >
> > This implementation has two issues:
> > 1. it requires hardcode in QEMU to put all BARs trapped and the time
> > spending on revoking mmaps is not necessary for devices that do not need it.
> > 2. !RUNNING state here is not accurate and it will confuse vendor
> > drivers who stop devices after vCPUs stop.
> >
> > For the 2nd issue, I think we can propose a new device state like
> > PRE-STOPPING.
>
> Yes, this is absolutely abusing the !RUNNING state, if the device is
> still processing accesses by the vCPU, it's still running.
>
> > But for the 1st issue, not sure how to fix it right now.
> > Maybe we can still add an asynchronous kernel notification and wait until
> > QEMU have switched the region mmap state?
>
> It seems like you're preemptively trying to optimize the SAVING state
> before we even have migration working. Shouldn't SAVING be the point
> at which you switch to trapping the device in order to track it?
> Thanks,

But for some devices, start trapping on entering SAVING state is too
early. They don't really need the trapping until PRE_STOPPING stage.
E.g. for DSA, it can get dirty pages without trapping. The intention for
it to enter trap is not for SAVING, but for emulation.

Thanks
Yan

2019-12-12 03:51:21

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci


On 2019/12/6 下午8:49, Yan Zhao wrote:
> On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote:
>> On 2019/12/6 下午4:22, Yan Zhao wrote:
>>> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
>>>> On 2019/12/5 下午4:51, Yan Zhao wrote:
>>>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
>>>>>> Hi:
>>>>>>
>>>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
>>>>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
>>>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
>>>>>>> dynamic host mediation is required to (1) get device states, (2) get
>>>>>>> dirty pages. Since device states as well as other critical information
>>>>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
>>>>>>> it is handy to provide an extension in PF driver to centralizingly control
>>>>>>> VFs' migration.
>>>>>>>
>>>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
>>>>>>> dynamically trap VFs' bars for dirty page tracking and
>>>>>> A silly question, what's the reason for doing this, is this a must for dirty
>>>>>> page tracking?
>>>>>>
>>>>> For performance consideration. VFs' bars should be passthoughed at
>>>>> normal time and only enter into trap state on need.
>>>> Right, but how does this matter for the case of dirty page tracking?
>>>>
>>> Take NIC as an example, to trap its VF dirty pages, software way is
>>> required to trap every write of ring tail that resides in BAR0.
>>
>> Interesting, but it looks like we need:
>> - decode the instruction
>> - mediate all access to BAR0
>> All of which seems a great burden for the VF driver. I wonder whether or
>> not doing interrupt relay and tracking head is better in this case.
>>
> hi Jason
>
> not familiar with the way you mentioned. could you elaborate more?


It looks to me that you want to intercept the bar that contains the
head. Then you can figure out the buffers submitted from driver and you
still need to decide a proper time to mark them as dirty.

What I meant is, intercept the interrupt, then you can figure still
figure out the buffers which has been modified by the device and make
them as dirty.

Then there's no need to trap BAR and do decoding/emulation etc.

But it will still be tricky to be correct...


>>> There's
>>> still no IOMMU Dirty bit available.
>>>>>>> (3) centralizing
>>>>>>> VF critical states retrieving and VF controls into one driver, we propose
>>>>>>> to introduce mediate ops on top of current vfio-pci device driver.
>>>>>>>
>>>>>>>
>>>>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>>>> __________ register mediate ops| ___________ ___________ |
>>>>>>> | |<-----------------------| VF | | |
>>>>>>> | vfio-pci | | | mediate | | PF driver | |
>>>>>>> |__________|----------------------->| driver | |___________|
>>>>>>> | open(pdev) | ----------- | |
>>>>>>> | |
>>>>>>> | |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
>>>>>>> \|/ \|/
>>>>>>> ----------- ------------
>>>>>>> | VF | | PF |
>>>>>>> ----------- ------------
>>>>>>>
>>>>>>>
>>>>>>> VF mediate driver could be a standalone driver that does not bind to
>>>>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
>>>>>>> extension of PF driver (as in patches 7-9) .
>>>>>>>
>>>>>>> Rather than directly bind to VF, VF mediate driver register a mediate
>>>>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
>>>>>>> mediate ops.
>>>>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
>>>>>>> before vfio-pci binding to any devices. And VF mediate driver can
>>>>>>> support mediating multiple devices.)
>>>>>>>
>>>>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
>>>>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
>>>>>>> device as a parameter.
>>>>>>> VF mediate driver should return success or failure depending on it
>>>>>>> supports the pdev or not.
>>>>>>> E.g. VF mediate driver would compare its supported VF devfn with the
>>>>>>> devfn of the passed-in pdev.
>>>>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
>>>>>>> stop querying other mediate ops and bind the opening device with this
>>>>>>> mediate ops using the returned mediate handle.
>>>>>>>
>>>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
>>>>>>> VF will be intercepted into VF mediate driver as
>>>>>>> vfio_pci_mediate_ops->get_region_info(),
>>>>>>> vfio_pci_mediate_ops->rw,
>>>>>>> vfio_pci_mediate_ops->mmap, and get customized.
>>>>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
>>>>>>> further return 'pt' to indicate whether vfio-pci should further
>>>>>>> passthrough data to hw.
>>>>>>>
>>>>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
>>>>>>> with a mediate handle as parameter.
>>>>>>>
>>>>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
>>>>>>> mediate driver be able to differentiate two opening VFs of the same device
>>>>>>> id and vendor id.
>>>>>>>
>>>>>>> When VF mediate driver exits, it unregisters its mediate ops from
>>>>>>> vfio-pci.
>>>>>>>
>>>>>>>
>>>>>>> In this patchset, we enable vfio-pci to provide 3 things:
>>>>>>> (1) calling mediate ops to allow vendor driver customizing default
>>>>>>> region info/rw/mmap of a region.
>>>>>>> (2) provide a migration region to support migration
>>>>>> What's the benefit of introducing a region? It looks to me we don't expect
>>>>>> the region to be accessed directly from guest. Could we simply extend device
>>>>>> fd ioctl for doing such things?
>>>>>>
>>>>> You may take a look on mdev live migration discussions in
>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>>>>
>>>>> or previous discussion at
>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
>>>>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
>>>>>
>>>>> generaly speaking, qemu part of live migration is consistent for
>>>>> vfio-pci + mediate ops way or mdev way.
>>>> So in mdev, do you still have a mediate driver? Or you expect the parent
>>>> to implement the region?
>>>>
>>> No, currently it's only for vfio-pci.
>> And specific to PCI.
>>
>>> mdev parent driver is free to customize its regions and hence does not
>>> requires this mediate ops hooks.
>>>
>>>>> The region is only a channel for
>>>>> QEMU and kernel to communicate information without introducing IOCTLs.
>>>> Well, at least you introduce new type of region in uapi. So this does
>>>> not answer why region is better than ioctl. If the region will only be
>>>> used by qemu, using ioctl is much more easier and straightforward.
>>>>
>>> It's not introduced by me :)
>>> mdev live migration is actually using this way, I'm just keeping
>>> compatible to the uapi.
>>
>> I meant e.g VFIO_REGION_TYPE_MIGRATION.
>>
> here's the history of vfio live migration:
> https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg05564.html
> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html
> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>
> If you have any concern of this region way, feel free to comment to the
> latest v9 patchset:
> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>
> The patchset here will always keep compatible to there.


Sure.


>>> From my own perspective, my answer is that a region is more flexible
>>> compared to ioctl. vendor driver can freely define the size,
>>>
>> Probably not since it's an ABI I think.
>>
> that's why I need to define VFIO_REGION_TYPE_MIGRATION here in this
> patchset, as it's not upstreamed yet.
> maybe I should make it into a prerequisite patch, indicating it is not
> introduced by this patchset


Yes.


>
>>> mmap cap of
>>> its data subregion.
>>>
>> It doesn't help much unless it can be mapped into guest (which I don't
>> think it was the case here).
>>
> it's access by host qemu, the same as how linux app access an mmaped
> memory. the mmap here is to reduce memory copy from kernel to user.
> No need to get mapped into guest.


But copy_to_user() is not a bad choice. If I read the code correctly
only the dirty bitmap was mmaped. This means you probably need to deal
with dcache carefully on some archs. [1]

Note KVM doesn't use shared dirty bitmap, it uses copy_to_user().

[1] https://lkml.org/lkml/2019/4/9/5


>
>>> Also, there're already too many ioctls in vfio.
>> Probably not :) We had a brunch of  subsystems that have much more
>> ioctls than VFIO. (e.g DRM)
>>
>>>>>>> (3) provide a dynamic trap bar info region to allow vendor driver
>>>>>>> control trap/untrap of device pci bars
>>>>>>>
>>>>>>> This vfio-pci + mediate ops way differs from mdev way in that
>>>>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
>>>>>>> specific mdev parent driver is bound to VF directly.
>>>>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
>>>>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
>>>>>>>
>>>>>>> The reason why we don't choose the way of writing mdev parent driver is
>>>>>>> that
>>>>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
>>>>>>> to vfio-pci can make most of the code shared/reused.
>>>>>> Can we split out the common parts from vfio-pci?
>>>>>>
>>>>> That's very attractive. but one cannot implement a vfio-pci except
>>>>> export everything in it as common part :)
>>>> Well, I think there should be not hard to do that. E..g you can route it
>>>> back to like:
>>>>
>>>> vfio -> vfio_mdev -> parent -> vfio_pci
>>>>
>>> it's desired for us to have mediate driver binding to PF device.
>>> so once a VF device is created, only PF driver and vfio-pci are
>>> required. Just the same as what needs to be done for a normal VF passthrough.
>>> otherwise, a separate parent driver binding to VF is required.
>>> Also, this parent driver has many drawbacks as I mentions in this
>>> cover-letter.
>> Well, as discussed, no need to duplicate the code, bar trick should
>> still work. The main issues I saw with this proposal is:
>>
>> 1) PCI specific, other bus may need something similar
> vfio-pci is only for PCI of course.


I meant if what propose here makes sense, other bus driver like
vfio-platform may want something similar.


>
>> 2) Function duplicated with mdev and mdev can do even more
>>
> could you elaborate how mdev can do solve the above saying problem ?


Well, I think both of us agree the mdev can do what mediate ops did,
mdev device implementation just need to add the direct PCI access part.


>>>>>>> If we write a
>>>>>>> vendor specific mdev parent driver, most of the code (like passthrough
>>>>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
>>>>>>> actually a duplicated and tedious work.
>>>>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
>>>>>> me we need to consider live migration for mdev as well. In that case, do we
>>>>>> still expect mediate ops through VFIO directly?
>>>>>>
>>>>>>
>>>>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
>>>>>>> vfio-pci, they can be available to most people without repeated code
>>>>>>> copying and re-testing.
>>>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
>>>>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
>>>>>>> it runs into a real migration need. However, if vfio-pci is bound
>>>>>>> initially, they have no chance to do live migration when there's a need
>>>>>>> later.
>>>>>> We can teach management layer to do this.
>>>>>>
>>>>> No. not possible as vfio-pci by default has no migration region and
>>>>> dirty page tracking needs vendor's mediation at least for most
>>>>> passthrough devices now.
>>>> I'm not quite sure I get here but in this case, just tech them to use
>>>> the driver that has migration support?
>>>>
>>> That's a way, but as more and more passthrough devices have demands and
>>> caps to do migration, will vfio-pci be used in future any more ?
>>
>> This should not be a problem:
>> - If we introduce a common mdev for vfio-pci, we can just bind that
>> driver always
> what is common mdev for vfio-pci? a common mdev parent driver that have
> the same implementation as vfio-pci?


The common part is not PCI of course. The common part is the both mdev
and mediate ops want to do some kind of mediation. Mdev is bus agnostic,
but what you propose here is PCI specific but should be bus agnostic as
well. Assume we implement a bug agnostic mediate ops, mdev could be even
built on top.


>
> There's actually already a solution of creating only one mdev on top
> of each passthrough device, and make mdev share the same iommu group
> with it. We've also made an implementation on it already. here's a
> sample one made by Yi at https://patchwork.kernel.org/cover/11134695/.
>
> But, as I said, it's desired to re-use vfio-pci directly for SRIOV,
> which is straghtforward :)


Can we have a device that is capable of both SRIOV and function slicing?
If yes, does it mean you need to provides two drivers? One for mdev,
another for mediate ops?


>
>> - The most straightforward way to support dirty page tracking is done by
>> IOMMU instead of device specific operations.
>>
> No such IOMMU yet. And all kinds of platforms should be cared, right?


Or the device can track dirty pages by itself, otherwise it would be
very hard to implement dirty page tracking correctly without the help of
switching to software datapath (or maybe you can post the part of BAR0
mediation and dirty page tracking which is missed in this series?)

Thanks


>
> Thanks
> Yan
>
>> Thanks
>>
>>> Thanks
>>> Yan
>>>
>>>> Thanks
>>>>
>>>>
>>>>> Thanks
>>>>> Yn
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>> In this patchset,
>>>>>>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
>>>>>>> driver to mediate/customize region info/rw/mmap.
>>>>>>>
>>>>>>> - patches 5-6 provide a standalone sample driver to register a mediate ops
>>>>>>> for Intel Graphics Devices. It does not bind to IGDs directly but decides
>>>>>>> what devices it supports via its pciidlist. It also demonstrates how to
>>>>>>> dynamic trap a device's PCI bars. (by adding more pciids in its
>>>>>>> pciidlist, this sample driver actually is not necessarily limited to
>>>>>>> support IGDs)
>>>>>>>
>>>>>>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
>>>>>>> Ethernet Controller XL710 Family of devices. It supports VF precopy live
>>>>>>> migration on Intel's 710 SRIOV. (but we commented out the real
>>>>>>> implementation of dirty page tracking and device state retrieving part
>>>>>>> to focus on demonstrating framework part. Will send out them in future
>>>>>>> versions)
>>>>>>> patch 7 registers/unregisters VF mediate ops when PF driver
>>>>>>> probes/removes. It specifies its supporting VFs via
>>>>>>> vfio_pci_mediate_ops->open(pdev)
>>>>>>>
>>>>>>> patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
>>>>>>> provides a sample implementation of migration region.
>>>>>>> The QEMU part of vfio migration is based on v8
>>>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
>>>>>>> We do not based on recent v9 because we think there are still opens in
>>>>>>> dirty page track part in that series.
>>>>>>>
>>>>>>> patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
>>>>>>> provides an example on how to trap part of bar0 when migration starts
>>>>>>> and passthrough this part of bar0 again when migration fails.
>>>>>>>
>>>>>>> Yan Zhao (9):
>>>>>>> vfio/pci: introduce mediate ops to intercept vfio-pci ops
>>>>>>> vfio/pci: test existence before calling region->ops
>>>>>>> vfio/pci: register a default migration region
>>>>>>> vfio-pci: register default dynamic-trap-bar-info region
>>>>>>> samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
>>>>>>> sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
>>>>>>> i40e/vf_migration: register mediate_ops to vfio-pci
>>>>>>> i40e/vf_migration: mediate migration region
>>>>>>> i40e/vf_migration: support dynamic trap of bar0
>>>>>>>
>>>>>>> drivers/net/ethernet/intel/Kconfig | 2 +-
>>>>>>> drivers/net/ethernet/intel/i40e/Makefile | 3 +-
>>>>>>> drivers/net/ethernet/intel/i40e/i40e.h | 2 +
>>>>>>> drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +
>>>>>>> .../ethernet/intel/i40e/i40e_vf_migration.c | 626 ++++++++++++++++++
>>>>>>> .../ethernet/intel/i40e/i40e_vf_migration.h | 78 +++
>>>>>>> drivers/vfio/pci/vfio_pci.c | 189 +++++-
>>>>>>> drivers/vfio/pci/vfio_pci_private.h | 2 +
>>>>>>> include/linux/vfio.h | 18 +
>>>>>>> include/uapi/linux/vfio.h | 160 +++++
>>>>>>> samples/Kconfig | 6 +
>>>>>>> samples/Makefile | 1 +
>>>>>>> samples/vfio-pci/Makefile | 2 +
>>>>>>> samples/vfio-pci/igd_dt.c | 367 ++++++++++
>>>>>>> 14 files changed, 1455 insertions(+), 4 deletions(-)
>>>>>>> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
>>>>>>> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>>>>>>> create mode 100644 samples/vfio-pci/Makefile
>>>>>>> create mode 100644 samples/vfio-pci/igd_dt.c
>>>>>>>

2019-12-12 04:12:43

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci


On 2019/12/7 上午1:42, Alex Williamson wrote:
> On Fri, 6 Dec 2019 17:40:02 +0800
> Jason Wang <[email protected]> wrote:
>
>> On 2019/12/6 下午4:22, Yan Zhao wrote:
>>> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
>>>> On 2019/12/5 下午4:51, Yan Zhao wrote:
>>>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
>>>>>> Hi:
>>>>>>
>>>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
>>>>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
>>>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
>>>>>>> dynamic host mediation is required to (1) get device states, (2) get
>>>>>>> dirty pages. Since device states as well as other critical information
>>>>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
>>>>>>> it is handy to provide an extension in PF driver to centralizingly control
>>>>>>> VFs' migration.
>>>>>>>
>>>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
>>>>>>> dynamically trap VFs' bars for dirty page tracking and
>>>>>> A silly question, what's the reason for doing this, is this a must for dirty
>>>>>> page tracking?
>>>>>>
>>>>> For performance consideration. VFs' bars should be passthoughed at
>>>>> normal time and only enter into trap state on need.
>>>> Right, but how does this matter for the case of dirty page tracking?
>>>>
>>> Take NIC as an example, to trap its VF dirty pages, software way is
>>> required to trap every write of ring tail that resides in BAR0.
>>
>> Interesting, but it looks like we need:
>> - decode the instruction
>> - mediate all access to BAR0
>> All of which seems a great burden for the VF driver. I wonder whether or
>> not doing interrupt relay and tracking head is better in this case.
> This sounds like a NIC specific solution, I believe the goal here is to
> allow any device type to implement a partial mediation solution, in
> this case to sufficiently track the device while in the migration
> saving state.


I suspect there's a solution that can work for any device type. E.g for
virtio, avail index (head) doesn't belongs to any BAR and device may
decide to disable doorbell from guest. So did interrupt relay since
driver may choose to disable interrupt from device. In this case, the
only way to track dirty pages correctly is to switch to software datapath.


>
>>> There's
>>> still no IOMMU Dirty bit available.
>>>>>>> (3) centralizing
>>>>>>> VF critical states retrieving and VF controls into one driver, we propose
>>>>>>> to introduce mediate ops on top of current vfio-pci device driver.
>>>>>>>
>>>>>>>
>>>>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>>>> __________ register mediate ops| ___________ ___________ |
>>>>>>> | |<-----------------------| VF | | |
>>>>>>> | vfio-pci | | | mediate | | PF driver | |
>>>>>>> |__________|----------------------->| driver | |___________|
>>>>>>> | open(pdev) | ----------- | |
>>>>>>> | |
>>>>>>> | |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
>>>>>>> \|/ \|/
>>>>>>> ----------- ------------
>>>>>>> | VF | | PF |
>>>>>>> ----------- ------------
>>>>>>>
>>>>>>>
>>>>>>> VF mediate driver could be a standalone driver that does not bind to
>>>>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
>>>>>>> extension of PF driver (as in patches 7-9) .
>>>>>>>
>>>>>>> Rather than directly bind to VF, VF mediate driver register a mediate
>>>>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
>>>>>>> mediate ops.
>>>>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
>>>>>>> before vfio-pci binding to any devices. And VF mediate driver can
>>>>>>> support mediating multiple devices.)
>>>>>>>
>>>>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
>>>>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
>>>>>>> device as a parameter.
>>>>>>> VF mediate driver should return success or failure depending on it
>>>>>>> supports the pdev or not.
>>>>>>> E.g. VF mediate driver would compare its supported VF devfn with the
>>>>>>> devfn of the passed-in pdev.
>>>>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
>>>>>>> stop querying other mediate ops and bind the opening device with this
>>>>>>> mediate ops using the returned mediate handle.
>>>>>>>
>>>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
>>>>>>> VF will be intercepted into VF mediate driver as
>>>>>>> vfio_pci_mediate_ops->get_region_info(),
>>>>>>> vfio_pci_mediate_ops->rw,
>>>>>>> vfio_pci_mediate_ops->mmap, and get customized.
>>>>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
>>>>>>> further return 'pt' to indicate whether vfio-pci should further
>>>>>>> passthrough data to hw.
>>>>>>>
>>>>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
>>>>>>> with a mediate handle as parameter.
>>>>>>>
>>>>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
>>>>>>> mediate driver be able to differentiate two opening VFs of the same device
>>>>>>> id and vendor id.
>>>>>>>
>>>>>>> When VF mediate driver exits, it unregisters its mediate ops from
>>>>>>> vfio-pci.
>>>>>>>
>>>>>>>
>>>>>>> In this patchset, we enable vfio-pci to provide 3 things:
>>>>>>> (1) calling mediate ops to allow vendor driver customizing default
>>>>>>> region info/rw/mmap of a region.
>>>>>>> (2) provide a migration region to support migration
>>>>>> What's the benefit of introducing a region? It looks to me we don't expect
>>>>>> the region to be accessed directly from guest. Could we simply extend device
>>>>>> fd ioctl for doing such things?
>>>>>>
>>>>> You may take a look on mdev live migration discussions in
>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>>>>
>>>>> or previous discussion at
>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
>>>>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
>>>>>
>>>>> generaly speaking, qemu part of live migration is consistent for
>>>>> vfio-pci + mediate ops way or mdev way.
>>>> So in mdev, do you still have a mediate driver? Or you expect the parent
>>>> to implement the region?
>>>>
>>> No, currently it's only for vfio-pci.
>> And specific to PCI.
> What's PCI specific? The implementation, yes, it's done in the bus
> vfio bus driver here but all device access is performed by the bus
> driver. I'm not sure how we could introduce the intercept at the
> vfio-core level, but I'm open to suggestions.


I haven't thought this too much, but if we can intercept at core level,
it basically can do what mdev can do right now.


>
>>> mdev parent driver is free to customize its regions and hence does not
>>> requires this mediate ops hooks.
>>>
>>>>> The region is only a channel for
>>>>> QEMU and kernel to communicate information without introducing IOCTLs.
>>>> Well, at least you introduce new type of region in uapi. So this does
>>>> not answer why region is better than ioctl. If the region will only be
>>>> used by qemu, using ioctl is much more easier and straightforward.
>>>>
>>> It's not introduced by me :)
>>> mdev live migration is actually using this way, I'm just keeping
>>> compatible to the uapi.
>>
>> I meant e.g VFIO_REGION_TYPE_MIGRATION.
>>
>>
>>> From my own perspective, my answer is that a region is more flexible
>>> compared to ioctl. vendor driver can freely define the size,
>>>
>> Probably not since it's an ABI I think.
> I think Kirti's thread proposing the migration interface is a better
> place for this discussion, I believe Yan has already linked to it. In
> general we prefer to be frugal in our introduction of new ioctls,
> especially when we have existing mechanisms via regions to support the
> interactions. The interface is designed to be flexible to the vendor
> driver needs, partially thanks to it being a region.
>
>>> mmap cap of
>>> its data subregion.
>>>
>> It doesn't help much unless it can be mapped into guest (which I don't
>> think it was the case here).
>> /
>>> Also, there're already too many ioctls in vfio.
>> Probably not :) We had a brunch of  subsystems that have much more
>> ioctls than VFIO. (e.g DRM)
> And this is a good thing?


Well, I just meant that "having too much ioctls already" is not a good
reason for not introducing new ones.


> We can more easily deprecate and revise
> region support than we can take back ioctls that have been previously
> used.


It belongs to uapi, how easily can we deprecate that?


> I generally don't like the "let's create a new ioctl for that"
> approach versus trying to fit something within the existing
> architecture and convention.
>
>>>>>>> (3) provide a dynamic trap bar info region to allow vendor driver
>>>>>>> control trap/untrap of device pci bars
>>>>>>>
>>>>>>> This vfio-pci + mediate ops way differs from mdev way in that
>>>>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
>>>>>>> specific mdev parent driver is bound to VF directly.
>>>>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
>>>>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
>>>>>>>
>>>>>>> The reason why we don't choose the way of writing mdev parent driver is
>>>>>>> that
>>>>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
>>>>>>> to vfio-pci can make most of the code shared/reused.
>>>>>> Can we split out the common parts from vfio-pci?
>>>>>>
>>>>> That's very attractive. but one cannot implement a vfio-pci except
>>>>> export everything in it as common part :)
>>>> Well, I think there should be not hard to do that. E..g you can route it
>>>> back to like:
>>>>
>>>> vfio -> vfio_mdev -> parent -> vfio_pci
>>>>
>>> it's desired for us to have mediate driver binding to PF device.
>>> so once a VF device is created, only PF driver and vfio-pci are
>>> required. Just the same as what needs to be done for a normal VF passthrough.
>>> otherwise, a separate parent driver binding to VF is required.
>>> Also, this parent driver has many drawbacks as I mentions in this
>>> cover-letter.
>> Well, as discussed, no need to duplicate the code, bar trick should
>> still work. The main issues I saw with this proposal is:
>>
>> 1) PCI specific, other bus may need something similar
> Propose how it could be implemented higher in the vfio stack to make it
> device agnostic.


E.g doing it in vfio_device_fops instead of vfio_pci_ops?


>
>> 2) Function duplicated with mdev and mdev can do even more
> mdev also comes with a device lifecycle interface that doesn't really
> make sense when a driver is only trying to partially mediate a single
> physical device rather than multiplex a physical device into virtual
> devices.


Yes, but that part could be decoupled out of mdev.


> mdev would also require vendor drivers to re-implement
> much of vfio-pci for the direct access mechanisms. Also, do we really
> want users or management tools to decide between binding a device to
> vfio-pci or a separate mdev driver to get this functionality. We've
> already been burnt trying to use mdev beyond its scope.


The problem is, if we had a device that support both SRIOV and mdev.
Does this mean we need prepare two set of drivers?


>
>>>>>>> If we write a
>>>>>>> vendor specific mdev parent driver, most of the code (like passthrough
>>>>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
>>>>>>> actually a duplicated and tedious work.
>>>>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
>>>>>> me we need to consider live migration for mdev as well. In that case, do we
>>>>>> still expect mediate ops through VFIO directly?
>>>>>>
>>>>>>
>>>>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
>>>>>>> vfio-pci, they can be available to most people without repeated code
>>>>>>> copying and re-testing.
>>>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
>>>>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
>>>>>>> it runs into a real migration need. However, if vfio-pci is bound
>>>>>>> initially, they have no chance to do live migration when there's a need
>>>>>>> later.
>>>>>> We can teach management layer to do this.
>>>>>>
>>>>> No. not possible as vfio-pci by default has no migration region and
>>>>> dirty page tracking needs vendor's mediation at least for most
>>>>> passthrough devices now.
>>>> I'm not quite sure I get here but in this case, just tech them to use
>>>> the driver that has migration support?
>>>>
>>> That's a way, but as more and more passthrough devices have demands and
>>> caps to do migration, will vfio-pci be used in future any more ?
>>
>> This should not be a problem:
>> - If we introduce a common mdev for vfio-pci, we can just bind that
>> driver always
> There's too much of mdev that doesn't make sense for this usage model,
> this is why Yi's proposed generic mdev PCI wrapper is only a sample
> driver. I think we do not want to introduce user confusion regarding
> which driver to use and there are outstanding non-singleton group
> issues with mdev that don't seem worthwhile to resolve.


I agree, but I think what user want is a unified driver that works for
both SRIOV and mdev. That's why trying to have a common way for doing
mediation may make sense.

Thanks


>
>> - The most straightforward way to support dirty page tracking is done by
>> IOMMU instead of device specific operations.
> Of course, but it doesn't exist yet. We're attempting to design the
> dirty page tracking in a way that's mostly transparent for current mdev
> drivers, would provide generic support for IOMMU-based dirty tracking,
> and extensible to the inevitability of vendor driver tracking. Thanks,
>
> Alex

2019-12-12 05:56:45

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

On Thu, Dec 12, 2019 at 11:48:25AM +0800, Jason Wang wrote:
>
> On 2019/12/6 下午8:49, Yan Zhao wrote:
> > On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote:
> >> On 2019/12/6 下午4:22, Yan Zhao wrote:
> >>> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
> >>>> On 2019/12/5 下午4:51, Yan Zhao wrote:
> >>>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> >>>>>> Hi:
> >>>>>>
> >>>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
> >>>>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
> >>>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>>>>>> dynamic host mediation is required to (1) get device states, (2) get
> >>>>>>> dirty pages. Since device states as well as other critical information
> >>>>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
> >>>>>>> it is handy to provide an extension in PF driver to centralizingly control
> >>>>>>> VFs' migration.
> >>>>>>>
> >>>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>>>>>> dynamically trap VFs' bars for dirty page tracking and
> >>>>>> A silly question, what's the reason for doing this, is this a must for dirty
> >>>>>> page tracking?
> >>>>>>
> >>>>> For performance consideration. VFs' bars should be passthoughed at
> >>>>> normal time and only enter into trap state on need.
> >>>> Right, but how does this matter for the case of dirty page tracking?
> >>>>
> >>> Take NIC as an example, to trap its VF dirty pages, software way is
> >>> required to trap every write of ring tail that resides in BAR0.
> >>
> >> Interesting, but it looks like we need:
> >> - decode the instruction
> >> - mediate all access to BAR0
> >> All of which seems a great burden for the VF driver. I wonder whether or
> >> not doing interrupt relay and tracking head is better in this case.
> >>
> > hi Jason
> >
> > not familiar with the way you mentioned. could you elaborate more?
>
>
> It looks to me that you want to intercept the bar that contains the
> head. Then you can figure out the buffers submitted from driver and you
> still need to decide a proper time to mark them as dirty.
>
Not need to be accurate, right? just a superset of real dirty bitmap is
enough.

> What I meant is, intercept the interrupt, then you can figure still
> figure out the buffers which has been modified by the device and make
> them as dirty.
>
> Then there's no need to trap BAR and do decoding/emulation etc.
>
> But it will still be tricky to be correct...
>
intercept the interrupt is a little hard if post interrupt is enabled..
I think what you worried about here is the timing to mark dirty pages,
right? upon interrupt receiving, you regard DMAs are finished and safe
to make them dirty.
But with BAR trap way, we at least can keep those dirtied pages as dirty
until device stop. Of course we have other methods to optimize it.

>
> >>> There's
> >>> still no IOMMU Dirty bit available.
> >>>>>>> (3) centralizing
> >>>>>>> VF critical states retrieving and VF controls into one driver, we propose
> >>>>>>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>>>>>
> >>>>>>>
> >>>>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>>>>> __________ register mediate ops| ___________ ___________ |
> >>>>>>> | |<-----------------------| VF | | |
> >>>>>>> | vfio-pci | | | mediate | | PF driver | |
> >>>>>>> |__________|----------------------->| driver | |___________|
> >>>>>>> | open(pdev) | ----------- | |
> >>>>>>> | |
> >>>>>>> | |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >>>>>>> \|/ \|/
> >>>>>>> ----------- ------------
> >>>>>>> | VF | | PF |
> >>>>>>> ----------- ------------
> >>>>>>>
> >>>>>>>
> >>>>>>> VF mediate driver could be a standalone driver that does not bind to
> >>>>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
> >>>>>>> extension of PF driver (as in patches 7-9) .
> >>>>>>>
> >>>>>>> Rather than directly bind to VF, VF mediate driver register a mediate
> >>>>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> >>>>>>> mediate ops.
> >>>>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
> >>>>>>> before vfio-pci binding to any devices. And VF mediate driver can
> >>>>>>> support mediating multiple devices.)
> >>>>>>>
> >>>>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> >>>>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> >>>>>>> device as a parameter.
> >>>>>>> VF mediate driver should return success or failure depending on it
> >>>>>>> supports the pdev or not.
> >>>>>>> E.g. VF mediate driver would compare its supported VF devfn with the
> >>>>>>> devfn of the passed-in pdev.
> >>>>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> >>>>>>> stop querying other mediate ops and bind the opening device with this
> >>>>>>> mediate ops using the returned mediate handle.
> >>>>>>>
> >>>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> >>>>>>> VF will be intercepted into VF mediate driver as
> >>>>>>> vfio_pci_mediate_ops->get_region_info(),
> >>>>>>> vfio_pci_mediate_ops->rw,
> >>>>>>> vfio_pci_mediate_ops->mmap, and get customized.
> >>>>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> >>>>>>> further return 'pt' to indicate whether vfio-pci should further
> >>>>>>> passthrough data to hw.
> >>>>>>>
> >>>>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> >>>>>>> with a mediate handle as parameter.
> >>>>>>>
> >>>>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> >>>>>>> mediate driver be able to differentiate two opening VFs of the same device
> >>>>>>> id and vendor id.
> >>>>>>>
> >>>>>>> When VF mediate driver exits, it unregisters its mediate ops from
> >>>>>>> vfio-pci.
> >>>>>>>
> >>>>>>>
> >>>>>>> In this patchset, we enable vfio-pci to provide 3 things:
> >>>>>>> (1) calling mediate ops to allow vendor driver customizing default
> >>>>>>> region info/rw/mmap of a region.
> >>>>>>> (2) provide a migration region to support migration
> >>>>>> What's the benefit of introducing a region? It looks to me we don't expect
> >>>>>> the region to be accessed directly from guest. Could we simply extend device
> >>>>>> fd ioctl for doing such things?
> >>>>>>
> >>>>> You may take a look on mdev live migration discussions in
> >>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >>>>>
> >>>>> or previous discussion at
> >>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
> >>>>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
> >>>>>
> >>>>> generaly speaking, qemu part of live migration is consistent for
> >>>>> vfio-pci + mediate ops way or mdev way.
> >>>> So in mdev, do you still have a mediate driver? Or you expect the parent
> >>>> to implement the region?
> >>>>
> >>> No, currently it's only for vfio-pci.
> >> And specific to PCI.
> >>
> >>> mdev parent driver is free to customize its regions and hence does not
> >>> requires this mediate ops hooks.
> >>>
> >>>>> The region is only a channel for
> >>>>> QEMU and kernel to communicate information without introducing IOCTLs.
> >>>> Well, at least you introduce new type of region in uapi. So this does
> >>>> not answer why region is better than ioctl. If the region will only be
> >>>> used by qemu, using ioctl is much more easier and straightforward.
> >>>>
> >>> It's not introduced by me :)
> >>> mdev live migration is actually using this way, I'm just keeping
> >>> compatible to the uapi.
> >>
> >> I meant e.g VFIO_REGION_TYPE_MIGRATION.
> >>
> > here's the history of vfio live migration:
> > https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg05564.html
> > https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html
> > https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >
> > If you have any concern of this region way, feel free to comment to the
> > latest v9 patchset:
> > https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >
> > The patchset here will always keep compatible to there.
>
>
> Sure.
>
>
> >>> From my own perspective, my answer is that a region is more flexible
> >>> compared to ioctl. vendor driver can freely define the size,
> >>>
> >> Probably not since it's an ABI I think.
> >>
> > that's why I need to define VFIO_REGION_TYPE_MIGRATION here in this
> > patchset, as it's not upstreamed yet.
> > maybe I should make it into a prerequisite patch, indicating it is not
> > introduced by this patchset
>
>
> Yes.
>
>
> >
> >>> mmap cap of
> >>> its data subregion.
> >>>
> >> It doesn't help much unless it can be mapped into guest (which I don't
> >> think it was the case here).
> >>
> > it's access by host qemu, the same as how linux app access an mmaped
> > memory. the mmap here is to reduce memory copy from kernel to user.
> > No need to get mapped into guest.
>
>
> But copy_to_user() is not a bad choice. If I read the code correctly
> only the dirty bitmap was mmaped. This means you probably need to deal
> with dcache carefully on some archs. [1]
>
> Note KVM doesn't use shared dirty bitmap, it uses copy_to_user().
>
> [1] https://lkml.org/lkml/2019/4/9/5
>
on those platforms, mmap can be safely disabled by vendor driver at will.
Also, when mmap is disabled, copy_to_user() is also used in region way.
Any way, please raise you concern in kirti's thread for this common part.

>
> >
> >>> Also, there're already too many ioctls in vfio.
> >> Probably not :) We had a brunch of  subsystems that have much more
> >> ioctls than VFIO. (e.g DRM)
> >>
> >>>>>>> (3) provide a dynamic trap bar info region to allow vendor driver
> >>>>>>> control trap/untrap of device pci bars
> >>>>>>>
> >>>>>>> This vfio-pci + mediate ops way differs from mdev way in that
> >>>>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> >>>>>>> specific mdev parent driver is bound to VF directly.
> >>>>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
> >>>>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
> >>>>>>>
> >>>>>>> The reason why we don't choose the way of writing mdev parent driver is
> >>>>>>> that
> >>>>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
> >>>>>>> to vfio-pci can make most of the code shared/reused.
> >>>>>> Can we split out the common parts from vfio-pci?
> >>>>>>
> >>>>> That's very attractive. but one cannot implement a vfio-pci except
> >>>>> export everything in it as common part :)
> >>>> Well, I think there should be not hard to do that. E..g you can route it
> >>>> back to like:
> >>>>
> >>>> vfio -> vfio_mdev -> parent -> vfio_pci
> >>>>
> >>> it's desired for us to have mediate driver binding to PF device.
> >>> so once a VF device is created, only PF driver and vfio-pci are
> >>> required. Just the same as what needs to be done for a normal VF passthrough.
> >>> otherwise, a separate parent driver binding to VF is required.
> >>> Also, this parent driver has many drawbacks as I mentions in this
> >>> cover-letter.
> >> Well, as discussed, no need to duplicate the code, bar trick should
> >> still work. The main issues I saw with this proposal is:
> >>
> >> 1) PCI specific, other bus may need something similar
> > vfio-pci is only for PCI of course.
>
>
> I meant if what propose here makes sense, other bus driver like
> vfio-platform may want something similar.
>
sure they can follow.
>
> >
> >> 2) Function duplicated with mdev and mdev can do even more
> >>
> > could you elaborate how mdev can do solve the above saying problem ?
>
>
> Well, I think both of us agree the mdev can do what mediate ops did,
> mdev device implementation just need to add the direct PCI access part.
>
>
> >>>>>>> If we write a
> >>>>>>> vendor specific mdev parent driver, most of the code (like passthrough
> >>>>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> >>>>>>> actually a duplicated and tedious work.
> >>>>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
> >>>>>> me we need to consider live migration for mdev as well. In that case, do we
> >>>>>> still expect mediate ops through VFIO directly?
> >>>>>>
> >>>>>>
> >>>>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
> >>>>>>> vfio-pci, they can be available to most people without repeated code
> >>>>>>> copying and re-testing.
> >>>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> >>>>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> >>>>>>> it runs into a real migration need. However, if vfio-pci is bound
> >>>>>>> initially, they have no chance to do live migration when there's a need
> >>>>>>> later.
> >>>>>> We can teach management layer to do this.
> >>>>>>
> >>>>> No. not possible as vfio-pci by default has no migration region and
> >>>>> dirty page tracking needs vendor's mediation at least for most
> >>>>> passthrough devices now.
> >>>> I'm not quite sure I get here but in this case, just tech them to use
> >>>> the driver that has migration support?
> >>>>
> >>> That's a way, but as more and more passthrough devices have demands and
> >>> caps to do migration, will vfio-pci be used in future any more ?
> >>
> >> This should not be a problem:
> >> - If we introduce a common mdev for vfio-pci, we can just bind that
> >> driver always
> > what is common mdev for vfio-pci? a common mdev parent driver that have
> > the same implementation as vfio-pci?
>
>
> The common part is not PCI of course. The common part is the both mdev
> and mediate ops want to do some kind of mediation. Mdev is bus agnostic,
> but what you propose here is PCI specific but should be bus agnostic as
> well. Assume we implement a bug agnostic mediate ops, mdev could be even
> built on top.
>
I believe Alex has already replied the above better than me.
>
> >
> > There's actually already a solution of creating only one mdev on top
> > of each passthrough device, and make mdev share the same iommu group
> > with it. We've also made an implementation on it already. here's a
> > sample one made by Yi at https://patchwork.kernel.org/cover/11134695/.
> >
> > But, as I said, it's desired to re-use vfio-pci directly for SRIOV,
> > which is straghtforward :)
>
>
> Can we have a device that is capable of both SRIOV and function slicing?
> If yes, does it mean you need to provides two drivers? One for mdev,
> another for mediate ops?
>
what do you mean by "function slicing"? SIOV?
For vendor driver, in SRIOV
- with mdev approach, two drivers required: one for mdev parent driver on
VF, one for PF driver.
- with mediate ops + vfio-pci: one driver on PF.

in SIOV, only one driver on PF in both case.


> >
> >> - The most straightforward way to support dirty page tracking is done by
> >> IOMMU instead of device specific operations.
> >>
> > No such IOMMU yet. And all kinds of platforms should be cared, right?
>
>
> Or the device can track dirty pages by itself, otherwise it would be
> very hard to implement dirty page tracking correctly without the help of
> switching to software datapath (or maybe you can post the part of BAR0

I think you mixed "correct" and "accurate".
DMA pre-inspection is a long existing term and we have implemented and
verified it in NIC for both precopy and postcopy case. Though I can't promise
there's 100% no bug, the method is right.

Also, whether to trap BARs for dirty page is vendor specific and is not
what should be cared about from this interface part.

> mediation and dirty page tracking which is missed in this series?)
>

Currently, that part of code is owned by shaopeng's team. The code I
posted is only for demonstrating how to use the interface. Shaopeng's
team is responsible for upsteam of their part at their timing.

Thanks
Yan

> >>>>>>
> >>>>>>> In this patchset,
> >>>>>>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
> >>>>>>> driver to mediate/customize region info/rw/mmap.
> >>>>>>>
> >>>>>>> - patches 5-6 provide a standalone sample driver to register a mediate ops
> >>>>>>> for Intel Graphics Devices. It does not bind to IGDs directly but decides
> >>>>>>> what devices it supports via its pciidlist. It also demonstrates how to
> >>>>>>> dynamic trap a device's PCI bars. (by adding more pciids in its
> >>>>>>> pciidlist, this sample driver actually is not necessarily limited to
> >>>>>>> support IGDs)
> >>>>>>>
> >>>>>>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
> >>>>>>> Ethernet Controller XL710 Family of devices. It supports VF precopy live
> >>>>>>> migration on Intel's 710 SRIOV. (but we commented out the real
> >>>>>>> implementation of dirty page tracking and device state retrieving part
> >>>>>>> to focus on demonstrating framework part. Will send out them in future
> >>>>>>> versions)
> >>>>>>> patch 7 registers/unregisters VF mediate ops when PF driver
> >>>>>>> probes/removes. It specifies its supporting VFs via
> >>>>>>> vfio_pci_mediate_ops->open(pdev)
> >>>>>>>
> >>>>>>> patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
> >>>>>>> provides a sample implementation of migration region.
> >>>>>>> The QEMU part of vfio migration is based on v8
> >>>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
> >>>>>>> We do not based on recent v9 because we think there are still opens in
> >>>>>>> dirty page track part in that series.
> >>>>>>>
> >>>>>>> patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
> >>>>>>> provides an example on how to trap part of bar0 when migration starts
> >>>>>>> and passthrough this part of bar0 again when migration fails.
> >>>>>>>
> >>>>>>> Yan Zhao (9):
> >>>>>>> vfio/pci: introduce mediate ops to intercept vfio-pci ops
> >>>>>>> vfio/pci: test existence before calling region->ops
> >>>>>>> vfio/pci: register a default migration region
> >>>>>>> vfio-pci: register default dynamic-trap-bar-info region
> >>>>>>> samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
> >>>>>>> sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
> >>>>>>> i40e/vf_migration: register mediate_ops to vfio-pci
> >>>>>>> i40e/vf_migration: mediate migration region
> >>>>>>> i40e/vf_migration: support dynamic trap of bar0
> >>>>>>>
> >>>>>>> drivers/net/ethernet/intel/Kconfig | 2 +-
> >>>>>>> drivers/net/ethernet/intel/i40e/Makefile | 3 +-
> >>>>>>> drivers/net/ethernet/intel/i40e/i40e.h | 2 +
> >>>>>>> drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +
> >>>>>>> .../ethernet/intel/i40e/i40e_vf_migration.c | 626 ++++++++++++++++++
> >>>>>>> .../ethernet/intel/i40e/i40e_vf_migration.h | 78 +++
> >>>>>>> drivers/vfio/pci/vfio_pci.c | 189 +++++-
> >>>>>>> drivers/vfio/pci/vfio_pci_private.h | 2 +
> >>>>>>> include/linux/vfio.h | 18 +
> >>>>>>> include/uapi/linux/vfio.h | 160 +++++
> >>>>>>> samples/Kconfig | 6 +
> >>>>>>> samples/Makefile | 1 +
> >>>>>>> samples/vfio-pci/Makefile | 2 +
> >>>>>>> samples/vfio-pci/igd_dt.c | 367 ++++++++++
> >>>>>>> 14 files changed, 1455 insertions(+), 4 deletions(-)
> >>>>>>> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> >>>>>>> create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> >>>>>>> create mode 100644 samples/vfio-pci/Makefile
> >>>>>>> create mode 100644 samples/vfio-pci/igd_dt.c
> >>>>>>>
>

2019-12-12 18:41:36

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

On Thu, 12 Dec 2019 12:09:48 +0800
Jason Wang <[email protected]> wrote:

> On 2019/12/7 上午1:42, Alex Williamson wrote:
> > On Fri, 6 Dec 2019 17:40:02 +0800
> > Jason Wang <[email protected]> wrote:
> >
> >> On 2019/12/6 下午4:22, Yan Zhao wrote:
> >>> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
> >>>> On 2019/12/5 下午4:51, Yan Zhao wrote:
> >>>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> >>>>>> Hi:
> >>>>>>
> >>>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
> >>>>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
> >>>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>>>>>> dynamic host mediation is required to (1) get device states, (2) get
> >>>>>>> dirty pages. Since device states as well as other critical information
> >>>>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
> >>>>>>> it is handy to provide an extension in PF driver to centralizingly control
> >>>>>>> VFs' migration.
> >>>>>>>
> >>>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>>>>>> dynamically trap VFs' bars for dirty page tracking and
> >>>>>> A silly question, what's the reason for doing this, is this a must for dirty
> >>>>>> page tracking?
> >>>>>>
> >>>>> For performance consideration. VFs' bars should be passthoughed at
> >>>>> normal time and only enter into trap state on need.
> >>>> Right, but how does this matter for the case of dirty page tracking?
> >>>>
> >>> Take NIC as an example, to trap its VF dirty pages, software way is
> >>> required to trap every write of ring tail that resides in BAR0.
> >>
> >> Interesting, but it looks like we need:
> >> - decode the instruction
> >> - mediate all access to BAR0
> >> All of which seems a great burden for the VF driver. I wonder whether or
> >> not doing interrupt relay and tracking head is better in this case.
> > This sounds like a NIC specific solution, I believe the goal here is to
> > allow any device type to implement a partial mediation solution, in
> > this case to sufficiently track the device while in the migration
> > saving state.
>
>
> I suspect there's a solution that can work for any device type. E.g for
> virtio, avail index (head) doesn't belongs to any BAR and device may
> decide to disable doorbell from guest. So did interrupt relay since
> driver may choose to disable interrupt from device. In this case, the
> only way to track dirty pages correctly is to switch to software datapath.
>
>
> >
> >>> There's
> >>> still no IOMMU Dirty bit available.
> >>>>>>> (3) centralizing
> >>>>>>> VF critical states retrieving and VF controls into one driver, we propose
> >>>>>>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>>>>>
> >>>>>>>
> >>>>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>>>>> __________ register mediate ops| ___________ ___________ |
> >>>>>>> | |<-----------------------| VF | | |
> >>>>>>> | vfio-pci | | | mediate | | PF driver | |
> >>>>>>> |__________|----------------------->| driver | |___________|
> >>>>>>> | open(pdev) | ----------- | |
> >>>>>>> | |
> >>>>>>> | |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >>>>>>> \|/ \|/
> >>>>>>> ----------- ------------
> >>>>>>> | VF | | PF |
> >>>>>>> ----------- ------------
> >>>>>>>
> >>>>>>>
> >>>>>>> VF mediate driver could be a standalone driver that does not bind to
> >>>>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
> >>>>>>> extension of PF driver (as in patches 7-9) .
> >>>>>>>
> >>>>>>> Rather than directly bind to VF, VF mediate driver register a mediate
> >>>>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> >>>>>>> mediate ops.
> >>>>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
> >>>>>>> before vfio-pci binding to any devices. And VF mediate driver can
> >>>>>>> support mediating multiple devices.)
> >>>>>>>
> >>>>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> >>>>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> >>>>>>> device as a parameter.
> >>>>>>> VF mediate driver should return success or failure depending on it
> >>>>>>> supports the pdev or not.
> >>>>>>> E.g. VF mediate driver would compare its supported VF devfn with the
> >>>>>>> devfn of the passed-in pdev.
> >>>>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> >>>>>>> stop querying other mediate ops and bind the opening device with this
> >>>>>>> mediate ops using the returned mediate handle.
> >>>>>>>
> >>>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> >>>>>>> VF will be intercepted into VF mediate driver as
> >>>>>>> vfio_pci_mediate_ops->get_region_info(),
> >>>>>>> vfio_pci_mediate_ops->rw,
> >>>>>>> vfio_pci_mediate_ops->mmap, and get customized.
> >>>>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> >>>>>>> further return 'pt' to indicate whether vfio-pci should further
> >>>>>>> passthrough data to hw.
> >>>>>>>
> >>>>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> >>>>>>> with a mediate handle as parameter.
> >>>>>>>
> >>>>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> >>>>>>> mediate driver be able to differentiate two opening VFs of the same device
> >>>>>>> id and vendor id.
> >>>>>>>
> >>>>>>> When VF mediate driver exits, it unregisters its mediate ops from
> >>>>>>> vfio-pci.
> >>>>>>>
> >>>>>>>
> >>>>>>> In this patchset, we enable vfio-pci to provide 3 things:
> >>>>>>> (1) calling mediate ops to allow vendor driver customizing default
> >>>>>>> region info/rw/mmap of a region.
> >>>>>>> (2) provide a migration region to support migration
> >>>>>> What's the benefit of introducing a region? It looks to me we don't expect
> >>>>>> the region to be accessed directly from guest. Could we simply extend device
> >>>>>> fd ioctl for doing such things?
> >>>>>>
> >>>>> You may take a look on mdev live migration discussions in
> >>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >>>>>
> >>>>> or previous discussion at
> >>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
> >>>>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
> >>>>>
> >>>>> generaly speaking, qemu part of live migration is consistent for
> >>>>> vfio-pci + mediate ops way or mdev way.
> >>>> So in mdev, do you still have a mediate driver? Or you expect the parent
> >>>> to implement the region?
> >>>>
> >>> No, currently it's only for vfio-pci.
> >> And specific to PCI.
> > What's PCI specific? The implementation, yes, it's done in the bus
> > vfio bus driver here but all device access is performed by the bus
> > driver. I'm not sure how we could introduce the intercept at the
> > vfio-core level, but I'm open to suggestions.
>
>
> I haven't thought this too much, but if we can intercept at core level,
> it basically can do what mdev can do right now.

An intercept at the core level is essentially a new vfio bus driver.

> >>> mdev parent driver is free to customize its regions and hence does not
> >>> requires this mediate ops hooks.
> >>>
> >>>>> The region is only a channel for
> >>>>> QEMU and kernel to communicate information without introducing IOCTLs.
> >>>> Well, at least you introduce new type of region in uapi. So this does
> >>>> not answer why region is better than ioctl. If the region will only be
> >>>> used by qemu, using ioctl is much more easier and straightforward.
> >>>>
> >>> It's not introduced by me :)
> >>> mdev live migration is actually using this way, I'm just keeping
> >>> compatible to the uapi.
> >>
> >> I meant e.g VFIO_REGION_TYPE_MIGRATION.
> >>
> >>
> >>> From my own perspective, my answer is that a region is more flexible
> >>> compared to ioctl. vendor driver can freely define the size,
> >>>
> >> Probably not since it's an ABI I think.
> > I think Kirti's thread proposing the migration interface is a better
> > place for this discussion, I believe Yan has already linked to it. In
> > general we prefer to be frugal in our introduction of new ioctls,
> > especially when we have existing mechanisms via regions to support the
> > interactions. The interface is designed to be flexible to the vendor
> > driver needs, partially thanks to it being a region.
> >
> >>> mmap cap of
> >>> its data subregion.
> >>>
> >> It doesn't help much unless it can be mapped into guest (which I don't
> >> think it was the case here).
> >> /
> >>> Also, there're already too many ioctls in vfio.
> >> Probably not :) We had a brunch of  subsystems that have much more
> >> ioctls than VFIO. (e.g DRM)
> > And this is a good thing?
>
>
> Well, I just meant that "having too much ioctls already" is not a good
> reason for not introducing new ones.

Avoiding ioctl proliferation is a reason to require a high bar for any
new ioctl though. Push back on every ioctl and maybe we won't get to
that point.

> > We can more easily deprecate and revise
> > region support than we can take back ioctls that have been previously
> > used.
>
>
> It belongs to uapi, how easily can we deprecate that?

I'm not saying there shouldn't be a deprecation process, but the core
uapi for vfio remains (relatively) unchanged. The user has a protocol
for discovering the features of a device and if we decide we've screwed
up the implementation of the migration_region-v1 we can simply add a
migration_region-v2 and both can be exposed via the same core ioctls
until we decide to no longer expose v1. Our address space of region
types is defined within vfio, not shared with every driver in the
kernel. The "add an ioctl for that" approach is not the model I
advocate.

> > I generally don't like the "let's create a new ioctl for that"
> > approach versus trying to fit something within the existing
> > architecture and convention.
> >
> >>>>>>> (3) provide a dynamic trap bar info region to allow vendor driver
> >>>>>>> control trap/untrap of device pci bars
> >>>>>>>
> >>>>>>> This vfio-pci + mediate ops way differs from mdev way in that
> >>>>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> >>>>>>> specific mdev parent driver is bound to VF directly.
> >>>>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
> >>>>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
> >>>>>>>
> >>>>>>> The reason why we don't choose the way of writing mdev parent driver is
> >>>>>>> that
> >>>>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
> >>>>>>> to vfio-pci can make most of the code shared/reused.
> >>>>>> Can we split out the common parts from vfio-pci?
> >>>>>>
> >>>>> That's very attractive. but one cannot implement a vfio-pci except
> >>>>> export everything in it as common part :)
> >>>> Well, I think there should be not hard to do that. E..g you can route it
> >>>> back to like:
> >>>>
> >>>> vfio -> vfio_mdev -> parent -> vfio_pci
> >>>>
> >>> it's desired for us to have mediate driver binding to PF device.
> >>> so once a VF device is created, only PF driver and vfio-pci are
> >>> required. Just the same as what needs to be done for a normal VF passthrough.
> >>> otherwise, a separate parent driver binding to VF is required.
> >>> Also, this parent driver has many drawbacks as I mentions in this
> >>> cover-letter.
> >> Well, as discussed, no need to duplicate the code, bar trick should
> >> still work. The main issues I saw with this proposal is:
> >>
> >> 1) PCI specific, other bus may need something similar
> > Propose how it could be implemented higher in the vfio stack to make it
> > device agnostic.
>
>
> E.g doing it in vfio_device_fops instead of vfio_pci_ops?

Which is essentially a new vfio bus driver. This is something vfio has
supported since day one. Issues with doing that here are that it puts
the burden on the mediation/vendor driver to re-implement or re-use a
lot of existing code in vfio-pci, and I think it creates user confusion
around which driver to use for what feature set when using a device
through vfio. You're complaining this series is PCI specific, when
re-using the vfio-pci code is exactly what we're trying to achieve.
Other bus types can do something similar and injecting vendor
specific drivers a layer above the bus driver is already a fundamental
part of the infrastructure.

> >> 2) Function duplicated with mdev and mdev can do even more
> > mdev also comes with a device lifecycle interface that doesn't really
> > make sense when a driver is only trying to partially mediate a single
> > physical device rather than multiplex a physical device into virtual
> > devices.
>
>
> Yes, but that part could be decoupled out of mdev.

There would be nothing left. vfio-mdev is essentially nothing more than
a vfio bus driver that forwards through to mdev to provide that
lifecycle interface to the vendor driver. Without that, it's just
another vfio bus driver.

> > mdev would also require vendor drivers to re-implement
> > much of vfio-pci for the direct access mechanisms. Also, do we really
> > want users or management tools to decide between binding a device to
> > vfio-pci or a separate mdev driver to get this functionality. We've
> > already been burnt trying to use mdev beyond its scope.
>
>
> The problem is, if we had a device that support both SRIOV and mdev.
> Does this mean we need prepare two set of drivers?

We have this situation today, modulo SR-IOV, but that's a red herring
anyway, VF vs PF is irrelevant. For example we can either directly
assign IGD graphics to a VM with vfio-pci or we can enable GVT-g
support in the i915 driver, which registers vGPU support via mdev.
These are different use cases, expose different features, and have
different support models. NVIDIA is the same way, assigning a GPU via
vfio-pci or a vGPU via vfio-mdev are entirely separate usage models.
Once we use mdev, it's at the vendor driver's discretion how the device
resources are backed, they might make use of the resource isolation of
SR-IOV or they might divide a single function.

If your question is whether there's a concern around proliferation of
vfio bus drivers and user confusion over which to use for what
features, yes, absolutely. I think this is why we're starting with
seeing what it looks like to add mediation to vfio-pci rather than
modularize vfio-pci and ask Intel to develop a new vfio-pci-intel-dsa
driver. I'm not yet convinced we won't eventually come back to that
latter approach though if this initial draft is what we can expect of a
mediated vfio-pci.

> >>>>>>> If we write a
> >>>>>>> vendor specific mdev parent driver, most of the code (like passthrough
> >>>>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> >>>>>>> actually a duplicated and tedious work.
> >>>>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
> >>>>>> me we need to consider live migration for mdev as well. In that case, do we
> >>>>>> still expect mediate ops through VFIO directly?
> >>>>>>
> >>>>>>
> >>>>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
> >>>>>>> vfio-pci, they can be available to most people without repeated code
> >>>>>>> copying and re-testing.
> >>>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> >>>>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> >>>>>>> it runs into a real migration need. However, if vfio-pci is bound
> >>>>>>> initially, they have no chance to do live migration when there's a need
> >>>>>>> later.
> >>>>>> We can teach management layer to do this.
> >>>>>>
> >>>>> No. not possible as vfio-pci by default has no migration region and
> >>>>> dirty page tracking needs vendor's mediation at least for most
> >>>>> passthrough devices now.
> >>>> I'm not quite sure I get here but in this case, just tech them to use
> >>>> the driver that has migration support?
> >>>>
> >>> That's a way, but as more and more passthrough devices have demands and
> >>> caps to do migration, will vfio-pci be used in future any more ?
> >>
> >> This should not be a problem:
> >> - If we introduce a common mdev for vfio-pci, we can just bind that
> >> driver always
> > There's too much of mdev that doesn't make sense for this usage model,
> > this is why Yi's proposed generic mdev PCI wrapper is only a sample
> > driver. I think we do not want to introduce user confusion regarding
> > which driver to use and there are outstanding non-singleton group
> > issues with mdev that don't seem worthwhile to resolve.
>
>
> I agree, but I think what user want is a unified driver that works for
> both SRIOV and mdev. That's why trying to have a common way for doing
> mediation may make sense.

I don't think we can get to one driver, nor is it clear to me that we
should. Direct assignment and mdev currently provide different usage
models. Both are valid, both are useful. That said, I don't
necessarily want a user to need to choose whether to bind a device to
vfio-pci for base functionality or vfio-pci-vendor-foo for extended
functionality either. I think that's why we're exploring this
mediation approach and there's already precedent in vfio-pci for some
extent of device specific support. It's still a question though
whether it can be done clean enough to make it worthwhile. Thanks,

Alex