2021-01-17 18:18:53

by Max Gurtovoy

[permalink] [raw]
Subject: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

Hi Alex and Cornelia,

This series split the vfio_pci driver into 2 parts: pci driver and a
subsystem driver that will also be library of code. The pci driver,
vfio_pci.ko will be used as before and it will bind to the subsystem
driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset
if fully backward compatible. This is a typical Linux subsystem
framework behaviour. This framework can be also adopted by vfio_mdev
devices as we'll see in the below sketch.

This series is coming to solve the issues that were raised in the
previous attempt for extending vfio-pci for vendor specific
functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao.

This solution is also deterministic in a sense that when a user will
bind to a vendor specific vfio-pci driver, it will get all the special
goodies of the HW.

This subsystem framework will also ease on adding vendor specific
functionality to VFIO devices in the future by allowing another module
to provide the pci_driver that can setup number of details before
registering to VFIO subsystem (such as inject its own operations).

Below we can see the proposed changes (this patchset only deals with
VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem
as well):

+----------------------------------------------------------------------+
| |
| VFIO |
| |
+----------------------------------------------------------------------+

+--------------------------------+ +--------------------------------+
| | | |
| VFIO_PCI_CORE | | VFIO_MDEV_CORE |
| | | |
+--------------------------------+ +--------------------------------+

+---------------+ +--------------+ +---------------+ +--------------+
| | | | | | | |
| | | | | | | |
| VFIO_PCI | | MLX5_VFIO_PCI| | VFIO_MDEV | |MLX5_VFIO_MDEV|
| | | | | | | |
| | | | | | | |
+---------------+ +--------------+ +---------------+ +--------------+

First 2 patches introduce the above changes for vfio_pci and
vfio_pci_core.

Patch (3/3) introduces a new mlx5 vfio-pci module that registers to VFIO
subsystem using vfio_pci_core. It also registers to Auxiliary bus for
binding to mlx5_core that is the parent of mlx5-vfio-pci devices. This
will allow extending mlx5-vfio-pci devices with HW specific features
such as Live Migration (mlx5_core patches are not part of this series
that comes for proposing the changes need for the vfio pci subsystem).

These devices will be seen on the Auxiliary bus as:
mlx5_core.vfio_pci.2048 -> ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/mlx5_core.vfio_pci.2048
mlx5_core.vfio_pci.2304 -> ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.1/mlx5_core.vfio_pci.2304

2048 represents BDF 08:00.0 and 2304 represents BDF 09:00.0 in decimal
view. In this manner, the administrator will be able to locate the
correct vfio-pci module it should bind the desired BDF to (by finding
the pointer to the module according to the Auxiliary driver of that
BDF).

In this way, we'll use the HW vendor driver core to manage the lifecycle
of these devices. This is reasonable since only the vendor driver knows
exactly about the status on its internal state and the capabilities of
its acceleratots, for example.

TODOs:
1. For this RFC we still haven't cleaned all vendor specific stuff that
were merged in the past into vfio_pci (such as VFIO_PCI_IG and
VFIO_PCI_NVLINK2).
2. Create subsystem module for VFIO_MDEV. This can be used for vendor
specific scalable functions for example (SFs).
3. Add Live migration functionality for mlx5 SNAP devices
(NVMe/Virtio-BLK).
4. Add Live migration functionality for mlx5 VFs
5. Add the needed functionality for mlx5_core

I would like to thank the great team that was involved in this
development, design and internal review:
Oren, Liran, Jason, Leon, Aviad, Shahaf, Gary, Artem, Kirti, Neo, Andy
and others.

This series applies cleanly on top of kernel 5.11-rc2+ commit 2ff90100ace8:
"Merge tag 'hwmon-for-v5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging"
from Linus.

Note: Live migration for MLX5 SNAP devices is WIP and will be the first
example for adding vendor extension to vfio-pci devices. As the
changes to the subsystem must be defined as a pre-condition for
this work, we've decided to split the submission for now.

Max Gurtovoy (3):
vfio-pci: rename vfio_pci.c to vfio_pci_core.c
vfio-pci: introduce vfio_pci_core subsystem driver
mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices

drivers/vfio/pci/Kconfig | 22 +-
drivers/vfio/pci/Makefile | 16 +-
drivers/vfio/pci/mlx5_vfio_pci.c | 253 +++
drivers/vfio/pci/vfio_pci.c | 2386 +--------------------------
drivers/vfio/pci/vfio_pci_core.c | 2311 ++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_private.h | 21 +
include/linux/mlx5/vfio_pci.h | 36 +
7 files changed, 2734 insertions(+), 2311 deletions(-)
create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c
create mode 100644 drivers/vfio/pci/vfio_pci_core.c
create mode 100644 include/linux/mlx5/vfio_pci.h

--
2.25.4


2021-01-17 18:19:12

by Max Gurtovoy

[permalink] [raw]
Subject: [PATCH 2/3] vfio-pci: introduce vfio_pci_core subsystem driver

Split the vfio_pci driver into two parts, the 'struct pci_driver'
(vfio_pci) and a library of code (vfio_pci_core) that helps creating a
VFIO device on top of a PCI device.

As before vfio_pci.ko continues to present the same interface under
sysfs and this change should have no functional impact.

vfio_pci_core exposes an interface that is similar to a typical
Linux subsystem, in that a pci_driver doing probe() can setup a number
of details and then create the VFIO char device.

Allowing another module to provide the pci_driver allows that module
to customize how VFIO is setup, inject its own operations, and easily
extend vendor specific functionality.

This is complementary to how VFIO's mediated devices work. Instead of
custome device lifecycle managmenet and a special bus drivers using
this approach will rely on the normal driver core lifecycle (eg
bind/unbind) management and this is optimized to effectively support
customization that is only making small modifications to what vfio_pci
would do normally.

This approach is also a pluggable alternative for the hard wired
CONFIG_VFIO_PCI_IG and CONFIG_VFIO_PCI_NVLINK2 "drivers" that are
built into vfio-pci. Using this work all of that code can be moved to
a dedicated device-specific modules and cleanly split out of the
generic vfio_pci driver.

Below is an example for adding new driver to vfio pci subsystem:
+----------------------------------+
| |
| VFIO |
| |
+----------------------------------+

+----------------------------------+
| |
| VFIO_PCI_CORE |
| |
+----------------------------------+

+----------------+ +---------------+
| | | |
| VFIO_PCI | | MLX5_VFIO_PCI |
| | | |
+----------------+ +---------------+

In this way mlx5_vfio_pci will use vfio_pci_core to register to vfio
subsystem and also use the generic PCI functionality exported from it.
Additionally it will add the needed vendor specific logic for HW
specific features such as Live Migration.

Signed-off-by: Max Gurtovoy <[email protected]>
---
drivers/vfio/pci/Kconfig | 12 +-
drivers/vfio/pci/Makefile | 13 +-
drivers/vfio/pci/vfio_pci.c | 220 ++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_core.c | 255 +++++++---------------------
drivers/vfio/pci/vfio_pci_private.h | 21 +++
5 files changed, 321 insertions(+), 200 deletions(-)
create mode 100644 drivers/vfio/pci/vfio_pci.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 40a223381ab6..5f90be27fba0 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
-config VFIO_PCI
- tristate "VFIO support for PCI devices"
+config VFIO_PCI_CORE
+ tristate "VFIO core support for PCI devices"
depends on VFIO && PCI && EVENTFD
select VFIO_VIRQFD
select IRQ_BYPASS_MANAGER
@@ -10,6 +10,14 @@ config VFIO_PCI

If you don't know what to do here, say N.

+config VFIO_PCI
+ tristate "VFIO support for PCI devices"
+ depends on VFIO_PCI_CORE
+ help
+ This provides a generic PCI support using the VFIO framework.
+
+ If you don't know what to do here, say N.
+
config VFIO_PCI_VGA
bool "VFIO PCI support for VGA devices"
depends on VFIO_PCI && X86 && VGA_ARB
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index d5555d350b9b..3f2a27e222cd 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,8 +1,11 @@
# SPDX-License-Identifier: GPL-2.0-only

-vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
-vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
-vfio-pci-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
-
+obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+
+vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
+
+vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
new file mode 100644
index 000000000000..4e115a136930
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -0,0 +1,220 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+ *
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/vgaarb.h>
+#include <linux/nospec.h>
+#include <linux/sched/mm.h>
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION "0.2"
+#define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
+#define DRIVER_DESC "VFIO PCI - User Level meta-driver"
+
+static char ids[1024] __initdata;
+module_param_string(ids, ids, sizeof(ids), 0);
+MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+
+static bool enable_sriov;
+#ifdef CONFIG_PCI_IOV
+module_param(enable_sriov, bool, 0644);
+MODULE_PARM_DESC(enable_sriov, "Enable support for SR-IOV configuration. Enabling SR-IOV on a PF typically requires support of the userspace PF driver, enabling VFs without such support may result in non-functional VFs or PF.");
+#endif
+
+static bool disable_denylist;
+module_param(disable_denylist, bool, 0444);
+MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
+
+static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
+{
+ switch (pdev->vendor) {
+ case PCI_VENDOR_ID_INTEL:
+ switch (pdev->device) {
+ case PCI_DEVICE_ID_INTEL_QAT_C3XXX:
+ case PCI_DEVICE_ID_INTEL_QAT_C3XXX_VF:
+ case PCI_DEVICE_ID_INTEL_QAT_C62X:
+ case PCI_DEVICE_ID_INTEL_QAT_C62X_VF:
+ case PCI_DEVICE_ID_INTEL_QAT_DH895XCC:
+ case PCI_DEVICE_ID_INTEL_QAT_DH895XCC_VF:
+ return true;
+ default:
+ return false;
+ }
+ }
+
+ return false;
+}
+
+static bool vfio_pci_is_denylisted(struct pci_dev *pdev)
+{
+ if (!vfio_pci_dev_in_denylist(pdev))
+ return false;
+
+ if (disable_denylist) {
+ pci_warn(pdev,
+ "device denylist disabled - allowing device %04x:%04x.\n",
+ pdev->vendor, pdev->device);
+ return false;
+ }
+
+ pci_warn(pdev, "%04x:%04x exists in vfio-pci device denylist, driver probing disallowed.\n",
+ pdev->vendor, pdev->device);
+
+ return true;
+}
+
+static const struct vfio_device_ops vfio_pci_ops = {
+ .name = "vfio-pci",
+ .open = vfio_pci_core_open,
+ .release = vfio_pci_core_release,
+ .ioctl = vfio_pci_core_ioctl,
+ .read = vfio_pci_core_read,
+ .write = vfio_pci_core_write,
+ .mmap = vfio_pci_core_mmap,
+ .request = vfio_pci_core_request,
+ .match = vfio_pci_core_match,
+};
+
+static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+ struct vfio_pci_device *vdev;
+
+ if (vfio_pci_is_denylisted(pdev))
+ return -EINVAL;
+
+ vdev = vfio_create_pci_device(pdev, &vfio_pci_ops, NULL);
+ if (IS_ERR(vdev))
+ return PTR_ERR(vdev);
+
+ return 0;
+}
+
+static void vfio_pci_remove(struct pci_dev *pdev)
+{
+ vfio_destroy_pci_device(pdev);
+}
+
+static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
+ pci_channel_state_t state)
+{
+ return vfio_pci_core_aer_err_detected(pdev, state);
+}
+
+static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
+{
+ might_sleep();
+
+ if (!enable_sriov)
+ return -ENOENT;
+
+ return vfio_pci_core_sriov_configure(pdev, nr_virtfn);
+}
+
+static const struct pci_error_handlers vfio_err_handlers = {
+ .error_detected = vfio_pci_aer_err_detected,
+};
+
+static struct pci_driver vfio_pci_driver = {
+ .name = "vfio-pci",
+ .id_table = NULL, /* only dynamic ids */
+ .probe = vfio_pci_probe,
+ .remove = vfio_pci_remove,
+ .sriov_configure = vfio_pci_sriov_configure,
+ .err_handler = &vfio_err_handlers,
+};
+
+static void __exit vfio_pci_cleanup(void)
+{
+ pci_unregister_driver(&vfio_pci_driver);
+}
+
+static void __init vfio_pci_fill_ids(void)
+{
+ char *p, *id;
+ int rc;
+
+ /* no ids passed actually */
+ if (ids[0] == '\0')
+ return;
+
+ /* add ids specified in the module parameter */
+ p = ids;
+ while ((id = strsep(&p, ","))) {
+ unsigned int vendor, device, subvendor = PCI_ANY_ID,
+ subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
+ int fields;
+
+ if (!strlen(id))
+ continue;
+
+ fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
+ &vendor, &device, &subvendor, &subdevice,
+ &class, &class_mask);
+
+ if (fields < 2) {
+ pr_warn("invalid id string \"%s\"\n", id);
+ continue;
+ }
+
+ rc = pci_add_dynid(&vfio_pci_driver, vendor, device,
+ subvendor, subdevice, class, class_mask, 0);
+ if (rc)
+ pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
+ vendor, device, subvendor, subdevice,
+ class, class_mask, rc);
+ else
+ pr_info("add [%04x:%04x[%04x:%04x]] class %#08x/%08x\n",
+ vendor, device, subvendor, subdevice,
+ class, class_mask);
+ }
+}
+
+static int __init vfio_pci_init(void)
+{
+ int ret;
+
+ /* Register and scan for devices */
+ ret = pci_register_driver(&vfio_pci_driver);
+ if (ret)
+ return ret;
+
+ vfio_pci_fill_ids();
+
+ if (disable_denylist)
+ pr_warn("device denylist disabled.\n");
+
+ return 0;
+}
+
+module_init(vfio_pci_init);
+module_exit(vfio_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 706de3ef94bb..e87f828faf71 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -32,11 +32,7 @@

#define DRIVER_VERSION "0.2"
#define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
-#define DRIVER_DESC "VFIO PCI - User Level meta-driver"
-
-static char ids[1024] __initdata;
-module_param_string(ids, ids, sizeof(ids), 0);
-MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+#define DRIVER_DESC "core driver for VFIO based PCI devices"

static bool nointxmask;
module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
@@ -54,16 +50,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
MODULE_PARM_DESC(disable_idle_d3,
"Disable using the PCI D3 low power state for idle, unused devices");

-static bool enable_sriov;
-#ifdef CONFIG_PCI_IOV
-module_param(enable_sriov, bool, 0644);
-MODULE_PARM_DESC(enable_sriov, "Enable support for SR-IOV configuration. Enabling SR-IOV on a PF typically requires support of the userspace PF driver, enabling VFs without such support may result in non-functional VFs or PF.");
-#endif
-
-static bool disable_denylist;
-module_param(disable_denylist, bool, 0444);
-MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
-
static inline bool vfio_vga_disabled(void)
{
#ifdef CONFIG_VFIO_PCI_VGA
@@ -73,44 +59,6 @@ static inline bool vfio_vga_disabled(void)
#endif
}

-static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
-{
- switch (pdev->vendor) {
- case PCI_VENDOR_ID_INTEL:
- switch (pdev->device) {
- case PCI_DEVICE_ID_INTEL_QAT_C3XXX:
- case PCI_DEVICE_ID_INTEL_QAT_C3XXX_VF:
- case PCI_DEVICE_ID_INTEL_QAT_C62X:
- case PCI_DEVICE_ID_INTEL_QAT_C62X_VF:
- case PCI_DEVICE_ID_INTEL_QAT_DH895XCC:
- case PCI_DEVICE_ID_INTEL_QAT_DH895XCC_VF:
- return true;
- default:
- return false;
- }
- }
-
- return false;
-}
-
-static bool vfio_pci_is_denylisted(struct pci_dev *pdev)
-{
- if (!vfio_pci_dev_in_denylist(pdev))
- return false;
-
- if (disable_denylist) {
- pci_warn(pdev,
- "device denylist disabled - allowing device %04x:%04x.\n",
- pdev->vendor, pdev->device);
- return false;
- }
-
- pci_warn(pdev, "%04x:%04x exists in vfio-pci device denylist, driver probing disallowed.\n",
- pdev->vendor, pdev->device);
-
- return true;
-}
-
/*
* Our VGA arbiter participation is limited since we don't know anything
* about the device itself. However, if the device is the only VGA device
@@ -515,8 +463,6 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
vfio_pci_set_power_state(vdev, PCI_D3hot);
}

-static struct pci_driver vfio_pci_driver;
-
static struct vfio_pci_device *get_pf_vdev(struct vfio_pci_device *vdev,
struct vfio_device **pf_dev)
{
@@ -529,7 +475,7 @@ static struct vfio_pci_device *get_pf_vdev(struct vfio_pci_device *vdev,
if (!*pf_dev)
return NULL;

- if (pci_dev_driver(physfn) != &vfio_pci_driver) {
+ if (pci_dev_driver(physfn) != pci_dev_driver(vdev->pdev)) {
vfio_device_put(*pf_dev);
return NULL;
}
@@ -553,7 +499,7 @@ static void vfio_pci_vf_token_user_add(struct vfio_pci_device *vdev, int val)
vfio_device_put(pf_dev);
}

-static void vfio_pci_release(void *device_data)
+void vfio_pci_core_release(void *device_data)
{
struct vfio_pci_device *vdev = device_data;

@@ -580,8 +526,9 @@ static void vfio_pci_release(void *device_data)

module_put(THIS_MODULE);
}
+EXPORT_SYMBOL_GPL(vfio_pci_core_release);

-static int vfio_pci_open(void *device_data)
+int vfio_pci_core_open(void *device_data)
{
struct vfio_pci_device *vdev = device_data;
int ret = 0;
@@ -606,6 +553,7 @@ static int vfio_pci_open(void *device_data)
module_put(THIS_MODULE);
return ret;
}
+EXPORT_SYMBOL_GPL(vfio_pci_core_open);

static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
{
@@ -797,8 +745,8 @@ struct vfio_devices {
int max_index;
};

-static long vfio_pci_ioctl(void *device_data,
- unsigned int cmd, unsigned long arg)
+long vfio_pci_core_ioctl(void *device_data, unsigned int cmd,
+ unsigned long arg)
{
struct vfio_pci_device *vdev = device_data;
unsigned long minsz;
@@ -1403,12 +1351,12 @@ static long vfio_pci_ioctl(void *device_data,

return -ENOTTY;
}
+EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);

-static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
+static ssize_t vfio_pci_rw(struct vfio_pci_device *vdev, char __user *buf,
size_t count, loff_t *ppos, bool iswrite)
{
unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
- struct vfio_pci_device *vdev = device_data;

if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
return -EINVAL;
@@ -1436,23 +1384,27 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
return -EINVAL;
}

-static ssize_t vfio_pci_read(void *device_data, char __user *buf,
- size_t count, loff_t *ppos)
+ssize_t vfio_pci_core_read(void *device_data, char __user *buf,
+ size_t count, loff_t *ppos)
{
+ struct vfio_pci_device *vdev = device_data;
if (!count)
return 0;

- return vfio_pci_rw(device_data, buf, count, ppos, false);
+ return vfio_pci_rw(vdev, buf, count, ppos, false);
}
+EXPORT_SYMBOL_GPL(vfio_pci_core_read);

-static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
- size_t count, loff_t *ppos)
+ssize_t vfio_pci_core_write(void *device_data,
+ const char __user *buf, size_t count, loff_t *ppos)
{
+ struct vfio_pci_device *vdev = device_data;
if (!count)
return 0;

- return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
+ return vfio_pci_rw(vdev, (char __user *)buf, count, ppos, true);
}
+EXPORT_SYMBOL_GPL(vfio_pci_core_write);

/* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */
static int vfio_pci_zap_and_vma_lock(struct vfio_pci_device *vdev, bool try)
@@ -1648,7 +1600,8 @@ static const struct vm_operations_struct vfio_pci_mmap_ops = {
.fault = vfio_pci_mmap_fault,
};

-static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+int vfio_pci_core_mmap(void *device_data,
+ struct vm_area_struct *vma)
{
struct vfio_pci_device *vdev = device_data;
struct pci_dev *pdev = vdev->pdev;
@@ -1715,8 +1668,9 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)

return 0;
}
+EXPORT_SYMBOL_GPL(vfio_pci_core_mmap);

-static void vfio_pci_request(void *device_data, unsigned int count)
+void vfio_pci_core_request(void *device_data, unsigned int count)
{
struct vfio_pci_device *vdev = device_data;
struct pci_dev *pdev = vdev->pdev;
@@ -1736,6 +1690,7 @@ static void vfio_pci_request(void *device_data, unsigned int count)

mutex_unlock(&vdev->igate);
}
+EXPORT_SYMBOL_GPL(vfio_pci_core_request);

static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
bool vf_token, uuid_t *uuid)
@@ -1832,7 +1787,7 @@ static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,

#define VF_TOKEN_ARG "vf_token="

-static int vfio_pci_match(void *device_data, char *buf)
+int vfio_pci_core_match(void *device_data, char *buf)
{
struct vfio_pci_device *vdev = device_data;
bool vf_token = false;
@@ -1880,18 +1835,7 @@ static int vfio_pci_match(void *device_data, char *buf)

return 1; /* Match */
}
-
-static const struct vfio_device_ops vfio_pci_ops = {
- .name = "vfio-pci",
- .open = vfio_pci_open,
- .release = vfio_pci_release,
- .ioctl = vfio_pci_ioctl,
- .read = vfio_pci_read,
- .write = vfio_pci_write,
- .mmap = vfio_pci_mmap,
- .request = vfio_pci_request,
- .match = vfio_pci_match,
-};
+EXPORT_SYMBOL_GPL(vfio_pci_core_match);

static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
@@ -1910,12 +1854,12 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
pci_info(vdev->pdev, "Captured SR-IOV VF %s driver_override\n",
pci_name(pdev));
pdev->driver_override = kasprintf(GFP_KERNEL, "%s",
- vfio_pci_ops.name);
+ vdev->vfio_pci_ops->name);
} else if (action == BUS_NOTIFY_BOUND_DRIVER &&
pdev->is_virtfn && physfn == vdev->pdev) {
struct pci_driver *drv = pci_dev_driver(pdev);

- if (drv && drv != &vfio_pci_driver)
+ if (drv && drv != pci_dev_driver(vdev->pdev))
pci_warn(vdev->pdev,
"VF %s bound to driver %s while PF bound to vfio-pci\n",
pci_name(pdev), drv->name);
@@ -1924,17 +1868,16 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
return 0;
}

-static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+struct vfio_pci_device *vfio_create_pci_device(struct pci_dev *pdev,
+ const struct vfio_device_ops *vfio_pci_ops,
+ void *dd_data)
{
struct vfio_pci_device *vdev;
struct iommu_group *group;
int ret;

- if (vfio_pci_is_denylisted(pdev))
- return -EINVAL;
-
if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
- return -EINVAL;
+ return ERR_PTR(-EINVAL);

/*
* Prevent binding to PFs with VFs enabled, the VFs might be in use
@@ -1946,12 +1889,12 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
*/
if (pci_num_vf(pdev)) {
pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
- return -EBUSY;
+ return ERR_PTR(-EBUSY);
}

group = vfio_iommu_group_get(&pdev->dev);
if (!group)
- return -EINVAL;
+ return ERR_PTR(-EINVAL);

vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
if (!vdev) {
@@ -1960,6 +1903,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
}

vdev->pdev = pdev;
+ vdev->vfio_pci_ops = vfio_pci_ops;
+ vdev->dd_data = dd_data;
vdev->irq_type = VFIO_PCI_NUM_IRQS;
mutex_init(&vdev->igate);
spin_lock_init(&vdev->irqlock);
@@ -1970,7 +1915,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
INIT_LIST_HEAD(&vdev->vma_list);
init_rwsem(&vdev->memory_lock);

- ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+ ret = vfio_add_group_dev(&pdev->dev, vfio_pci_ops, vdev);
if (ret)
goto out_free;

@@ -2016,7 +1961,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
vfio_pci_set_power_state(vdev, PCI_D3hot);
}

- return ret;
+ return vdev;

out_vf_token:
kfree(vdev->vf_token);
@@ -2028,10 +1973,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
kfree(vdev);
out_group_put:
vfio_iommu_group_put(group, &pdev->dev);
- return ret;
+ return ERR_PTR(ret);
}
+EXPORT_SYMBOL_GPL(vfio_create_pci_device);

-static void vfio_pci_remove(struct pci_dev *pdev)
+void vfio_destroy_pci_device(struct pci_dev *pdev)
{
struct vfio_pci_device *vdev;

@@ -2069,9 +2015,10 @@ static void vfio_pci_remove(struct pci_dev *pdev)
VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
}
}
+EXPORT_SYMBOL_GPL(vfio_destroy_pci_device);

-static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
- pci_channel_state_t state)
+pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
+ pci_channel_state_t state)
{
struct vfio_pci_device *vdev;
struct vfio_device *device;
@@ -2097,18 +2044,14 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,

return PCI_ERS_RESULT_CAN_RECOVER;
}
+EXPORT_SYMBOL_GPL(vfio_pci_core_aer_err_detected);

-static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
+int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
{
struct vfio_pci_device *vdev;
struct vfio_device *device;
int ret = 0;

- might_sleep();
-
- if (!enable_sriov)
- return -ENOENT;
-
device = vfio_device_get_from_dev(&pdev->dev);
if (!device)
return -ENODEV;
@@ -2128,19 +2071,7 @@ static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)

return ret < 0 ? ret : nr_virtfn;
}
-
-static const struct pci_error_handlers vfio_err_handlers = {
- .error_detected = vfio_pci_aer_err_detected,
-};
-
-static struct pci_driver vfio_pci_driver = {
- .name = "vfio-pci",
- .id_table = NULL, /* only dynamic ids */
- .probe = vfio_pci_probe,
- .remove = vfio_pci_remove,
- .sriov_configure = vfio_pci_sriov_configure,
- .err_handler = &vfio_err_handlers,
-};
+EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);

static DEFINE_MUTEX(reflck_lock);

@@ -2173,13 +2104,13 @@ static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
if (!device)
return 0;

- if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+ vdev = vfio_device_data(device);
+
+ if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
vfio_device_put(device);
return 0;
}

- vdev = vfio_device_data(device);
-
if (vdev->reflck) {
vfio_pci_reflck_get(vdev->reflck);
*preflck = vdev->reflck;
@@ -2235,13 +2166,13 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
if (!device)
return -EINVAL;

- if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+ vdev = vfio_device_data(device);
+
+ if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
vfio_device_put(device);
return -EBUSY;
}

- vdev = vfio_device_data(device);
-
/* Fault if the device is not unused */
if (vdev->refcnt) {
vfio_device_put(device);
@@ -2265,13 +2196,13 @@ static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data)
if (!device)
return -EINVAL;

- if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+ vdev = vfio_device_data(device);
+
+ if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
vfio_device_put(device);
return -EBUSY;
}

- vdev = vfio_device_data(device);
-
/*
* Locking multiple devices is prone to deadlock, runaway and
* unwind if we hit contention.
@@ -2360,81 +2291,19 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
kfree(devs.devices);
}

-static void __exit vfio_pci_cleanup(void)
+static void __exit vfio_pci_core_cleanup(void)
{
- pci_unregister_driver(&vfio_pci_driver);
vfio_pci_uninit_perm_bits();
}

-static void __init vfio_pci_fill_ids(void)
-{
- char *p, *id;
- int rc;
-
- /* no ids passed actually */
- if (ids[0] == '\0')
- return;
-
- /* add ids specified in the module parameter */
- p = ids;
- while ((id = strsep(&p, ","))) {
- unsigned int vendor, device, subvendor = PCI_ANY_ID,
- subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
- int fields;
-
- if (!strlen(id))
- continue;
-
- fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
- &vendor, &device, &subvendor, &subdevice,
- &class, &class_mask);
-
- if (fields < 2) {
- pr_warn("invalid id string \"%s\"\n", id);
- continue;
- }
-
- rc = pci_add_dynid(&vfio_pci_driver, vendor, device,
- subvendor, subdevice, class, class_mask, 0);
- if (rc)
- pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
- vendor, device, subvendor, subdevice,
- class, class_mask, rc);
- else
- pr_info("add [%04x:%04x[%04x:%04x]] class %#08x/%08x\n",
- vendor, device, subvendor, subdevice,
- class, class_mask);
- }
-}
-
-static int __init vfio_pci_init(void)
+static int __init vfio_pci_core_init(void)
{
- int ret;
-
/* Allocate shared config space permision data used by all devices */
- ret = vfio_pci_init_perm_bits();
- if (ret)
- return ret;
-
- /* Register and scan for devices */
- ret = pci_register_driver(&vfio_pci_driver);
- if (ret)
- goto out_driver;
-
- vfio_pci_fill_ids();
-
- if (disable_denylist)
- pr_warn("device denylist disabled.\n");
-
- return 0;
-
-out_driver:
- vfio_pci_uninit_perm_bits();
- return ret;
+ return vfio_pci_init_perm_bits();
}

-module_init(vfio_pci_init);
-module_exit(vfio_pci_cleanup);
+module_init(vfio_pci_core_init);
+module_exit(vfio_pci_core_cleanup);

MODULE_VERSION(DRIVER_VERSION);
MODULE_LICENSE("GPL v2");
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 5c90e560c5c7..331a8db6b537 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -101,6 +101,7 @@ struct vfio_pci_mmap_vma {

struct vfio_pci_device {
struct pci_dev *pdev;
+ const struct vfio_device_ops *vfio_pci_ops;
void __iomem *barmap[PCI_STD_NUM_BARS];
bool bar_mmap_supported[PCI_STD_NUM_BARS];
u8 *pci_config_map;
@@ -142,6 +143,7 @@ struct vfio_pci_device {
struct mutex vma_lock;
struct list_head vma_list;
struct rw_semaphore memory_lock;
+ void *dd_data;
};

#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
@@ -225,4 +227,23 @@ static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
}
#endif

+/* Exported functions */
+struct vfio_pci_device *vfio_create_pci_device(struct pci_dev *pdev,
+ const struct vfio_device_ops *vfio_pci_ops, void *dd_data);
+void vfio_destroy_pci_device(struct pci_dev *pdev);
+int vfio_pci_core_open(void *device_data);
+void vfio_pci_core_release(void *device_data);
+long vfio_pci_core_ioctl(void *device_data, unsigned int cmd,
+ unsigned long arg);
+ssize_t vfio_pci_core_read(void *device_data, char __user *buf, size_t count,
+ loff_t *ppos);
+ssize_t vfio_pci_core_write(void *device_data, const char __user *buf,
+ size_t count, loff_t *ppos);
+int vfio_pci_core_mmap(void *device_data, struct vm_area_struct *vma);
+void vfio_pci_core_request(void *device_data, unsigned int count);
+int vfio_pci_core_match(void *device_data, char *buf);
+int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn);
+pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
+ pci_channel_state_t state);
+
#endif /* VFIO_PCI_PRIVATE_H */
--
2.25.4

2021-01-17 18:20:24

by Max Gurtovoy

[permalink] [raw]
Subject: [PATCH 1/3] vfio-pci: rename vfio_pci.c to vfio_pci_core.c

This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change any
logic.

Signed-off-by: Max Gurtovoy <[email protected]>
---
drivers/vfio/pci/Makefile | 2 +-
drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} | 0
2 files changed, 1 insertion(+), 1 deletion(-)
rename drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} (100%)

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 781e0809d6ee..d5555d350b9b 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only

-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
vfio-pci-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci_core.c
similarity index 100%
rename from drivers/vfio/pci/vfio_pci.c
rename to drivers/vfio/pci/vfio_pci_core.c
--
2.25.4

2021-01-17 18:20:33

by Max Gurtovoy

[permalink] [raw]
Subject: [PATCH 3/3] mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices

This driver will register to PCI bus and Auxiliary bus. In case the
probe of both devices will succeed, we'll have a vendor specific VFIO
PCI device. mlx5_vfio_pci use vfio_pci_core to register and create a
VFIO device and use auxiliary_device to get the needed extension from
the vendor device driver. If one of the probe() functions will fail, the
VFIO char device will not be created. For now, only register and bind
the auxiliary_device to the pci_device in case we have a match between
the auxiliary_device id to the pci_device BDF. Later, vendor specific
features such as live migration will be added and will be available to
the virtualization software.

Note: Although we've created the mlx5-vfio-pci.ko, the binding to
vfio-pci.ko will still work as before. It's fully backward compatible.
Of course, the extended vendor functionality will not exist in case one
will bind the device to the generic vfio_pci.ko.

Signed-off-by: Max Gurtovoy <[email protected]>
---
drivers/vfio/pci/Kconfig | 10 ++
drivers/vfio/pci/Makefile | 3 +
drivers/vfio/pci/mlx5_vfio_pci.c | 253 +++++++++++++++++++++++++++++++
include/linux/mlx5/vfio_pci.h | 36 +++++
4 files changed, 302 insertions(+)
create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c
create mode 100644 include/linux/mlx5/vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 5f90be27fba0..2133cd2f9c92 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -65,3 +65,13 @@ config VFIO_PCI_ZDEV
for zPCI devices passed through via VFIO on s390.

Say Y here.
+
+config MLX5_VFIO_PCI
+ tristate "VFIO support for MLX5 PCI devices"
+ depends on VFIO_PCI_CORE && MLX5_CORE
+ select AUXILIARY_BUS
+ help
+ This provides a generic PCI support for MLX5 devices using the VFIO
+ framework.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 3f2a27e222cd..9f67edca31c5 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,6 +2,7 @@

obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o

vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
@@ -9,3 +10,5 @@ vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o

vfio-pci-y := vfio_pci.o
+
+mlx5-vfio-pci-y := mlx5_vfio_pci.o
diff --git a/drivers/vfio/pci/mlx5_vfio_pci.c b/drivers/vfio/pci/mlx5_vfio_pci.c
new file mode 100644
index 000000000000..98cc2d037b0d
--- /dev/null
+++ b/drivers/vfio/pci/mlx5_vfio_pci.c
@@ -0,0 +1,253 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+ * Author: Max Gurtovoy <[email protected]>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/sched/mm.h>
+#include <linux/mlx5/vfio_pci.h>
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION "0.1"
+#define DRIVER_AUTHOR "Max Gurtovoy <[email protected]>"
+#define DRIVER_DESC "MLX5 VFIO PCI - User Level meta-driver for NVIDIA MLX5 device family"
+
+static LIST_HEAD(aux_devs_list);
+static DEFINE_MUTEX(aux_devs_lock);
+
+static struct mlx5_vfio_pci_adev *mlx5_vfio_pci_find_adev(struct pci_dev *pdev)
+{
+ struct mlx5_vfio_pci_adev *mvadev, *found = NULL;
+
+ mutex_lock(&aux_devs_lock);
+ list_for_each_entry(mvadev, &aux_devs_list, entry) {
+ if (mvadev->madev.adev.id == pci_dev_id(pdev)) {
+ found = mvadev;
+ break;
+ }
+ }
+ mutex_unlock(&aux_devs_lock);
+
+ return found;
+}
+
+static int mlx5_vfio_pci_aux_probe(struct auxiliary_device *adev,
+ const struct auxiliary_device_id *id)
+{
+ struct mlx5_vfio_pci_adev *mvadev;
+
+ mvadev = adev_to_mvadev(adev);
+
+ pr_info("%s aux probing bdf %02x:%02x.%d mdev is %s\n",
+ adev->name,
+ PCI_BUS_NUM(adev->id & 0xffff),
+ PCI_SLOT(adev->id & 0xff),
+ PCI_FUNC(adev->id & 0xff), dev_name(mvadev->madev.mdev->device));
+
+ mutex_lock(&aux_devs_lock);
+ list_add(&mvadev->entry, &aux_devs_list);
+ mutex_unlock(&aux_devs_lock);
+
+ return 0;
+}
+
+static void mlx5_vfio_pci_aux_remove(struct auxiliary_device *adev)
+{
+ struct mlx5_vfio_pci_adev *mvadev = adev_to_mvadev(adev);
+ struct vfio_pci_device *vdev = dev_get_drvdata(&adev->dev);
+
+ /* TODO: is this the right thing to do ? maybe FLR ? */
+ if (vdev)
+ pci_reset_function(vdev->pdev);
+
+ mutex_lock(&aux_devs_lock);
+ list_del(&mvadev->entry);
+ mutex_unlock(&aux_devs_lock);
+}
+
+static const struct auxiliary_device_id mlx5_vfio_pci_aux_id_table[] = {
+ { .name = MLX5_ADEV_NAME ".vfio_pci", },
+ {},
+};
+
+MODULE_DEVICE_TABLE(auxiliary, mlx5_vfio_pci_aux_id_table);
+
+static struct auxiliary_driver mlx5_vfio_pci_aux_driver = {
+ .name = "vfio_pci_ex",
+ .probe = mlx5_vfio_pci_aux_probe,
+ .remove = mlx5_vfio_pci_aux_remove,
+ .id_table = mlx5_vfio_pci_aux_id_table,
+};
+
+static struct pci_driver mlx5_vfio_pci_driver;
+
+static ssize_t mlx5_vfio_pci_write(void *device_data,
+ const char __user *buf, size_t count, loff_t *ppos)
+{
+ /* DO vendor specific stuff here ? */
+
+ return vfio_pci_core_write(device_data, buf, count, ppos);
+}
+
+static ssize_t mlx5_vfio_pci_read(void *device_data, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ /* DO vendor specific stuff here ? */
+
+ return vfio_pci_core_read(device_data, buf, count, ppos);
+}
+
+static long mlx5_vfio_pci_ioctl(void *device_data, unsigned int cmd,
+ unsigned long arg)
+{
+ /* DO vendor specific stuff here ? */
+
+ return vfio_pci_core_ioctl(device_data, cmd, arg);
+}
+
+static void mlx5_vfio_pci_release(void *device_data)
+{
+ /* DO vendor specific stuff here ? */
+
+ vfio_pci_core_release(device_data);
+}
+
+static int mlx5_vfio_pci_open(void *device_data)
+{
+ /* DO vendor specific stuff here ? */
+
+ return vfio_pci_core_open(device_data);
+}
+
+static const struct vfio_device_ops mlx5_vfio_pci_ops = {
+ .name = "mlx5-vfio-pci",
+ .open = mlx5_vfio_pci_open,
+ .release = mlx5_vfio_pci_release,
+ .ioctl = mlx5_vfio_pci_ioctl,
+ .read = mlx5_vfio_pci_read,
+ .write = mlx5_vfio_pci_write,
+ .mmap = vfio_pci_core_mmap,
+ .request = vfio_pci_core_request,
+ .match = vfio_pci_core_match,
+};
+
+static int mlx5_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+ struct vfio_pci_device *vdev;
+ struct mlx5_vfio_pci_adev *mvadev;
+
+ mvadev = mlx5_vfio_pci_find_adev(pdev);
+ if (!mvadev) {
+ pr_err("failed to find aux device for %s\n",
+ dev_name(&pdev->dev));
+ return -ENODEV;
+ }
+
+ vdev = vfio_create_pci_device(pdev, &mlx5_vfio_pci_ops, mvadev);
+ if (IS_ERR(vdev))
+ return PTR_ERR(vdev);
+
+ dev_set_drvdata(&mvadev->madev.adev.dev, vdev);
+ return 0;
+}
+
+static void mlx5_vfio_pci_remove(struct pci_dev *pdev)
+{
+ struct mlx5_vfio_pci_adev *mvadev;
+
+ mvadev = mlx5_vfio_pci_find_adev(pdev);
+ if (mvadev)
+ dev_set_drvdata(&mvadev->madev.adev.dev, NULL);
+
+ vfio_destroy_pci_device(pdev);
+}
+
+static pci_ers_result_t mlx5_vfio_pci_aer_err_detected(struct pci_dev *pdev,
+ pci_channel_state_t state)
+{
+ /* DO vendor specific stuff here ? */
+
+ return vfio_pci_core_aer_err_detected(pdev, state);
+}
+
+#ifdef CONFIG_PCI_IOV
+static int mlx5_vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
+{
+ might_sleep();
+
+ /* DO vendor specific stuff here */
+
+ return vfio_pci_core_sriov_configure(pdev, nr_virtfn);
+}
+#endif
+
+static const struct pci_error_handlers mlx5_vfio_err_handlers = {
+ .error_detected = mlx5_vfio_pci_aer_err_detected,
+};
+
+static const struct pci_device_id mlx5_vfio_pci_table[] = {
+ { PCI_VDEVICE(MELLANOX, 0x6001) }, /* NVMe SNAP controllers */
+ { PCI_DEVICE_SUB(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1042,
+ PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID) }, /* Virtio SNAP controllers */
+ { 0, }
+};
+
+static struct pci_driver mlx5_vfio_pci_driver = {
+ .name = "mlx5-vfio-pci",
+ .id_table = mlx5_vfio_pci_table,
+ .probe = mlx5_vfio_pci_probe,
+ .remove = mlx5_vfio_pci_remove,
+#ifdef CONFIG_PCI_IOV
+ .sriov_configure = mlx5_vfio_pci_sriov_configure,
+#endif
+ .err_handler = &mlx5_vfio_err_handlers,
+};
+
+static void __exit mlx5_vfio_pci_cleanup(void)
+{
+ auxiliary_driver_unregister(&mlx5_vfio_pci_aux_driver);
+ pci_unregister_driver(&mlx5_vfio_pci_driver);
+}
+
+static int __init mlx5_vfio_pci_init(void)
+{
+ int ret;
+
+ ret = pci_register_driver(&mlx5_vfio_pci_driver);
+ if (ret)
+ return ret;
+
+ ret = auxiliary_driver_register(&mlx5_vfio_pci_aux_driver);
+ if (ret)
+ goto out_unregister;
+
+ return 0;
+
+out_unregister:
+ pci_unregister_driver(&mlx5_vfio_pci_driver);
+ return ret;
+}
+
+module_init(mlx5_vfio_pci_init);
+module_exit(mlx5_vfio_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/mlx5/vfio_pci.h b/include/linux/mlx5/vfio_pci.h
new file mode 100644
index 000000000000..c1e7b4d6da30
--- /dev/null
+++ b/include/linux/mlx5/vfio_pci.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Copyright (c) 2020 NVIDIA Corporation
+ */
+
+#ifndef _VFIO_PCI_H
+#define _VFIO_PCI_H
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/auxiliary_bus.h>
+#include <linux/mlx5/device.h>
+#include <linux/mlx5/driver.h>
+
+struct mlx5_vfio_pci_adev {
+ struct mlx5_adev madev;
+
+ /* These fields should not be used outside mlx5_vfio_pci.ko */
+ struct list_head entry;
+};
+
+static inline struct mlx5_vfio_pci_adev*
+madev_to_mvadev(struct mlx5_adev *madev)
+{
+ return container_of(madev, struct mlx5_vfio_pci_adev, madev);
+}
+
+static inline struct mlx5_vfio_pci_adev*
+adev_to_mvadev(struct auxiliary_device *adev)
+{
+ struct mlx5_adev *madev = container_of(adev, struct mlx5_adev, adev);
+
+ return madev_to_mvadev(madev);
+}
+
+#endif
--
2.25.4

2021-01-18 16:57:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Mon, Jan 18, 2021 at 02:38:06PM +0100, Cornelia Huck wrote:

> > These devices will be seen on the Auxiliary bus as:
> > mlx5_core.vfio_pci.2048 -> ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/mlx5_core.vfio_pci.2048
> > mlx5_core.vfio_pci.2304 -> ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.1/mlx5_core.vfio_pci.2304
> >
> > 2048 represents BDF 08:00.0 and 2304 represents BDF 09:00.0 in decimal
> > view. In this manner, the administrator will be able to locate the
> > correct vfio-pci module it should bind the desired BDF to (by finding
> > the pointer to the module according to the Auxiliary driver of that
> > BDF).
>
> I'm not familiar with that auxiliary framework (it seems to be fairly
> new?);

Auxillary bus is for connecting two parts of the same driver that
reside in different modules/subystems. Here it is connecting the vfio
part to the core part of mlx5 (running on the PF).

> but can you maybe create an auxiliary device unconditionally and
> contain all hardware-specific things inside a driver for it? Or is
> that not flexible enough?

The goal is to make a vfio device, auxiliary bus is only in the
picture because a VF device under vfio needs to interact with the PF
mlx5_core driver, auxiliary bus provides that connection.

You can say that all the HW specific things are in the mlx5_vfio_pci
driver. It is an unusual driver because it must bind to both the PCI
VF with a pci_driver and to the mlx5_core PF using an
auxiliary_driver. This is needed for the object lifetimes to be
correct.

The PF is providing services to control a full VF which mlx5_vfio_pci
converts into VFIO API.

> > drivers/vfio/pci/Kconfig | 22 +-
> > drivers/vfio/pci/Makefile | 16 +-
> > drivers/vfio/pci/mlx5_vfio_pci.c | 253 +++
> > drivers/vfio/pci/vfio_pci.c | 2386 +--------------------------
> > drivers/vfio/pci/vfio_pci_core.c | 2311 ++++++++++++++++++++++++++
>
> Especially regarding this diffstat...

It is a bit misleading because it doesn't show the rename

Jason

2021-01-18 18:20:53

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Mon, Jan 18, 2021 at 05:00:09PM +0100, Cornelia Huck wrote:

> > You can say that all the HW specific things are in the mlx5_vfio_pci
> > driver. It is an unusual driver because it must bind to both the PCI
> > VF with a pci_driver and to the mlx5_core PF using an
> > auxiliary_driver. This is needed for the object lifetimes to be
> > correct.
>
> Hm... I might be confused about the usage of the term 'driver' here.
> IIUC, there are two drivers, one on the pci bus and one on the
> auxiliary bus. Is the 'driver' you're talking about here more the
> module you load (and not a driver in the driver core sense?)

Here "driver" would be the common term meaning the code that realizes
a subsytem for HW - so mlx5_vfio_pci is a VFIO driver because it
ultimately creates a /dev/vfio* through the vfio subsystem.

The same way we usually call something like mlx5_en an "ethernet
driver" not just a "pci driver"

> Yes, sure. But it also shows that mlx5_vfio_pci aka the device-specific
> code is rather small in comparison to the common vfio-pci code.
> Therefore my question whether it will gain more specific changes (that
> cannot be covered via the auxiliary driver.)

I'm not sure what you mean "via the auxiliary driver" - there is only
one mlx5_vfio_pci, and the non-RFC version with all the migration code
is fairly big.

The pci_driver contributes a 'struct pci_device *' and the
auxiliary_driver contributes a 'struct mlx5_core_dev *'. mlx5_vfio_pci
fuses them together into a VFIO device. Depending on the VFIO
callback, it may use an API from the pci_device or from the
mlx5_core_dev device, or both.

Jason

2021-01-19 04:23:29

by Cornelia Huck

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Sun, 17 Jan 2021 18:15:31 +0000
Max Gurtovoy <[email protected]> wrote:

> Hi Alex and Cornelia,
>
> This series split the vfio_pci driver into 2 parts: pci driver and a
> subsystem driver that will also be library of code. The pci driver,
> vfio_pci.ko will be used as before and it will bind to the subsystem
> driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset
> if fully backward compatible. This is a typical Linux subsystem
> framework behaviour. This framework can be also adopted by vfio_mdev
> devices as we'll see in the below sketch.
>
> This series is coming to solve the issues that were raised in the
> previous attempt for extending vfio-pci for vendor specific
> functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao.
>
> This solution is also deterministic in a sense that when a user will
> bind to a vendor specific vfio-pci driver, it will get all the special
> goodies of the HW.
>
> This subsystem framework will also ease on adding vendor specific
> functionality to VFIO devices in the future by allowing another module
> to provide the pci_driver that can setup number of details before
> registering to VFIO subsystem (such as inject its own operations).
>
> Below we can see the proposed changes (this patchset only deals with
> VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem
> as well):
>
> +----------------------------------------------------------------------+
> | |
> | VFIO |
> | |
> +----------------------------------------------------------------------+
>
> +--------------------------------+ +--------------------------------+
> | | | |
> | VFIO_PCI_CORE | | VFIO_MDEV_CORE |
> | | | |
> +--------------------------------+ +--------------------------------+
>
> +---------------+ +--------------+ +---------------+ +--------------+
> | | | | | | | |
> | | | | | | | |
> | VFIO_PCI | | MLX5_VFIO_PCI| | VFIO_MDEV | |MLX5_VFIO_MDEV|
> | | | | | | | |
> | | | | | | | |
> +---------------+ +--------------+ +---------------+ +--------------+
>
> First 2 patches introduce the above changes for vfio_pci and
> vfio_pci_core.
>
> Patch (3/3) introduces a new mlx5 vfio-pci module that registers to VFIO
> subsystem using vfio_pci_core. It also registers to Auxiliary bus for
> binding to mlx5_core that is the parent of mlx5-vfio-pci devices. This
> will allow extending mlx5-vfio-pci devices with HW specific features
> such as Live Migration (mlx5_core patches are not part of this series
> that comes for proposing the changes need for the vfio pci subsystem).
>
> These devices will be seen on the Auxiliary bus as:
> mlx5_core.vfio_pci.2048 -> ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/mlx5_core.vfio_pci.2048
> mlx5_core.vfio_pci.2304 -> ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.1/mlx5_core.vfio_pci.2304
>
> 2048 represents BDF 08:00.0 and 2304 represents BDF 09:00.0 in decimal
> view. In this manner, the administrator will be able to locate the
> correct vfio-pci module it should bind the desired BDF to (by finding
> the pointer to the module according to the Auxiliary driver of that
> BDF).

I'm not familiar with that auxiliary framework (it seems to be fairly
new?); but can you maybe create an auxiliary device unconditionally and
contain all hardware-specific things inside a driver for it? Or is that
not flexible enough?

>
> In this way, we'll use the HW vendor driver core to manage the lifecycle
> of these devices. This is reasonable since only the vendor driver knows
> exactly about the status on its internal state and the capabilities of
> its acceleratots, for example.
>
> TODOs:
> 1. For this RFC we still haven't cleaned all vendor specific stuff that
> were merged in the past into vfio_pci (such as VFIO_PCI_IG and
> VFIO_PCI_NVLINK2).
> 2. Create subsystem module for VFIO_MDEV. This can be used for vendor
> specific scalable functions for example (SFs).
> 3. Add Live migration functionality for mlx5 SNAP devices
> (NVMe/Virtio-BLK).
> 4. Add Live migration functionality for mlx5 VFs
> 5. Add the needed functionality for mlx5_core
>
> I would like to thank the great team that was involved in this
> development, design and internal review:
> Oren, Liran, Jason, Leon, Aviad, Shahaf, Gary, Artem, Kirti, Neo, Andy
> and others.
>
> This series applies cleanly on top of kernel 5.11-rc2+ commit 2ff90100ace8:
> "Merge tag 'hwmon-for-v5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging"
> from Linus.
>
> Note: Live migration for MLX5 SNAP devices is WIP and will be the first
> example for adding vendor extension to vfio-pci devices. As the
> changes to the subsystem must be defined as a pre-condition for
> this work, we've decided to split the submission for now.
>
> Max Gurtovoy (3):
> vfio-pci: rename vfio_pci.c to vfio_pci_core.c
> vfio-pci: introduce vfio_pci_core subsystem driver
> mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices
>
> drivers/vfio/pci/Kconfig | 22 +-
> drivers/vfio/pci/Makefile | 16 +-
> drivers/vfio/pci/mlx5_vfio_pci.c | 253 +++
> drivers/vfio/pci/vfio_pci.c | 2386 +--------------------------
> drivers/vfio/pci/vfio_pci_core.c | 2311 ++++++++++++++++++++++++++

Especially regarding this diffstat... from a quick glance at patch 3,
it mostly forwards to vfio_pci_core anyway. Do you expect a huge amount
of device-specific callback invocations?

[I have not looked at this in detail yet.]

> drivers/vfio/pci/vfio_pci_private.h | 21 +
> include/linux/mlx5/vfio_pci.h | 36 +
> 7 files changed, 2734 insertions(+), 2311 deletions(-)
> create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c
> create mode 100644 drivers/vfio/pci/vfio_pci_core.c
> create mode 100644 include/linux/mlx5/vfio_pci.h
>

2021-01-19 04:46:33

by Cornelia Huck

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Mon, 18 Jan 2021 11:10:20 -0400
Jason Gunthorpe <[email protected]> wrote:

> On Mon, Jan 18, 2021 at 02:38:06PM +0100, Cornelia Huck wrote:
>
> > > These devices will be seen on the Auxiliary bus as:
> > > mlx5_core.vfio_pci.2048 -> ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/mlx5_core.vfio_pci.2048
> > > mlx5_core.vfio_pci.2304 -> ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.1/mlx5_core.vfio_pci.2304
> > >
> > > 2048 represents BDF 08:00.0 and 2304 represents BDF 09:00.0 in decimal
> > > view. In this manner, the administrator will be able to locate the
> > > correct vfio-pci module it should bind the desired BDF to (by finding
> > > the pointer to the module according to the Auxiliary driver of that
> > > BDF).
> >
> > I'm not familiar with that auxiliary framework (it seems to be fairly
> > new?);
>
> Auxillary bus is for connecting two parts of the same driver that
> reside in different modules/subystems. Here it is connecting the vfio
> part to the core part of mlx5 (running on the PF).

Yes, that's also what I understood so far; still need to do more reading up.

>
> > but can you maybe create an auxiliary device unconditionally and
> > contain all hardware-specific things inside a driver for it? Or is
> > that not flexible enough?
>
> The goal is to make a vfio device, auxiliary bus is only in the
> picture because a VF device under vfio needs to interact with the PF
> mlx5_core driver, auxiliary bus provides that connection.

Nod.

>
> You can say that all the HW specific things are in the mlx5_vfio_pci
> driver. It is an unusual driver because it must bind to both the PCI
> VF with a pci_driver and to the mlx5_core PF using an
> auxiliary_driver. This is needed for the object lifetimes to be
> correct.

Hm... I might be confused about the usage of the term 'driver' here.
IIUC, there are two drivers, one on the pci bus and one on the
auxiliary bus. Is the 'driver' you're talking about here more the
module you load (and not a driver in the driver core sense?)

>
> The PF is providing services to control a full VF which mlx5_vfio_pci
> converts into VFIO API.
>
> > > drivers/vfio/pci/Kconfig | 22 +-
> > > drivers/vfio/pci/Makefile | 16 +-
> > > drivers/vfio/pci/mlx5_vfio_pci.c | 253 +++
> > > drivers/vfio/pci/vfio_pci.c | 2386 +--------------------------
> > > drivers/vfio/pci/vfio_pci_core.c | 2311 ++++++++++++++++++++++++++
> >
> > Especially regarding this diffstat...
>
> It is a bit misleading because it doesn't show the rename

Yes, sure. But it also shows that mlx5_vfio_pci aka the device-specific
code is rather small in comparison to the common vfio-pci code.
Therefore my question whether it will gain more specific changes (that
cannot be covered via the auxiliary driver.)

2021-01-19 19:14:44

by Cornelia Huck

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Mon, 18 Jan 2021 14:16:26 -0400
Jason Gunthorpe <[email protected]> wrote:

> On Mon, Jan 18, 2021 at 05:00:09PM +0100, Cornelia Huck wrote:
>
> > > You can say that all the HW specific things are in the mlx5_vfio_pci
> > > driver. It is an unusual driver because it must bind to both the PCI
> > > VF with a pci_driver and to the mlx5_core PF using an
> > > auxiliary_driver. This is needed for the object lifetimes to be
> > > correct.
> >
> > Hm... I might be confused about the usage of the term 'driver' here.
> > IIUC, there are two drivers, one on the pci bus and one on the
> > auxiliary bus. Is the 'driver' you're talking about here more the
> > module you load (and not a driver in the driver core sense?)
>
> Here "driver" would be the common term meaning the code that realizes
> a subsytem for HW - so mlx5_vfio_pci is a VFIO driver because it
> ultimately creates a /dev/vfio* through the vfio subsystem.
>
> The same way we usually call something like mlx5_en an "ethernet
> driver" not just a "pci driver"
>
> > Yes, sure. But it also shows that mlx5_vfio_pci aka the device-specific
> > code is rather small in comparison to the common vfio-pci code.
> > Therefore my question whether it will gain more specific changes (that
> > cannot be covered via the auxiliary driver.)
>
> I'm not sure what you mean "via the auxiliary driver" - there is only
> one mlx5_vfio_pci, and the non-RFC version with all the migration code
> is fairly big.
>
> The pci_driver contributes a 'struct pci_device *' and the
> auxiliary_driver contributes a 'struct mlx5_core_dev *'. mlx5_vfio_pci
> fuses them together into a VFIO device. Depending on the VFIO
> callback, it may use an API from the pci_device or from the
> mlx5_core_dev device, or both.

Let's rephrase my question a bit:

This proposal splits the existing vfio-pci driver into a "core"
component and code actually implementing the "driver" part. For mlx5,
an alternative "driver" is introduced that reuses the "core" component
and also hooks into mlx5-specific code parts via the auxiliary device
framework. (IIUC, the plan is to make existing special cases for
devices follow mlx5's lead later.)

I've been thinking of an alternative split: Keep vfio-pci as it is now,
but add an auxiliary device. For mlx5, an auxiliary device_driver can
match to that device and implement mlx5-specific things. From the code
in this RFC, it is not clear to me whether this would be feasible: most
callbacks seem to simply forward to the core component, and that might
be possible to be done by a purely auxiliary device_driver; but this
may or may not work well for additional functionality.

I guess my question is: into which callbacks will the additional
functionality hook? If there's no good way to do what they need to do
without manipulating the vfio-pci calls, my proposal will not work, and
this proposal looks like the better way. But it's hard to tell without
seeing the code, which is why I'm asking :)

2021-01-19 19:54:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Tue, Jan 19, 2021 at 07:56:10PM +0100, Cornelia Huck wrote:
> On Mon, 18 Jan 2021 14:16:26 -0400
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Mon, Jan 18, 2021 at 05:00:09PM +0100, Cornelia Huck wrote:
> >
> > > > You can say that all the HW specific things are in the mlx5_vfio_pci
> > > > driver. It is an unusual driver because it must bind to both the PCI
> > > > VF with a pci_driver and to the mlx5_core PF using an
> > > > auxiliary_driver. This is needed for the object lifetimes to be
> > > > correct.
> > >
> > > Hm... I might be confused about the usage of the term 'driver' here.
> > > IIUC, there are two drivers, one on the pci bus and one on the
> > > auxiliary bus. Is the 'driver' you're talking about here more the
> > > module you load (and not a driver in the driver core sense?)
> >
> > Here "driver" would be the common term meaning the code that realizes
> > a subsytem for HW - so mlx5_vfio_pci is a VFIO driver because it
> > ultimately creates a /dev/vfio* through the vfio subsystem.
> >
> > The same way we usually call something like mlx5_en an "ethernet
> > driver" not just a "pci driver"
> >
> > > Yes, sure. But it also shows that mlx5_vfio_pci aka the device-specific
> > > code is rather small in comparison to the common vfio-pci code.
> > > Therefore my question whether it will gain more specific changes (that
> > > cannot be covered via the auxiliary driver.)
> >
> > I'm not sure what you mean "via the auxiliary driver" - there is only
> > one mlx5_vfio_pci, and the non-RFC version with all the migration code
> > is fairly big.
> >
> > The pci_driver contributes a 'struct pci_device *' and the
> > auxiliary_driver contributes a 'struct mlx5_core_dev *'. mlx5_vfio_pci
> > fuses them together into a VFIO device. Depending on the VFIO
> > callback, it may use an API from the pci_device or from the
> > mlx5_core_dev device, or both.
>
> Let's rephrase my question a bit:
>
> This proposal splits the existing vfio-pci driver into a "core"
> component and code actually implementing the "driver" part. For mlx5,
> an alternative "driver" is introduced that reuses the "core" component
> and also hooks into mlx5-specific code parts via the auxiliary device
> framework.

Yes, I think you understand it well

> (IIUC, the plan is to make existing special cases for devices follow
> mlx5's lead later.)

Well, it is a direction to go. I think adding 'if pci matches XX then
squeeze in driver Y' to vfio-pci was a hacky thing to do, this is a
way out.

We could just add 'if pci matches mlx5 then squeeze in driver mlx5'
too - but that is really too horific to seriously consider.

> I've been thinking of an alternative split: Keep vfio-pci as it is now,
> but add an auxiliary device.

vfio-pci cannot use auxiliary device. It is only for connecting parts
of the same driver together. vfio-pci has no other parts to connect.

Further, that is not how the driver core in Linux is designed to
work. We don't have subsytems provide a pci_driver and then go look
around for a 2nd driver to somehow mix in. How would it know what
driver to pick? How would drivers be autoloaded? How can the user know
things worked out right? What if they didn't want that? So many
questions.

The standard driver core flow is always
pci_driver -> subsystem -> userspace

I don't think VFIO's needs are special, so why deviate?

> I guess my question is: into which callbacks will the additional
> functionality hook? If there's no good way to do what they need to do
> without manipulating the vfio-pci calls, my proposal will not work, and
> this proposal looks like the better way. But it's hard to tell without
> seeing the code, which is why I'm asking :)

Well, could we have done the existing special devices we have today
with that approach? What about the Intel thing I saw RFC'd a while
ago? Or the next set of mlx5 devices beyond storage? Or an future SIOV
device?

If you have doubts the idea is flexible enough, then I think you
already answered the question :)

LWN had a good article on these design patterns. This RFC is following
what LWN called the "tool box" pattern. We keep the driver and
subsystem close together and provide useful tools to share common
code. vfio_pci_core's stuff is a tool. It is a proven and flexable
pattern.

I think you are suggesting what LWN called a "midlayer mistake"
pattern. While that can be workable, it doesn't seem right
here. vfio-pci is not really a logical midlayer, and something acting
as a midlayer should never have a device_driver..

To quote lwn:
The core thesis of the "midlayer mistake" is that midlayers are bad
and should not exist. That common functionality which it is so
tempting to put in a midlayer should instead be provided as library
routines which can used, augmented, or ignored by each bottom level
driver independently. Thus every subsystem that supports multiple
implementations (or drivers) should provide a very thin top layer
which calls directly into the bottom layer drivers, and a rich library
of support code that eases the implementation of those drivers. This
library is available to, but not forced upon, those drivers.

Given the exciting future of VFIO I belive strongly the above is the
right general design direction.

Jason

2021-01-22 19:40:50

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Sun, 17 Jan 2021 18:15:31 +0000
Max Gurtovoy <[email protected]> wrote:

> Hi Alex and Cornelia,
>
> This series split the vfio_pci driver into 2 parts: pci driver and a
> subsystem driver that will also be library of code. The pci driver,
> vfio_pci.ko will be used as before and it will bind to the subsystem
> driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset
> if fully backward compatible. This is a typical Linux subsystem
> framework behaviour. This framework can be also adopted by vfio_mdev
> devices as we'll see in the below sketch.
>
> This series is coming to solve the issues that were raised in the
> previous attempt for extending vfio-pci for vendor specific
> functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao.

Is the only significant difference here the fact that function
declarations remain in a private header? Otherwise it seems to
have all the risks of previous attempts, ie. exported symbols with a
lot of implicit behavior shared between them.

> This solution is also deterministic in a sense that when a user will
> bind to a vendor specific vfio-pci driver, it will get all the special
> goodies of the HW.

That determinism really comes after the point where we were concerned
about deterministic behavior, ie. how does a user know which driver to
use for which device. It seems with the aux bus approach we're letting
the default host driver expose sub-functions for assignment, similar to
mdev, but nothing about this code split requires such an approach. As
noted in 2/3, IGD support could be a separate module, but that's a
direct assignment driver, so then the user must decide to use vfio-pci
or igd-vfio-pci, depending on which features they want.

> This subsystem framework will also ease on adding vendor specific
> functionality to VFIO devices in the future by allowing another module
> to provide the pci_driver that can setup number of details before
> registering to VFIO subsystem (such as inject its own operations).

Which leads us directly back to the problem that we then have numerous
drivers that a user might choose for a given device.

> Below we can see the proposed changes (this patchset only deals with
> VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem
> as well):
>
> +----------------------------------------------------------------------+
> | |
> | VFIO |
> | |
> +----------------------------------------------------------------------+
>
> +--------------------------------+ +--------------------------------+
> | | | |
> | VFIO_PCI_CORE | | VFIO_MDEV_CORE |
> | | | |
> +--------------------------------+ +--------------------------------+
>
> +---------------+ +--------------+ +---------------+ +--------------+
> | | | | | | | |
> | | | | | | | |
> | VFIO_PCI | | MLX5_VFIO_PCI| | VFIO_MDEV | |MLX5_VFIO_MDEV|
> | | | | | | | |
> | | | | | | | |
> +---------------+ +--------------+ +---------------+ +--------------+
>
> First 2 patches introduce the above changes for vfio_pci and
> vfio_pci_core.
>
> Patch (3/3) introduces a new mlx5 vfio-pci module that registers to VFIO
> subsystem using vfio_pci_core. It also registers to Auxiliary bus for
> binding to mlx5_core that is the parent of mlx5-vfio-pci devices. This
> will allow extending mlx5-vfio-pci devices with HW specific features
> such as Live Migration (mlx5_core patches are not part of this series
> that comes for proposing the changes need for the vfio pci subsystem).
>
> These devices will be seen on the Auxiliary bus as:
> mlx5_core.vfio_pci.2048 -> ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/mlx5_core.vfio_pci.2048
> mlx5_core.vfio_pci.2304 -> ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.1/mlx5_core.vfio_pci.2304
>
> 2048 represents BDF 08:00.0 and 2304 represents BDF 09:00.0 in decimal

And what are devices 08:00.0 and 09:00.0 on the host? We can only see
from above that they're children of a PCI device.

> view. In this manner, the administrator will be able to locate the
> correct vfio-pci module it should bind the desired BDF to (by finding
> the pointer to the module according to the Auxiliary driver of that
> BDF).

Sorry, clear as mud. Are we really expecting users to decode a decimal
aux bus ID to a PCI BDF and how is the ID standardized among aux bus
devices?

> In this way, we'll use the HW vendor driver core to manage the lifecycle
> of these devices. This is reasonable since only the vendor driver knows
> exactly about the status on its internal state and the capabilities of
> its acceleratots, for example.

But mdev provides that too, or the vendor could write their own vfio
bus driver for the device, this doesn't really justify or delve deep
enough to show examples beyond "TODO" remarks for a vendor driver
actually interacting with vfio-pci-core in an extensible way. One of
the concerns of previous efforts was that it's trying to directly
expose vfio-pci's implementation as an API for vendor drivers, I don't
really see that anything has changed in that respect here. Thanks,

Alex

> TODOs:
> 1. For this RFC we still haven't cleaned all vendor specific stuff that
> were merged in the past into vfio_pci (such as VFIO_PCI_IG and
> VFIO_PCI_NVLINK2).
> 2. Create subsystem module for VFIO_MDEV. This can be used for vendor
> specific scalable functions for example (SFs).
> 3. Add Live migration functionality for mlx5 SNAP devices
> (NVMe/Virtio-BLK).
> 4. Add Live migration functionality for mlx5 VFs
> 5. Add the needed functionality for mlx5_core
>
> I would like to thank the great team that was involved in this
> development, design and internal review:
> Oren, Liran, Jason, Leon, Aviad, Shahaf, Gary, Artem, Kirti, Neo, Andy
> and others.
>
> This series applies cleanly on top of kernel 5.11-rc2+ commit 2ff90100ace8:
> "Merge tag 'hwmon-for-v5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging"
> from Linus.
>
> Note: Live migration for MLX5 SNAP devices is WIP and will be the first
> example for adding vendor extension to vfio-pci devices. As the
> changes to the subsystem must be defined as a pre-condition for
> this work, we've decided to split the submission for now.
>
> Max Gurtovoy (3):
> vfio-pci: rename vfio_pci.c to vfio_pci_core.c
> vfio-pci: introduce vfio_pci_core subsystem driver
> mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices
>
> drivers/vfio/pci/Kconfig | 22 +-
> drivers/vfio/pci/Makefile | 16 +-
> drivers/vfio/pci/mlx5_vfio_pci.c | 253 +++
> drivers/vfio/pci/vfio_pci.c | 2386 +--------------------------
> drivers/vfio/pci/vfio_pci_core.c | 2311 ++++++++++++++++++++++++++
> drivers/vfio/pci/vfio_pci_private.h | 21 +
> include/linux/mlx5/vfio_pci.h | 36 +
> 7 files changed, 2734 insertions(+), 2311 deletions(-)
> create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c
> create mode 100644 drivers/vfio/pci/vfio_pci_core.c
> create mode 100644 include/linux/mlx5/vfio_pci.h
>

2021-01-22 20:23:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Fri, Jan 22, 2021 at 12:25:03PM -0700, Alex Williamson wrote:

> Is the only significant difference here the fact that function
> declarations remain in a private header? Otherwise it seems to
> have all the risks of previous attempts, ie. exported symbols with a
> lot of implicit behavior shared between them.

If you want to use substantial amounts of vfio-pci for multiple
drivers then what other choice is there? You and I previously talked
about shrinking vfio-pci by moving things to common code, outside
VFIO, and you didn't like that direction either.

So if this shared code lives in vfio-pci, vfio-pci must become a
library, because this code must be shared.

The big difference between the May patch series and this one, is to
actually starts down the path of turning vfio-pci into a proper
library.

The general long term goal is to make the interface exposed from the
vfio_pci_core.ko library well defined and sensible. Obviously the
first patches to make this split are not going to get eveything
polished, but set a direction for future work.

This approach is actually pretty clean because only the ops are
exported and the ops API is already well defined by VFIO. The internal
bits of vfio-pci-core can remain hidden behind a defined/restricted
API.

The May series lacks a clear demark for where the library begins/ends
and vfio_pci.ko start, and doesn't really present a clean way to get
anywhere better.

Also the May series developed its own internalized copy of the driver
core, which is big no-no. Don't duplicate driver core functionality in
subsystems. This uses the driver core much closer to how it was
intended - only the dual binding is a bit odd, but necessary.

> noted in 2/3, IGD support could be a separate module, but that's a
> direct assignment driver, so then the user must decide to use vfio-pci
> or igd-vfio-pci, depending on which features they want.

Which I think actually makes sense. Having vfio-pci transparently do
more than just the basic PCI-sig defined stuff is a very confusing
path.

Trying to make vfio_pci do everything for everyone will just be a huge
incomprehensible mess.

> > This subsystem framework will also ease on adding vendor specific
> > functionality to VFIO devices in the future by allowing another module
> > to provide the pci_driver that can setup number of details before
> > registering to VFIO subsystem (such as inject its own operations).
>
> Which leads us directly back to the problem that we then have numerous
> drivers that a user might choose for a given device.

Sure, but that is a problem that can be tackled in several different
ways from userspace. Max here mused about some general algorithm to
process aux device names, you could have userspace look in the module
data to find VFIO drivers and auto load them like everything else in
Linux, you could have some VFIO netlink thing, many options worth
exploring.

Solving the proliferation of VFIO drivers feels like a tangent - lets
first agree having multiple VFIO drivers is the right technical
direction within the Linux driver core model before we try to figure
out what to do for userspace.

> > In this way, we'll use the HW vendor driver core to manage the lifecycle
> > of these devices. This is reasonable since only the vendor driver knows
> > exactly about the status on its internal state and the capabilities of
> > its acceleratots, for example.
>
> But mdev provides that too, or the vendor could write their own vfio

Not really, mdev has a completely different lifecycle model that is
not very compatible with what is required here.

And writing a VFIO driver is basically what this does, just a large
portion of the driver is reusing code from the normal vfio-pci cases.

Jason

2021-01-25 16:26:53

by Cornelia Huck

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Fri, 22 Jan 2021 16:04:21 -0400
Jason Gunthorpe <[email protected]> wrote:

> On Fri, Jan 22, 2021 at 12:25:03PM -0700, Alex Williamson wrote:

> > > In this way, we'll use the HW vendor driver core to manage the lifecycle
> > > of these devices. This is reasonable since only the vendor driver knows
> > > exactly about the status on its internal state and the capabilities of
> > > its acceleratots, for example.
> >
> > But mdev provides that too, or the vendor could write their own vfio
>
> Not really, mdev has a completely different lifecycle model that is
> not very compatible with what is required here.
>
> And writing a VFIO driver is basically what this does, just a large
> portion of the driver is reusing code from the normal vfio-pci cases.

I think you cut out an important part of Alex' comment, so let me
repost it here:

"But mdev provides that too, or the vendor could write their own vfio
bus driver for the device, this doesn't really justify or delve deep
enough to show examples beyond "TODO" remarks for a vendor driver
actually interacting with vfio-pci-core in an extensible way. One of
the concerns of previous efforts was that it's trying to directly
expose vfio-pci's implementation as an API for vendor drivers, I don't
really see that anything has changed in that respect here."

I'm missing the bigger picture of how this api is supposed to work out,
a driver with a lot of TODOs does not help much to figure out whether
this split makes sense and is potentially useful for a number of use
cases, or whether mdev (even with its different lifecycle model) or a
different vfio bus driver might be a better fit for the more involved
cases. (For example, can s390 ISM fit here?)

2021-01-25 18:14:42

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Mon, Jan 25, 2021 at 05:20:35PM +0100, Cornelia Huck wrote:

> I think you cut out an important part of Alex' comment, so let me
> repost it here:

Yes, I've already respnded to this.

> I'm missing the bigger picture of how this api is supposed to work out,
> a driver with a lot of TODOs does not help much to figure out whether
> this split makes sense and is potentially useful for a number of use
> cases

The change to vfio-pci is essentially complete, ignoring some
additional cleanup. I'm not sure seeing details how some mlx5 driver
uses it will be substantially more informative if it is useful for
S390, or Intel.

As far as API it is clear to see:

> +/* Exported functions */
> +struct vfio_pci_device *vfio_create_pci_device(struct pci_dev *pdev,
> + const struct vfio_device_ops *vfio_pci_ops, void *dd_data);
> +void vfio_destroy_pci_device(struct pci_dev *pdev);

Called from probe/remove of the consuming driver

> +int vfio_pci_core_open(void *device_data);
> +void vfio_pci_core_release(void *device_data);
> +long vfio_pci_core_ioctl(void *device_data, unsigned int cmd,
> + unsigned long arg);
> +ssize_t vfio_pci_core_read(void *device_data, char __user *buf, size_t count,
> + loff_t *ppos);
> +ssize_t vfio_pci_core_write(void *device_data, const char __user *buf,
> + size_t count, loff_t *ppos);
> +int vfio_pci_core_mmap(void *device_data, struct vm_area_struct *vma);
> +void vfio_pci_core_request(void *device_data, unsigned int count);
> +int vfio_pci_core_match(void *device_data, char *buf);

Called from vfio_device_ops and has the existing well defined API of
the matching ops.

> +int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn);
> +pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
> + pci_channel_state_t state);
> +

Callbacks from the PCI core, API defined by the PCI subsystem.

Notice there is no major new API exposed from vfio-pci, nor are
vfio-pci internals any more exposed then they used to be.

Except create/destroy, every single newly exported function was
already available via a function pointer someplace else in the
API. This is a key point, because it is very much NOT like the May
series.

Because the new driver sits before vfio_pci because it owns the
vfio_device_ops, it introduces nearly nothing new. The May series put
the new driver after vfio-pci as some internalized sub-driver and
exposed a whole new API, wrappers and callbacks to go along with it.

For instance if a new driver wants to implement some new thing under
ioctl, like migration, then it would do

static long new_driver_pci_core_ioctl(void *device_data, unsigned int cmd,
unsigned long arg)
{
switch (cmd) {
case NEW_THING: return new_driver_thing();
default: return vfio_pci_core_ioctl(device_data, cmd, arg);
}
}
static const struct vfio_device_ops new_driver_pci_ops = {
[...]
.ioctl = new_driver_ioctl,
};

Simple enough, if you understand the above, then you also understand
what direction the mlx5 driver will go in.

This is also why it is clearly useful for a wide range of cases, as a
driver can use as much or as little of the vfio-pci-core ops as it
wants. The driver doesn't get to reach into vfio-pci, but it can sit
before it and intercept the entire uAPI. That is very powerful.

> or whether mdev (even with its different lifecycle model) or a

I think it is appropriate to think of mdev as only the special
lifecycle model, it doesn't have any other functionality.

mdev's lifecycle model does nothing to solve the need to libraryize
vfio-pci.

> different vfio bus driver might be a better fit for the more

What is a "vfio bus driver" ? Why would we need a bus here?

> involved cases. (For example, can s390 ISM fit here?)

Don't know what is special about ISM? What TODO do you think will help
answer that question?

Jason

2021-01-25 23:35:40

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Mon, 25 Jan 2021 14:04:40 -0400
Jason Gunthorpe <[email protected]> wrote:

> On Mon, Jan 25, 2021 at 05:20:35PM +0100, Cornelia Huck wrote:
>
> > I think you cut out an important part of Alex' comment, so let me
> > repost it here:
>
> Yes, I've already respnded to this.

Not really. For example, how can struct vfio_pci_device be exposed
as-is to the various drivers that may be derived from vfio-pci-core?
As soon as such a driver starts touching those fields or performing
operations on their own without making proper use of those fields, then
the API is not the simple set of exported functions below, it's the
entire mechanics of every bit of data in that structure and the
structures it includes. If we're making a library then we need to
define public and private data. We're just tossing everything here
into the private header which is necessarily public to the derived
drivers if they're to have any means to operate on the device.

> > I'm missing the bigger picture of how this api is supposed to work out,
> > a driver with a lot of TODOs does not help much to figure out whether
> > this split makes sense and is potentially useful for a number of use
> > cases
>
> The change to vfio-pci is essentially complete, ignoring some
> additional cleanup. I'm not sure seeing details how some mlx5 driver
> uses it will be substantially more informative if it is useful for
> S390, or Intel.

We're supposed to be enlightened by a vendor driver that does nothing
more than pass the opaque device_data through to the core functions,
but in reality this is exactly the point of concern above. At a
minimum that vendor driver needs to look at the vdev to get the pdev,
but then what else does it look at, consume, or modify. Now we have
vendor drivers misusing the core because it's not clear which fields
are private and how public fields can be used safely, any core
extensions potentially break vendor drivers, etc. We're only even hand
waving that existing device specific support could be farmed out to new
device specific drivers without even going to the effort to prove that.

> As far as API it is clear to see:
>
> > +/* Exported functions */
> > +struct vfio_pci_device *vfio_create_pci_device(struct pci_dev *pdev,
> > + const struct vfio_device_ops *vfio_pci_ops, void *dd_data);

Nit, dd_data is never defined or used. The function returns a struct
vfio_pci_device, but I'm not sure whether that's supposed to be public
or private. IMO, it should be private and perhaps access functions
should be provided for fields we consider public.

> > +void vfio_destroy_pci_device(struct pci_dev *pdev);
>
> Called from probe/remove of the consuming driver
>
> > +int vfio_pci_core_open(void *device_data);
> > +void vfio_pci_core_release(void *device_data);
> > +long vfio_pci_core_ioctl(void *device_data, unsigned int cmd,
> > + unsigned long arg);
> > +ssize_t vfio_pci_core_read(void *device_data, char __user *buf, size_t count,
> > + loff_t *ppos);
> > +ssize_t vfio_pci_core_write(void *device_data, const char __user *buf,
> > + size_t count, loff_t *ppos);
> > +int vfio_pci_core_mmap(void *device_data, struct vm_area_struct *vma);
> > +void vfio_pci_core_request(void *device_data, unsigned int count);
> > +int vfio_pci_core_match(void *device_data, char *buf);
>
> Called from vfio_device_ops and has the existing well defined API of
> the matching ops.

All of these require using struct vfio_pci_device to do anything beyond
the stub implementation in 3/3.

> > +int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn);
> > +pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
> > + pci_channel_state_t state);
> > +
>
> Callbacks from the PCI core, API defined by the PCI subsystem.
>
> Notice there is no major new API exposed from vfio-pci, nor are
> vfio-pci internals any more exposed then they used to be.

They absolutely are. These are all private, non-exported functions and
data structures. Here the vfio_device_ops and all the internal state
are all laid bare.

> Except create/destroy, every single newly exported function was
> already available via a function pointer someplace else in the
> API. This is a key point, because it is very much NOT like the May
> series.

They were available to vfio-core, which honored that device_data was
private.

> Because the new driver sits before vfio_pci because it owns the
> vfio_device_ops, it introduces nearly nothing new. The May series put
> the new driver after vfio-pci as some internalized sub-driver and
> exposed a whole new API, wrappers and callbacks to go along with it.
>
> For instance if a new driver wants to implement some new thing under
> ioctl, like migration, then it would do
>
> static long new_driver_pci_core_ioctl(void *device_data, unsigned int cmd,
> unsigned long arg)
> {
> switch (cmd) {
> case NEW_THING: return new_driver_thing();

new_driver_thing relative too what. Now we need to decode the
vfio-pci-core private data and parse through it to access or affect the
device.

> default: return vfio_pci_core_ioctl(device_data, cmd, arg);
> }
> }
> static const struct vfio_device_ops new_driver_pci_ops = {
> [...]
> .ioctl = new_driver_ioctl,
> };
>
> Simple enough, if you understand the above, then you also understand
> what direction the mlx5 driver will go in.
>
> This is also why it is clearly useful for a wide range of cases, as a
> driver can use as much or as little of the vfio-pci-core ops as it
> wants. The driver doesn't get to reach into vfio-pci, but it can sit
> before it and intercept the entire uAPI. That is very powerful.

But vfio-pci-core is actually vfio-pci and there's no way for a vendor
derived driver to do much of anything without reaching into struct
vfio_pci_device, so yes, it is reaching into vfio-pci.

> > or whether mdev (even with its different lifecycle model) or a
>
> I think it is appropriate to think of mdev as only the special
> lifecycle model, it doesn't have any other functionality.
>
> mdev's lifecycle model does nothing to solve the need to libraryize
> vfio-pci.
>
> > different vfio bus driver might be a better fit for the more
>
> What is a "vfio bus driver" ? Why would we need a bus here?

./Documentation/driver-api/vfio.rst

We refer to drivers that bridge vfio-core to bus implementations, ex.
vfio-pci, as "vfio bus drivers". This is partially why we've so far
extended vfio-pci with device specific features, as a bus specific
driver rather than a device specific driver.

> > involved cases. (For example, can s390 ISM fit here?)
>
> Don't know what is special about ISM? What TODO do you think will help
> answer that question?

So far the TODOs rather mask the dirty little secrets of the extension
rather than showing how a vendor derived driver needs to root around in
struct vfio_pci_device to do something useful, so probably porting
actual device specific support rather than further hand waving would be
more helpful. Thanks,

Alex

2021-01-26 10:52:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:

> We're supposed to be enlightened by a vendor driver that does nothing
> more than pass the opaque device_data through to the core functions,
> but in reality this is exactly the point of concern above. At a
> minimum that vendor driver needs to look at the vdev to get the
> pdev,

The end driver already havs the pdev, the RFC doesn't go enough into
those bits, it is a good comment.

The dd_data pased to the vfio_create_pci_device() will be retrieved
from the ops to get back to the end drivers data. This can cleanly
include everything: the VF pci_device, PF pci_device, mlx5_core
pointer, vfio_device and vfio_pci_device.

This is why the example passes in the mvadev:

+ vdev = vfio_create_pci_device(pdev, &mlx5_vfio_pci_ops, mvadev);

The mvadev has the PF, VF, and mlx5 core driver pointer.

Getting that back out during the ops is enough to do what the mlx5
driver needs to do, which is relay migration related IOCTLs to the PF
function via the mlx5_core driver so the device can execute them on
behalf of the VF.

> but then what else does it look at, consume, or modify. Now we have
> vendor drivers misusing the core because it's not clear which fields
> are private and how public fields can be used safely,

The kernel has never followed rigid rules for data isolation, it is
normal to have whole private structs exposed in headers so that
container_of can be used to properly compose data structures.

Look at struct device, for instance. Most of that is private to the
driver core.

A few 'private to vfio-pci-core' comments would help, it is good
feedback to make that more clear.

> extensions potentially break vendor drivers, etc. We're only even hand
> waving that existing device specific support could be farmed out to new
> device specific drivers without even going to the effort to prove that.

This is a RFC, not a complete patch series. The RFC is to get feedback
on the general design before everyone comits alot of resources and
positions get dug in.

Do you really think the existing device specific support would be a
problem to lift? It already looks pretty clean with the
vfio_pci_regops, looks easy enough to lift to the parent.

> So far the TODOs rather mask the dirty little secrets of the
> extension rather than showing how a vendor derived driver needs to
> root around in struct vfio_pci_device to do something useful, so
> probably porting actual device specific support rather than further
> hand waving would be more helpful.

It would be helpful to get actual feedback on the high level design -
someting like this was already tried in May and didn't go anywhere -
are you surprised that we are reluctant to commit alot of resources
doing a complete job just to have it go nowhere again?

Thanks,
Jason

2021-01-26 11:01:28

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Mon, 25 Jan 2021 20:45:22 -0400
Jason Gunthorpe <[email protected]> wrote:

> On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:
>
> > We're supposed to be enlightened by a vendor driver that does nothing
> > more than pass the opaque device_data through to the core functions,
> > but in reality this is exactly the point of concern above. At a
> > minimum that vendor driver needs to look at the vdev to get the
> > pdev,
>
> The end driver already havs the pdev, the RFC doesn't go enough into
> those bits, it is a good comment.
>
> The dd_data pased to the vfio_create_pci_device() will be retrieved
> from the ops to get back to the end drivers data. This can cleanly
> include everything: the VF pci_device, PF pci_device, mlx5_core
> pointer, vfio_device and vfio_pci_device.
>
> This is why the example passes in the mvadev:
>
> + vdev = vfio_create_pci_device(pdev, &mlx5_vfio_pci_ops, mvadev);
>
> The mvadev has the PF, VF, and mlx5 core driver pointer.
>
> Getting that back out during the ops is enough to do what the mlx5
> driver needs to do, which is relay migration related IOCTLs to the PF
> function via the mlx5_core driver so the device can execute them on
> behalf of the VF.
>
> > but then what else does it look at, consume, or modify. Now we have
> > vendor drivers misusing the core because it's not clear which fields
> > are private and how public fields can be used safely,
>
> The kernel has never followed rigid rules for data isolation, it is
> normal to have whole private structs exposed in headers so that
> container_of can be used to properly compose data structures.

I reject this assertion, there are plenty of structs that clearly
indicate private vs public data or, as we've done in mdev, clearly
marking the private data in a "private" header and provide access
functions for public fields. Including a "private" header to make use
of a "library" is just wrong. In the example above, there's no way for
the mlx vendor driver to get back dd_data without extracting it from
struct vfio_pci_device itself.

> Look at struct device, for instance. Most of that is private to the
> driver core.
>
> A few 'private to vfio-pci-core' comments would help, it is good
> feedback to make that more clear.
>
> > extensions potentially break vendor drivers, etc. We're only even hand
> > waving that existing device specific support could be farmed out to new
> > device specific drivers without even going to the effort to prove that.
>
> This is a RFC, not a complete patch series. The RFC is to get feedback
> on the general design before everyone comits alot of resources and
> positions get dug in.
>
> Do you really think the existing device specific support would be a
> problem to lift? It already looks pretty clean with the
> vfio_pci_regops, looks easy enough to lift to the parent.
>
> > So far the TODOs rather mask the dirty little secrets of the
> > extension rather than showing how a vendor derived driver needs to
> > root around in struct vfio_pci_device to do something useful, so
> > probably porting actual device specific support rather than further
> > hand waving would be more helpful.
>
> It would be helpful to get actual feedback on the high level design -
> someting like this was already tried in May and didn't go anywhere -
> are you surprised that we are reluctant to commit alot of resources
> doing a complete job just to have it go nowhere again?

That's not really what I'm getting from your feedback, indicating
vfio-pci is essentially done, the mlx stub driver should be enough to
see the direction, and additional concerns can be handled with TODO
comments. Sorry if this is not construed as actual feedback, I think
both Connie and I are making an effort to understand this and being
hampered by lack of a clear api or a vendor driver that's anything more
than vfio-pci plus an aux bus interface. Thanks,

Alex

2021-01-26 18:54:55

by Max Gurtovoy

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

Hi Alex, Cornelia and Jason,

thanks for the reviewing this.

On 1/26/2021 5:34 AM, Alex Williamson wrote:
> On Mon, 25 Jan 2021 20:45:22 -0400
> Jason Gunthorpe <[email protected]> wrote:
>
>> On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:
>>
>>> We're supposed to be enlightened by a vendor driver that does nothing
>>> more than pass the opaque device_data through to the core functions,
>>> but in reality this is exactly the point of concern above. At a
>>> minimum that vendor driver needs to look at the vdev to get the
>>> pdev,
>> The end driver already havs the pdev, the RFC doesn't go enough into
>> those bits, it is a good comment.
>>
>> The dd_data pased to the vfio_create_pci_device() will be retrieved
>> from the ops to get back to the end drivers data. This can cleanly
>> include everything: the VF pci_device, PF pci_device, mlx5_core
>> pointer, vfio_device and vfio_pci_device.
>>
>> This is why the example passes in the mvadev:
>>
>> + vdev = vfio_create_pci_device(pdev, &mlx5_vfio_pci_ops, mvadev);
>>
>> The mvadev has the PF, VF, and mlx5 core driver pointer.
>>
>> Getting that back out during the ops is enough to do what the mlx5
>> driver needs to do, which is relay migration related IOCTLs to the PF
>> function via the mlx5_core driver so the device can execute them on
>> behalf of the VF.
>>
>>> but then what else does it look at, consume, or modify. Now we have
>>> vendor drivers misusing the core because it's not clear which fields
>>> are private and how public fields can be used safely,
>> The kernel has never followed rigid rules for data isolation, it is
>> normal to have whole private structs exposed in headers so that
>> container_of can be used to properly compose data structures.
> I reject this assertion, there are plenty of structs that clearly
> indicate private vs public data or, as we've done in mdev, clearly
> marking the private data in a "private" header and provide access
> functions for public fields. Including a "private" header to make use
> of a "library" is just wrong. In the example above, there's no way for
> the mlx vendor driver to get back dd_data without extracting it from
> struct vfio_pci_device itself.

I'll create a better separation between private/public fields according
to my understanding for the V2.

I'll just mention that beyond this separation, future improvements will
be needed and can be done incrementally.

I don't think that we should do so many changes at one shut. The
incremental process is safer from subsystem point of view.

I also think that upstreaming mlx5_vfio_pci.ko and upstreaming vfio-pci
separation into 2 modules doesn't have to happen in one-shut.

But again, to make our point in this RFC, I'll improve it for V2.

>
>> Look at struct device, for instance. Most of that is private to the
>> driver core.
>>
>> A few 'private to vfio-pci-core' comments would help, it is good
>> feedback to make that more clear.
>>
>>> extensions potentially break vendor drivers, etc. We're only even hand
>>> waving that existing device specific support could be farmed out to new
>>> device specific drivers without even going to the effort to prove that.
>> This is a RFC, not a complete patch series. The RFC is to get feedback
>> on the general design before everyone comits alot of resources and
>> positions get dug in.
>>
>> Do you really think the existing device specific support would be a
>> problem to lift? It already looks pretty clean with the
>> vfio_pci_regops, looks easy enough to lift to the parent.
>>
>>> So far the TODOs rather mask the dirty little secrets of the
>>> extension rather than showing how a vendor derived driver needs to
>>> root around in struct vfio_pci_device to do something useful, so
>>> probably porting actual device specific support rather than further
>>> hand waving would be more helpful.
>> It would be helpful to get actual feedback on the high level design -
>> someting like this was already tried in May and didn't go anywhere -
>> are you surprised that we are reluctant to commit alot of resources
>> doing a complete job just to have it go nowhere again?
> That's not really what I'm getting from your feedback, indicating
> vfio-pci is essentially done, the mlx stub driver should be enough to
> see the direction, and additional concerns can be handled with TODO
> comments. Sorry if this is not construed as actual feedback, I think
> both Connie and I are making an effort to understand this and being
> hampered by lack of a clear api or a vendor driver that's anything more
> than vfio-pci plus an aux bus interface. Thanks,

I think I got the main idea and I'll try to summarize it:

The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we
do need it to be able to create vendor-vfio-pci.ko driver in the future
to include vendor special souse inside.

The separation implementation and the question of what is private and
what is public, and the core APIs to the various drivers should be
improved or better demonstrated in the V2.

I'll work on improving it and I'll send the V2.


If you have some feedback of the things/fields/structs you think should
remain private to vfio-pci-core please let us know.

Thanks for the effort in the review,

-Max.

> Alex
>

2021-01-27 00:41:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Mon, Jan 25, 2021 at 08:34:29PM -0700, Alex Williamson wrote:

> > someting like this was already tried in May and didn't go anywhere -
> > are you surprised that we are reluctant to commit alot of resources
> > doing a complete job just to have it go nowhere again?
>
> That's not really what I'm getting from your feedback, indicating
> vfio-pci is essentially done, the mlx stub driver should be enough to
> see the direction, and additional concerns can be handled with TODO
> comments.

I think we are looking at this RFC in different ways. I see it as
largely "done" showing the general design of few big ideas:

- new vfio drivers will be creating treating VFIO PCI as a "VFIO bus
driver" library
- These new drivers are PCI devices bound via driver core as peers to
vfio-pci, vs sub drivers of vfio-pci
- It uses the subsystem -> driver -> library pattern for composing drivers
instead of the subsystem -> midlayer -> driver pattern mdev/platform use
- It will have some driver facing API from vfio-pci-core that is
close to what is shown in the RFC
- The drivers can "double bind" in the driver core to access the PF
resources via aux devices from the VF VFIO driver.

The point of a RFC discussion is to try to come to some community
understanding on a general high level direction.

It is not a perfectly polished illustration of things that shouldn't
be contentious or technically hard. There are alot of things that can
be polished here, this illustration has lots of stuff in vfio-pci-core
that really should be in vfio-pci - it will take time and effort to
properly split things up and do a great job here.

> Sorry if this is not construed as actual feedback, I think both
> Connie and I are making an effort to understand this and being
> hampered by lack of a clear api or a vendor driver that's anything
> more than vfio-pci plus an aux bus interface. Thanks,

I appreciate the effort, and there is a lot to understand here. Most
of this stuff is very new technology and not backed by industry
standards bodies.

I really do think this simplified RFC will help the process - I've
seen the internal prototype and it is a mass of opaque device specific
code. Max's V2 should flesh things out more.

Thanks,
Jason

2021-01-28 16:34:57

by Cornelia Huck

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Tue, 26 Jan 2021 15:27:43 +0200
Max Gurtovoy <[email protected]> wrote:

> Hi Alex, Cornelia and Jason,
>
> thanks for the reviewing this.
>
> On 1/26/2021 5:34 AM, Alex Williamson wrote:
> > On Mon, 25 Jan 2021 20:45:22 -0400
> > Jason Gunthorpe <[email protected]> wrote:
> >
> >> On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:
> >>
> >>> We're supposed to be enlightened by a vendor driver that does nothing
> >>> more than pass the opaque device_data through to the core functions,
> >>> but in reality this is exactly the point of concern above. At a
> >>> minimum that vendor driver needs to look at the vdev to get the
> >>> pdev,
> >> The end driver already havs the pdev, the RFC doesn't go enough into
> >> those bits, it is a good comment.
> >>
> >> The dd_data pased to the vfio_create_pci_device() will be retrieved
> >> from the ops to get back to the end drivers data. This can cleanly
> >> include everything: the VF pci_device, PF pci_device, mlx5_core
> >> pointer, vfio_device and vfio_pci_device.
> >>
> >> This is why the example passes in the mvadev:
> >>
> >> + vdev = vfio_create_pci_device(pdev, &mlx5_vfio_pci_ops, mvadev);
> >>
> >> The mvadev has the PF, VF, and mlx5 core driver pointer.
> >>
> >> Getting that back out during the ops is enough to do what the mlx5
> >> driver needs to do, which is relay migration related IOCTLs to the PF
> >> function via the mlx5_core driver so the device can execute them on
> >> behalf of the VF.
> >>
> >>> but then what else does it look at, consume, or modify. Now we have
> >>> vendor drivers misusing the core because it's not clear which fields
> >>> are private and how public fields can be used safely,
> >> The kernel has never followed rigid rules for data isolation, it is
> >> normal to have whole private structs exposed in headers so that
> >> container_of can be used to properly compose data structures.
> > I reject this assertion, there are plenty of structs that clearly
> > indicate private vs public data or, as we've done in mdev, clearly
> > marking the private data in a "private" header and provide access
> > functions for public fields. Including a "private" header to make use
> > of a "library" is just wrong. In the example above, there's no way for
> > the mlx vendor driver to get back dd_data without extracting it from
> > struct vfio_pci_device itself.
>
> I'll create a better separation between private/public fields according
> to my understanding for the V2.
>
> I'll just mention that beyond this separation, future improvements will
> be needed and can be done incrementally.
>
> I don't think that we should do so many changes at one shut. The
> incremental process is safer from subsystem point of view.
>
> I also think that upstreaming mlx5_vfio_pci.ko and upstreaming vfio-pci
> separation into 2 modules doesn't have to happen in one-shut.

The design can probably benefit from tossing a non-mlx5 driver into the
mix.

So, let me suggest the zdev support for that experiment (see
e6b817d4b8217a9528fcfd59719b924ab8a5ff23 and friends.) It is quite
straightforward: it injects some capabilities into the info ioctl. A
specialized driver would basically only need to provide special
handling for the ioctl callback and just forward anything else. It also
would not need any matching for device ids etc., as it would only make
sense on s390x, but regardless of the device. It could, however, help
us to get an idea what a good split would look like.

>
> But again, to make our point in this RFC, I'll improve it for V2.
>
> >
> >> Look at struct device, for instance. Most of that is private to the
> >> driver core.
> >>
> >> A few 'private to vfio-pci-core' comments would help, it is good
> >> feedback to make that more clear.
> >>
> >>> extensions potentially break vendor drivers, etc. We're only even hand
> >>> waving that existing device specific support could be farmed out to new
> >>> device specific drivers without even going to the effort to prove that.
> >> This is a RFC, not a complete patch series. The RFC is to get feedback
> >> on the general design before everyone comits alot of resources and
> >> positions get dug in.
> >>
> >> Do you really think the existing device specific support would be a
> >> problem to lift? It already looks pretty clean with the
> >> vfio_pci_regops, looks easy enough to lift to the parent.
> >>
> >>> So far the TODOs rather mask the dirty little secrets of the
> >>> extension rather than showing how a vendor derived driver needs to
> >>> root around in struct vfio_pci_device to do something useful, so
> >>> probably porting actual device specific support rather than further
> >>> hand waving would be more helpful.
> >> It would be helpful to get actual feedback on the high level design -
> >> someting like this was already tried in May and didn't go anywhere -
> >> are you surprised that we are reluctant to commit alot of resources
> >> doing a complete job just to have it go nowhere again?
> > That's not really what I'm getting from your feedback, indicating
> > vfio-pci is essentially done, the mlx stub driver should be enough to
> > see the direction, and additional concerns can be handled with TODO
> > comments. Sorry if this is not construed as actual feedback, I think
> > both Connie and I are making an effort to understand this and being
> > hampered by lack of a clear api or a vendor driver that's anything more
> > than vfio-pci plus an aux bus interface. Thanks,
>
> I think I got the main idea and I'll try to summarize it:
>
> The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we
> do need it to be able to create vendor-vfio-pci.ko driver in the future
> to include vendor special souse inside.

One other thing I'd like to bring up: What needs to be done in
userspace? Does a userspace driver like QEMU need changes to actually
exploit this? Does management software like libvirt need to be involved
in decision making, or does it just need to provide the knobs to make
the driver configurable?

>
> The separation implementation and the question of what is private and
> what is public, and the core APIs to the various drivers should be
> improved or better demonstrated in the V2.
>
> I'll work on improving it and I'll send the V2.
>
>
> If you have some feedback of the things/fields/structs you think should
> remain private to vfio-pci-core please let us know.
>
> Thanks for the effort in the review,
>
> -Max.
>
> > Alex
> >
>

2021-01-28 21:06:49

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Thu, 28 Jan 2021 17:29:30 +0100
Cornelia Huck <[email protected]> wrote:

> On Tue, 26 Jan 2021 15:27:43 +0200
> Max Gurtovoy <[email protected]> wrote:
> > On 1/26/2021 5:34 AM, Alex Williamson wrote:
> > > On Mon, 25 Jan 2021 20:45:22 -0400
> > > Jason Gunthorpe <[email protected]> wrote:
> > >
> > >> On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:
> > >>> extensions potentially break vendor drivers, etc. We're only even hand
> > >>> waving that existing device specific support could be farmed out to new
> > >>> device specific drivers without even going to the effort to prove that.
> > >> This is a RFC, not a complete patch series. The RFC is to get feedback
> > >> on the general design before everyone comits alot of resources and
> > >> positions get dug in.
> > >>
> > >> Do you really think the existing device specific support would be a
> > >> problem to lift? It already looks pretty clean with the
> > >> vfio_pci_regops, looks easy enough to lift to the parent.
> > >>
> > >>> So far the TODOs rather mask the dirty little secrets of the
> > >>> extension rather than showing how a vendor derived driver needs to
> > >>> root around in struct vfio_pci_device to do something useful, so
> > >>> probably porting actual device specific support rather than further
> > >>> hand waving would be more helpful.
> > >> It would be helpful to get actual feedback on the high level design -
> > >> someting like this was already tried in May and didn't go anywhere -
> > >> are you surprised that we are reluctant to commit alot of resources
> > >> doing a complete job just to have it go nowhere again?
> > > That's not really what I'm getting from your feedback, indicating
> > > vfio-pci is essentially done, the mlx stub driver should be enough to
> > > see the direction, and additional concerns can be handled with TODO
> > > comments. Sorry if this is not construed as actual feedback, I think
> > > both Connie and I are making an effort to understand this and being
> > > hampered by lack of a clear api or a vendor driver that's anything more
> > > than vfio-pci plus an aux bus interface. Thanks,
> >
> > I think I got the main idea and I'll try to summarize it:
> >
> > The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we
> > do need it to be able to create vendor-vfio-pci.ko driver in the future
> > to include vendor special souse inside.
>
> One other thing I'd like to bring up: What needs to be done in
> userspace? Does a userspace driver like QEMU need changes to actually
> exploit this? Does management software like libvirt need to be involved
> in decision making, or does it just need to provide the knobs to make
> the driver configurable?

I'm still pretty nervous about the userspace aspect of this as well.
QEMU and other actual vfio drivers are probably the least affected,
at least for QEMU, it'll happily open any device that has a pointer to
an IOMMU group that's reflected as a vfio group device. Tools like
libvirt, on the other hand, actually do driver binding and we need to
consider how they make driver decisions. Jason suggested that the
vfio-pci driver ought to be only spec compliant behavior, which sounds
like some deprecation process of splitting out the IGD, NVLink, zpci,
etc. features into sub-drivers and eventually removing that device
specific support from vfio-pci. Would we expect libvirt to know, "this
is an 8086 graphics device, try to bind it to vfio-pci-igd" or "uname
-m says we're running on s390, try to bind it to vfio-zpci"? Maybe we
expect derived drivers to only bind to devices they recognize, so
libvirt could blindly try a whole chain of drivers, ending in vfio-pci.
Obviously if we have competing drivers that support the same device in
different ways, that quickly falls apart.

Libvirt could also expand its available driver models for the user to
specify a variant, I'd support that for overriding a choice that libvirt
might make otherwise, but forcing the user to know this information is
just passing the buck.

Some derived drivers could probably actually include device IDs rather
than only relying on dynamic ids, but then we get into the problem that
we're competing with native host driver for a device. The aux bus
example here is essentially the least troublesome variation since it
works in conjunction with the native host driver rather than replacing
it. Thanks,

Alex

2021-01-31 20:15:50

by Max Gurtovoy

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem


On 1/28/2021 6:29 PM, Cornelia Huck wrote:
> On Tue, 26 Jan 2021 15:27:43 +0200
> Max Gurtovoy <[email protected]> wrote:
>
>> Hi Alex, Cornelia and Jason,
>>
>> thanks for the reviewing this.
>>
>> On 1/26/2021 5:34 AM, Alex Williamson wrote:
>>> On Mon, 25 Jan 2021 20:45:22 -0400
>>> Jason Gunthorpe <[email protected]> wrote:
>>>
>>>> On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:
>>>>
>>>>> We're supposed to be enlightened by a vendor driver that does nothing
>>>>> more than pass the opaque device_data through to the core functions,
>>>>> but in reality this is exactly the point of concern above. At a
>>>>> minimum that vendor driver needs to look at the vdev to get the
>>>>> pdev,
>>>> The end driver already havs the pdev, the RFC doesn't go enough into
>>>> those bits, it is a good comment.
>>>>
>>>> The dd_data pased to the vfio_create_pci_device() will be retrieved
>>>> from the ops to get back to the end drivers data. This can cleanly
>>>> include everything: the VF pci_device, PF pci_device, mlx5_core
>>>> pointer, vfio_device and vfio_pci_device.
>>>>
>>>> This is why the example passes in the mvadev:
>>>>
>>>> + vdev = vfio_create_pci_device(pdev, &mlx5_vfio_pci_ops, mvadev);
>>>>
>>>> The mvadev has the PF, VF, and mlx5 core driver pointer.
>>>>
>>>> Getting that back out during the ops is enough to do what the mlx5
>>>> driver needs to do, which is relay migration related IOCTLs to the PF
>>>> function via the mlx5_core driver so the device can execute them on
>>>> behalf of the VF.
>>>>
>>>>> but then what else does it look at, consume, or modify. Now we have
>>>>> vendor drivers misusing the core because it's not clear which fields
>>>>> are private and how public fields can be used safely,
>>>> The kernel has never followed rigid rules for data isolation, it is
>>>> normal to have whole private structs exposed in headers so that
>>>> container_of can be used to properly compose data structures.
>>> I reject this assertion, there are plenty of structs that clearly
>>> indicate private vs public data or, as we've done in mdev, clearly
>>> marking the private data in a "private" header and provide access
>>> functions for public fields. Including a "private" header to make use
>>> of a "library" is just wrong. In the example above, there's no way for
>>> the mlx vendor driver to get back dd_data without extracting it from
>>> struct vfio_pci_device itself.
>> I'll create a better separation between private/public fields according
>> to my understanding for the V2.
>>
>> I'll just mention that beyond this separation, future improvements will
>> be needed and can be done incrementally.
>>
>> I don't think that we should do so many changes at one shut. The
>> incremental process is safer from subsystem point of view.
>>
>> I also think that upstreaming mlx5_vfio_pci.ko and upstreaming vfio-pci
>> separation into 2 modules doesn't have to happen in one-shut.
> The design can probably benefit from tossing a non-mlx5 driver into the
> mix.
>
> So, let me suggest the zdev support for that experiment (see
> e6b817d4b8217a9528fcfd59719b924ab8a5ff23 and friends.) It is quite
> straightforward: it injects some capabilities into the info ioctl. A
> specialized driver would basically only need to provide special
> handling for the ioctl callback and just forward anything else. It also
> would not need any matching for device ids etc., as it would only make
> sense on s390x, but regardless of the device. It could, however, help
> us to get an idea what a good split would look like.

AFAIU, s390x is related to IBM architecture and not to a specific PCI
device. So I guess it should stay in the core as many PCI devices will
need these capabilities on IBM system.

I think I'll use NVLINK2 P9 stuff as an example of the split and add it
to vfio_pci.ko instead of vfio_pci_core.ko as a first step.

later we can create a dedicated module for these devices (V100 GPUs).

>> But again, to make our point in this RFC, I'll improve it for V2.
>>
>>>
>>>> Look at struct device, for instance. Most of that is private to the
>>>> driver core.
>>>>
>>>> A few 'private to vfio-pci-core' comments would help, it is good
>>>> feedback to make that more clear.
>>>>
>>>>> extensions potentially break vendor drivers, etc. We're only even hand
>>>>> waving that existing device specific support could be farmed out to new
>>>>> device specific drivers without even going to the effort to prove that.
>>>> This is a RFC, not a complete patch series. The RFC is to get feedback
>>>> on the general design before everyone comits alot of resources and
>>>> positions get dug in.
>>>>
>>>> Do you really think the existing device specific support would be a
>>>> problem to lift? It already looks pretty clean with the
>>>> vfio_pci_regops, looks easy enough to lift to the parent.
>>>>
>>>>> So far the TODOs rather mask the dirty little secrets of the
>>>>> extension rather than showing how a vendor derived driver needs to
>>>>> root around in struct vfio_pci_device to do something useful, so
>>>>> probably porting actual device specific support rather than further
>>>>> hand waving would be more helpful.
>>>> It would be helpful to get actual feedback on the high level design -
>>>> someting like this was already tried in May and didn't go anywhere -
>>>> are you surprised that we are reluctant to commit alot of resources
>>>> doing a complete job just to have it go nowhere again?
>>> That's not really what I'm getting from your feedback, indicating
>>> vfio-pci is essentially done, the mlx stub driver should be enough to
>>> see the direction, and additional concerns can be handled with TODO
>>> comments. Sorry if this is not construed as actual feedback, I think
>>> both Connie and I are making an effort to understand this and being
>>> hampered by lack of a clear api or a vendor driver that's anything more
>>> than vfio-pci plus an aux bus interface. Thanks,
>> I think I got the main idea and I'll try to summarize it:
>>
>> The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we
>> do need it to be able to create vendor-vfio-pci.ko driver in the future
>> to include vendor special souse inside.
> One other thing I'd like to bring up: What needs to be done in
> userspace? Does a userspace driver like QEMU need changes to actually
> exploit this? Does management software like libvirt need to be involved
> in decision making, or does it just need to provide the knobs to make
> the driver configurable?
>
>> The separation implementation and the question of what is private and
>> what is public, and the core APIs to the various drivers should be
>> improved or better demonstrated in the V2.
>>
>> I'll work on improving it and I'll send the V2.
>>
>>
>> If you have some feedback of the things/fields/structs you think should
>> remain private to vfio-pci-core please let us know.
>>
>> Thanks for the effort in the review,
>>
>> -Max.
>>
>>> Alex
>>>

2021-01-31 21:36:33

by Max Gurtovoy

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem


On 1/28/2021 11:02 PM, Alex Williamson wrote:
> On Thu, 28 Jan 2021 17:29:30 +0100
> Cornelia Huck <[email protected]> wrote:
>
>> On Tue, 26 Jan 2021 15:27:43 +0200
>> Max Gurtovoy <[email protected]> wrote:
>>> On 1/26/2021 5:34 AM, Alex Williamson wrote:
>>>> On Mon, 25 Jan 2021 20:45:22 -0400
>>>> Jason Gunthorpe <[email protected]> wrote:
>>>>
>>>>> On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:
>>>>>> extensions potentially break vendor drivers, etc. We're only even hand
>>>>>> waving that existing device specific support could be farmed out to new
>>>>>> device specific drivers without even going to the effort to prove that.
>>>>> This is a RFC, not a complete patch series. The RFC is to get feedback
>>>>> on the general design before everyone comits alot of resources and
>>>>> positions get dug in.
>>>>>
>>>>> Do you really think the existing device specific support would be a
>>>>> problem to lift? It already looks pretty clean with the
>>>>> vfio_pci_regops, looks easy enough to lift to the parent.
>>>>>
>>>>>> So far the TODOs rather mask the dirty little secrets of the
>>>>>> extension rather than showing how a vendor derived driver needs to
>>>>>> root around in struct vfio_pci_device to do something useful, so
>>>>>> probably porting actual device specific support rather than further
>>>>>> hand waving would be more helpful.
>>>>> It would be helpful to get actual feedback on the high level design -
>>>>> someting like this was already tried in May and didn't go anywhere -
>>>>> are you surprised that we are reluctant to commit alot of resources
>>>>> doing a complete job just to have it go nowhere again?
>>>> That's not really what I'm getting from your feedback, indicating
>>>> vfio-pci is essentially done, the mlx stub driver should be enough to
>>>> see the direction, and additional concerns can be handled with TODO
>>>> comments. Sorry if this is not construed as actual feedback, I think
>>>> both Connie and I are making an effort to understand this and being
>>>> hampered by lack of a clear api or a vendor driver that's anything more
>>>> than vfio-pci plus an aux bus interface. Thanks,
>>> I think I got the main idea and I'll try to summarize it:
>>>
>>> The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we
>>> do need it to be able to create vendor-vfio-pci.ko driver in the future
>>> to include vendor special souse inside.
>> One other thing I'd like to bring up: What needs to be done in
>> userspace? Does a userspace driver like QEMU need changes to actually
>> exploit this? Does management software like libvirt need to be involved
>> in decision making, or does it just need to provide the knobs to make
>> the driver configurable?
> I'm still pretty nervous about the userspace aspect of this as well.
> QEMU and other actual vfio drivers are probably the least affected,
> at least for QEMU, it'll happily open any device that has a pointer to
> an IOMMU group that's reflected as a vfio group device. Tools like
> libvirt, on the other hand, actually do driver binding and we need to
> consider how they make driver decisions. Jason suggested that the
> vfio-pci driver ought to be only spec compliant behavior, which sounds
> like some deprecation process of splitting out the IGD, NVLink, zpci,
> etc. features into sub-drivers and eventually removing that device
> specific support from vfio-pci. Would we expect libvirt to know, "this
> is an 8086 graphics device, try to bind it to vfio-pci-igd" or "uname
> -m says we're running on s390, try to bind it to vfio-zpci"? Maybe we
> expect derived drivers to only bind to devices they recognize, so
> libvirt could blindly try a whole chain of drivers, ending in vfio-pci.
> Obviously if we have competing drivers that support the same device in
> different ways, that quickly falls apart.

I think we can leave common arch specific stuff, such as s390 (IIUC) in
the core driver. And only create vfio_pci drivers for
vendor/device/subvendor specific stuff.

Also, the competing drivers issue can also happen today, right ? after
adding new_id to vfio_pci I don't know how linux will behave if we'll
plug new device with same id to the system. which driver will probe it ?

I don't really afraid of competing drivers since we can ask from vendor
vfio pci_drivers to add vendor_id, device_id, subsystem_vendor and
subsystem_device so we won't have this problem. I don't think that there
will be 2 drivers that drive the same device with these 4 ids.

Userspace tool can have a map of ids to drivers and bind the device to
the right vfio-pci vendor driver if it has one. if not, bind to vfio_pci.ko.

>
> Libvirt could also expand its available driver models for the user to
> specify a variant, I'd support that for overriding a choice that libvirt
> might make otherwise, but forcing the user to know this information is
> just passing the buck.

We can add a code to libvirt as mentioned above.

>
> Some derived drivers could probably actually include device IDs rather
> than only relying on dynamic ids, but then we get into the problem that
> we're competing with native host driver for a device. The aux bus
> example here is essentially the least troublesome variation since it
> works in conjunction with the native host driver rather than replacing
> it. Thanks,

same competition after we add new_id to vfio_pci, right ?

>
> Alex

A pointer to needed additions to libvirt will be awsome (or any other hint).

I'll send the V2 soon and then move to libvirt.

-Max.

2021-02-01 04:39:05

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Sun, 31 Jan 2021 20:46:40 +0200
Max Gurtovoy <[email protected]> wrote:

> On 1/28/2021 11:02 PM, Alex Williamson wrote:
> > On Thu, 28 Jan 2021 17:29:30 +0100
> > Cornelia Huck <[email protected]> wrote:
> >
> >> On Tue, 26 Jan 2021 15:27:43 +0200
> >> Max Gurtovoy <[email protected]> wrote:
> >>> On 1/26/2021 5:34 AM, Alex Williamson wrote:
> >>>> On Mon, 25 Jan 2021 20:45:22 -0400
> >>>> Jason Gunthorpe <[email protected]> wrote:
> >>>>
> >>>>> On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:
> >>>>>> extensions potentially break vendor drivers, etc. We're only even hand
> >>>>>> waving that existing device specific support could be farmed out to new
> >>>>>> device specific drivers without even going to the effort to prove that.
> >>>>> This is a RFC, not a complete patch series. The RFC is to get feedback
> >>>>> on the general design before everyone comits alot of resources and
> >>>>> positions get dug in.
> >>>>>
> >>>>> Do you really think the existing device specific support would be a
> >>>>> problem to lift? It already looks pretty clean with the
> >>>>> vfio_pci_regops, looks easy enough to lift to the parent.
> >>>>>
> >>>>>> So far the TODOs rather mask the dirty little secrets of the
> >>>>>> extension rather than showing how a vendor derived driver needs to
> >>>>>> root around in struct vfio_pci_device to do something useful, so
> >>>>>> probably porting actual device specific support rather than further
> >>>>>> hand waving would be more helpful.
> >>>>> It would be helpful to get actual feedback on the high level design -
> >>>>> someting like this was already tried in May and didn't go anywhere -
> >>>>> are you surprised that we are reluctant to commit alot of resources
> >>>>> doing a complete job just to have it go nowhere again?
> >>>> That's not really what I'm getting from your feedback, indicating
> >>>> vfio-pci is essentially done, the mlx stub driver should be enough to
> >>>> see the direction, and additional concerns can be handled with TODO
> >>>> comments. Sorry if this is not construed as actual feedback, I think
> >>>> both Connie and I are making an effort to understand this and being
> >>>> hampered by lack of a clear api or a vendor driver that's anything more
> >>>> than vfio-pci plus an aux bus interface. Thanks,
> >>> I think I got the main idea and I'll try to summarize it:
> >>>
> >>> The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we
> >>> do need it to be able to create vendor-vfio-pci.ko driver in the future
> >>> to include vendor special souse inside.
> >> One other thing I'd like to bring up: What needs to be done in
> >> userspace? Does a userspace driver like QEMU need changes to actually
> >> exploit this? Does management software like libvirt need to be involved
> >> in decision making, or does it just need to provide the knobs to make
> >> the driver configurable?
> > I'm still pretty nervous about the userspace aspect of this as well.
> > QEMU and other actual vfio drivers are probably the least affected,
> > at least for QEMU, it'll happily open any device that has a pointer to
> > an IOMMU group that's reflected as a vfio group device. Tools like
> > libvirt, on the other hand, actually do driver binding and we need to
> > consider how they make driver decisions. Jason suggested that the
> > vfio-pci driver ought to be only spec compliant behavior, which sounds
> > like some deprecation process of splitting out the IGD, NVLink, zpci,
> > etc. features into sub-drivers and eventually removing that device
> > specific support from vfio-pci. Would we expect libvirt to know, "this
> > is an 8086 graphics device, try to bind it to vfio-pci-igd" or "uname
> > -m says we're running on s390, try to bind it to vfio-zpci"? Maybe we
> > expect derived drivers to only bind to devices they recognize, so
> > libvirt could blindly try a whole chain of drivers, ending in vfio-pci.
> > Obviously if we have competing drivers that support the same device in
> > different ways, that quickly falls apart.
>
> I think we can leave common arch specific stuff, such as s390 (IIUC) in
> the core driver. And only create vfio_pci drivers for
> vendor/device/subvendor specific stuff.

So on one hand you're telling us that the design principles here can be
applied to various other device/platform specific support, but on the
other you're saying, but don't do that...

> Also, the competing drivers issue can also happen today, right ? after
> adding new_id to vfio_pci I don't know how linux will behave if we'll
> plug new device with same id to the system. which driver will probe it ?

new_id is non-deterministic, that's why we have driver_override.

> I don't really afraid of competing drivers since we can ask from vendor
> vfio pci_drivers to add vendor_id, device_id, subsystem_vendor and
> subsystem_device so we won't have this problem. I don't think that there
> will be 2 drivers that drive the same device with these 4 ids.
>
> Userspace tool can have a map of ids to drivers and bind the device to
> the right vfio-pci vendor driver if it has one. if not, bind to vfio_pci.ko.

As I've outlined, the support is not really per device, there might be
a preferred default driver for the platform, ex. s390.

> > Libvirt could also expand its available driver models for the user to
> > specify a variant, I'd support that for overriding a choice that libvirt
> > might make otherwise, but forcing the user to know this information is
> > just passing the buck.
>
> We can add a code to libvirt as mentioned above.

That's rather the question here, what is that algorithm by which a
userspace tool such as libvirt would determine the optimal driver for a
device?

> > Some derived drivers could probably actually include device IDs rather
> > than only relying on dynamic ids, but then we get into the problem that
> > we're competing with native host driver for a device. The aux bus
> > example here is essentially the least troublesome variation since it
> > works in conjunction with the native host driver rather than replacing
> > it. Thanks,
>
> same competition after we add new_id to vfio_pci, right ?

new_id is already superseded by driver_override to avoid the ambiguity,
but to which driver does a userspace tool like libvirt define as the
ultimate target driver for a device and how?

> A pointer to needed additions to libvirt will be awsome (or any other hint).
>
> I'll send the V2 soon and then move to libvirt.

The libvirt driver for a device likely needs to accept vfio variants
and allow users to specify a variant, but the real question is how
libvirt makes an educated guess which variant to use initially, which I
don't really have any good ideas to resolve. Thanks,

Alex

2021-02-01 09:43:08

by Max Gurtovoy

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem


On 2/1/2021 6:32 AM, Alex Williamson wrote:
> On Sun, 31 Jan 2021 20:46:40 +0200
> Max Gurtovoy <[email protected]> wrote:
>
>> On 1/28/2021 11:02 PM, Alex Williamson wrote:
>>> On Thu, 28 Jan 2021 17:29:30 +0100
>>> Cornelia Huck <[email protected]> wrote:
>>>
>>>> On Tue, 26 Jan 2021 15:27:43 +0200
>>>> Max Gurtovoy <[email protected]> wrote:
>>>>> On 1/26/2021 5:34 AM, Alex Williamson wrote:
>>>>>> On Mon, 25 Jan 2021 20:45:22 -0400
>>>>>> Jason Gunthorpe <[email protected]> wrote:
>>>>>>
>>>>>>> On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:
>>>>>>>> extensions potentially break vendor drivers, etc. We're only even hand
>>>>>>>> waving that existing device specific support could be farmed out to new
>>>>>>>> device specific drivers without even going to the effort to prove that.
>>>>>>> This is a RFC, not a complete patch series. The RFC is to get feedback
>>>>>>> on the general design before everyone comits alot of resources and
>>>>>>> positions get dug in.
>>>>>>>
>>>>>>> Do you really think the existing device specific support would be a
>>>>>>> problem to lift? It already looks pretty clean with the
>>>>>>> vfio_pci_regops, looks easy enough to lift to the parent.
>>>>>>>
>>>>>>>> So far the TODOs rather mask the dirty little secrets of the
>>>>>>>> extension rather than showing how a vendor derived driver needs to
>>>>>>>> root around in struct vfio_pci_device to do something useful, so
>>>>>>>> probably porting actual device specific support rather than further
>>>>>>>> hand waving would be more helpful.
>>>>>>> It would be helpful to get actual feedback on the high level design -
>>>>>>> someting like this was already tried in May and didn't go anywhere -
>>>>>>> are you surprised that we are reluctant to commit alot of resources
>>>>>>> doing a complete job just to have it go nowhere again?
>>>>>> That's not really what I'm getting from your feedback, indicating
>>>>>> vfio-pci is essentially done, the mlx stub driver should be enough to
>>>>>> see the direction, and additional concerns can be handled with TODO
>>>>>> comments. Sorry if this is not construed as actual feedback, I think
>>>>>> both Connie and I are making an effort to understand this and being
>>>>>> hampered by lack of a clear api or a vendor driver that's anything more
>>>>>> than vfio-pci plus an aux bus interface. Thanks,
>>>>> I think I got the main idea and I'll try to summarize it:
>>>>>
>>>>> The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we
>>>>> do need it to be able to create vendor-vfio-pci.ko driver in the future
>>>>> to include vendor special souse inside.
>>>> One other thing I'd like to bring up: What needs to be done in
>>>> userspace? Does a userspace driver like QEMU need changes to actually
>>>> exploit this? Does management software like libvirt need to be involved
>>>> in decision making, or does it just need to provide the knobs to make
>>>> the driver configurable?
>>> I'm still pretty nervous about the userspace aspect of this as well.
>>> QEMU and other actual vfio drivers are probably the least affected,
>>> at least for QEMU, it'll happily open any device that has a pointer to
>>> an IOMMU group that's reflected as a vfio group device. Tools like
>>> libvirt, on the other hand, actually do driver binding and we need to
>>> consider how they make driver decisions. Jason suggested that the
>>> vfio-pci driver ought to be only spec compliant behavior, which sounds
>>> like some deprecation process of splitting out the IGD, NVLink, zpci,
>>> etc. features into sub-drivers and eventually removing that device
>>> specific support from vfio-pci. Would we expect libvirt to know, "this
>>> is an 8086 graphics device, try to bind it to vfio-pci-igd" or "uname
>>> -m says we're running on s390, try to bind it to vfio-zpci"? Maybe we
>>> expect derived drivers to only bind to devices they recognize, so
>>> libvirt could blindly try a whole chain of drivers, ending in vfio-pci.
>>> Obviously if we have competing drivers that support the same device in
>>> different ways, that quickly falls apart.
>> I think we can leave common arch specific stuff, such as s390 (IIUC) in
>> the core driver. And only create vfio_pci drivers for
>> vendor/device/subvendor specific stuff.
> So on one hand you're telling us that the design principles here can be
> applied to various other device/platform specific support, but on the
> other you're saying, but don't do that...

I guess I was looking at nvlink2 as device specific.

But let's update the nvlink2, s390 and IGD a bit:

1. s390 -  config VFIO_PCI_ZDEV rename to config VFIO_PCI_S390 (it will
include all needed tweeks for S390)

2. nvlink2 - config VFIO_PCI_NVLINK2 rename to config VFIO_PCI_P9 (it
will include all needed tweeks for P9)

3. igd - config VFIO_PCI_IGD rename to config VFIO_PCI_X86 (it will
include all needed tweeks for X86)

All the 3 stays in the vfio-pci-core.ko since we might need S390 stuff
if we plug Network adapter from vendor-A or  NVMe adapter from vendor-B
for example. This is platform specific and we don't want to duplicate it
in each vendor driver.

Same for P9 (and nvlink2 is only a special case in there) and X86.

>
>> Also, the competing drivers issue can also happen today, right ? after
>> adding new_id to vfio_pci I don't know how linux will behave if we'll
>> plug new device with same id to the system. which driver will probe it ?
> new_id is non-deterministic, that's why we have driver_override.

I'm not sure I understand how driver_override help in the competition ?

it's only enforce driver binding to a device.

if we have device AAA0 that is driven by aaa.ko and we add AAA as new_id
to vfio_pci and afterwards we plug AAA1 that is also driven by aaa.ko
and can be driven by vfio_pci.ko. what will happen ? will it be the
wanted behavior always ?

We will have a competition in any case in the current linux design. Only
now we add new players to the competition.

how does libvirt use driver_override ?

and why will it change in case of vendor specific vfio-pci driver ?


>
>> I don't really afraid of competing drivers since we can ask from vendor
>> vfio pci_drivers to add vendor_id, device_id, subsystem_vendor and
>> subsystem_device so we won't have this problem. I don't think that there
>> will be 2 drivers that drive the same device with these 4 ids.
>>
>> Userspace tool can have a map of ids to drivers and bind the device to
>> the right vfio-pci vendor driver if it has one. if not, bind to vfio_pci.ko.
> As I've outlined, the support is not really per device, there might be
> a preferred default driver for the platform, ex. s390.
>
>>> Libvirt could also expand its available driver models for the user to
>>> specify a variant, I'd support that for overriding a choice that libvirt
>>> might make otherwise, but forcing the user to know this information is
>>> just passing the buck.
>> We can add a code to libvirt as mentioned above.
> That's rather the question here, what is that algorithm by which a
> userspace tool such as libvirt would determine the optimal driver for a
> device?

If exist, the optimal driver is the vendor driver according to mapping
of device_id + vendor_id + subsystem_device + subsystem_vendor to
vendor-vfio-pci.ko.

If not, bind to vfio-pci.ko.

Platform specific stuff will be handled in vfio-pci-core.ko and not in a
vendor driver. vendor drivers are for PCI devices and not platform tweeks.

>
>>> Some derived drivers could probably actually include device IDs rather
>>> than only relying on dynamic ids, but then we get into the problem that
>>> we're competing with native host driver for a device. The aux bus
>>> example here is essentially the least troublesome variation since it
>>> works in conjunction with the native host driver rather than replacing
>>> it. Thanks,
>> same competition after we add new_id to vfio_pci, right ?
> new_id is already superseded by driver_override to avoid the ambiguity,
> but to which driver does a userspace tool like libvirt define as the
> ultimate target driver for a device and how?

it will have a lookup table as mentioned above.


>
>> A pointer to needed additions to libvirt will be awsome (or any other hint).
>>
>> I'll send the V2 soon and then move to libvirt.
> The libvirt driver for a device likely needs to accept vfio variants
> and allow users to specify a variant, but the real question is how
> libvirt makes an educated guess which variant to use initially, which I
> don't really have any good ideas to resolve. Thanks,
>
> Alex
>

2021-02-01 17:21:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Sun, Jan 31, 2021 at 09:32:28PM -0700, Alex Williamson wrote:

> > I think we can leave common arch specific stuff, such as s390 (IIUC) in
> > the core driver. And only create vfio_pci drivers for
> > vendor/device/subvendor specific stuff.
>
> So on one hand you're telling us that the design principles here can be
> applied to various other device/platform specific support, but on the
> other you're saying, but don't do that...

The kernel code for the other three can be reworked to follow this
scheme, but because of the uABI we have established binding vfio_pci
is going to have to continue to automatically use the other drivers
for a long time.

For instance this can be accomplished by structuring the three drivers
in the new way and diverting the driver ops from vfio_pci to the other
drivers in a hardcoded way.

If this is worth doing from a maintability perspective vs just
continuing to hardwire the PCI IDs in vfio_pci, is something to look
at.

The point was it could work.

> > We can add a code to libvirt as mentioned above.
>
> That's rather the question here, what is that algorithm by which a
> userspace tool such as libvirt would determine the optimal driver for a
> device?

Well, the PCI drivers do specify their PCI ids:

+static const struct pci_device_id mlx5_vfio_pci_table[] = {
+ { PCI_VDEVICE(MELLANOX, 0x6001) }, /* NVMe SNAP controllers */
+ { PCI_DEVICE_SUB(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1042,
+ PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID) }, /* Virtio SNAP controllers */
+ { 0, }
+};

The problem in VFIO is that it doesn't have any way to create a VFIO
from a '/sys/device/blah/'. It expects userspace to know the module
name as part of the uAPI.

Generally speaking in the kernel we'd want to see some uAPI that was
'create a VFIO from /sys/device/blah' that does automatic module
loading and automatic driver selection.

For instance, by forming the '/sys/device/blah' to a normal modalias,
and then searching for the most specific VFIO modalias to handle the
device.

When implemented in netlink the whole thing is fully automatic, users
never have to know or care about any modules or their names. This is
the gold standard.

Jason

2021-02-01 17:35:11

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

On Mon, 1 Feb 2021 11:40:45 +0200
Max Gurtovoy <[email protected]> wrote:

> On 2/1/2021 6:32 AM, Alex Williamson wrote:
> > On Sun, 31 Jan 2021 20:46:40 +0200
> > Max Gurtovoy <[email protected]> wrote:
> >
> >> On 1/28/2021 11:02 PM, Alex Williamson wrote:
> >>> On Thu, 28 Jan 2021 17:29:30 +0100
> >>> Cornelia Huck <[email protected]> wrote:
> >>>
> >>>> On Tue, 26 Jan 2021 15:27:43 +0200
> >>>> Max Gurtovoy <[email protected]> wrote:
> >>>>> On 1/26/2021 5:34 AM, Alex Williamson wrote:
> >>>>>> On Mon, 25 Jan 2021 20:45:22 -0400
> >>>>>> Jason Gunthorpe <[email protected]> wrote:
> >>>>>>
> >>>>>>> On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:
> >>>>>>>> extensions potentially break vendor drivers, etc. We're only even hand
> >>>>>>>> waving that existing device specific support could be farmed out to new
> >>>>>>>> device specific drivers without even going to the effort to prove that.
> >>>>>>> This is a RFC, not a complete patch series. The RFC is to get feedback
> >>>>>>> on the general design before everyone comits alot of resources and
> >>>>>>> positions get dug in.
> >>>>>>>
> >>>>>>> Do you really think the existing device specific support would be a
> >>>>>>> problem to lift? It already looks pretty clean with the
> >>>>>>> vfio_pci_regops, looks easy enough to lift to the parent.
> >>>>>>>
> >>>>>>>> So far the TODOs rather mask the dirty little secrets of the
> >>>>>>>> extension rather than showing how a vendor derived driver needs to
> >>>>>>>> root around in struct vfio_pci_device to do something useful, so
> >>>>>>>> probably porting actual device specific support rather than further
> >>>>>>>> hand waving would be more helpful.
> >>>>>>> It would be helpful to get actual feedback on the high level design -
> >>>>>>> someting like this was already tried in May and didn't go anywhere -
> >>>>>>> are you surprised that we are reluctant to commit alot of resources
> >>>>>>> doing a complete job just to have it go nowhere again?
> >>>>>> That's not really what I'm getting from your feedback, indicating
> >>>>>> vfio-pci is essentially done, the mlx stub driver should be enough to
> >>>>>> see the direction, and additional concerns can be handled with TODO
> >>>>>> comments. Sorry if this is not construed as actual feedback, I think
> >>>>>> both Connie and I are making an effort to understand this and being
> >>>>>> hampered by lack of a clear api or a vendor driver that's anything more
> >>>>>> than vfio-pci plus an aux bus interface. Thanks,
> >>>>> I think I got the main idea and I'll try to summarize it:
> >>>>>
> >>>>> The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we
> >>>>> do need it to be able to create vendor-vfio-pci.ko driver in the future
> >>>>> to include vendor special souse inside.
> >>>> One other thing I'd like to bring up: What needs to be done in
> >>>> userspace? Does a userspace driver like QEMU need changes to actually
> >>>> exploit this? Does management software like libvirt need to be involved
> >>>> in decision making, or does it just need to provide the knobs to make
> >>>> the driver configurable?
> >>> I'm still pretty nervous about the userspace aspect of this as well.
> >>> QEMU and other actual vfio drivers are probably the least affected,
> >>> at least for QEMU, it'll happily open any device that has a pointer to
> >>> an IOMMU group that's reflected as a vfio group device. Tools like
> >>> libvirt, on the other hand, actually do driver binding and we need to
> >>> consider how they make driver decisions. Jason suggested that the
> >>> vfio-pci driver ought to be only spec compliant behavior, which sounds
> >>> like some deprecation process of splitting out the IGD, NVLink, zpci,
> >>> etc. features into sub-drivers and eventually removing that device
> >>> specific support from vfio-pci. Would we expect libvirt to know, "this
> >>> is an 8086 graphics device, try to bind it to vfio-pci-igd" or "uname
> >>> -m says we're running on s390, try to bind it to vfio-zpci"? Maybe we
> >>> expect derived drivers to only bind to devices they recognize, so
> >>> libvirt could blindly try a whole chain of drivers, ending in vfio-pci.
> >>> Obviously if we have competing drivers that support the same device in
> >>> different ways, that quickly falls apart.
> >> I think we can leave common arch specific stuff, such as s390 (IIUC) in
> >> the core driver. And only create vfio_pci drivers for
> >> vendor/device/subvendor specific stuff.
> > So on one hand you're telling us that the design principles here can be
> > applied to various other device/platform specific support, but on the
> > other you're saying, but don't do that...
>
> I guess I was looking at nvlink2 as device specific.

It's device specific w/ platform dependencies as I see it.

> But let's update the nvlink2, s390 and IGD a bit:
>
> 1. s390 -  config VFIO_PCI_ZDEV rename to config VFIO_PCI_S390 (it will
> include all needed tweeks for S390)
>
> 2. nvlink2 - config VFIO_PCI_NVLINK2 rename to config VFIO_PCI_P9 (it
> will include all needed tweeks for P9)
>
> 3. igd - config VFIO_PCI_IGD rename to config VFIO_PCI_X86 (it will
> include all needed tweeks for X86)
>
> All the 3 stays in the vfio-pci-core.ko since we might need S390 stuff
> if we plug Network adapter from vendor-A or  NVMe adapter from vendor-B
> for example. This is platform specific and we don't want to duplicate it
> in each vendor driver.
>
> Same for P9 (and nvlink2 is only a special case in there) and X86.

I'm not a fan of this, you're essentially avoiding the issue by turning
everything into an architecture specific version of the driver. That
takes us down a path where everything gets added to vfio-pci-core with
a bunch of Kconfig switches, which seems to be exactly the opposite
direction of creating a core module as a library for derived device
and/or platform specific modules. This also appears to be the opposite
of Jason's suggestion that vfio-pci become a pure PCI spec module
without various device/vendor quirks.

> >> Also, the competing drivers issue can also happen today, right ? after
> >> adding new_id to vfio_pci I don't know how linux will behave if we'll
> >> plug new device with same id to the system. which driver will probe it ?
> > new_id is non-deterministic, that's why we have driver_override.
>
> I'm not sure I understand how driver_override help in the competition ?
>
> it's only enforce driver binding to a device.
>
> if we have device AAA0 that is driven by aaa.ko and we add AAA as new_id
> to vfio_pci and afterwards we plug AAA1 that is also driven by aaa.ko
> and can be driven by vfio_pci.ko. what will happen ? will it be the
> wanted behavior always ?

I think with AAA vs AAA0 and AAA1 you're suggesting a wildcard in
new_id where both the aaa.ko driver and vfio-pci.ko (once the wildcard
is added) could bind to the device. At that point, which driver gets
first shot at a compatible device depends on the driver load order, ie.
if the aaa.ko module was loaded before vfio-pci.ko, it might win. If
an event happens that causes the competing driver to be loaded between
setting a new_id and binding the device to the driver, that competing
module load could claim the device instead. driver_override helps by
allowing the user to define that a device will match a driver rather
than a driver matching a device. The user can write a driver name to
the driver_override of the device such that that device can only be
bound to the specified driver.

> We will have a competition in any case in the current linux design. Only
> now we add new players to the competition.
>
> how does libvirt use driver_override ?

Libvirt would write "vfio-pci" to the driver_override attribute for a
device such that when an unbind and drivers_probe occurs, that device
can only be bound to vfio-pci. There is no longer a race with another
driver nor is there non-determinism based on module load order.

> and why will it change in case of vendor specific vfio-pci driver ?

Libvirt needs to know *what* driver to set as the vendor_override, so
if we had vfio-pci, vfio-pci-zdev, vfio-zpci-ism, vfio-pci-igd,
vfio-pci-ppc-nvlink, etc., how does libvirt decide which driver it
should use? The drivers themselves cannot populate their match ids or
else we get into the problem above where for example vfio-pci-igd could
possibly claim the Intel graphics device before i915 depending on the
module load order, which would be a support issue.

> >> I don't really afraid of competing drivers since we can ask from vendor
> >> vfio pci_drivers to add vendor_id, device_id, subsystem_vendor and
> >> subsystem_device so we won't have this problem. I don't think that there
> >> will be 2 drivers that drive the same device with these 4 ids.
> >>
> >> Userspace tool can have a map of ids to drivers and bind the device to
> >> the right vfio-pci vendor driver if it has one. if not, bind to vfio_pci.ko.
> > As I've outlined, the support is not really per device, there might be
> > a preferred default driver for the platform, ex. s390.
> >
> >>> Libvirt could also expand its available driver models for the user to
> >>> specify a variant, I'd support that for overriding a choice that libvirt
> >>> might make otherwise, but forcing the user to know this information is
> >>> just passing the buck.
> >> We can add a code to libvirt as mentioned above.
> > That's rather the question here, what is that algorithm by which a
> > userspace tool such as libvirt would determine the optimal driver for a
> > device?
>
> If exist, the optimal driver is the vendor driver according to mapping
> of device_id + vendor_id + subsystem_device + subsystem_vendor to
> vendor-vfio-pci.ko.

And how is that mapping done? The only sane way would be depmod, but
that implies that vendor-vfio-pci.ko fills the ids table for that
device, which can only happen for devices that have no competing host
driver. For example, how would Intel go about creating their vendor
vfio-pci module that can bind to and xl710 VF that wouldn't create
competition and non-determinism in module loading vs the existing
iavf.ko module? This only works with your aux bus based driver, but
that's also the only vfio-pci derived driver proposed that can take
advantage of that approach.

> If not, bind to vfio-pci.ko.
>
> Platform specific stuff will be handled in vfio-pci-core.ko and not in a
> vendor driver. vendor drivers are for PCI devices and not platform tweeks.

I don't think that's the correct, or a sustainable approach. vfio-pci
will necessarily make use of platform access functions, but trying to
loop in devices that have platform dependencies as platform extensions
of vfio-pci-core rather than vendor extensions for a device seems wrong
to me.

> >>> Some derived drivers could probably actually include device IDs rather
> >>> than only relying on dynamic ids, but then we get into the problem that
> >>> we're competing with native host driver for a device. The aux bus
> >>> example here is essentially the least troublesome variation since it
> >>> works in conjunction with the native host driver rather than replacing
> >>> it. Thanks,
> >> same competition after we add new_id to vfio_pci, right ?
> > new_id is already superseded by driver_override to avoid the ambiguity,
> > but to which driver does a userspace tool like libvirt define as the
> > ultimate target driver for a device and how?
>
> it will have a lookup table as mentioned above.

So the expectation is that each user application that manages binding
devices to vfio-pci derived drivers will have a lookup table that's
manually managed to provide an optimal device to driver mapping?
That's terrible. Thanks,

Alex