2022-12-06 06:21:12

by Rao, Lei

[permalink] [raw]
Subject: [RFC PATCH 0/5] Add new VFIO PCI driver for NVMe devices

Some devices, such as Infrastructure Processing Units (IPUs) and Data
Processing Units (DPUs), expose SR-IOV capable NVMe devices to the host.
Its VF devices support live migration via specific NVMe commands issued
through the PF's device's admin queue.

One of the problems is there are no standardized NVMe live migration
commands to make our patches spec compliant, which prevents us from
creating a spec-compliant implementation. So we've created a reference
implantation based on the Vendor-specific command fields of the protocol.
We want these commands to be standardized so that the implementation will
be spec compliant in future versions and use this RFC as a basis for the
same.

More importantly, we provide our work to help the community and start the
discussion, so everyone can participate and benefit from our work in
NVMe Live Migration implementation and help drive on standardization
efforts.

This series implements the specific vfio-pci driver to support live
migration of NVMe devices. Adds a new interface in the general
NVMe driver to receive the VF device's commands submission to the PF
device's admin queue. And also documents the details of NVMe device
live migration commands.

Any feedback and comments are greatly appreciated.

Lei Rao (5):
nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for
VF driver.
nvme-vfio: add new vfio-pci driver for NVMe device
nvme-vfio: enable the function of VFIO live migration.
nvme-vfio: check if the hardware supports live migration
nvme-vfio: Add a document for the NVMe device

drivers/nvme/host/pci.c | 18 +
drivers/vfio/pci/Kconfig | 2 +
drivers/vfio/pci/Makefile | 2 +
drivers/vfio/pci/nvme/Kconfig | 10 +
drivers/vfio/pci/nvme/Makefile | 3 +
drivers/vfio/pci/nvme/nvme.c | 636 +++++++++++++++++++++++++++++++++
drivers/vfio/pci/nvme/nvme.h | 111 ++++++
drivers/vfio/pci/nvme/nvme.txt | 278 ++++++++++++++
8 files changed, 1060 insertions(+)
create mode 100644 drivers/vfio/pci/nvme/Kconfig
create mode 100644 drivers/vfio/pci/nvme/Makefile
create mode 100644 drivers/vfio/pci/nvme/nvme.c
create mode 100644 drivers/vfio/pci/nvme/nvme.h
create mode 100644 drivers/vfio/pci/nvme/nvme.txt

--
2.34.1


2022-12-06 06:47:32

by Rao, Lei

[permalink] [raw]
Subject: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

The documentation describes the details of the NVMe hardware
extension to support VFIO live migration.

Signed-off-by: Lei Rao <[email protected]>
Signed-off-by: Yadong Li <[email protected]>
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Eddie Dong <[email protected]>
Reviewed-by: Hang Yuan <[email protected]>
---
drivers/vfio/pci/nvme/nvme.txt | 278 +++++++++++++++++++++++++++++++++
1 file changed, 278 insertions(+)
create mode 100644 drivers/vfio/pci/nvme/nvme.txt

diff --git a/drivers/vfio/pci/nvme/nvme.txt b/drivers/vfio/pci/nvme/nvme.txt
new file mode 100644
index 000000000000..eadcf2082eed
--- /dev/null
+++ b/drivers/vfio/pci/nvme/nvme.txt
@@ -0,0 +1,278 @@
+===========================
+NVMe Live Migration Support
+===========================
+
+Introduction
+------------
+To support live migration, NVMe device designs its own implementation,
+including five new specific admin commands and a capability flag in
+the vendor-specific field in the identify controller data structure to
+support VF's live migration usage. Software can use these live migration
+admin commands to get device migration state data size, save and load the
+data, suspend and resume the given VF device. They are submitted by software
+to the NVMe PF device's admin queue and ignored if placed in the VF device's
+admin queue. This is due to the NVMe VF device being passed to the virtual
+machine in the virtualization scenario. So VF device's admin queue is not
+available for the hypervisor to submit VF device live migration commands.
+The capability flag in the identify controller data structure can be used by
+software to detect if the NVMe device supports live migration. The following
+chapters introduce the detailed format of the commands and the capability flag.
+
+Definition of opcode for live migration commands
+------------------------------------------------
+
++---------------------------+-----------+-----------+------------+
+| | | | |
+| Opcode by Field | | | |
+| | | | |
++--------+---------+--------+ | | |
+| | | | Combined | Namespace | |
+| 07 | 06:02 | 01:00 | Opcode | Identifier| Command |
+| | | | | used | |
++--------+---------+--------+ | | |
+|Generic | Function| Data | | | |
+|command | |Transfer| | | |
++--------+---------+--------+-----------+-----------+------------+
+| |
+| Vendor SpecificOpcode |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Query the |
+| 1b | 10001 | 00 | 0xC4 | | data size |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Suspend the|
+| 1b | 10010 | 00 | 0xC8 | | VF |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Resume the |
+| 1b | 10011 | 00 | 0xCC | | VF |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Save the |
+| 1b | 10100 | 10 | 0xD2 | |device data |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Load the |
+| 1b | 10101 | 01 | 0xD5 | |device data |
++--------+---------+--------+-----------+-----------+------------+
+
+Definition of QUERY_DATA_SIZE command
+-------------------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xC4 to indicate a qeury command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for more details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 39:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller internal data size to query |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The QUERY_DATA_SIZE command is used to query the NVMe VF internal data size for live migration.
+When the NVMe firmware receives the command, it will return the size of NVMe VF internal
+data. The data size depends on how many IO queues are created.
+
+Definition of SUSPEND command
+-----------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xC8 to indicate a suspend command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe specification for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 39:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller to suspend |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The SUSPEND command is used to suspend the NVMe VF controller. When the NVMe firmware receives
+this command, it will suspend the NVMe VF controller.
+
+Definition of RESUME command
+----------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xCC to indicate a resume command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 39:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller to resume |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The RESUME command is used to resume the NVMe VF controller. When firmware receives this command,
+it will restart the NVMe VF controller.
+
+Definition of SAVE_DATA command
+--------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xD2 to indicate a save command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 23:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer |
++---------+------------------------------------------------------------------------------------+
+| 39:32 | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller internal data to save |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The SAVE_DATA command is used to save the NVMe VF internal data for live migration. When firmware
+receives this command, it will save the admin queue states, save some registers, drain IO SQs
+and CQs, save every IO queue state, disable the VF controller, and transfer all data to the
+host memory through DMA.
+
+Definition of LOAD_DATA command
+--------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xD5 to indicate a load command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 23:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer |
++---------+------------------------------------------------------------------------------------+
+| 39:32 | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller internal data to load |
++---------+------------------------------------------------------------------------------------+
+| 47:44 | Size: means the size of the device's internal data to be loaded |
++---------+------------------------------------------------------------------------------------+
+| 63:48 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The LOAD_DATA command is used to restore the NVMe VF internal data. When firmware receives this
+command, it will read the device internal's data from the host memory through DMA, restore the
+admin queue states and some registers, and restore every IO queue state.
+
+Extensions of the vendor-specific field in the identify controller data structure
+---------------------------------------------------------------------------------
+
++---------+------+------+------+-------------------------------+
+| | | | | |
+| Bytes | I/O |Admin | Disc | Description |
+| | | | | |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| 01:00 | M | M | R | PCI Vendor ID(VID) |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| 03:02 | M | M | R | PCI Subsytem Vendor ID(SSVID) |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| ... | ... | ... | ... | ... |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| 3072 | O | O | O | Live Migration Support |
++---------+------+------+------+-------------------------------+
+| | | | | |
+|4095:3073| O | O | O | Vendor Specific |
++---------+------+------+------+-------------------------------+
+
+According to NVMe specification, the bytes from 3072 to 4095 are vendor-specific fields.
+NVMe device uses the 3072 bytes in the identify controller data structure to indicate
+whether live migration is supported. 0x0 means live migration is not supported. 0x01 means
+live migration is supported, and other values are reserved.
+
+[1] https://nvmexpress.org/wp-content/uploads/NVMe-NVM-Express-2.0a-2021.07.26-Ratified.pdf
--
2.34.1

2022-12-06 06:52:46

by Rao, Lei

[permalink] [raw]
Subject: [RFC PATCH 2/5] nvme-vfio: add new vfio-pci driver for NVMe device

NVMe device has specific live migration implementation.
Add an specific VFIO PCI driver for NVMe device. Its
live migration support will be added in the subsequent
patches.

Signed-off-by: Lei Rao <[email protected]>
Signed-off-by: Yadong Li <[email protected]>
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Eddie Dong <[email protected]>
Reviewed-by: Hang Yuan <[email protected]>
---
drivers/vfio/pci/Kconfig | 2 +
drivers/vfio/pci/Makefile | 2 +
drivers/vfio/pci/nvme/Kconfig | 9 ++++
drivers/vfio/pci/nvme/Makefile | 3 ++
drivers/vfio/pci/nvme/nvme.c | 99 ++++++++++++++++++++++++++++++++++
5 files changed, 115 insertions(+)
create mode 100644 drivers/vfio/pci/nvme/Kconfig
create mode 100644 drivers/vfio/pci/nvme/Makefile
create mode 100644 drivers/vfio/pci/nvme/nvme.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index f9d0c908e738..fcd45144d3e3 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -59,4 +59,6 @@ source "drivers/vfio/pci/mlx5/Kconfig"

source "drivers/vfio/pci/hisilicon/Kconfig"

+source "drivers/vfio/pci/nvme/Kconfig"
+
endif
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 24c524224da5..eddc8e889726 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -11,3 +11,5 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/

obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
+
+obj-$(CONFIG_NVME_VFIO_PCI) += nvme/
diff --git a/drivers/vfio/pci/nvme/Kconfig b/drivers/vfio/pci/nvme/Kconfig
new file mode 100644
index 000000000000..c281fe154007
--- /dev/null
+++ b/drivers/vfio/pci/nvme/Kconfig
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config NVME_VFIO_PCI
+ tristate "VFIO support for NVMe PCI devices"
+ depends on VFIO_PCI_CORE
+ help
+ This provides generic VFIO PCI support for NVMe device
+ using the VFIO framework.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/nvme/Makefile b/drivers/vfio/pci/nvme/Makefile
new file mode 100644
index 000000000000..2f4a0ad3d9cf
--- /dev/null
+++ b/drivers/vfio/pci/nvme/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_NVME_VFIO_PCI) += nvme-vfio-pci.o
+nvme-vfio-pci-y := nvme.o
diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c
new file mode 100644
index 000000000000..f1386d8a9287
--- /dev/null
+++ b/drivers/vfio/pci/nvme/nvme.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2022, INTEL CORPORATION. All rights reserved
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/types.h>
+#include <linux/vfio.h>
+#include <linux/anon_inodes.h>
+#include <linux/kernel.h>
+#include <linux/vfio_pci_core.h>
+
+static int nvmevf_pci_open_device(struct vfio_device *core_vdev)
+{
+ struct vfio_pci_core_device *vdev =
+ container_of(core_vdev, struct vfio_pci_core_device, vdev);
+ int ret;
+
+ ret = vfio_pci_core_enable(vdev);
+ if (ret)
+ return ret;
+
+ vfio_pci_core_finish_enable(vdev);
+ return 0;
+}
+
+static const struct vfio_device_ops nvmevf_pci_ops = {
+ .name = "nvme-vfio-pci",
+ .init = vfio_pci_core_init_dev,
+ .release = vfio_pci_core_release_dev,
+ .open_device = nvmevf_pci_open_device,
+ .close_device = vfio_pci_core_close_device,
+ .ioctl = vfio_pci_core_ioctl,
+ .device_feature = vfio_pci_core_ioctl_feature,
+ .read = vfio_pci_core_read,
+ .write = vfio_pci_core_write,
+ .mmap = vfio_pci_core_mmap,
+ .request = vfio_pci_core_request,
+ .match = vfio_pci_core_match,
+};
+
+static int nvmevf_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+ struct vfio_pci_core_device *vdev;
+ int ret;
+
+ vdev = vfio_alloc_device(vfio_pci_core_device, vdev, &pdev->dev,
+ &nvmevf_pci_ops);
+ if (IS_ERR(vdev))
+ return PTR_ERR(vdev);
+
+ dev_set_drvdata(&pdev->dev, vdev);
+ ret = vfio_pci_core_register_device(vdev);
+ if (ret)
+ goto out_put_dev;
+
+ return 0;
+
+out_put_dev:
+ vfio_put_device(&vdev->vdev);
+ return ret;
+}
+
+static void nvmevf_pci_remove(struct pci_dev *pdev)
+{
+ struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
+
+ vfio_pci_core_unregister_device(vdev);
+ vfio_put_device(&vdev->vdev);
+}
+
+static const struct pci_device_id nvmevf_pci_table[] = {
+ /* Intel IPU NVMe Virtual Function */
+ { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_INTEL, 0x1457) },
+ {}
+};
+
+MODULE_DEVICE_TABLE(pci, nvmevf_pci_table);
+
+static struct pci_driver nvmevf_pci_driver = {
+ .name = KBUILD_MODNAME,
+ .id_table = nvmevf_pci_table,
+ .probe = nvmevf_pci_probe,
+ .remove = nvmevf_pci_remove,
+ .err_handler = &vfio_pci_core_err_handlers,
+ .driver_managed_dma = true,
+};
+
+module_pci_driver(nvmevf_pci_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Lei Rao <[email protected]>");
+MODULE_DESCRIPTION("NVMe VFIO PCI - Generic VFIO PCI driver for NVMe");
--
2.34.1

2022-12-06 07:04:00

by Rao, Lei

[permalink] [raw]
Subject: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

The new function nvme_submit_vf_cmd() helps the host VF driver to issue
VF admin commands. It's helpful in some cases that the host NVMe driver
does not control VF's admin queue. For example, in the virtualization
device pass-through case, the VF controller's admin queue is governed
by the Guest NVMe driver. Host VF driver relies on PF device's admin
queue to control VF devices like vendor-specific live migration commands.

Signed-off-by: Lei Rao <[email protected]>
Signed-off-by: Yadong Li <[email protected]>
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Eddie Dong <[email protected]>
Reviewed-by: Hang Yuan <[email protected]>
---
drivers/nvme/host/pci.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 488ad7dabeb8..3d9c54d8e7fc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3585,6 +3585,24 @@ static struct pci_driver nvme_driver = {
.err_handler = &nvme_err_handler,
};

+int nvme_submit_vf_cmd(struct pci_dev *dev, struct nvme_command *cmd,
+ size_t *result, void *buffer, unsigned int bufflen)
+{
+ struct nvme_dev *ndev = NULL;
+ union nvme_result res = { };
+ int ret;
+
+ ndev = pci_iov_get_pf_drvdata(dev, &nvme_driver);
+ if (IS_ERR(ndev))
+ return PTR_ERR(ndev);
+ ret = __nvme_submit_sync_cmd(ndev->ctrl.admin_q, cmd, &res, buffer, bufflen,
+ NVME_QID_ANY, 0, 0);
+ if (ret >= 0 && result)
+ *result = le32_to_cpu(res.u32);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(nvme_submit_vf_cmd);
+
static int __init nvme_init(void)
{
BUILD_BUG_ON(sizeof(struct nvme_create_cq) != 64);
--
2.34.1

2022-12-06 07:51:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Tue, Dec 06, 2022 at 01:58:16PM +0800, Lei Rao wrote:
> The documentation describes the details of the NVMe hardware
> extension to support VFIO live migration.

This is not a NVMe hardware extension, this is some really strange
and half-assed intel-specific extension to nvme, which like any other
vendors specific non-standard extensions to nvme we refused to support
in Linux.

There is a TPAR for live migration building blocks under discussion in
the NVMe technical working group. It will still require mediatation
of access to the admin queue to deal with the huge amout of state nvme
has that needs to be migrated (and which doesn't seem to be covered at
all here). In Linux the equivalent would be to implement a mdev driver
that allows passing through the I/O qeues to a guest, but it might
be a better idea to handle the device model emulation entirely in
Qemu (or other userspace device models) and just find a way to expose
enough of the I/O queues to userspace.

The current TPAR seems to be very complicated for that, as in many
cases we'd only need a way to tіe certain namespaces to certain I/O
queues and not waste a lot of resources on the rest of the controller.

2022-12-06 13:14:52

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Tue, Dec 06, 2022 at 07:26:04AM +0100, Christoph Hellwig wrote:
> all here). In Linux the equivalent would be to implement a mdev driver
> that allows passing through the I/O qeues to a guest, but it might

Definately not - "mdev" drivers should be avoided as much as possible.

In this case Intel has a real PCI SRIOV VF to expose to the guest,
with a full VF RID. The proper VFIO abstraction is the variant PCI
driver as this series does. We want to use the variant PCI drivers
because they properly encapsulate all the PCI behaviors (MSI, config
space, regions, reset, etc) without requiring re-implementation of this
in mdev drivers.

mdev drivers should only be considered if a real PCI VF is not
available - eg because the device is doing "SIOV" or something.

We have several migration drivers in VFIO now following this general
pattern, from what I can see they have done it broadly properly from a
VFIO perspective.

> be a better idea to handle the device model emulation entirely in
> Qemu (or other userspace device models) and just find a way to expose
> enough of the I/O queues to userspace.

This is much closer to the VDPA model which is basically providing a
some kernel support to access the IO queue and a lot of SW in qemu to
generate the PCI device in the VM.

The approach has positives and negatives, we have done both in mlx5
devices and we have a preference toward the VFIO model. VPDA
specifically is very big and complicated compared to the VFIO
approach.

Overall having fully functional PCI SRIOV VF's available lets more
uses cases work than just "qemu to create a VM". qemu can always build
a VDPA like thing by using VFIO and VFIO live migration to shift
control of the device between qemu and HW.

I don't think we know enough about this space at the moment to fix a
specification to one path or the other, so I hope the TPAR will settle
on something that can support both models in SW and people can try
things out.

Jason

2022-12-06 13:37:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Tue, Dec 06, 2022 at 09:05:05AM -0400, Jason Gunthorpe wrote:
> In this case Intel has a real PCI SRIOV VF to expose to the guest,
> with a full VF RID.

RID?

> The proper VFIO abstraction is the variant PCI
> driver as this series does. We want to use the variant PCI drivers
> because they properly encapsulate all the PCI behaviors (MSI, config
> space, regions, reset, etc) without requiring re-implementation of this
> in mdev drivers.

I don't think the code in this series has any chance of actually
working. There is a lot of state associated with a NVMe subsystem,
controller and namespace, such as the serial number, subsystem NQN,
namespace uniqueue identifiers, Get/Set features state, pending AENs,
log page content. Just migrating from one device to another without
capturing all this has no chance of actually working.

> I don't think we know enough about this space at the moment to fix a
> specification to one path or the other, so I hope the TPAR will settle
> on something that can support both models in SW and people can try
> things out.

I've not seen anyone from Intel actually contributing to the live
migration TPAR, which is almost two month old by now.

2022-12-06 14:05:20

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Tue, Dec 06, 2022 at 02:09:01PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 09:05:05AM -0400, Jason Gunthorpe wrote:
> > In this case Intel has a real PCI SRIOV VF to expose to the guest,
> > with a full VF RID.
>
> RID?

"Requester ID" - PCI SIG term that in Linux basically means you get to
assign an iommu_domain to the vfio device.

Compared to a mdev where many vfio devices will share the same RID and
cannot have iommu_domain's without using PASID.

> > The proper VFIO abstraction is the variant PCI
> > driver as this series does. We want to use the variant PCI drivers
> > because they properly encapsulate all the PCI behaviors (MSI, config
> > space, regions, reset, etc) without requiring re-implementation of this
> > in mdev drivers.
>
> I don't think the code in this series has any chance of actually
> working. There is a lot of state associated with a NVMe subsystem,
> controller and namespace, such as the serial number, subsystem NQN,
> namespace uniqueue identifiers, Get/Set features state, pending AENs,
> log page content. Just migrating from one device to another without
> capturing all this has no chance of actually working.

From what I understood this series basically allows two Intel devices
to pass a big opaque blob of data. Intel didn't document what is in
that blob, so I assume it captures everything you mention above.

At least, that is the approach we have taken with mlx5. Every single
bit of device state is serialized into the blob and when the device
resumes it is indistinguishable from the original. Otherwise it is a
bug.

Jason

2022-12-06 14:09:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Tue, Dec 06, 2022 at 09:44:08AM -0400, Jason Gunthorpe wrote:
> Not speaking to NVMe - but this driver is clearly copying mlx5's live
> migration driver, almost completely - including this basic function.

Maybe that's not a good idea in an NVMe environment, and maybe that
should have talked to the standards committee before spending their
time on cargo cult engineering.

Most importantly NVMe is very quiet on the relationship between
VFs and PFs, and there is no way to guarantee that a PF is, at the
NVMe level, much in control of a VF at all. In other words this
concept really badly breaks NVMe abstractions.

> Thus, mxl5 has the same sort of design where the VF VFIO driver
> reaches into the PF kernel driver and asks the PF driver to perform
> some commands targeting the PF's own VFs. The DMA is then done using
> the RID of the PF, and reaches the kernel owned iommu_domain of the
> PF. This way the entire operation is secure aginst meddling by the
> guest.

And the works for you because you have a clearly defined relationship.
In NVMe we do not have this. We'd either need to define a way
to query that relationship or find another way to deal with the
problem. But in doubt the best design would be to drive VF
live migration entirely from the PF, turn the lookup from controlled
function to controlling function upside down, that is a list of
controlled functions (which could very well be, and in some designs
are, additional PFs and not VFs) by controlling function. In fact
NVMe already has that list in it's architecture with the
"Secondary Controller List".

2022-12-06 14:43:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Tue, Dec 06, 2022 at 03:00:02PM +0100, Christoph Hellwig wrote:
> > >From what I understood this series basically allows two Intel devices
> > to pass a big opaque blob of data. Intel didn't document what is in
> > that blob, so I assume it captures everything you mention above.
>
> Which would be just as bad, because it then changes the IDs under
> the live OS on a restore. This is not something that can be done
> behind the back of the hypervisors / control plane OS.

Sorry, what live OS?

In the VFIO restore model there is no "live OS" on resume. The
load/resume cycle is as destructive as reset to the vfio device.

When qemu operates vfio the destination CPU will not be running until
the load/resume of all the VFIO devices is completed.

So from the VM perspective it sees a complete no change, so long as
the data blob causes the destination vfio device to fully match the
source, including all IDs, etc.

Jason

2022-12-06 14:54:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Tue, Dec 06, 2022 at 09:52:54AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 06, 2022 at 02:09:01PM +0100, Christoph Hellwig wrote:
> > On Tue, Dec 06, 2022 at 09:05:05AM -0400, Jason Gunthorpe wrote:
> > > In this case Intel has a real PCI SRIOV VF to expose to the guest,
> > > with a full VF RID.
> >
> > RID?
>
> "Requester ID" - PCI SIG term that in Linux basically means you get to
> assign an iommu_domain to the vfio device.

Yeah I now the Requester ID, I've just never seen that shortcut for it.

> >From what I understood this series basically allows two Intel devices
> to pass a big opaque blob of data. Intel didn't document what is in
> that blob, so I assume it captures everything you mention above.

Which would be just as bad, because it then changes the IDs under
the live OS on a restore. This is not something that can be done
behind the back of the hypervisors / control plane OS.

2022-12-06 14:54:50

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Tue, Dec 06, 2022 at 03:31:26PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 10:20:26AM -0400, Jason Gunthorpe wrote:
> > In the VFIO restore model there is no "live OS" on resume. The
> > load/resume cycle is as destructive as reset to the vfio device.
>
> Of course there may be and OS. As soon as the VF is live Linux
> will by default bind to it. And that's the big problem here,
> the VF should not actually exist or at least not be usable when
> such a restore happens - or to say it in NVMe terms, the Secondary
> Controller better be in offline state when state is loaded into it.

Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
use.

What we do have is a transfer of control from the normal OS driver (eg
nvme) to the VFIO driver. Also, remember, that VFIO only does live
migration between VFIO devices. We cannot use live migration and end
up with a situation where the normal nvme driver is controlling the
VF.

The VFIO load model is explicitly destructive. We replace the current
VF with the loading VF. Both the VFIO variant driver and the VFIO
userspace issuing the load have to be aware of this and understand
that the whole device will change.

From an implementation perspective, I would expect the nvme varient
driver to either place the nvme device in the correct state during
load, or refuse to execute load if it is in the wrong state.

To be compatible with what qemu is doing the "right state" should be
entered by completing function level reset of the VF.

The Linux/qemu parts are still being finalized, so if you see
something that could be changed to better match nvme it would be a
great time to understand that.

Jason

2022-12-06 14:55:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Tue, Dec 06, 2022 at 10:20:26AM -0400, Jason Gunthorpe wrote:
> In the VFIO restore model there is no "live OS" on resume. The
> load/resume cycle is as destructive as reset to the vfio device.

Of course there may be and OS. As soon as the VF is live Linux
will by default bind to it. And that's the big problem here,
the VF should not actually exist or at least not be usable when
such a restore happens - or to say it in NVMe terms, the Secondary
Controller better be in offline state when state is loaded into it.

2022-12-06 15:24:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote:
> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
> use.

Beward: The secondary function might as well be a physical function
as well. In fact one of the major customers for "smart" multifunction
nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the
symmetric dual ported devices are multi-PF as well).

So this isn't really about a VF live cycle, but how to manage life
migration, especially on the receive / restore side. And restoring
the entire controller state is extremely invasive and can't be done
on a controller that is in any classic form live. In fact a lot
of the state is subsystem-wide, so without some kind of virtualization
of the subsystem it is impossible to actually restore the state.

To cycle back to the hardware that is posted here, I'm really confused
how it actually has any chance to work and no one has even tried
to explain how it is supposed to work.

2022-12-06 15:46:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Tue, Dec 06, 2022 at 11:28:12AM -0400, Jason Gunthorpe wrote:
> I'm interested as well, my mental model goes as far as mlx5 and
> hisillicon, so if nvme prevents the VFs from being contained units, it
> is a really big deviation from VFIO's migration design..

In NVMe the controller (which maps to a PCIe physical or virtual
function) is unfortunately not very self contained. A lot of
state is subsystem-wide, where the subsystem is, roughly speaking,
the container for all controllers that shared storage. That is
the right thing to do for say dual ported SSDs that are used for
clustering or multi-pathing, for tentant isolation is it about
as wrong as it gets.

There is nothing in the NVMe spec that prohibits your from
implementing multiple subsystems for multiple functions of a PCIe
device, but if you do that there is absolutely no support in the
spec to manage shared resources or any other interaction between
them.

2022-12-06 15:47:15

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Tue, Dec 06, 2022 at 02:58:10PM +0100, Christoph Hellwig wrote:

> Most importantly NVMe is very quiet on the relationship between
> VFs and PFs, and there is no way to guarantee that a PF is, at the
> NVMe level, much in control of a VF at all. In other words this
> concept really badly breaks NVMe abstractions.

Yeah, I think the spec effort is going to be interesting for sure.

From a pure Linux and implementation perspective a decision must be
made early on how to label the DMAs for kernel/qemu vs VM controlled
items at the PCI TLP level.

> controlled functions (which could very well be, and in some designs
> are, additional PFs and not VFs) by controlling function.

In principle PF vs VF doesn't matter much - the question is really TLP
labeling. If the spec says RID A is the controlling RID and RID B is
the guest RID, then it doesn't matter if they have a PF/VF
relationship or PF/PF relationship.

We have locking issues in Linux SW connecting different SW drivers for
things that are not a PF/VF relationship, but perhaps that can be
solved.

Using VF RID / VF PASID is appealing at first glance, but there is
list of PCI emulation details that have to be worked out for that to
be good. eg what do you do with guest triggered FLR? Or guest
triggered memory disable? How do you handle PCIe AER? Also lack of
PASID support in CPUs is problematic.

Lots of trade offs..

Jason

2022-12-06 15:51:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Tue, Dec 06, 2022 at 11:22:33AM -0400, Jason Gunthorpe wrote:
> > controlled functions (which could very well be, and in some designs
> > are, additional PFs and not VFs) by controlling function.
>
> In principle PF vs VF doesn't matter much - the question is really TLP
> labeling. If the spec says RID A is the controlling RID and RID B is
> the guest RID, then it doesn't matter if they have a PF/VF
> relationship or PF/PF relationship.

Yes. Or in fact if you use PASIDs inside a single function.

> We have locking issues in Linux SW connecting different SW drivers for
> things that are not a PF/VF relationship, but perhaps that can be
> solved.

And I think the only reasonable answer is that the entire workflow
must be 100% managed from the controlling function, and the controlled
function is just around for a ride, with the controlling function
enabling/disabling it as needed without ever interacting with software
that directly deals with the controlled function.

2022-12-06 16:02:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Tue, Dec 06, 2022 at 04:01:31PM +0100, Christoph Hellwig wrote:

> So this isn't really about a VF live cycle, but how to manage life
> migration, especially on the receive / restore side. And restoring
> the entire controller state is extremely invasive and can't be done
> on a controller that is in any classic form live. In fact a lot
> of the state is subsystem-wide, so without some kind of virtualization
> of the subsystem it is impossible to actually restore the state.

I cannot speak to nvme, but for mlx5 the VF is laregly a contained
unit so we just replace the whole thing.

From the PF there is some observability, eg the VF's MAC address is
visible and a few other things. So the PF has to re-synchronize after
the migration to get those things aligned.

> To cycle back to the hardware that is posted here, I'm really confused
> how it actually has any chance to work and no one has even tried
> to explain how it is supposed to work.

I'm interested as well, my mental model goes as far as mlx5 and
hisillicon, so if nvme prevents the VFs from being contained units, it
is a really big deviation from VFIO's migration design..

Jason

2022-12-06 16:37:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Tue, Dec 06, 2022 at 04:38:11PM +0100, Christoph Hellwig wrote:

> > We have locking issues in Linux SW connecting different SW drivers for
> > things that are not a PF/VF relationship, but perhaps that can be
> > solved.
>
> And I think the only reasonable answer is that the entire workflow
> must be 100% managed from the controlling function, and the controlled
> function is just around for a ride, with the controlling function
> enabling/disabling it as needed without ever interacting with software
> that directly deals with the controlled function.

That is a big deviation from where VFIO is right now, the controlled
function is the one with the VFIO driver, it should be the one that
drives the migration uAPI components.

Adding another uAPI that can manipulate the same VFIO device from some
unrelated chardev feels wrong.

There are certain things that need to be co-ordinated for eveything to
work. Like you can't suspend the VFIO device unless you promise to
also stop MMIO operations. Stuff like FLR interfers with the migration
operation and has to be co-ordinated. Some migration operation
failures, like load failure, have to be healed through FLR.

It really does not want to be two different uAPIs even if that is
convenient for the kernel.

I'd much rather try to fix the problems PASID brings that try to make
this work :\

Jason

2022-12-06 17:10:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Tue, Dec 06, 2022 at 11:51:23AM -0400, Jason Gunthorpe wrote:
> That is a big deviation from where VFIO is right now, the controlled
> function is the one with the VFIO driver, it should be the one that
> drives the migration uAPI components.

Well, that is one way to see it, but I think the more natural
way to deal with it is to drive everyting from the controlling
function, because that is by definition much more in control.

More importantly any sane design will have easy ways to list and
manipulate all the controlled functions from the controlling
functions, while getting from the controlled function to the
controlling one is extremely awkward, as anything that can be
used for that is by definition and information leak. It seems
mlx5 just gets away with that by saying controlled functions are
always VFs, and the controlling function is a PF, but that will
break down very easily, especially once you want to nest the
controlling scheme (and yes, I'm not making this up, as the
nesting scheme is being proposed for nvme privately).

2022-12-06 18:32:43

by Dong, Eddie

[permalink] [raw]
Subject: RE: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device



> -----Original Message-----
> From: Christoph Hellwig <[email protected]>
> Sent: Tuesday, December 6, 2022 7:36 AM
> To: Jason Gunthorpe <[email protected]>
> Cc: Christoph Hellwig <[email protected]>; Rao, Lei <[email protected]>;
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Tian, Kevin <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Dong, Eddie
> <[email protected]>; Li, Yadong <[email protected]>; Liu, Yi L
> <[email protected]>; Wilk, Konrad <[email protected]>;
> [email protected]; Yuan, Hang <[email protected]>
> Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
>
> On Tue, Dec 06, 2022 at 11:28:12AM -0400, Jason Gunthorpe wrote:
> > I'm interested as well, my mental model goes as far as mlx5 and
> > hisillicon, so if nvme prevents the VFs from being contained units, it
> > is a really big deviation from VFIO's migration design..
>
> In NVMe the controller (which maps to a PCIe physical or virtual
> function) is unfortunately not very self contained. A lot of state is subsystem-
> wide, where the subsystem is, roughly speaking, the container for all
> controllers that shared storage. That is the right thing to do for say dual
> ported SSDs that are used for clustering or multi-pathing, for tentant isolation
> is it about as wrong as it gets.


NVMe spec is general, but the implementation details (such as internal state) may
be vendor specific. If the migration happens between 2 identical NVMe devices
(from same vendor/device w/ same firmware version), migration of
subsystem-wide state can be naturally covered, right?

>
> There is nothing in the NVMe spec that prohibits your from implementing
> multiple subsystems for multiple functions of a PCIe device, but if you do that
> there is absolutely no support in the spec to manage shared resources or any
> other interaction between them.

In IPU/DPU area, it seems multiple VFs with SR-IOV is widely adopted.

In VFs, the usage of shared resource can be viewed as implementation specific,
and load/save state of a VF can rely on the hardware/firmware itself.
Migration of NVMe devices crossing vendor/device is another story: it may
be useful, but brings additional challenges.

2022-12-06 19:58:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Tue, Dec 06, 2022 at 05:55:03PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 11:51:23AM -0400, Jason Gunthorpe wrote:
> > That is a big deviation from where VFIO is right now, the controlled
> > function is the one with the VFIO driver, it should be the one that
> > drives the migration uAPI components.
>
> Well, that is one way to see it, but I think the more natural
> way to deal with it is to drive everyting from the controlling
> function, because that is by definition much more in control.

Sure, the controlling function should (and does in mlx5) drive
everything here.

What the kernel is doing is providing the abstraction to link the
controlling function to the VFIO device in a general way.

We don't want to just punt this problem to user space and say 'good
luck finding the right cdev for migration control'. If the kernel
struggles to link them then userspace will not fare better on its own.

Especially, we do not want every VFIO device to have its own crazy way
for userspace to link the controlling/controlled functions
together. This is something the kernel has to abstract away.

So, IMHO, we must assume the kernel is aware of the relationship,
whatever algorithm it uses to become aware.

It just means the issue is doing the necessary cross-subsystem
locking.

That combined with the fact they really are two halfs of the same
thing - operations on the controlling function have to be sequenced
with operations on the VFIO device - makes me prefer the single uAPI.

> More importantly any sane design will have easy ways to list and
> manipulate all the controlled functions from the controlling
> functions, while getting from the controlled function to the
> controlling one is extremely awkward, as anything that can be
> used for that is by definition and information leak.

We have spend some time looking at this for mlx5. It is a hard
problem. The VFIO driver cannot operate the device, eg it cannot do
MMIO and things, so it is limited to items in the PCI config space to
figure out the device identity.

> It seems mlx5 just gets away with that by saying controlled
> functions are always VFs, and the controlling function is a PF, but
> that will break down very easily,

Yes, that is part of the current mlx5 model. It is not inherent to the
device design, but the problems with arbitary nesting were hard enough
they were not tackled at this point.

Jason

2022-12-07 03:34:20

by Max Gurtovoy

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.


On 12/6/2022 9:15 PM, Jason Gunthorpe wrote:
> On Tue, Dec 06, 2022 at 05:55:03PM +0100, Christoph Hellwig wrote:
>> On Tue, Dec 06, 2022 at 11:51:23AM -0400, Jason Gunthorpe wrote:
>>> That is a big deviation from where VFIO is right now, the controlled
>>> function is the one with the VFIO driver, it should be the one that
>>> drives the migration uAPI components.
>> Well, that is one way to see it, but I think the more natural
>> way to deal with it is to drive everyting from the controlling
>> function, because that is by definition much more in control.
> Sure, the controlling function should (and does in mlx5) drive
> everything here.
>
> What the kernel is doing is providing the abstraction to link the
> controlling function to the VFIO device in a general way.
>
> We don't want to just punt this problem to user space and say 'good
> luck finding the right cdev for migration control'. If the kernel
> struggles to link them then userspace will not fare better on its own.
>
> Especially, we do not want every VFIO device to have its own crazy way
> for userspace to link the controlling/controlled functions
> together. This is something the kernel has to abstract away.
>
> So, IMHO, we must assume the kernel is aware of the relationship,
> whatever algorithm it uses to become aware.
>
> It just means the issue is doing the necessary cross-subsystem
> locking.

Agree. This is not the first time we have cross-subsystem interactions
in the kernel, right ?

>
> That combined with the fact they really are two halfs of the same
> thing - operations on the controlling function have to be sequenced
> with operations on the VFIO device - makes me prefer the single uAPI.
>
>> More importantly any sane design will have easy ways to list and
>> manipulate all the controlled functions from the controlling
>> functions, while getting from the controlled function to the
>> controlling one is extremely awkward, as anything that can be
>> used for that is by definition and information leak.
> We have spend some time looking at this for mlx5. It is a hard
> problem. The VFIO driver cannot operate the device, eg it cannot do
> MMIO and things, so it is limited to items in the PCI config space to
> figure out the device identity.

Christoph,

I'm not sure how awkward is for migration driver to ask the controlling
device driver to operate a migration action.

The controlling device driver can expose limited API for that matter.

I don't see why is it wrong to submit some admin commands between
subsystems - we already have a way to send admin commands from linux cli
or from nvmet drivers to an NVMe device.

Also the concept of primary controller that control it's secondary
controllers is already in the SPEC so we can use it. It's not introduced
in this RFC but we're here to fix it.

In our case the primary controller is the PF and the secondary
controllers are the VFs.

If we'll think of  a concept of new optional "Migration Management"
admin commands that will be supported by primary controllers to manage
the migration process of its secondary controllers - we can at least
solve the initial step of migrating VFs by its parent PF.

>
>> It seems mlx5 just gets away with that by saying controlled
>> functions are always VFs, and the controlling function is a PF, but
>> that will break down very easily,
> Yes, that is part of the current mlx5 model. It is not inherent to the
> device design, but the problems with arbitary nesting were hard enough
> they were not tackled at this point.

Agree. There are a lot of challenges in the non-nested migration that we
can just limit the NVMe SPEC to it and extend later on - like we did in
other features such as copy-Offload.

Most of the infrastructure work for LM was done in the VFIO subsystem in
the last year so we can now focus on the Architecture aspects of NVMe.

>
> Jason

2022-12-07 08:09:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Wed, Dec 07, 2022 at 04:30:20AM +0200, Max Gurtovoy wrote:
> I'm not sure how awkward is for migration driver to ask the controlling
> device driver to operate a migration action.

It can't. That's the whole point. The controlled function that is
being migrate must be absolutely unaware of that (except for things
like quiescing access or FLRs that could happen anyway), because
otherwise your have a fundamental information leak.

> The controlling device driver can expose limited API for that matter.

No, it can't. It must be in charge.

> Also the concept of primary controller that control it's secondary
> controllers is already in the SPEC so we can use it. It's not introduced in
> this RFC but we're here to fix it.

Yes, it is as I've pointed out multiple times. But, that relationship
is only visible to the primary controller. It also doesn't help with
the general problem where the secondary controller must be able to
completely change it's identify and thus the subsystem.

> In our case the primary controller is the PF and the secondary controllers
> are the VFs.

Yes, that's your case, and probably a very common one. But also far from
the only one, so there is no way Linux or the specification can rely
on that dumb fact. Never mind that there are virtualization schemes
(look at the s390 code) where the PF to VF relationship gets lost.

2022-12-07 08:10:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
> What the kernel is doing is providing the abstraction to link the
> controlling function to the VFIO device in a general way.
>
> We don't want to just punt this problem to user space and say 'good
> luck finding the right cdev for migration control'. If the kernel
> struggles to link them then userspace will not fare better on its own.

Yes. But the right interface for that is to issue the userspace
commands for anything that is not normal PCIe function level
to the controlling funtion, and to discover the controlled functions
based on the controlling functions.

In other words: there should be absolutely no need to have any
special kernel support for the controlled function. Instead the
controlling function enumerates all the function it controls exports
that to userspace and exposes the functionality to save state from
and restore state to the controlled functions.

> Especially, we do not want every VFIO device to have its own crazy way
> for userspace to link the controlling/controlled functions
> together. This is something the kernel has to abstract away.

Yes. But the direction must go controlling to controlled, not the
other way around.

2022-12-07 11:06:07

by Max Gurtovoy

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.


On 12/7/2022 9:54 AM, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
>> What the kernel is doing is providing the abstraction to link the
>> controlling function to the VFIO device in a general way.
>>
>> We don't want to just punt this problem to user space and say 'good
>> luck finding the right cdev for migration control'. If the kernel
>> struggles to link them then userspace will not fare better on its own.
> Yes. But the right interface for that is to issue the userspace
> commands for anything that is not normal PCIe function level
> to the controlling funtion, and to discover the controlled functions
> based on the controlling functions.
>
> In other words: there should be absolutely no need to have any
> special kernel support for the controlled function. Instead the
> controlling function enumerates all the function it controls exports
> that to userspace and exposes the functionality to save state from
> and restore state to the controlled functions.

Why is it preferred that the migration SW will talk directly to the PF
and not via VFIO interface ?

It's just an implementation detail.

I feel like it's even sounds more reasonable to have a common API like
we have today to save_state/resume_state/quiesce_device/freeze_device
and each device implementation will translate this functionality to its
own SPEC.

If I understand your direction is to have QEMU code to talk to
nvmecli/new_mlx5cli/my_device_cli to do that and I'm not sure it's needed.

The controlled device is not aware of any of the migration process. Only
the migration SW, system admin and controlling device.

I see 2 orthogonal discussions here: NVMe standardization for LM and
Linux implementation for LM.

For the NVMe standardization: I think we all agree, in high level, that
primary controller manages the LM of the secondary controllers. Primary
controller can list the secondary controllers. Primary controller expose
APIs using its admin_queue to manage LM process of its secondary
controllers. LM Capabilities will be exposed using identify_ctrl admin
cmd of the primary controller.

For the Linux implementation: the direction we started last year is to
have vendor specific (mlx5/hisi/..) or protocol specific
(nvme/virtio/..) vfio drivers. We built an infrastructure to do that by
dividing the vfio_pci driver to vfio_pci and vfio_pci_core and updated
uAPIs as well to support the P2P case for live migration. Dirty page
tracking is also progressing. More work is still to be done to improve
this infrastructure for sure.
I hope that all the above efforts are going to be used also with NVMe LM
implementation unless there is something NVMe specific in the way of
migrating PCI functions that I can't see now.
If there is something that is NVMe specific for LM then the migration SW
and QEMU will need to be aware of that, and in this awareness we lose
the benefit of generic VFIO interface.

>
>> Especially, we do not want every VFIO device to have its own crazy way
>> for userspace to link the controlling/controlled functions
>> together. This is something the kernel has to abstract away.
> Yes. But the direction must go controlling to controlled, not the
> other way around.

So in the source:

1. We enable SRIOV on the NVMe driver

2. We list all the secondary controllers: nvme1, nvme2, nvme3

3. We allow migrating nvme1, nvme2, nvme3 - now these VFs are migratable
(controlling to controlled).

4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver

5. We pass these functions to VM

6. We start migration process.


And in the destination:

1. We enable SRIOV on the NVMe driver

2. We list all the secondary controllers: nvme1, nvme2, nvme3

3. We allow migration resume to nvme1, nvme2, nvme3 - now these VFs are
resumable (controlling to controlled).

4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver

5. We pass these functions to VM

6. We start migration resume process.


>

>

2022-12-07 14:02:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Wed, Dec 07, 2022 at 12:59:00PM +0200, Max Gurtovoy wrote:
> Why is it preferred that the migration SW will talk directly to the PF and
> not via VFIO interface ?

It should never talk directly to any hardware, but through a kernel
interface, and that's probably vfio. But that interface needs to
centered around the controlling function for all the reasons I've
written down multiple times now.

> It's just an implementation detail.

No, it's not. While you could come up with awkward ways to map how
the hardware interface must work to a completely contrary kernel
interface that's just going to create the need for lots of boilerplate
code _and_ confuses users. The function that is beeing migrated can
fundamentally not be in control of itself. Any interface that pretends
it is broken and a long term nightmare for users and implementers.

> I feel like it's even sounds more reasonable to have a common API like we
> have today to save_state/resume_state/quiesce_device/freeze_device and each
> device implementation will translate this functionality to its own SPEC.

Absolutely.

> If I understand your direction is to have QEMU code to talk to
> nvmecli/new_mlx5cli/my_device_cli to do that and I'm not sure it's needed.

No.

> The controlled device is not aware of any of the migration process. Only
> the migration SW, system admin and controlling device.

Exactly.

> So in the source:
>
> 1. We enable SRIOV on the NVMe driver

Again. Nothing in live migration is tied to SR-IOV at all. SR-IOV
is just one way to get multiple functions.

> 2. We list all the secondary controllers: nvme1, nvme2, nvme3
>
> 3. We allow migrating nvme1, nvme2, nvme3 - now these VFs are migratable
> (controlling to controlled).
>
> 4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver
>
> 5. We pass these functions to VM

And you need to pass the controlling function (or rather a handle for
it), because there is absolutely no sane way to discover that from
the controlled function as it can't have that information by the
fact that it is beeing passed to unprivilged VMs.

2022-12-07 14:21:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Wed, Dec 07, 2022 at 09:34:14AM -0400, Jason Gunthorpe wrote:
> The VFIO design assumes that the "vfio migration driver" will talk to
> both functions under the hood, and I don't see a fundamental problem
> with this beyond it being awkward with the driver core.

And while that is a fine concept per see, the current incarnation of
that is fundamentally broken is it centered around the controlled
VM. Which really can't work.

> Even the basic assumption that there would be a controlling/controlled
> relationship is not universally true. The mdev type drivers, and
> SIOV-like devices are unlikely to have that. Once you can use PASID
> the reasons to split things at the HW level go away, and a VF could
> certainly self-migrate.

Even then you need a controlling and a controlled entity. The
controlling entity even in SIOV remains a PCIe function. The
controlled entity might just be a bunch of hardware resoures and
a PASID. Making it important again that all migration is driven
by the controlling entity.

Also the whole concept that only VFIO can do live migration is
a little bogus. With checkpoint and restart it absolutely
does make sense to live migrate a container, and with that
the hardware interface (e.g. nvme controller) assigned to it.

> So, when you see both Intel and Pensando proposing this kind of
> layered model for NVMe where migration is subsystem-local to VFIO, I
> think this is where the inspiration is coming from. Their native DPU
> drivers already work this way.

Maybe they should have talked to someone not high on their own
supply before designing this.

2022-12-07 14:23:02

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Wed, Dec 07, 2022 at 08:54:15AM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
> > What the kernel is doing is providing the abstraction to link the
> > controlling function to the VFIO device in a general way.
> >
> > We don't want to just punt this problem to user space and say 'good
> > luck finding the right cdev for migration control'. If the kernel
> > struggles to link them then userspace will not fare better on its own.
>
> Yes. But the right interface for that is to issue the userspace
> commands for anything that is not normal PCIe function level
> to the controlling funtion, and to discover the controlled functions
> based on the controlling functions.

I don't think we should mix up how the HW works, what PCI function the
commands are executed at, with how the uAPI is designed.

The VFIO design assumes that the "vfio migration driver" will talk to
both functions under the hood, and I don't see a fundamental problem
with this beyond it being awkward with the driver core.

It is done this way deliberately. When we did the work on it we found
there are problems with some device models, like when you suspend them
and then wrongly MMIO touch them they can trigger AER and maybe even a
machine check crash. One of the roles of the VFIO driver is to fix
these HW bugs and make the uAPI safe. Eg by revoking mmaps or
whatever.

Even more importantly, we do not want migration to ever be operated
unless VFIO is in control of the device. In general, migration resume
basically allows userspace to command the device to do effectively any
DMA. This is a kernel security problem if the device is not
constrained by a user iommu_domain - for security we must not allow
userspace to resume a VF that is under kernel control and potentially
linked to an passthrough iommu_domain.

VFIO provides the security model to make all of this safe - the two
concepts cannot become disconnected. Even if we create a new migration
uAPI it just means that the nvme driver has to be awkwardly aware of
VFIO VF drivers and interlock with their state, and the uAPI is
useless unless you already have a VFIO FD open.

Even the basic assumption that there would be a controlling/controlled
relationship is not universally true. The mdev type drivers, and
SIOV-like devices are unlikely to have that. Once you can use PASID
the reasons to split things at the HW level go away, and a VF could
certainly self-migrate.

VFIO's goal is to abstract all of the above stuff. You get one char
device that inherently provides the security guarentees required to
operate migration. It provides all the necessary hooks to fix up HW
issues, which so far every merged device has:

- mlx5 has a weird issue where FLR on the VF resets the migration
context, we fix that in the VFIO driver (in mlx5 even though the
commands are issued via the PF they are logically executed inside
the VF context)

- hi-silicon has security problems because it doesn't have
controlling/controlled, so it needs to carve out BAR MMIO maps and
other stuff

So while I agree that, in principle, migration and the VFIO VF are
seperate concerns our broken reality makes them linked.

Even the idea that started this thread - that PF/PF could be a problem
- seems to have been explored by the Pensando RFC which is using the
aux devices to connect arbitrary PF/PF for their model.

The logical model we have been using on complex multi-function devices
(like every DPU driver) has been to split the driver into subsystem
local code and thread all the pieces together with aux devices.

The model has a PCI driver that operates the lowest level of the
device, eg the 'admin queue' and then aux devices create
subsystem-local drivers (netdev, rdma, vdpa, iscsi, vfio, etc, etc)
that ride on the common API exported by the PCI driver.

So, when you see both Intel and Pensando proposing this kind of
layered model for NVMe where migration is subsystem-local to VFIO, I
think this is where the inspiration is coming from. Their native DPU
drivers already work this way.

Jason

2022-12-07 15:18:51

by Max Gurtovoy

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.


On 12/7/2022 3:46 PM, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 12:59:00PM +0200, Max Gurtovoy wrote:
>> Why is it preferred that the migration SW will talk directly to the PF and
>> not via VFIO interface ?
> It should never talk directly to any hardware, but through a kernel
> interface, and that's probably vfio. But that interface needs to
> centered around the controlling function for all the reasons I've
> written down multiple times now.
>
>> It's just an implementation detail.
> No, it's not. While you could come up with awkward ways to map how
> the hardware interface must work to a completely contrary kernel
> interface that's just going to create the need for lots of boilerplate
> code _and_ confuses users. The function that is beeing migrated can
> fundamentally not be in control of itself. Any interface that pretends
> it is broken and a long term nightmare for users and implementers.

We're defining the SPEC and interfaces now :)

Bellow is some possible direction I can think of.

>> I feel like it's even sounds more reasonable to have a common API like we
>> have today to save_state/resume_state/quiesce_device/freeze_device and each
>> device implementation will translate this functionality to its own SPEC.
> Absolutely.
>
>> If I understand your direction is to have QEMU code to talk to
>> nvmecli/new_mlx5cli/my_device_cli to do that and I'm not sure it's needed.
> No.
great.
>
>> The controlled device is not aware of any of the migration process. Only
>> the migration SW, system admin and controlling device.
> Exactly.
>
>> So in the source:
>>
>> 1. We enable SRIOV on the NVMe driver
> Again. Nothing in live migration is tied to SR-IOV at all. SR-IOV
> is just one way to get multiple functions.

Sure.

It's just an example. It can be some mdev.

>
>> 2. We list all the secondary controllers: nvme1, nvme2, nvme3
>>
>> 3. We allow migrating nvme1, nvme2, nvme3 - now these VFs are migratable
>> (controlling to controlled).
>>
>> 4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver
>>
>> 5. We pass these functions to VM
> And you need to pass the controlling function (or rather a handle for
> it), because there is absolutely no sane way to discover that from
> the controlled function as it can't have that information by the
> fact that it is beeing passed to unprivilged VMs.

Just thinking out loud:

When we perform step #3 we are narrowing it's scope and maybe some caps
that you're concerned of. After this setting, the controlled function is
in LM mode (we should define what does that mean in order to be able to
migrate it correctly) and the controlling function is the migration
master of it. Both can be aware of that. The only one that can master
the controlled function is the controlling function in LM mode. Thus, it
will be easy to keep that handle inside the kernel for VFs and for MDEVs
as well.
Although I'm not against passing this handle to migration SW somehow in
the command line of the QEMU but I still can't completely agree it's
necessary.

2022-12-07 16:09:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Wed, Dec 07, 2022 at 02:52:03PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 09:34:14AM -0400, Jason Gunthorpe wrote:
> > The VFIO design assumes that the "vfio migration driver" will talk to
> > both functions under the hood, and I don't see a fundamental problem
> > with this beyond it being awkward with the driver core.
>
> And while that is a fine concept per see, the current incarnation of
> that is fundamentally broken is it centered around the controlled
> VM. Which really can't work.

I don't see why you keep saying this. It is centered around the struct
vfio_device object in the kernel, which is definately NOT the VM.

The struct vfio_device is the handle for the hypervisor to control
the physical assigned device - and it is the hypervisor that controls
the migration.

We do not need the hypervisor userspace to have a handle to the hidden
controlling function. It provides no additional functionality,
security or insight to what qemu needs to do. Keeping that
relationship abstracted inside the kernel is a reasonable choice and
is not "fundamentally broken".

> > Even the basic assumption that there would be a controlling/controlled
> > relationship is not universally true. The mdev type drivers, and
> > SIOV-like devices are unlikely to have that. Once you can use PASID
> > the reasons to split things at the HW level go away, and a VF could
> > certainly self-migrate.
>
> Even then you need a controlling and a controlled entity. The
> controlling entity even in SIOV remains a PCIe function. The
> controlled entity might just be a bunch of hardware resoures and
> a PASID. Making it important again that all migration is driven
> by the controlling entity.

If they are the same driver implementing vfio_device you may be able
to claim they conceptually exist, but it is pretty artificial to draw
this kind of distinction inside a single driver.

> Also the whole concept that only VFIO can do live migration is
> a little bogus. With checkpoint and restart it absolutely
> does make sense to live migrate a container, and with that
> the hardware interface (e.g. nvme controller) assigned to it.

I agree people may want to do this, but it is very unclear how SRIOV
live migration can help do this.

SRIOV live migration is all about not disturbing the kernel driver,
assuming it is the same kernel driver on both sides. If you have two
different kernel's there is nothing worth migrating. There isn't even
an assurance the dma API will have IOMMU mapped the same objects to
the same IOVAs. eg so you have re-establish your admin queue, IO
queues, etc after migration anyhow.

Let alone how to solve the security problems of allow userspace to
load arbitary FW blobs into a device with potentially insecure DMA
access..

At that point it isn't really the same kind of migration.

Jason

2022-12-07 16:39:36

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Wed, Dec 07, 2022 at 04:50:00PM +0200, Max Gurtovoy wrote:
> When we perform step #3 we are narrowing it's scope and maybe some caps
> that you're concerned of. After this setting, the controlled function is in
> LM mode (we should define what does that mean in order to be able to
> migrate it correctly) and the controlling function is the migration master
> of it. Both can be aware of that. The only one that can master the
> controlled function is the controlling function in LM mode. Thus, it will
> be easy to keep that handle inside the kernel for VFs and for MDEVs as
> well.

Maybe. So you'd introduce a kernel linkage that both side would have
to be part of?

2022-12-07 16:59:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Wed, Dec 07, 2022 at 11:07:11AM -0400, Jason Gunthorpe wrote:
> > And while that is a fine concept per see, the current incarnation of
> > that is fundamentally broken is it centered around the controlled
> > VM. Which really can't work.
>
> I don't see why you keep saying this. It is centered around the struct
> vfio_device object in the kernel, which is definately NOT the VM.

Sorry, I meant VF. Your continued using of SR-IOV teminology now keeps
confusing my mind so much that I start mistyping things.

> > Even then you need a controlling and a controlled entity. The
> > controlling entity even in SIOV remains a PCIe function. The
> > controlled entity might just be a bunch of hardware resoures and
> > a PASID. Making it important again that all migration is driven
> > by the controlling entity.
>
> If they are the same driver implementing vfio_device you may be able
> to claim they conceptually exist, but it is pretty artificial to draw
> this kind of distinction inside a single driver.

How are they in this reply? I can't parse how this even relates to
what I wrote.

> > Also the whole concept that only VFIO can do live migration is
> > a little bogus. With checkpoint and restart it absolutely
> > does make sense to live migrate a container, and with that
> > the hardware interface (e.g. nvme controller) assigned to it.
>
> I agree people may want to do this, but it is very unclear how SRIOV
> live migration can help do this.

SRIOV live migration doesn't, because honestly there is no such
thing as "SRIOV" live migration to start with.

> Let alone how to solve the security problems of allow userspace to
> load arbitary FW blobs into a device with potentially insecure DMA
> access..

Any time you assign a PCI device to userspace you might get into that.

2022-12-07 17:40:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Wed, Dec 07, 2022 at 05:38:57PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 11:07:11AM -0400, Jason Gunthorpe wrote:
> > > And while that is a fine concept per see, the current incarnation of
> > > that is fundamentally broken is it centered around the controlled
> > > VM. Which really can't work.
> >
> > I don't see why you keep saying this. It is centered around the struct
> > vfio_device object in the kernel, which is definately NOT the VM.
>
> Sorry, I meant VF. Your continued using of SR-IOV teminology now keeps
> confusing my mind so much that I start mistyping things.

Well, what words do you want to use?

Regardless of VF/VM, it doesn't matter - my point is that the
vfio_device is the hypervisor control for *whatever* is under the
vfio_device and it is not desirable to break it up along arbitrary HW
boundaries.

I've given lots of reasons why not to do this now.

I strongly suspect it can work technically - as ugly as it is Pensando
shows an approach.

So I don't think I've learned anything more about your objection.

"fundamentally broken" doesn't help

It is a major point, because if we are going to have to rip apart all
the uAPI we built here and are completing the qemu work for, we need
to come to an understanding very soon.

And given how difficult it has been to get to this point, I want a
*really* good reason why we have to start again, and rewrite all the
drivers that exist and are coming.

Jason

2022-12-07 18:56:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Wed, Dec 07, 2022 at 01:31:44PM -0400, Jason Gunthorpe wrote:
> > Sorry, I meant VF. Your continued using of SR-IOV teminology now keeps
> > confusing my mind so much that I start mistyping things.
>
> Well, what words do you want to use?

The same I've used through this whole thread: controlling and
controlled function.

> So I don't think I've learned anything more about your objection.
>
> "fundamentally broken" doesn't help

The objection is that:

- in hardware fundamentally only the controlling funtion can
control live migration features on the controlled function,
because the controlled function is assigned to a VM which has
control over it.
- for the same reason there is no portable way to even find
the controlling function from a controlled function, unless
you want to equate PF = controlling and VF = controlled,
and even that breaks down in some corner cases
- if you want to control live migration from the controlled
VM you need a new vfio subdriver for a function that has
absolutely no new functionality itself related to live
migration (quirks for bugs excepted).

So by this architecture you build a convoluted mess where you
need tons of vfio subdrivers that mostly just talk to the
driver for the controlling function, which they can't even
sanely discover. That's what I keep calling fundamentally
broken.

2022-12-07 20:44:06

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Wed, Dec 07, 2022 at 07:33:33PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 01:31:44PM -0400, Jason Gunthorpe wrote:
> > > Sorry, I meant VF. Your continued using of SR-IOV teminology now keeps
> > > confusing my mind so much that I start mistyping things.
> >
> > Well, what words do you want to use?
>
> The same I've used through this whole thread: controlling and
> controlled function.
>
> > So I don't think I've learned anything more about your objection.
> >
> > "fundamentally broken" doesn't help
>
> The objection is that:
>
> - in hardware fundamentally only the controlling funtion can
> control live migration features on the controlled function,
> because the controlled function is assigned to a VM which has
> control over it.

Yes

However hisilicon managed to do their implementation without this, or
rather you could say their "controlling function" is a single MMIO BAR
page in their PCI VF and their "controlled function" is the rest of
the PCI VF.

> - for the same reason there is no portable way to even find
> the controlling function from a controlled function, unless
> you want to equate PF = controlling and VF = controlled,
> and even that breaks down in some corner cases

As you say, the kernel must know the relationship between
controlling->controlled. Nothing else is sane.

If the kernel knows this information then we can find a way for the
vfio_device to have pointers to both controlling and controlled
objects. I have a suggestion below.

> - if you want to control live migration from the controlled
> VM you need a new vfio subdriver for a function that has
> absolutely no new functionality itself related to live
> migration (quirks for bugs excepted).

I see it differently, the VFIO driver *is* the live migration
driver. Look at all the drivers that have come through and they are
99% live migration code. They have, IMHO, properly split the live
migration concern out of their controlling/PF driver and placed it in
the "VFIO live migration driver".

We've done a pretty good job of allowing the VFIO live migration
driver to pretty much just focus on live migration stuff and delegate
the VFIO part to library code.

Excepting quirks and bugs sounds nice, except we actually can't ignore
them. Having all the VFIO capabilities is exactly how we are fixing
the quirks and bugs in the first place, and I don't see your vision
how we can continue to do that if we split all the live migration code
into yet another subsystem.

For instance how do I trap FLR like mlx5 must do if the
drivers/live_migration code cannot intercept the FLR VFIO ioctl?

How do I mangle and share the BAR like hisilicon does?

Which is really why this is in VFIO in the first place. It actually is
coupled in practice, if not in theory.

> So by this architecture you build a convoluted mess where you need
> tons of vfio subdrivers that mostly just talk to the driver for the
> controlling function, which they can't even sanely discover. That's
> what I keep calling fundamentally broken.

The VFIO live migration drivers will look basically the same if you
put them under drivers/live_migration. This cannot be considered a
"convoluted mess" as splitting things by subsystem is best-practice,
AFAIK.

If we accept that drivers/vfio can be the "live migration subsystem"
then lets go all the way and have the controlling driver to call
vfio_device_group_register() to create the VFIO char device for the
controlled function.

This solves the "sanely discover" problem because of course the
controlling function driver knows what the controlled function is and
it can acquire both functions before it calls
vfio_device_group_register().

This is actually what I want to do anyhow for SIOV-like functions and
VFIO. Doing it for PCI VFs (or related PFs) is very nice symmetry. I
really dislike that our current SRIOV model in Linux forces the VF to
instantly exist without a chance for the controlling driver to
provision it.

We have some challenges on how to do this in the kernel, but I think
we can overcome them. VFIO is ready for this thanks to all the
restructuring work we already did.

I'd really like to get away from VFIO having to do all this crazy
sysfs crap to activate its driver. I think there is a lot of appeal to
having, say, a nvmecli command that just commands the controlling
driver to provision a function, enable live migration, configure it
and then make it visible via VFIO. The same API regardless if the
underlying controlled function technology is PF/VF/SIOV.

At least we have been very interested in doing that for networking
devices.

Jason

2022-12-09 02:11:52

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

> From: Christoph Hellwig <[email protected]>
> Sent: Tuesday, December 6, 2022 9:09 PM
>
>
> > I don't think we know enough about this space at the moment to fix a
> > specification to one path or the other, so I hope the TPAR will settle
> > on something that can support both models in SW and people can try
> > things out.
>
> I've not seen anyone from Intel actually contributing to the live
> migration TPAR, which is almost two month old by now.

Not a NVMe guy but obviously Intel should join. I have forwarded this
to internal folks to check the actual status.

And I also asked them to prepare a document explaining how this
opaque cmd actually works to address your concerns.

2022-12-09 02:15:06

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

> From: Christoph Hellwig <[email protected]>
> Sent: Wednesday, December 7, 2022 3:59 PM
>
> On Wed, Dec 07, 2022 at 04:30:20AM +0200, Max Gurtovoy wrote:
> > I'm not sure how awkward is for migration driver to ask the controlling
> > device driver to operate a migration action.
>
> It can't. That's the whole point. The controlled function that is
> being migrate must be absolutely unaware of that (except for things
> like quiescing access or FLRs that could happen anyway), because
> otherwise your have a fundamental information leak.
>

Can you elaborate which information is leaked?

No matter who provides the uAPI (controlling or controlled), end of
the day it's the same cmd invoked on the controlling device to save/
restore the state of the controlled device itself.

2022-12-09 03:42:36

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, December 8, 2022 4:08 AM
>
> For instance how do I trap FLR like mlx5 must do if the
> drivers/live_migration code cannot intercept the FLR VFIO ioctl?
>
> How do I mangle and share the BAR like hisilicon does?
>
> Which is really why this is in VFIO in the first place. It actually is
> coupled in practice, if not in theory.
>

Those are good questions which I also buy in to stay with the
current VFIO framework.

Actually the same controlling vs. controlled design choice also exists
in the hardware side. There are plenty of SR-IOV devices supporting
doorbells for VF (controlled function) to call services on PF (controlling
function) while the doorbell interface is implemented on the VF side.

If following the direction having controlling function to explicitly
provide services then all those doorbells should be deprecated
and instead we want explicit communications between VF driver
and PF driver.

From userspace driver p.o.v. the VFIO uAPI is kind of a device
programming interface. Here we just present everything related
to the controlled device itself (including the state management)
via a centralized portal though in the kernel there might be linkage
out of VFIO to reach the controlling driver. kind of a sw doorbell. ????

btw just to add more background to this work. Half a year ago Lei
actually did a flavor [1] in the other way.

The controlling function of Intel IPU card also supports a network gRPC
protocol to manage the state of controlled NVMe function.

Then that series attempted to introduce an out-of-band migration
framework in Qemu so instead of doing in-line state management
via kernel VFIO uAPI Qemu can turn to an external plugin which
forwards the state cmd via gRPC to the controlling function.

Just that the controlling driver is not in the kernel.

It's dropped as the inline way was preferred.

Thanks
Kevin

[1] https://lore.kernel.org/all/[email protected]/T/

2022-12-09 17:30:04

by Li, Yadong

[permalink] [raw]
Subject: RE: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

>From: Tian, Kevin <[email protected]>

>Not a NVMe guy but obviously Intel should join. I have forwarded this
>to internal folks to check the actual status.
Intel team has been actively participating in the TPAR definition over last month.
The key leadership in NVMe WG, Peter and Nick (from Intel), are on top of this TPAR,
and help influence the scope of the TPAR to better support IPU/DPUs. IPU/DPU is a
new type of device that desperately needs a standard based live migration solution.
Soon, you may also see more Intel people to be part of the SW WG for the NVMe live
migration standardization effort.

>And I also asked them to prepare a document explaining how this
>opaque cmd actually works to address your concerns.
There is nothing secret. It's the VF states including VF CSR registers, IO QP states, and
the AdminQ state. The AdminQ state may also include the pending AER command.
As part of the NVMe live migration standardization effort, it's desirable to standardize
the structure as much as we can, and then have an extension for any device specific
state that may be required to work around specific design HW limitations.

2022-12-09 19:11:53

by Dong, Eddie

[permalink] [raw]
Subject: RE: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

>
> If following the direction having controlling function to explicitly provide
> services then all those doorbells should be deprecated and instead we want
> explicit communications between VF driver and PF driver.

Hardware mechanism of communication (door bell here) can be used to
support broader scenario when VF driver and PF driver run in 2 different
OS kernels (when the VF is assigned to a VM).

>
> From userspace driver p.o.v. the VFIO uAPI is kind of a device programming
> interface. Here we just present everything related to the controlled device
> itself (including the state management) via a centralized portal though in the
> kernel there might be linkage out of VFIO to reach the controlling driver. kind
> of a sw doorbell. ????

Yes.
And sw doorbell is more efficient than hardware doorbell ???? .

2022-12-11 12:42:54

by Max Gurtovoy

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device


On 12/6/2022 5:01 PM, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote:
>> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
>> use.
> Beward: The secondary function might as well be a physical function
> as well. In fact one of the major customers for "smart" multifunction
> nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the
> symmetric dual ported devices are multi-PF as well).
>
> So this isn't really about a VF live cycle, but how to manage life
> migration, especially on the receive / restore side. And restoring
> the entire controller state is extremely invasive and can't be done
> on a controller that is in any classic form live. In fact a lot
> of the state is subsystem-wide, so without some kind of virtualization
> of the subsystem it is impossible to actually restore the state.

ohh, great !

I read this subsystem virtualization proposal of yours after I sent my
proposal for subsystem virtualization in patch 1/5 thread.
I guess this means that this is the right way to go.
Lets continue brainstorming this idea. I think this can be the way to
migrate NVMe controllers in a standard way.

>
> To cycle back to the hardware that is posted here, I'm really confused
> how it actually has any chance to work and no one has even tried
> to explain how it is supposed to work.

I guess in vendor specific implementation you can assume some things
that we are discussing now for making it as a standard.


2022-12-11 13:06:51

by Max Gurtovoy

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.


On 12/7/2022 10:08 PM, Jason Gunthorpe wrote:
> On Wed, Dec 07, 2022 at 07:33:33PM +0100, Christoph Hellwig wrote:
>> On Wed, Dec 07, 2022 at 01:31:44PM -0400, Jason Gunthorpe wrote:
>>>> Sorry, I meant VF. Your continued using of SR-IOV teminology now keeps
>>>> confusing my mind so much that I start mistyping things.
>>> Well, what words do you want to use?
>> The same I've used through this whole thread: controlling and
>> controlled function.
>>
>>> So I don't think I've learned anything more about your objection.
>>>
>>> "fundamentally broken" doesn't help
>> The objection is that:
>>
>> - in hardware fundamentally only the controlling funtion can
>> control live migration features on the controlled function,
>> because the controlled function is assigned to a VM which has
>> control over it.
> Yes
>
> However hisilicon managed to do their implementation without this, or
> rather you could say their "controlling function" is a single MMIO BAR
> page in their PCI VF and their "controlled function" is the rest of
> the PCI VF.
>
>> - for the same reason there is no portable way to even find
>> the controlling function from a controlled function, unless
>> you want to equate PF = controlling and VF = controlled,
>> and even that breaks down in some corner cases
> As you say, the kernel must know the relationship between
> controlling->controlled. Nothing else is sane.
>
> If the kernel knows this information then we can find a way for the
> vfio_device to have pointers to both controlling and controlled
> objects. I have a suggestion below.
>
>> - if you want to control live migration from the controlled
>> VM you need a new vfio subdriver for a function that has
>> absolutely no new functionality itself related to live
>> migration (quirks for bugs excepted).
> I see it differently, the VFIO driver *is* the live migration
> driver. Look at all the drivers that have come through and they are
> 99% live migration code. They have, IMHO, properly split the live
> migration concern out of their controlling/PF driver and placed it in
> the "VFIO live migration driver".
>
> We've done a pretty good job of allowing the VFIO live migration
> driver to pretty much just focus on live migration stuff and delegate
> the VFIO part to library code.
>
> Excepting quirks and bugs sounds nice, except we actually can't ignore
> them. Having all the VFIO capabilities is exactly how we are fixing
> the quirks and bugs in the first place, and I don't see your vision
> how we can continue to do that if we split all the live migration code
> into yet another subsystem.
>
> For instance how do I trap FLR like mlx5 must do if the
> drivers/live_migration code cannot intercept the FLR VFIO ioctl?
>
> How do I mangle and share the BAR like hisilicon does?
>
> Which is really why this is in VFIO in the first place. It actually is
> coupled in practice, if not in theory.
>
>> So by this architecture you build a convoluted mess where you need
>> tons of vfio subdrivers that mostly just talk to the driver for the
>> controlling function, which they can't even sanely discover. That's
>> what I keep calling fundamentally broken.
> The VFIO live migration drivers will look basically the same if you
> put them under drivers/live_migration. This cannot be considered a
> "convoluted mess" as splitting things by subsystem is best-practice,
> AFAIK.
>
> If we accept that drivers/vfio can be the "live migration subsystem"
> then lets go all the way and have the controlling driver to call
> vfio_device_group_register() to create the VFIO char device for the
> controlled function.
>
> This solves the "sanely discover" problem because of course the
> controlling function driver knows what the controlled function is and
> it can acquire both functions before it calls
> vfio_device_group_register().
>
> This is actually what I want to do anyhow for SIOV-like functions and
> VFIO. Doing it for PCI VFs (or related PFs) is very nice symmetry. I
> really dislike that our current SRIOV model in Linux forces the VF to
> instantly exist without a chance for the controlling driver to
> provision it.
>
> We have some challenges on how to do this in the kernel, but I think
> we can overcome them. VFIO is ready for this thanks to all the
> restructuring work we already did.
>
> I'd really like to get away from VFIO having to do all this crazy
> sysfs crap to activate its driver. I think there is a lot of appeal to
> having, say, a nvmecli command that just commands the controlling
> driver to provision a function, enable live migration, configure it
> and then make it visible via VFIO. The same API regardless if the
> underlying controlled function technology is PF/VF/SIOV.
>
> At least we have been very interested in doing that for networking
> devices.
>
> Jason

Jason/Christoph,

As I mentioned earlier we have 2 orthogonal paths here: implementation
and SPEC. They are for some reason mixed in this discussion.

I've tried to understand some SPEC related issues that were raised here,
that if we fix them - the implementation will be possible and all the
VFIO efforts we did can be re-used.

In high level I think that for the SPEC:

1. Need to define a concept of a "virtual subsystem". A primary
controller will be able to create a virtual subsystem. Inside this
subsystem the primary controller will be the master ("the controlling")
of the migration process. It will also be able to add secondary
controllers to this virtual subsystem and assign "virtual controller ID"
to it.
something like:
- nvme virtual_subsys_create --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1"
--dev_vcid = 1
- nvme virtual_subsys_add --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1"
--secondary_dev=/dev/nvme2 --secondary_dev_vcid=20

2. Now the primary controller have a list of ctrls inside it's virtual
subsystem for migration. It has handle to it that doesn't go away after
binding the controlled function to VFIO.

3. Same virtual subsystem should be created in the destination hypervisor.

4. Now, migration process starts using the VFIO uAPI - we will get to a
point that VFIO driver of the controlled function needs to ask the
controlling function to send admin commands to manage the migration process.
    How to do it ? implementation detail. We can set a pointer in
pci_dev/dev structures or we can ask
nvme_migration_handle_get(controlled_function) or NVMe can expose API's
dedicated to migration nvme_state_save(controlled_function).


When creating a virtual subsystem and adding controllers to it, we can
control any leakage or narrow some functionality that make migration
impossible. This can be using more admin commands for example.
After the migration process is over, one can remove the secondary
controller from the virtual subsystem and re-use it.

WDYT ?

2022-12-11 13:52:07

by Rao, Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device



On 12/11/2022 8:05 PM, Max Gurtovoy wrote:
>
> On 12/6/2022 5:01 PM, Christoph Hellwig wrote:
>> On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote:
>>> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
>>> use.
>> Beward:  The secondary function might as well be a physical function
>> as well.  In fact one of the major customers for "smart" multifunction
>> nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the
>> symmetric dual ported devices are multi-PF as well).
>>
>> So this isn't really about a VF live cycle, but how to manage life
>> migration, especially on the receive / restore side.  And restoring
>> the entire controller state is extremely invasive and can't be done
>> on a controller that is in any classic form live.  In fact a lot
>> of the state is subsystem-wide, so without some kind of virtualization
>> of the subsystem it is impossible to actually restore the state.
>
> ohh, great !
>
> I read this subsystem virtualization proposal of yours after I sent my proposal for subsystem virtualization in patch 1/5 thread.
> I guess this means that this is the right way to go.
> Lets continue brainstorming this idea. I think this can be the way to migrate NVMe controllers in a standard way.
>
>>
>> To cycle back to the hardware that is posted here, I'm really confused
>> how it actually has any chance to work and no one has even tried
>> to explain how it is supposed to work.
>
> I guess in vendor specific implementation you can assume some things that we are discussing now for making it as a standard.

Yes, as I wrote in the cover letter, this is a reference implementation to
start a discussion and help drive standardization efforts, but this series
works well for Intel IPU NVMe. As Jason said, there are two use cases:
shared medium and local medium. I think the live migration of the local medium
is complicated due to the large amount of user data that needs to be migrated.
I don't have a good idea to deal with this situation. But for Intel IPU NVMe,
each VF can connect to remote storage via the NVMF protocol to achieve storage
offloading. This is the shared medium. In this case, we don't need to migrate
the user data, which will significantly simplify the work of live migration.
The series tries to solve the problem of live migration of shared medium.
But it still lacks dirty page tracking and P2P support, we are also developing
these features.

About the nvme device state, As described in my document, the VF states include
VF CSR registers, Every IO Queue Pair state, and the AdminQ state. During the
implementation, I found that the device state data is small per VF. So, I decided
to use the admin queue of the Primary controller to send the live migration
commands to save and restore the VF states like MLX5.

Thanks,
Lei

>
>

2022-12-11 15:48:12

by Max Gurtovoy

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device


On 12/11/2022 3:21 PM, Rao, Lei wrote:
>
>
> On 12/11/2022 8:05 PM, Max Gurtovoy wrote:
>>
>> On 12/6/2022 5:01 PM, Christoph Hellwig wrote:
>>> On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote:
>>>> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
>>>> use.
>>> Beward:  The secondary function might as well be a physical function
>>> as well.  In fact one of the major customers for "smart" multifunction
>>> nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the
>>> symmetric dual ported devices are multi-PF as well).
>>>
>>> So this isn't really about a VF live cycle, but how to manage life
>>> migration, especially on the receive / restore side.  And restoring
>>> the entire controller state is extremely invasive and can't be done
>>> on a controller that is in any classic form live.  In fact a lot
>>> of the state is subsystem-wide, so without some kind of virtualization
>>> of the subsystem it is impossible to actually restore the state.
>>
>> ohh, great !
>>
>> I read this subsystem virtualization proposal of yours after I sent
>> my proposal for subsystem virtualization in patch 1/5 thread.
>> I guess this means that this is the right way to go.
>> Lets continue brainstorming this idea. I think this can be the way to
>> migrate NVMe controllers in a standard way.
>>
>>>
>>> To cycle back to the hardware that is posted here, I'm really confused
>>> how it actually has any chance to work and no one has even tried
>>> to explain how it is supposed to work.
>>
>> I guess in vendor specific implementation you can assume some things
>> that we are discussing now for making it as a standard.
>
> Yes, as I wrote in the cover letter, this is a reference
> implementation to
> start a discussion and help drive standardization efforts, but this
> series
> works well for Intel IPU NVMe. As Jason said, there are two use cases:
> shared medium and local medium. I think the live migration of the
> local medium
> is complicated due to the large amount of user data that needs to be
> migrated.
> I don't have a good idea to deal with this situation. But for Intel
> IPU NVMe,
> each VF can connect to remote storage via the NVMF protocol to achieve
> storage
> offloading. This is the shared medium. In this case, we don't need to
> migrate
> the user data, which will significantly simplify the work of live
> migration.

I don't think that medium migration should be part of the SPEC. We can
specify it's out of scope.

All the idea of live migration is to have a short downtime and I don't
think we can guarantee short downtime if we need to copy few terabytes
throw the networking.
If the media copy is taking few seconds, there is no need to do live
migration of few milisecs downtime. Just do regular migration of a VM.

>
> The series tries to solve the problem of live migration of shared medium.
> But it still lacks dirty page tracking and P2P support, we are also
> developing
> these features.
>
> About the nvme device state, As described in my document, the VF
> states include
> VF CSR registers, Every IO Queue Pair state, and the AdminQ state.
> During the
> implementation, I found that the device state data is small per VF.
> So, I decided
> to use the admin queue of the Primary controller to send the live
> migration
> commands to save and restore the VF states like MLX5.

I think and hope we all agree that the AdminQ of the controlling NVMe
function will be used to migrate the controlled NVMe function.

which document are you refereeing to ?

>
> Thanks,
> Lei
>
>>
>>

2022-12-12 01:45:39

by Rao, Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device



On 12/11/2022 10:51 PM, Max Gurtovoy wrote:
>
> On 12/11/2022 3:21 PM, Rao, Lei wrote:
>>
>>
>> On 12/11/2022 8:05 PM, Max Gurtovoy wrote:
>>>
>>> On 12/6/2022 5:01 PM, Christoph Hellwig wrote:
>>>> On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote:
>>>>> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
>>>>> use.
>>>> Beward:  The secondary function might as well be a physical function
>>>> as well.  In fact one of the major customers for "smart" multifunction
>>>> nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the
>>>> symmetric dual ported devices are multi-PF as well).
>>>>
>>>> So this isn't really about a VF live cycle, but how to manage life
>>>> migration, especially on the receive / restore side.  And restoring
>>>> the entire controller state is extremely invasive and can't be done
>>>> on a controller that is in any classic form live.  In fact a lot
>>>> of the state is subsystem-wide, so without some kind of virtualization
>>>> of the subsystem it is impossible to actually restore the state.
>>>
>>> ohh, great !
>>>
>>> I read this subsystem virtualization proposal of yours after I sent my proposal for subsystem virtualization in patch 1/5 thread.
>>> I guess this means that this is the right way to go.
>>> Lets continue brainstorming this idea. I think this can be the way to migrate NVMe controllers in a standard way.
>>>
>>>>
>>>> To cycle back to the hardware that is posted here, I'm really confused
>>>> how it actually has any chance to work and no one has even tried
>>>> to explain how it is supposed to work.
>>>
>>> I guess in vendor specific implementation you can assume some things that we are discussing now for making it as a standard.
>>
>> Yes, as I wrote in the cover letter, this is a reference implementation to
>> start a discussion and help drive standardization efforts, but this series
>> works well for Intel IPU NVMe. As Jason said, there are two use cases:
>> shared medium and local medium. I think the live migration of the local medium
>> is complicated due to the large amount of user data that needs to be migrated.
>> I don't have a good idea to deal with this situation. But for Intel IPU NVMe,
>> each VF can connect to remote storage via the NVMF protocol to achieve storage
>> offloading. This is the shared medium. In this case, we don't need to migrate
>> the user data, which will significantly simplify the work of live migration.
>
> I don't think that medium migration should be part of the SPEC. We can specify it's out of scope.
>
> All the idea of live migration is to have a short downtime and I don't think we can guarantee short downtime if we need to copy few terabytes throw the networking.
> If the media copy is taking few seconds, there is no need to do live migration of few milisecs downtime. Just do regular migration of a
>
>>
>> The series tries to solve the problem of live migration of shared medium.
>> But it still lacks dirty page tracking and P2P support, we are also developing
>> these features.
>>
>> About the nvme device state, As described in my document, the VF states include
>> VF CSR registers, Every IO Queue Pair state, and the AdminQ state. During the
>> implementation, I found that the device state data is small per VF. So, I decided
>> to use the admin queue of the Primary controller to send the live migration
>> commands to save and restore the VF states like MLX5.
>
> I think and hope we all agree that the AdminQ of the controlling NVMe function will be used to migrate the controlled NVMe function.

Fully agree.

>
> which document are you refereeing to ?

The fifth patch includes the definition of these commands and how the firmware handles
these live migration commands. It's the documentation that I referenced.

>>
>>>
>>>

2022-12-12 07:46:16

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Fri, Dec 09, 2022 at 02:11:21AM +0000, Tian, Kevin wrote:
> > It can't. That's the whole point. The controlled function that is
> > being migrate must be absolutely unaware of that (except for things
> > like quiescing access or FLRs that could happen anyway), because
> > otherwise your have a fundamental information leak.
> >
>
> Can you elaborate which information is leaked?

Information about what controllers exist, what namespaces exist, etc.

2022-12-12 08:30:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Sun, Dec 11, 2022 at 01:39:37PM +0200, Max Gurtovoy wrote:
> 1. Need to define a concept of a "virtual subsystem". A primary controller
> will be able to create a virtual subsystem. Inside this subsystem the
> primary controller will be the master ("the controlling") of the migration
> process. It will also be able to add secondary controllers to this virtual
> subsystem and assign "virtual controller ID" to it.
> something like:
> - nvme virtual_subsys_create --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1"
> --dev_vcid = 1
> - nvme virtual_subsys_add --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1"
> --secondary_dev=/dev/nvme2 --secondary_dev_vcid=20

Yes. Note that there is a bit more state than just the NQN. You also
need at least a serial number, and also probably a different vendor
ID (the PCI vendor ID that is also mirror in Identify Controller and
the IEEE OUI), and new unique namespace identifier.

2022-12-12 08:33:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Tue, Dec 06, 2022 at 06:00:27PM +0000, Dong, Eddie wrote:
> NVMe spec is general, but the implementation details (such as internal state) may
> be vendor specific. If the migration happens between 2 identical NVMe devices
> (from same vendor/device w/ same firmware version), migration of
> subsystem-wide state can be naturally covered, right?

No. If you want live migration for nvme supported in Linux, it must
be speced in the NVMe technical working group and interoperate between
different implementations.

2022-12-12 08:35:01

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Wed, Dec 07, 2022 at 04:08:02PM -0400, Jason Gunthorpe wrote:
> However hisilicon managed to do their implementation without this, or
> rather you could say their "controlling function" is a single MMIO BAR
> page in their PCI VF and their "controlled function" is the rest of
> the PCI VF.

Eww. So you need to carefully filter the BAR and can't actually do
any DMA at all? I'm not sure that is actually practical, especially
not for something with a lot of state.

> If the kernel knows this information then we can find a way for the
> vfio_device to have pointers to both controlling and controlled
> objects. I have a suggestion below.

So now we need to write a vfio shim for every function even if there
is absolutely nothing special about that function? Migrating really
is the controlling functions behavior, and writing a new vfio bit
for every controlled thing just does not scale.

> I see it differently, the VFIO driver *is* the live migration
> driver. Look at all the drivers that have come through and they are
> 99% live migration code.

Yes, and that's the problem, because they are associated with the
controlled function, and now we have a communication problem between
that vfio driver binding to the controlled function and the drive
that actually controlls live migration that is associated with the
controlling function. In other words: you've created a giant mess.

> Excepting quirks and bugs sounds nice, except we actually can't ignore
> them.

I'm not proposing to ignore them. But they should not be needed most
of the time.

> For instance how do I trap FLR like mlx5 must do if the
> drivers/live_migration code cannot intercept the FLR VFIO ioctl?
>
> How do I mangle and share the BAR like hisilicon does?
>
> Which is really why this is in VFIO in the first place. It actually is
> coupled in practice, if not in theory.

So you've created a long term userspace API around working around
around buggy and/or misdesigned early designs and now want to force
it down everyones throat?

Can we please take a step back and think about how things should work,
and only then think how to work around weirdo devices that do strange
things as a second step?

> If we accept that drivers/vfio can be the "live migration subsystem"
> then lets go all the way and have the controlling driver to call
> vfio_device_group_register() to create the VFIO char device for the
> controlled function.

While creating the VFs from the PF driver makes a lot more sense,
remember that vfio is absolutely not the only use case for VFs.
There are plenty use cases where you want to use them with the normal
kernel driver as well. So the interface to create VFs needs a now
to decide if it should be vfio exported, or use the normal kernel
binding.

> This solves the "sanely discover" problem because of course the
> controlling function driver knows what the controlled function is and
> it can acquire both functions before it calls
> vfio_device_group_register().

Yes.

> This is actually what I want to do anyhow for SIOV-like functions and
> VFIO. Doing it for PCI VFs (or related PFs) is very nice symmetry. I
> really dislike that our current SRIOV model in Linux forces the VF to
> instantly exist without a chance for the controlling driver to
> provision it.

For SIOV you have no other choice anyway. But I agree that it is
the right thing to do for VFIO. Now the next step is to control
live migration from the controlling function, so that for most sane
devices the controlled device does not need all the pointless
boilerplate of its own vfio driver.

> I'd really like to get away from VFIO having to do all this crazy
> sysfs crap to activate its driver. I think there is a lot of appeal to
> having, say, a nvmecli command that just commands the controlling
> driver to provision a function, enable live migration, configure it
> and then make it visible via VFIO. The same API regardless if the
> underlying controlled function technology is PF/VF/SIOV.

Yes.

2022-12-12 08:46:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Fri, Dec 09, 2022 at 04:53:47PM +0000, Li, Yadong wrote:
> The key leadership in NVMe WG, Peter and Nick (from Intel), are on top of this TPAR,
> and help influence the scope of the TPAR to better support IPU/DPUs.

You guys should talk more to each other. I think Peter especially has
been vocal to reduce the scope and not include this.

2022-12-12 08:51:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

On Sun, Dec 11, 2022 at 04:51:02PM +0200, Max Gurtovoy wrote:
> I don't think that medium migration should be part of the SPEC. We can
> specify it's out of scope.

This is the main item in the TPAR in the technical working group,
with SQ/CQ state beeing the other one. So instead of arguing here
I'd suggest you all get involved in the working group ASAP.

> All the idea of live migration is to have a short downtime and I don't
> think we can guarantee short downtime if we need to copy few terabytes
> throw the networking.

You can. Look at the existing qemu code for live migration for
image based storage, the same concepts also work for hardware offloads.

> If the media copy is taking few seconds, there is no need to do live
> migration of few milisecs downtime. Just do regular migration of a VM.

The point is of course to not do the data migration during the downtime,
but to track newly written LBAs after the start of the copy proces.
Again look at qemu for how this has been done for years in software.

2022-12-12 15:21:30

by Max Gurtovoy

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.


On 12/12/2022 9:55 AM, Christoph Hellwig wrote:
> On Sun, Dec 11, 2022 at 01:39:37PM +0200, Max Gurtovoy wrote:
>> 1. Need to define a concept of a "virtual subsystem". A primary controller
>> will be able to create a virtual subsystem. Inside this subsystem the
>> primary controller will be the master ("the controlling") of the migration
>> process. It will also be able to add secondary controllers to this virtual
>> subsystem and assign "virtual controller ID" to it.
>> something like:
>> - nvme virtual_subsys_create --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1"
>> --dev_vcid = 1
>> - nvme virtual_subsys_add --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1"
>> --secondary_dev=/dev/nvme2 --secondary_dev_vcid=20
> Yes. Note that there is a bit more state than just the NQN. You also
> need at least a serial number, and also probably a different vendor
> ID (the PCI vendor ID that is also mirror in Identify Controller and
> the IEEE OUI), and new unique namespace identifier.

Yes for sure there is more bits we should consider.

I wanted to describe the high level.

I think that we can maybe say the when a secondary function is moved to
a virtual subsystem its feature set of the virtual ctrl is narrowed to
mandatory NVMe features set.
And we'll provide an API to set/extend it's feature set to a maximum of
the feature set that the original ctrl of the secondary function had.
Then the sys admin will configure the virtual ctrl in both src/dst hosts.

The high level idea is to have a programmable way to set the features of
a virtual controller inside a virtual subsystem that is also programmable.

2022-12-13 15:09:57

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Mon, Dec 12, 2022 at 08:50:46AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 04:08:02PM -0400, Jason Gunthorpe wrote:
> > However hisilicon managed to do their implementation without this, or
> > rather you could say their "controlling function" is a single MMIO BAR
> > page in their PCI VF and their "controlled function" is the rest of
> > the PCI VF.
>
> Eww. So you need to carefully filter the BAR and can't actually do
> any DMA at all? I'm not sure that is actually practical, especially
> not for something with a lot of state.

Indeed, but it is what they did and the HW should be supported by the
kernel, IMO.

> > If the kernel knows this information then we can find a way for the
> > vfio_device to have pointers to both controlling and controlled
> > objects. I have a suggestion below.
>
> So now we need to write a vfio shim for every function even if there
> is absolutely nothing special about that function? Migrating really
> is the controlling functions behavior, and writing a new vfio bit
> for every controlled thing just does not scale.

Huh? "does not scale?" We are looking at boilerplates of around 20-30
lines to make a VFIO driver for a real PCI device. Why is that even
something we should worry about optimizing?

And when you get into exciting future devices like SIOV you already
need to make a special VFIO driver anyhow.

> Yes, and that's the problem, because they are associated with the
> controlled function, and now we have a communication problem between
> that vfio driver binding to the controlled function and the drive
> that actually controlls live migration that is associated with the
> controlling function. In other words: you've created a giant mess.

So far 100% of the drivers that have been presented, including the two
RFC ones, have entanglements between live migration and vfio. Shifting
things to dev/live_migration doesn't make the "communication problem"
away, it just shifted it into another subsystem.

This is my point, I've yet to even see a driver that meets your
theoretical standard that it can exist without vfio entanglement. So
I'd be much more interested in this idea if we had a stable of drivers
that obviously were harmed by VFIO. We don't have that, and I don't
even think that we ever will, considering both RFC drivers have
devices that also stepped on the mlx5 FLR problem too.

The FLR thing actually makes sense, becauase it is not actually the
controlling function that is doing the live migration inside the
devices. The controlling function is just a *proxy* to deliver
commands to the controlled function. So FLR on the controlled device
effects commands being executed on the controlling function. It is a
pain.

As it is, live migration is only useful with VFIO, so they are put
together to keep things simpler. The only complexity is the
controlled/controlling issue and for all existing SRIOV PF/VF
relationships we have an OK solution (at least it isn't buggy).

nvme's higher flexability needs more work, but that doesn't mean the
idea is so wrong. I think you are reall overstating the "mess"

> I'm not proposing to ignore them. But they should not be needed most
> of the time.

I'm not seeing that in the drivers I've looked at.

> > Which is really why this is in VFIO in the first place. It actually is
> > coupled in practice, if not in theory.
>
> So you've created a long term userspace API around working around
> around buggy and/or misdesigned early designs and now want to force
> it down everyones throat?

No, we coupled live migration and VFIO because they are never useful
apart, and we accept that we must find reasonable solutions to linking
the controlling/controlled device because it is necessary in all cases
we've seen.

> > If we accept that drivers/vfio can be the "live migration subsystem"
> > then lets go all the way and have the controlling driver to call
> > vfio_device_group_register() to create the VFIO char device for the
> > controlled function.
>
> While creating the VFs from the PF driver makes a lot more sense,
> remember that vfio is absolutely not the only use case for VFs.
> There are plenty use cases where you want to use them with the normal
> kernel driver as well. So the interface to create VFs needs a now
> to decide if it should be vfio exported, or use the normal kernel
> binding.

Yes, that is why this problem has been open for so long. Fixing it
well requires some reconsideration of how the driver core works :(

It is worse than just VFIO vs one kernel driver, like mlx5 could spawn
a controlled function that is NVMe, VDPA, mlx5, virtio-net, VFIO,
etc.

When we create the function we really want to tell the device what
kind of function it is, and that also tells the kernel what driver
should be bound to it.

mlx5 even has weird limitations, like a controlled function that is
live migration capable has fewer features than a function that is
not. So the user must specify what parameters it wants the controlled
function to have..

Jason

2022-12-13 16:22:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Tue, Dec 13, 2022 at 10:01:03AM -0400, Jason Gunthorpe wrote:
> > So now we need to write a vfio shim for every function even if there
> > is absolutely nothing special about that function? Migrating really
> > is the controlling functions behavior, and writing a new vfio bit
> > for every controlled thing just does not scale.
>
> Huh? "does not scale?" We are looking at boilerplates of around 20-30
> lines to make a VFIO driver for a real PCI device. Why is that even
> something we should worry about optimizing?

But we need a new driver for every controlled function now, which
is very different from the classic VFIO model where we had one
vfio_pci.

> And when you get into exciting future devices like SIOV you already
> need to make a special VFIO driver anyhow.

You need to special support for it. It's probably not another
Linux driver but part of the parent one, though.

> So far 100% of the drivers that have been presented, including the two
> RFC ones, have entanglements between live migration and vfio. Shifting
> things to dev/live_migration doesn't make the "communication problem"
> away, it just shifted it into another subsystem.

The main entanglement seems to be that it needs to support a vfio
interface for live migration while the actual commands go to the
parent device.

> This is my point, I've yet to even see a driver that meets your
> theoretical standard that it can exist without vfio entanglement.

It can't right now due to the VFIO design.

> > While creating the VFs from the PF driver makes a lot more sense,
> > remember that vfio is absolutely not the only use case for VFs.
> > There are plenty use cases where you want to use them with the normal
> > kernel driver as well. So the interface to create VFs needs a now
> > to decide if it should be vfio exported, or use the normal kernel
> > binding.
>
> Yes, that is why this problem has been open for so long. Fixing it
> well requires some reconsideration of how the driver core works :(
>
> It is worse than just VFIO vs one kernel driver, like mlx5 could spawn
> a controlled function that is NVMe, VDPA, mlx5, virtio-net, VFIO,
> etc.

This seems to violate the PCIe spec, which says:

"All VFs associated with a PF must be the same device type as the PF,
(e.g., the same network device type or the same storage device type.)",

which is also enforced by not allowing to read vendor/device/class
fields from VFs.

(not that I'm arguing that this is a good limit, but that's how
PCIe does it).

> When we create the function we really want to tell the device what
> kind of function it is, and that also tells the kernel what driver
> should be bound to it.

I'd rather have different ways to probe by passing a "kind" or "type"
argument along the device IDs during probing. E.g. "driver"
and "vfio", and then only match for the kind the creator of the device
added them to the device model for.

> mlx5 even has weird limitations, like a controlled function that is
> live migration capable has fewer features than a function that is
> not. So the user must specify what parameters it wants the controlled
> function to have..

I don't think that is weird. If you want to live migrate, you need to

a) make sure the feature set is compatible with the other side
b) there is only state that actually is migratable

so I'd expect that for any other sufficiently complex device. NVMe
for sure will have limits like this.

2022-12-13 18:30:27

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

On Tue, Dec 13, 2022 at 05:08:07PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 13, 2022 at 10:01:03AM -0400, Jason Gunthorpe wrote:
> > > So now we need to write a vfio shim for every function even if there
> > > is absolutely nothing special about that function? Migrating really
> > > is the controlling functions behavior, and writing a new vfio bit
> > > for every controlled thing just does not scale.
> >
> > Huh? "does not scale?" We are looking at boilerplates of around 20-30
> > lines to make a VFIO driver for a real PCI device. Why is that even
> > something we should worry about optimizing?
>
> But we need a new driver for every controlled function now, which
> is very different from the classic VFIO model where we had one
> vfio_pci.

To be fair, mainly vfio_pci had that model. Other uses of VFIO have
device specific drivers already. We have the reset drivers in vfio
platform, and the mdevs already. SIOV drivers are coming and they will
not be general either. I know a few coming non-migration VFIO PCI
variant drivers as well to deal with HW issues.

Remember, we did a bunch of work to make this reasonable. Userspace
can properly probe the correct VFIO driver for the HW it wants to use,
just like normal devices. If we spawn the VFIO from the controlling
function then it obviously will bring the correct driver along too.

The mental model I have for VFIO is that every vfio_device has a
driver, and we have three "universal" drivers that wildcard match to
many devices (pci, fsl, and platform acpi reset). Otherwise VFIO is
like every other driver subsystem out there, with physical devices and
matching drivers that support them.

Creating drivers for HW is not a problem, that is what a driver
subsystem is for. We already invested effort in VFIO to make this
scalable.

> > And when you get into exciting future devices like SIOV you already
> > need to make a special VFIO driver anyhow.
>
> You need to special support for it. It's probably not another
> Linux driver but part of the parent one, though.

The designs we have done in mlx5 are split. The "parent" has just
enough shim to describe what the SIOV is in terms of a 'slice of the
parents resources' and then we layer another driver, located in the
proper subsystem, to operate that slice. VDPA makes a
/dev/virtio-whatever, VFIO would make a fake PCI function, mlx5 makes
a netdev, etc.

It is not so different from how a PF/VF relationship works, just that
the SIOV is described by a struct auxillary_device not a struct
pci_dev.

I don't really like implementing VFIO drivers outside drivers/vfio, I
think that has historically had bad outcomes in other subsystems.

> > So far 100% of the drivers that have been presented, including the two
> > RFC ones, have entanglements between live migration and vfio. Shifting
> > things to dev/live_migration doesn't make the "communication problem"
> > away, it just shifted it into another subsystem.
>
> The main entanglement seems to be that it needs to support a vfio
> interface for live migration while the actual commands go to the
> parent device.

Not at all, that is only a couple function calls in 4 of the drivers
so far.

The entanglement is that the live migration FSM and the VFIO device
operation are not isolated. I keep repeating this - mlx5 and the two
RFC drivers must trap VFIO operations and relay them to their
migration logic. hns has to mangle its BARs. These are things that
only exist on the VFIO side.

So, you are viewing live migration as orthogonal and separable to
VFIO, and I don't agree with this because I haven't yet seen any proof
in implementations.

Let's go through the nvme spec process and see how it works out. If
NVMe can address things which are tripping up other implemenations,
like FLR of the controlled function. Then we may have the first
example. If not, then it is just how things are.

FLR is trickey, it not obvious to me that you want a definition of
migration that isolates controlled function FLR from the migration
FSM..

There are advantages to having a reliable, universal, way to bring a
function back to a clean slate, including restoring it to full
operation (ie canceling any migration operation). The current
definition of FLR provides this.

> > It is worse than just VFIO vs one kernel driver, like mlx5 could spawn
> > a controlled function that is NVMe, VDPA, mlx5, virtio-net, VFIO,
> > etc.
>
> This seems to violate the PCIe spec, which says:
>
> "All VFs associated with a PF must be the same device type as the PF,
> (e.g., the same network device type or the same storage device type.)",

For VFs there are multiple PFs to follow the above, and for SIOV this
language doesn't apply.

It seems the PDS RFC driver does violate this spec requirement though..

> > When we create the function we really want to tell the device what
> > kind of function it is, and that also tells the kernel what driver
> > should be bound to it.
>
> I'd rather have different ways to probe by passing a "kind" or "type"
> argument along the device IDs during probing. E.g. "driver"
> and "vfio", and then only match for the kind the creator of the device
> added them to the device model for.

Not everything can be done during driver probing. There are certainly
steps at SIOV instantiation time or VF provisioning that impact what
exactly is available on the controlled function. Eg on mlx5 when we
create a VDPA device it actually is different from a full-function
mlx5 device and that customization was done before any driver was
probed.

In fact, not only is it done before driver binding, but it can be
enforced as a security property from the DPU side when the DPU is the
thing creating the function.

I like the general idea of type to help specify the driver to probe,
we tried to work on something like that once and it didn't go far, but
I did like the concept of it.

> > mlx5 even has weird limitations, like a controlled function that is
> > live migration capable has fewer features than a function that is
> > not. So the user must specify what parameters it wants the controlled
> > function to have..
>
> I don't think that is weird. If you want to live migrate, you need to
>
> a) make sure the feature set is compatible with the other side
> b) there is only state that actually is migratable
>
> so I'd expect that for any other sufficiently complex device. NVMe
> for sure will have limits like this.

Oy, this has been pretty hard to define in mlx5 already :( Hopefully
nvme-cli can sort it out for NVMe configurables.

Jason