2023-10-17 00:42:59

by Si-Wei Liu

[permalink] [raw]
Subject: [PATCH v2 0/4] vdpa: decouple reset of iotlb mapping from device reset

In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device could implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, the
latter of which is mainly used to reset virtio device state.
This new .reset_map() callback will be invoked only before
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g.
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device creation,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

This patchset is based off of the descriptor group v3 series
from Dragos. [2]

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/[email protected]/msg953755.html
[2] [PATCH vhost v3 00/16] vdpa: Add support for vq descriptor mappings
https://lore.kernel.org/lkml/[email protected]/

---
v2:
- improved commit message to clarify the intended csope of .reset_map API
- improved commit messages to clarify no breakage on older userspace

v1:
- rewrote commit messages to include more detailed description and background
- reword to vendor specific IOMMU implementation from on-chip IOMMU
- include parent device backend feautres to persistent iotlb precondition
- reimplement mlx5_vdpa patch on top of descriptor group series

RFC v3:
- fix missing return due to merge error in patch #4

RFC v2:
- rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group" series:
https://lore.kernel.org/virtualization/[email protected]/

---

Si-Wei Liu (4):
vdpa: introduce .reset_map operation callback
vhost-vdpa: reset vendor specific mapping to initial state in .release
vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
vdpa/mlx5: implement .reset_map driver op

drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 +
drivers/vdpa/mlx5/core/mr.c | 17 ++++++++++++++++
drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++++++++++++-----
drivers/vhost/vdpa.c | 31 ++++++++++++++++++++++++++++++
include/linux/vdpa.h | 10 ++++++++++
include/uapi/linux/vhost_types.h | 2 ++
6 files changed, 74 insertions(+), 5 deletions(-)

--
2.39.3


2023-10-17 00:43:06

by Si-Wei Liu

[permalink] [raw]
Subject: [PATCH v2 3/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

Userspace needs this feature flag to distinguish if vhost-vdpa
iotlb in the kernel can be trusted to persist IOTLB mapping
across vDPA reset. Without it, userspace has no way to tell apart
if it's running on an older kernel, which could silently drop
all iotlb mapping across vDPA reset, especially with broken
parent driver implementation for the .reset driver op. The broken
driver may incorrectly drop all mappings of its own as part of
.reset, which inadvertently ends up with corrupted mapping state
between vhost-vdpa userspace and the kernel. As a workaround, to
make the mapping behaviour predictable across reset, userspace
has to pro-actively remove all mappings before vDPA reset, and
then restore all the mappings afterwards. This workaround is done
unconditionally on top of all parent drivers today, due to the
parent driver implementation issue and no means to differentiate.
This workaround had been utilized in QEMU since day one when the
corresponding vhost-vdpa userspace backend came to the world.

There are 3 cases that backend may claim this feature bit on for:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
.reset_map support in driver
- parent device with vendor specific IOMMU implementation with
persistent IOTLB mapping already that has to specifically
declare this backend feature

The reason why .reset_map is being one of the pre-condition for
persistent iotlb is because without it, vhost-vdpa can't switch
back iotlb to the initial state later on, especially for the
on-chip IOMMU case which starts with identity mapping at device
creation. virtio-vdpa requires on-chip IOMMU to perform 1:1
passthrough translation from PA to IOVA as-is to begin with, and
.reset_map is the only means to turn back iotlb to the identity
mapping mode after vhost-vdpa is gone.

The difference in behavior did not matter as QEMU unmaps all the
memory unregistering the memory listener at vhost_vdpa_dev_start(
started = false), but the backend acknowledging this feature flag
allows QEMU to make sure it is safe to skip this unmap & map in the
case of vhost stop & start cycle.

In that sense, this feature flag is actually a signal for userspace
to know that the driver bug has been solved. Not offering it
indicates that userspace cannot trust the kernel will retain the
maps.

Signed-off-by: Si-Wei Liu <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
---
drivers/vhost/vdpa.c | 15 +++++++++++++++
include/uapi/linux/vhost_types.h | 2 ++
2 files changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index a3f8160c9807..9202986a7d81 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -438,6 +438,15 @@ static u64 vhost_vdpa_get_backend_features(const struct vhost_vdpa *v)
return ops->get_backend_features(vdpa);
}

+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+ struct vdpa_device *vdpa = v->vdpa;
+ const struct vdpa_config_ops *ops = vdpa->config;
+
+ return (!ops->set_map && !ops->dma_map) || ops->reset_map ||
+ vhost_vdpa_get_backend_features(v) & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
+}
+
static long vhost_vdpa_set_features(struct vhost_vdpa *v, u64 __user *featurep)
{
struct vdpa_device *vdpa = v->vdpa;
@@ -725,6 +734,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
return -EFAULT;
if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
+ BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) |
BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
BIT_ULL(VHOST_BACKEND_F_RESUME) |
BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK)))
@@ -741,6 +751,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
!vhost_vdpa_has_desc_group(v))
return -EOPNOTSUPP;
+ if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) &&
+ !vhost_vdpa_has_persistent_map(v))
+ return -EOPNOTSUPP;
vhost_set_backend_features(&v->vdev, features);
return 0;
}
@@ -796,6 +809,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
if (vhost_vdpa_has_desc_group(v))
features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
+ if (vhost_vdpa_has_persistent_map(v))
+ features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
features |= vhost_vdpa_get_backend_features(v);
if (copy_to_user(featurep, &features, sizeof(features)))
r = -EFAULT;
diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index 18ad6ae7ab5c..d7656908f730 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -190,5 +190,7 @@ struct vhost_vdpa_iova_range {
* buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID.
*/
#define VHOST_BACKEND_F_DESC_ASID 0x7
+/* IOTLB don't flush memory mapping across device reset */
+#define VHOST_BACKEND_F_IOTLB_PERSIST 0x8

#endif
--
2.39.3

2023-10-17 00:43:10

by Si-Wei Liu

[permalink] [raw]
Subject: [PATCH v2 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
---
drivers/vhost/vdpa.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f57b95..a3f8160c9807 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
return vhost_vdpa_alloc_as(v, asid);
}

+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+ struct vdpa_device *vdpa = v->vdpa;
+ const struct vdpa_config_ops *ops = vdpa->config;
+
+ if (ops->reset_map)
+ ops->reset_map(vdpa, asid);
+}
+
static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
{
struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)

hlist_del(&as->hash_link);
vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid);
+ /*
+ * Devices with vendor specific IOMMU may need to restore
+ * iotlb to the initial or default state which is not done
+ * through device reset, as the IOTLB mapping manipulation
+ * could be decoupled from the virtio device life cycle.
+ */
+ vhost_vdpa_reset_map(v, asid);
kfree(as);

return 0;
--
2.39.3

2023-10-17 00:43:14

by Si-Wei Liu

[permalink] [raw]
Subject: [PATCH v2 1/4] vdpa: introduce .reset_map operation callback

Device specific IOMMU parent driver who wishes to see mapping to be
decoupled from virtio or vdpa device life cycle (device reset) can use
it to restore memory mapping in the device IOMMU to the initial or
default state. The reset of mapping is done per address space basis.

The reason why a separate .reset_map op is introduced is because this
allows a simple on-chip IOMMU model without exposing too much device
implementation details to the upper vdpa layer. The .dma_map/unmap or
.set_map driver API is meant to be used to manipulate the IOTLB mappings,
and has been abstracted in a way similar to how a real IOMMU device maps
or unmaps pages for certain memory ranges. However, apart from this there
also exists other mapping needs, in which case 1:1 passthrough mapping
has to be used by other users (read virtio-vdpa). To ease parent/vendor
driver implementation and to avoid abusing DMA ops in an unexpacted way,
these on-chip IOMMU devices can start with 1:1 passthrough mapping mode
initially at the he time of creation. Then the .reset_map op can be used
to switch iotlb back to this initial state without having to expose a
complex two-dimensional IOMMU device model.

The .reset_map is not a MUST for every parent that implements the
.dma_map or .set_map API, because there could be software vDPA devices
(which has use_va=true) that don't have to pin kernel memory so they
don't care much about high mapping cost during device reset. And those
software devices may have also implemented their own DMA ops, so don't
have to use .reset_map to achieve a simple IOMMU device model for 1:1
passthrough mapping, either.

Signed-off-by: Si-Wei Liu <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Acked-by: Jason Wang <[email protected]>
---
include/linux/vdpa.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index d376309b99cf..26ae6ae1eac3 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -327,6 +327,15 @@ struct vdpa_map_file {
* @iova: iova to be unmapped
* @size: size of the area
* Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping to the default
+ * state (optional)
+ * Needed for devices that are using device
+ * specific DMA translation and prefer mapping
+ * to be decoupled from the virtio life cycle,
+ * i.e. device .reset op does not reset mapping
+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
* @get_vq_dma_dev: Get the dma device for a specific
* virtqueue (optional)
* @vdev: vdpa device
@@ -405,6 +414,7 @@ struct vdpa_config_ops {
u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
u64 iova, u64 size);
+ int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
unsigned int asid);
struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
--
2.39.3

2023-10-17 00:43:25

by Si-Wei Liu

[permalink] [raw]
Subject: [PATCH v2 4/4] vdpa/mlx5: implement .reset_map driver op

Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with
virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device
creation time. This 1:1 DMA MR will be implicitly destroyed while
the first .set_map call is invoked, in which case callers like
vhost-vdpa will start to set up custom mappings. When the .reset
callback is invoked, the custom mappings will be cleared and the 1:1
DMA MR will be re-created.

In order to reduce excessive memory mapping cost in live migration,
it is desirable to decouple the vhost-vdpa IOTLB abstraction from
the virtio device life cycle, i.e. mappings can be kept around intact
across virtio device reset. Leverage the .reset_map callback, which
is meant to destroy the regular MR on the given ASID and recreate the
initial DMA mapping. That way, the device .reset op can run free from
having to maintain and clean up memory mappings by itself.

The cvq mapping also needs to be cleared if is in the given ASID.

Co-developed-by: Dragos Tatulea <[email protected]>
Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Si-Wei Liu <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
---
drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 +
drivers/vdpa/mlx5/core/mr.c | 17 +++++++++++++++++
drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +++++++++++++-----
3 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index db988ced5a5d..84547d998bcf 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev,
struct vhost_iotlb *iotlb,
unsigned int asid);
int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);

#define mlx5_vdpa_warn(__dev, format, ...) \
dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 66530e28f327..2197c46e563a 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev)

return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0);
}
+
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid)
+{
+ if (asid >= MLX5_VDPA_NUM_AS)
+ return -EINVAL;
+
+ mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]);
+
+ if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
+ if (mlx5_vdpa_create_dma_mr(mvdev))
+ mlx5_vdpa_warn(mvdev, "create DMA MR failed\n");
+ } else {
+ mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid);
+ }
+
+ return 0;
+}
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 6abe02310f2b..928e71bc5571 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -2838,7 +2838,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
unregister_link_notifier(ndev);
teardown_driver(ndev);
clear_vqs_ready(ndev);
- mlx5_vdpa_destroy_mr_resources(&ndev->mvdev);
ndev->mvdev.status = 0;
ndev->mvdev.suspended = false;
ndev->cur_num_vqs = 0;
@@ -2849,10 +2848,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
init_group_to_asid_map(mvdev);
++mvdev->generation;

- if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
- if (mlx5_vdpa_create_dma_mr(mvdev))
- mlx5_vdpa_warn(mvdev, "create MR failed\n");
- }
up_write(&ndev->reslock);

return 0;
@@ -2932,6 +2927,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, unsigned int asid,
return err;
}

+static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsigned int asid)
+{
+ struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
+ struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
+ int err;
+
+ down_write(&ndev->reslock);
+ err = mlx5_vdpa_reset_mr(mvdev, asid);
+ up_write(&ndev->reslock);
+ return err;
+}
+
static struct device *mlx5_get_vq_dma_dev(struct vdpa_device *vdev, u16 idx)
{
struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
@@ -3199,6 +3206,7 @@ static const struct vdpa_config_ops mlx5_vdpa_ops = {
.set_config = mlx5_vdpa_set_config,
.get_generation = mlx5_vdpa_get_generation,
.set_map = mlx5_vdpa_set_map,
+ .reset_map = mlx5_vdpa_reset_map,
.set_group_asid = mlx5_set_group_asid,
.get_vq_dma_dev = mlx5_get_vq_dma_dev,
.free = mlx5_vdpa_free,
--
2.39.3