2023-02-07 21:20:38

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v2 00/10] Add IO page table replacement support

Changelog
v1->v2:
* Rebased on top of vfio_device cdev v2 series.
* Update the kdoc and commit message of iommu_group_replace_domain().
* Dropped revert-to-core-domain part in iommu_group_replace_domain().
* Dropped !ops->dma_unmap check in vfio_iommufd_emulated_attach_ioas().
* Added missing rc value in vfio_iommufd_emulated_attach_ioas() from the
iommufd_access_set_ioas() call.
* Added a new patch in vfio_main to deny vfio_pin/unpin_pages() calls if
vdev->ops->dma_unmap is not implemented.
* Added a __iommmufd_device_detach helper and let the replace routine do
a partial detach().
* Added restriction on auto_domains to use the replace feature.
* Added the patch "iommufd/device: Make hwpt_list list_add/del symmetric"
from the has_group removal series.

Hi all,

The existing IOMMU APIs provide a pair of functions: iommu_attach_group()
for callers to attach a device from the default_domain (NULL if not being
supported) to a given iommu domain, and iommu_detach_group() for callers
to detach a device from a given domain to the default_domain. Internally,
the detach_dev op is deprecated for the newer drivers with default_domain.
This means that those drivers likely can switch an attaching domain to
another one, without stagging the device at a blocking or default domain,
for use cases such as:
1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
table with a larger table (PASID=N)
2) Nesting mode, when switching the attaching device from an S2 domain
to an S1 domain, or when switching between relevant S1 domains.

This series introduces a new iommu_group_replace_domain() for that. And
add corresponding support throughout the uAPI. So user space can do such
a REPLACE ioctl reusing the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT. This
means that user space needs to be aware whether the device is attached or
not: an unattached device calling VFIO_DEVICE_ATTACH_IOMMUFD_PT means a
regular ATTACH; an attached device calling VFIO_DEVICE_ATTACH_IOMMUFD_PT
on the other hand means a REPLACE.

QEMU with this feature should have the vIOMMU maintain a cache of the
guest io page table addresses and assign a unique IOAS to each unique
guest page table.

As the guest writes the page table address to the HW registers qemu should
then use the 'replace domain' operation on VFIO to assign the VFIO device
to the correct de-duplicated page table.

The algorithm where QEMU uses one VFIO container per-device and removes
all the mappings to change the assignment should ideally not be used with
iommufd.

To apply this series, please rebase on top of the following patches:
1) [PATCH v2 00/14] Add vfio_device cdev for iommufd support
https://lore.kernel.org/kvm/[email protected]/

Or you can also find this series on Github:
https://github.com/nicolinc/iommufd/commits/iommu_group_replace_domain-v2

Thank you
Nicolin Chen

Nicolin Chen (9):
iommu: Introduce a new iommu_group_replace_domain() API
iommufd: Create access in vfio_iommufd_emulated_bind()
iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_SET_IOAS coverage
iommufd: Add replace support in iommufd_access_set_ioas()
iommufd/selftest: Add coverage for access->ioas replacement
iommufd/device: Make hwpt_list list_add/del symmetric
iommufd/device: Use iommu_group_replace_domain()
vfio: Support IO page table replacement
vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages()

Yi Liu (1):
iommu: Move dev_iommu_ops() to private header

drivers/iommu/iommu-priv.h | 22 ++
drivers/iommu/iommu.c | 30 +++
drivers/iommu/iommufd/device.c | 221 +++++++++++++-----
drivers/iommu/iommufd/iommufd_private.h | 4 +
drivers/iommu/iommufd/iommufd_test.h | 4 +
drivers/iommu/iommufd/selftest.c | 25 +-
drivers/vfio/iommufd.c | 30 ++-
drivers/vfio/vfio_main.c | 4 +
include/linux/iommu.h | 11 -
include/linux/iommufd.h | 3 +-
include/uapi/linux/vfio.h | 6 +
tools/testing/selftests/iommu/iommufd.c | 29 ++-
tools/testing/selftests/iommu/iommufd_utils.h | 22 +-
13 files changed, 321 insertions(+), 90 deletions(-)
create mode 100644 drivers/iommu/iommu-priv.h

--
2.39.1



2023-02-07 21:20:42

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v2 01/10] iommu: Move dev_iommu_ops() to private header

From: Yi Liu <[email protected]>

dev_iommu_ops() is essentially only used in iommu subsystem, so
move to a private header to avoid being abused by other drivers.

Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommu-priv.h | 18 ++++++++++++++++++
drivers/iommu/iommu.c | 2 ++
include/linux/iommu.h | 11 -----------
3 files changed, 20 insertions(+), 11 deletions(-)
create mode 100644 drivers/iommu/iommu-priv.h

diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
new file mode 100644
index 000000000000..9e1497027cff
--- /dev/null
+++ b/drivers/iommu/iommu-priv.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef __LINUX_IOMMU_PRIV_H
+#define __LINUX_IOMMU_PRIV_H
+
+#include <linux/iommu.h>
+
+static inline const struct iommu_ops *dev_iommu_ops(struct device *dev)
+{
+ /*
+ * Assume that valid ops must be installed if iommu_probe_device()
+ * has succeeded. The device ops are essentially for internal use
+ * within the IOMMU subsystem itself, so we should be able to trust
+ * ourselves not to misuse the helper.
+ */
+ return dev->iommu->iommu_dev->ops;
+}
+#endif /* __LINUX_IOMMU_PRIV_H */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 4b5a21ac5e88..a18b7f1a4e6e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -35,6 +35,8 @@

#include "iommu-sva.h"

+#include "iommu-priv.h"
+
static struct kset *iommu_group_kset;
static DEFINE_IDA(iommu_group_ida);

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a8063f26ff69..cb586d054c57 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -445,17 +445,6 @@ static inline void iommu_iotlb_gather_init(struct iommu_iotlb_gather *gather)
};
}

-static inline const struct iommu_ops *dev_iommu_ops(struct device *dev)
-{
- /*
- * Assume that valid ops must be installed if iommu_probe_device()
- * has succeeded. The device ops are essentially for internal use
- * within the IOMMU subsystem itself, so we should be able to trust
- * ourselves not to misuse the helper.
- */
- return dev->iommu->iommu_dev->ops;
-}
-
extern int bus_iommu_probe(struct bus_type *bus);
extern bool iommu_present(struct bus_type *bus);
extern bool device_iommu_capable(struct device *dev, enum iommu_cap cap);
--
2.39.1


2023-02-07 21:20:46

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v2 03/10] iommufd: Create access in vfio_iommufd_emulated_bind()

To prepare for an access->ioas replacement, move iommufd_access_create()
call into vfio_iommufd_emulated_bind(), making it symmetric with the
__vfio_iommufd_access_destroy() call in vfio_iommufd_emulated_unbind().
This means an access is created/destroyed by the bind()/unbind(), and the
vfio_iommufd_emulated_attach_ioas() only updates the access->ioas pointer.

Since there's no longer an ioas_id input for iommufd_access_create(), add
a new helper iommufd_access_set_ioas() to set access->ioas. We can later
add a "replace" feature simply to the new iommufd_access_set_ioas() too.

Leaving the access->ioas in vfio_iommufd_emulated_attach_ioas(), however,
can introduce some potential of a race condition during pin_/unpin_pages()
call where access->ioas->iopt is getting referenced. So, add an ioas_lock
to protect it.

Note that the "refcount_dec(&access->ioas->obj.users)" line is also moved
to the new iommufd_access_set_ioas() from iommufd_access_destroy_object()
for symmetry. Without this change, the old_ioas would also lose the track
of its refcount when the replace support is added.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/device.c | 100 ++++++++++++++++++------
drivers/iommu/iommufd/iommufd_private.h | 1 +
drivers/iommu/iommufd/selftest.c | 5 +-
drivers/vfio/iommufd.c | 27 ++++---
include/linux/iommufd.h | 3 +-
5 files changed, 97 insertions(+), 39 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index d81f93a321af..f4bd6f532a90 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -418,9 +418,9 @@ void iommufd_access_destroy_object(struct iommufd_object *obj)
struct iommufd_access *access =
container_of(obj, struct iommufd_access, obj);

- iopt_remove_access(&access->ioas->iopt, access);
+ iommufd_access_set_ioas(access, 0);
iommufd_ctx_put(access->ictx);
- refcount_dec(&access->ioas->obj.users);
+ mutex_destroy(&access->ioas_lock);
}

/**
@@ -437,12 +437,10 @@ void iommufd_access_destroy_object(struct iommufd_object *obj)
* The provided ops are required to use iommufd_access_pin_pages().
*/
struct iommufd_access *
-iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
+iommufd_access_create(struct iommufd_ctx *ictx,
const struct iommufd_access_ops *ops, void *data)
{
struct iommufd_access *access;
- struct iommufd_object *obj;
- int rc;

/*
* There is no uAPI for the access object, but to keep things symmetric
@@ -455,33 +453,18 @@ iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
access->data = data;
access->ops = ops;

- obj = iommufd_get_object(ictx, ioas_id, IOMMUFD_OBJ_IOAS);
- if (IS_ERR(obj)) {
- rc = PTR_ERR(obj);
- goto out_abort;
- }
- access->ioas = container_of(obj, struct iommufd_ioas, obj);
- iommufd_ref_to_users(obj);
-
if (ops->needs_pin_pages)
access->iova_alignment = PAGE_SIZE;
else
access->iova_alignment = 1;
- rc = iopt_add_access(&access->ioas->iopt, access);
- if (rc)
- goto out_put_ioas;

/* The calling driver is a user until iommufd_access_destroy() */
refcount_inc(&access->obj.users);
+ mutex_init(&access->ioas_lock);
access->ictx = ictx;
iommufd_ctx_get(ictx);
iommufd_object_finalize(ictx, &access->obj);
return access;
-out_put_ioas:
- refcount_dec(&access->ioas->obj.users);
-out_abort:
- iommufd_object_abort(ictx, &access->obj);
- return ERR_PTR(rc);
}
EXPORT_SYMBOL_NS_GPL(iommufd_access_create, IOMMUFD);

@@ -500,6 +483,50 @@ void iommufd_access_destroy(struct iommufd_access *access)
}
EXPORT_SYMBOL_NS_GPL(iommufd_access_destroy, IOMMUFD);

+int iommufd_access_set_ioas(struct iommufd_access *access, u32 ioas_id)
+{
+ struct iommufd_ioas *new_ioas = NULL, *cur_ioas;
+ struct iommufd_ctx *ictx = access->ictx;
+ struct iommufd_object *obj;
+ int rc = 0;
+
+ if (ioas_id) {
+ obj = iommufd_get_object(ictx, ioas_id, IOMMUFD_OBJ_IOAS);
+ if (IS_ERR(obj))
+ return PTR_ERR(obj);
+ new_ioas = container_of(obj, struct iommufd_ioas, obj);
+ }
+
+ mutex_lock(&access->ioas_lock);
+ cur_ioas = access->ioas;
+ if (cur_ioas == new_ioas)
+ goto out_unlock;
+
+ if (new_ioas) {
+ rc = iopt_add_access(&new_ioas->iopt, access);
+ if (rc)
+ goto out_unlock;
+ iommufd_ref_to_users(obj);
+ }
+
+ if (cur_ioas) {
+ iopt_remove_access(&cur_ioas->iopt, access);
+ refcount_dec(&cur_ioas->obj.users);
+ }
+
+ access->ioas = new_ioas;
+ mutex_unlock(&access->ioas_lock);
+
+ return 0;
+
+out_unlock:
+ mutex_unlock(&access->ioas_lock);
+ if (new_ioas)
+ iommufd_put_object(obj);
+ return rc;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_access_set_ioas, IOMMUFD);
+
/**
* iommufd_access_notify_unmap - Notify users of an iopt to stop using it
* @iopt: iopt to work on
@@ -550,8 +577,8 @@ void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
void iommufd_access_unpin_pages(struct iommufd_access *access,
unsigned long iova, unsigned long length)
{
- struct io_pagetable *iopt = &access->ioas->iopt;
struct iopt_area_contig_iter iter;
+ struct io_pagetable *iopt;
unsigned long last_iova;
struct iopt_area *area;

@@ -559,6 +586,13 @@ void iommufd_access_unpin_pages(struct iommufd_access *access,
WARN_ON(check_add_overflow(iova, length - 1, &last_iova)))
return;

+ mutex_lock(&access->ioas_lock);
+ if (!access->ioas) {
+ mutex_unlock(&access->ioas_lock);
+ return;
+ }
+ iopt = &access->ioas->iopt;
+
down_read(&iopt->iova_rwsem);
iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova)
iopt_area_remove_access(
@@ -568,6 +602,7 @@ void iommufd_access_unpin_pages(struct iommufd_access *access,
min(last_iova, iopt_area_last_iova(area))));
up_read(&iopt->iova_rwsem);
WARN_ON(!iopt_area_contig_done(&iter));
+ mutex_unlock(&access->ioas_lock);
}
EXPORT_SYMBOL_NS_GPL(iommufd_access_unpin_pages, IOMMUFD);

@@ -613,8 +648,8 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
unsigned long length, struct page **out_pages,
unsigned int flags)
{
- struct io_pagetable *iopt = &access->ioas->iopt;
struct iopt_area_contig_iter iter;
+ struct io_pagetable *iopt;
unsigned long last_iova;
struct iopt_area *area;
int rc;
@@ -629,6 +664,13 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
if (check_add_overflow(iova, length - 1, &last_iova))
return -EOVERFLOW;

+ mutex_lock(&access->ioas_lock);
+ if (!access->ioas) {
+ mutex_unlock(&access->ioas_lock);
+ return -ENOENT;
+ }
+ iopt = &access->ioas->iopt;
+
down_read(&iopt->iova_rwsem);
iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) {
unsigned long last = min(last_iova, iopt_area_last_iova(area));
@@ -659,6 +701,7 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
}

up_read(&iopt->iova_rwsem);
+ mutex_unlock(&access->ioas_lock);
return 0;

err_remove:
@@ -673,6 +716,7 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
iopt_area_last_iova(area))));
}
up_read(&iopt->iova_rwsem);
+ mutex_unlock(&access->ioas_lock);
return rc;
}
EXPORT_SYMBOL_NS_GPL(iommufd_access_pin_pages, IOMMUFD);
@@ -692,8 +736,8 @@ EXPORT_SYMBOL_NS_GPL(iommufd_access_pin_pages, IOMMUFD);
int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
void *data, size_t length, unsigned int flags)
{
- struct io_pagetable *iopt = &access->ioas->iopt;
struct iopt_area_contig_iter iter;
+ struct io_pagetable *iopt;
struct iopt_area *area;
unsigned long last_iova;
int rc;
@@ -703,6 +747,13 @@ int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
if (check_add_overflow(iova, length - 1, &last_iova))
return -EOVERFLOW;

+ mutex_lock(&access->ioas_lock);
+ if (!access->ioas) {
+ mutex_unlock(&access->ioas_lock);
+ return -ENOENT;
+ }
+ iopt = &access->ioas->iopt;
+
down_read(&iopt->iova_rwsem);
iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) {
unsigned long last = min(last_iova, iopt_area_last_iova(area));
@@ -729,6 +780,7 @@ int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
rc = -ENOENT;
err_out:
up_read(&iopt->iova_rwsem);
+ mutex_unlock(&access->ioas_lock);
return rc;
}
EXPORT_SYMBOL_NS_GPL(iommufd_access_rw, IOMMUFD);
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 222e86591f8a..2f4bb106bac6 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -261,6 +261,7 @@ struct iommufd_access {
struct iommufd_object obj;
struct iommufd_ctx *ictx;
struct iommufd_ioas *ioas;
+ struct mutex ioas_lock;
const struct iommufd_access_ops *ops;
void *data;
unsigned long iova_alignment;
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index cfb5fe9a5e0e..db4011bdc8a9 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -571,7 +571,7 @@ static int iommufd_test_create_access(struct iommufd_ucmd *ucmd,
}

access = iommufd_access_create(
- ucmd->ictx, ioas_id,
+ ucmd->ictx,
(flags & MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES) ?
&selftest_access_ops_pin :
&selftest_access_ops,
@@ -580,6 +580,9 @@ static int iommufd_test_create_access(struct iommufd_ucmd *ucmd,
rc = PTR_ERR(access);
goto out_put_fdno;
}
+ rc = iommufd_access_set_ioas(access, ioas_id);
+ if (rc)
+ goto out_destroy;
cmd->create_access.out_access_fd = fdno;
rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
if (rc)
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index 026f81a87dd7..dc9feab73db7 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -141,10 +141,19 @@ static const struct iommufd_access_ops vfio_user_ops = {
int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
struct iommufd_ctx *ictx, u32 *out_device_id)
{
+ struct iommufd_access *user;
+
lockdep_assert_held(&vdev->dev_set->lock);

- vdev->iommufd_ictx = ictx;
iommufd_ctx_get(ictx);
+ user = iommufd_access_create(vdev->iommufd_ictx, &vfio_user_ops, vdev);
+ if (IS_ERR(user)) {
+ iommufd_ctx_put(vdev->iommufd_ictx);
+ return PTR_ERR(user);
+ }
+ iommufd_access_set_ioas(user, 0);
+ vdev->iommufd_access = user;
+ vdev->iommufd_ictx = ictx;
return 0;
}
EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_bind);
@@ -168,22 +177,14 @@ EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_unbind);

int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id)
{
- struct iommufd_access *user;
-
lockdep_assert_held(&vdev->dev_set->lock);

if (!vdev->iommufd_ictx)
return -EINVAL;
+ if (!vdev->iommufd_access)
+ return -ENOENT;

- if (vdev->iommufd_access)
- return -EBUSY;
-
- user = iommufd_access_create(vdev->iommufd_ictx, *pt_id, &vfio_user_ops,
- vdev);
- if (IS_ERR(user))
- return PTR_ERR(user);
- vdev->iommufd_access = user;
- return 0;
+ return iommufd_access_set_ioas(vdev->iommufd_access, *pt_id);
}
EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_attach_ioas);

@@ -194,6 +195,6 @@ void vfio_iommufd_emulated_detach_ioas(struct vfio_device *vdev)
if (!vdev->iommufd_ictx || !vdev->iommufd_access)
return;

- __vfio_iommufd_access_destroy(vdev);
+ iommufd_access_set_ioas(vdev->iommufd_access, 0);
}
EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_detach_ioas);
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 9672cf839687..f9bac6f9db2e 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -46,9 +46,10 @@ enum {
};

struct iommufd_access *
-iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
+iommufd_access_create(struct iommufd_ctx *ictx,
const struct iommufd_access_ops *ops, void *data);
void iommufd_access_destroy(struct iommufd_access *access);
+int iommufd_access_set_ioas(struct iommufd_access *access, u32 ioas_id);

void iommufd_ctx_get(struct iommufd_ctx *ictx);

--
2.39.1


2023-02-07 21:20:49

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v2 05/10] iommufd: Add replace support in iommufd_access_set_ioas()

Support an access->ioas replacement in iommufd_access_set_ioas(), which
sets the access->ioas to NULL provisionally so that any further incoming
iommufd_access_pin_pages() callback can be blocked.

Then, call access->ops->unmap() to clean up the entire iopt. To allow an
iommufd_access_unpin_pages() callback to happen via this unmap() call,
add an ioas_unpin pointer so the unpin routine won't be affected by the
"access->ioas = NULL" trick above.

Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/device.c | 16 ++++++++++++++--
drivers/iommu/iommufd/iommufd_private.h | 1 +
2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index f4bd6f532a90..10ce47484ffa 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -509,11 +509,23 @@ int iommufd_access_set_ioas(struct iommufd_access *access, u32 ioas_id)
iommufd_ref_to_users(obj);
}

+ /*
+ * Set ioas to NULL to block any further iommufd_access_pin_pages().
+ * iommufd_access_unpin_pages() can continue using access->ioas_unpin.
+ */
+ access->ioas = NULL;
+
if (cur_ioas) {
+ if (new_ioas) {
+ mutex_unlock(&access->ioas_lock);
+ access->ops->unmap(access->data, 0, ULONG_MAX);
+ mutex_lock(&access->ioas_lock);
+ }
iopt_remove_access(&cur_ioas->iopt, access);
refcount_dec(&cur_ioas->obj.users);
}

+ access->ioas_unpin = new_ioas;
access->ioas = new_ioas;
mutex_unlock(&access->ioas_lock);

@@ -587,11 +599,11 @@ void iommufd_access_unpin_pages(struct iommufd_access *access,
return;

mutex_lock(&access->ioas_lock);
- if (!access->ioas) {
+ if (!access->ioas_unpin) {
mutex_unlock(&access->ioas_lock);
return;
}
- iopt = &access->ioas->iopt;
+ iopt = &access->ioas_unpin->iopt;

down_read(&iopt->iova_rwsem);
iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 2f4bb106bac6..593138bb37b8 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -261,6 +261,7 @@ struct iommufd_access {
struct iommufd_object obj;
struct iommufd_ctx *ictx;
struct iommufd_ioas *ioas;
+ struct iommufd_ioas *ioas_unpin;
struct mutex ioas_lock;
const struct iommufd_access_ops *ops;
void *data;
--
2.39.1


2023-02-07 21:20:54

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

qemu has a need to replace the translations associated with a domain
when the guest does large-scale operations like switching between an
IDENTITY domain and, say, dma-iommu.c.

Currently, it does this by replacing all the mappings in a single
domain, but this is very inefficient and means that domains have to be
per-device rather than per-translation.

Provide a high-level API to allow replacements of one domain with
another. This is similar to a detach/attach cycle except it doesn't
force the group to go to the blocking domain in-between.

By removing this forced blocking domain the iommu driver has the
opportunity to implement an atomic replacement of the domains to the
greatest extent its hardware allows.

It could be possible to adderss this by simply removing the protection
from the iommu_attach_group(), but it is not so clear if that is safe
for the few users. Thus, add a new API to serve this new purpose.

Atomic replacement allows the qemu emulation of the viommu to be more
complete, as real hardware has this ability.

All drivers are already required to support changing between active
UNMANAGED domains when using their attach_dev ops.

This API is expected to be used by IOMMUFD, so add to the iommu-priv
header and mark it as IOMMUFD_INTERNAL.

Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommu-priv.h | 4 ++++
drivers/iommu/iommu.c | 28 ++++++++++++++++++++++++++++
2 files changed, 32 insertions(+)

diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
index 9e1497027cff..b546795a7e49 100644
--- a/drivers/iommu/iommu-priv.h
+++ b/drivers/iommu/iommu-priv.h
@@ -15,4 +15,8 @@ static inline const struct iommu_ops *dev_iommu_ops(struct device *dev)
*/
return dev->iommu->iommu_dev->ops;
}
+
+extern int iommu_group_replace_domain(struct iommu_group *group,
+ struct iommu_domain *new_domain);
+
#endif /* __LINUX_IOMMU_PRIV_H */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index a18b7f1a4e6e..15e07d39cd8d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2151,6 +2151,34 @@ int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
}
EXPORT_SYMBOL_GPL(iommu_attach_group);

+/**
+ * iommu_group_replace_domain - replace the domain that a group is attached to
+ * @new_domain: new IOMMU domain to replace with
+ * @group: IOMMU group that will be attached to the new domain
+ *
+ * This API allows the group to switch domains without being forced to go to
+ * the blocking domain in-between.
+ *
+ * If the currently attached domain is a core domain (e.g. a default_domain),
+ * it will act just like the iommu_attach_group().
+ */
+int iommu_group_replace_domain(struct iommu_group *group,
+ struct iommu_domain *new_domain)
+{
+ int ret;
+
+ if (!new_domain)
+ return -EINVAL;
+
+ mutex_lock(&group->mutex);
+ ret = __iommu_group_set_domain(group, new_domain);
+ if (ret)
+ __iommu_group_set_domain(group, group->domain);
+ mutex_unlock(&group->mutex);
+ return ret;
+}
+EXPORT_SYMBOL_NS_GPL(iommu_group_replace_domain, IOMMUFD_INTERNAL);
+
static int iommu_group_do_set_platform_dma(struct device *dev, void *data)
{
const struct iommu_ops *ops = dev_iommu_ops(dev);
--
2.39.1


2023-02-07 21:20:58

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v2 04/10] iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_SET_IOAS coverage

Add a new IOMMU_TEST_OP_ACCESS_SET_IOAS to allow setting access->ioas
individually, corresponding to the iommufd_access_set_ioas() helper.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/iommufd_test.h | 4 +++
drivers/iommu/iommufd/selftest.c | 26 +++++++++++++++----
tools/testing/selftests/iommu/iommufd_utils.h | 22 ++++++++++++++--
3 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index 1d96a8f466fd..f2c61a9500e7 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -13,6 +13,7 @@ enum {
IOMMU_TEST_OP_MD_CHECK_MAP,
IOMMU_TEST_OP_MD_CHECK_REFS,
IOMMU_TEST_OP_CREATE_ACCESS,
+ IOMMU_TEST_OP_ACCESS_SET_IOAS,
IOMMU_TEST_OP_DESTROY_ACCESS_PAGES,
IOMMU_TEST_OP_ACCESS_PAGES,
IOMMU_TEST_OP_ACCESS_RW,
@@ -66,6 +67,9 @@ struct iommu_test_cmd {
__u32 out_access_fd;
__u32 flags;
} create_access;
+ struct {
+ __u32 ioas_id;
+ } access_set_ioas;
struct {
__u32 access_pages_id;
} destroy_access_pages;
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index db4011bdc8a9..b94870f93138 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -549,7 +549,7 @@ static struct selftest_access *iommufd_test_alloc_access(void)
}

static int iommufd_test_create_access(struct iommufd_ucmd *ucmd,
- unsigned int ioas_id, unsigned int flags)
+ unsigned int flags)
{
struct iommu_test_cmd *cmd = ucmd->cmd;
struct selftest_access *staccess;
@@ -580,9 +580,6 @@ static int iommufd_test_create_access(struct iommufd_ucmd *ucmd,
rc = PTR_ERR(access);
goto out_put_fdno;
}
- rc = iommufd_access_set_ioas(access, ioas_id);
- if (rc)
- goto out_destroy;
cmd->create_access.out_access_fd = fdno;
rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
if (rc)
@@ -601,6 +598,22 @@ static int iommufd_test_create_access(struct iommufd_ucmd *ucmd,
return rc;
}

+static int iommufd_test_access_set_ioas(struct iommufd_ucmd *ucmd,
+ unsigned int access_id,
+ unsigned int ioas_id)
+{
+ struct selftest_access *staccess;
+ int rc;
+
+ staccess = iommufd_access_get(access_id);
+ if (IS_ERR(staccess))
+ return PTR_ERR(staccess);
+
+ rc = iommufd_access_set_ioas(staccess->access, ioas_id);
+ fput(staccess->file);
+ return rc;
+}
+
/* Check that the pages in a page array match the pages in the user VA */
static int iommufd_test_check_pages(void __user *uptr, struct page **pages,
size_t npages)
@@ -810,8 +823,11 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
ucmd, u64_to_user_ptr(cmd->check_refs.uptr),
cmd->check_refs.length, cmd->check_refs.refs);
case IOMMU_TEST_OP_CREATE_ACCESS:
- return iommufd_test_create_access(ucmd, cmd->id,
+ return iommufd_test_create_access(ucmd,
cmd->create_access.flags);
+ case IOMMU_TEST_OP_ACCESS_SET_IOAS:
+ return iommufd_test_access_set_ioas(
+ ucmd, cmd->id, cmd->access_set_ioas.ioas_id);
case IOMMU_TEST_OP_ACCESS_PAGES:
return iommufd_test_access_pages(
ucmd, cmd->id, cmd->access_pages.iova,
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 0d1f46369c2a..67805afc620f 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -66,13 +66,31 @@ static int _test_cmd_mock_domain(int fd, unsigned int ioas_id, __u32 *device_id,
EXPECT_ERRNO(_errno, _test_cmd_mock_domain(self->fd, ioas_id, \
device_id, hwpt_id))

+static int _test_cmd_access_set_ioas(int fd, __u32 access_id,
+ unsigned int ioas_id)
+{
+ struct iommu_test_cmd cmd = {
+ .size = sizeof(cmd),
+ .op = IOMMU_TEST_OP_ACCESS_SET_IOAS,
+ .id = access_id,
+ .access_set_ioas = { .ioas_id = ioas_id },
+ };
+ int ret;
+
+ ret = ioctl(fd, IOMMU_TEST_CMD, &cmd);
+ if (ret)
+ return ret;
+ return 0;
+}
+#define test_cmd_access_set_ioas(access_id, ioas_id) \
+ ASSERT_EQ(0, _test_cmd_access_set_ioas(self->fd, access_id, ioas_id))
+
static int _test_cmd_create_access(int fd, unsigned int ioas_id,
__u32 *access_id, unsigned int flags)
{
struct iommu_test_cmd cmd = {
.size = sizeof(cmd),
.op = IOMMU_TEST_OP_CREATE_ACCESS,
- .id = ioas_id,
.create_access = { .flags = flags },
};
int ret;
@@ -81,7 +99,7 @@ static int _test_cmd_create_access(int fd, unsigned int ioas_id,
if (ret)
return ret;
*access_id = cmd.create_access.out_access_fd;
- return 0;
+ return _test_cmd_access_set_ioas(fd, *access_id, ioas_id);
}
#define test_cmd_create_access(ioas_id, access_id, flags) \
ASSERT_EQ(0, _test_cmd_create_access(self->fd, ioas_id, access_id, \
--
2.39.1


2023-02-07 21:21:02

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v2 07/10] iommufd/device: Make hwpt_list list_add/del symmetric

Because list_del() is together with iopt_table_remove_domain(), it makes
sense to have list_add_tail() together with iopt_table_add_domain().

Also place the mutex outside the iommufd_device_do_attach() call, similar
to what's in the iommufd_device_auto_get_domain() function.

Co-developed-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/device.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 10ce47484ffa..b8c3e3baccb5 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -200,6 +200,8 @@ static int iommufd_device_do_attach(struct iommufd_device *idev,
phys_addr_t sw_msi_start = PHYS_ADDR_MAX;
int rc;

+ lockdep_assert_held(&hwpt->ioas->mutex);
+
mutex_lock(&hwpt->devices_lock);

/*
@@ -243,6 +245,7 @@ static int iommufd_device_do_attach(struct iommufd_device *idev,
hwpt->domain);
if (rc)
goto out_detach;
+ list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list);
}
}

@@ -304,7 +307,6 @@ static int iommufd_device_auto_get_domain(struct iommufd_device *idev,
rc = iommufd_device_do_attach(idev, hwpt);
if (rc)
goto out_abort;
- list_add_tail(&hwpt->hwpt_item, &ioas->hwpt_list);

mutex_unlock(&ioas->mutex);
iommufd_object_finalize(idev->ictx, &hwpt->obj);
@@ -343,13 +345,11 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id)
struct iommufd_hw_pagetable *hwpt =
container_of(pt_obj, struct iommufd_hw_pagetable, obj);

+ mutex_lock(&hwpt->ioas->mutex);
rc = iommufd_device_do_attach(idev, hwpt);
+ mutex_unlock(&hwpt->ioas->mutex);
if (rc)
goto out_put_pt_obj;
-
- mutex_lock(&hwpt->ioas->mutex);
- list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list);
- mutex_unlock(&hwpt->ioas->mutex);
break;
}
case IOMMUFD_OBJ_IOAS: {
--
2.39.1


2023-02-07 21:21:06

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

iommu_group_replace_domain() is introduced to support use cases where an
iommu_group can be attached to a new domain without getting detached from
the old one. This replacement feature will be useful, for cases such as:
1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
table with a larger table (PASID=N)
2) Nesting mode, when switching the attaching device from an S2 domain
to an S1 domain, or when switching between relevant S1 domains.
as it allows these cases to switch seamlessly without a DMA disruption.

So, call iommu_group_replace_domain() in the iommufd_device_do_attach().
And add a __iommmufd_device_detach helper to allow the replace routine to
do a partial detach on the current hwpt that's being replaced. Though the
updated locking logic is overcomplicated, it will be eased, once those
iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's
allocation/destroy() functions in the coming nesting series, as that'll
depend on a new ->domain_alloc_user op in the iommu core.

Also, block replace operations that are from/to auto_domains, i.e. only
user-allocated hw_pagetables can be replaced or replaced with.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/device.c | 101 +++++++++++++++++-------
drivers/iommu/iommufd/iommufd_private.h | 2 +
2 files changed, 76 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index b8c3e3baccb5..8a9834fc129a 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -9,6 +9,8 @@
#include "io_pagetable.h"
#include "iommufd_private.h"

+MODULE_IMPORT_NS(IOMMUFD_INTERNAL);
+
static bool allow_unsafe_interrupts;
module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR);
MODULE_PARM_DESC(
@@ -194,9 +196,61 @@ static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt,
return false;
}

+/**
+ * __iommmufd_device_detach - Detach a device from idev->hwpt to new_hwpt
+ * @idev: device to detach
+ * @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple detach)
+ * @detach_group: flag to call iommu_detach_group
+ *
+ * This is a cleanup helper shared by the replace and detach routines. Comparing
+ * to a detach routine, a replace routine only needs a partial detach procedure:
+ * it does not need the iommu_detach_group(); it will attach the device to a new
+ * hw_pagetable after a partial detach from the currently attached hw_pagetable,
+ * so certain steps can be skipped if two hw_pagetables have the same IOAS.
+ */
+static void __iommmufd_device_detach(struct iommufd_device *idev,
+ struct iommufd_hw_pagetable *new_hwpt,
+ bool detach_group)
+{
+ struct iommufd_hw_pagetable *hwpt = idev->hwpt;
+ struct iommufd_ioas *new_ioas = NULL;
+
+ if (new_hwpt)
+ new_ioas = new_hwpt->ioas;
+
+ mutex_lock(&hwpt->devices_lock);
+ list_del(&idev->devices_item);
+ if (hwpt->ioas != new_ioas)
+ mutex_lock(&hwpt->ioas->mutex);
+ if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
+ if (list_empty(&hwpt->devices)) {
+ iopt_table_remove_domain(&hwpt->ioas->iopt,
+ hwpt->domain);
+ list_del(&hwpt->hwpt_item);
+ }
+ if (detach_group)
+ iommu_detach_group(hwpt->domain, idev->group);
+ }
+ if (hwpt->ioas != new_ioas) {
+ iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
+ mutex_unlock(&hwpt->ioas->mutex);
+ }
+ mutex_unlock(&hwpt->devices_lock);
+
+ if (hwpt->auto_domain)
+ iommufd_object_destroy_user(idev->ictx, &hwpt->obj);
+ else
+ refcount_dec(&hwpt->obj.users);
+
+ idev->hwpt = NULL;
+
+ refcount_dec(&idev->obj.users);
+}
+
static int iommufd_device_do_attach(struct iommufd_device *idev,
struct iommufd_hw_pagetable *hwpt)
{
+ struct iommufd_hw_pagetable *cur_hwpt = idev->hwpt;
phys_addr_t sw_msi_start = PHYS_ADDR_MAX;
int rc;

@@ -236,7 +290,7 @@ static int iommufd_device_do_attach(struct iommufd_device *idev,
* the group once for the first device that is in the group.
*/
if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
- rc = iommu_attach_group(hwpt->domain, idev->group);
+ rc = iommu_group_replace_domain(idev->group, hwpt->domain);
if (rc)
goto out_iova;

@@ -249,6 +303,10 @@ static int iommufd_device_do_attach(struct iommufd_device *idev,
}
}

+ /* Replace the cur_hwpt without iommu_detach_group() */
+ if (cur_hwpt)
+ __iommmufd_device_detach(idev, hwpt, false);
+
idev->hwpt = hwpt;
refcount_inc(&hwpt->obj.users);
list_add(&idev->devices_item, &hwpt->devices);
@@ -256,7 +314,10 @@ static int iommufd_device_do_attach(struct iommufd_device *idev,
return 0;

out_detach:
- iommu_detach_group(hwpt->domain, idev->group);
+ if (cur_hwpt)
+ iommu_group_replace_domain(idev->group, cur_hwpt->domain);
+ else
+ iommu_detach_group(hwpt->domain, idev->group);
out_iova:
iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
out_unlock:
@@ -345,6 +406,13 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id)
struct iommufd_hw_pagetable *hwpt =
container_of(pt_obj, struct iommufd_hw_pagetable, obj);

+ if (idev->hwpt == hwpt)
+ goto out_done;
+ if (idev->hwpt && idev->hwpt->auto_domain) {
+ rc = -EBUSY;
+ goto out_put_pt_obj;
+ }
+
mutex_lock(&hwpt->ioas->mutex);
rc = iommufd_device_do_attach(idev, hwpt);
mutex_unlock(&hwpt->ioas->mutex);
@@ -356,6 +424,8 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id)
struct iommufd_ioas *ioas =
container_of(pt_obj, struct iommufd_ioas, obj);

+ if (idev->hwpt)
+ return -EBUSY;
rc = iommufd_device_auto_get_domain(idev, ioas);
if (rc)
goto out_put_pt_obj;
@@ -367,6 +437,7 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id)
}

refcount_inc(&idev->obj.users);
+out_done:
*pt_id = idev->hwpt->obj.id;
rc = 0;

@@ -385,31 +456,7 @@ EXPORT_SYMBOL_NS_GPL(iommufd_device_attach, IOMMUFD);
*/
void iommufd_device_detach(struct iommufd_device *idev)
{
- struct iommufd_hw_pagetable *hwpt = idev->hwpt;
-
- mutex_lock(&hwpt->ioas->mutex);
- mutex_lock(&hwpt->devices_lock);
- list_del(&idev->devices_item);
- if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
- if (list_empty(&hwpt->devices)) {
- iopt_table_remove_domain(&hwpt->ioas->iopt,
- hwpt->domain);
- list_del(&hwpt->hwpt_item);
- }
- iommu_detach_group(hwpt->domain, idev->group);
- }
- iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
- mutex_unlock(&hwpt->devices_lock);
- mutex_unlock(&hwpt->ioas->mutex);
-
- if (hwpt->auto_domain)
- iommufd_object_destroy_user(idev->ictx, &hwpt->obj);
- else
- refcount_dec(&hwpt->obj.users);
-
- idev->hwpt = NULL;
-
- refcount_dec(&idev->obj.users);
+ __iommmufd_device_detach(idev, NULL, true);
}
EXPORT_SYMBOL_NS_GPL(iommufd_device_detach, IOMMUFD);

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 593138bb37b8..200c783800ad 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -9,6 +9,8 @@
#include <linux/refcount.h>
#include <linux/uaccess.h>

+#include "../iommu-priv.h"
+
struct iommu_domain;
struct iommu_group;
struct iommu_option;
--
2.39.1


2023-02-07 21:21:09

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v2 06/10] iommufd/selftest: Add coverage for access->ioas replacement

Add replace coverage as a part of user_copy test case. It basically
repeats the copy test after replacing the old ioas with a new one.

Signed-off-by: Nicolin Chen <[email protected]>
---
tools/testing/selftests/iommu/iommufd.c | 29 +++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index fa08209268c4..1e293950ac88 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -1239,7 +1239,13 @@ TEST_F(iommufd_mock_domain, user_copy)
.dst_iova = MOCK_APERTURE_START,
.length = BUFFER_SIZE,
};
- unsigned int ioas_id;
+ struct iommu_ioas_unmap unmap_cmd = {
+ .size = sizeof(unmap_cmd),
+ .ioas_id = self->ioas_id,
+ .iova = MOCK_APERTURE_START,
+ .length = BUFFER_SIZE,
+ };
+ unsigned int new_ioas_id, ioas_id;

/* Pin the pages in an IOAS with no domains then copy to an IOAS with domains */
test_ioctl_ioas_alloc(&ioas_id);
@@ -1257,11 +1263,30 @@ TEST_F(iommufd_mock_domain, user_copy)
ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
check_mock_iova(buffer, MOCK_APERTURE_START, BUFFER_SIZE);

+ /* Now replace the ioas with a new one */
+ test_ioctl_ioas_alloc(&new_ioas_id);
+ test_ioctl_ioas_map_id(new_ioas_id, buffer, BUFFER_SIZE,
+ &copy_cmd.src_iova);
+ test_cmd_access_set_ioas(access_cmd.id, new_ioas_id);
+
+ /* Destroy the old ioas and cleanup copied mapping */
+ ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+ test_ioctl_destroy(ioas_id);
+
+ /* Then run the same test again with the new ioas */
+ access_cmd.access_pages.iova = copy_cmd.src_iova;
+ ASSERT_EQ(0,
+ ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+ &access_cmd));
+ copy_cmd.src_ioas_id = new_ioas_id;
+ ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+ check_mock_iova(buffer, MOCK_APERTURE_START, BUFFER_SIZE);
+
test_cmd_destroy_access_pages(
access_cmd.id, access_cmd.access_pages.out_access_pages_id);
test_cmd_destroy_access(access_cmd.id);

- test_ioctl_destroy(ioas_id);
+ test_ioctl_destroy(new_ioas_id);
}

/* VFIO compatibility IOCTLs */
--
2.39.1


2023-02-07 21:21:13

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v2 09/10] vfio: Support IO page table replacement

Remove the vdev->iommufd_attached check, since the kernel can internally
handle a replacement of the IO page table now.

Also update the VFIO_DEVICE_ATTACH_IOMMUFD_PT kdoc in the uAPI header.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/vfio/iommufd.c | 3 ---
include/uapi/linux/vfio.h | 6 ++++++
2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index dc9feab73db7..8b719d9424b8 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -97,9 +97,6 @@ int vfio_iommufd_physical_attach_ioas(struct vfio_device *vdev, u32 *pt_id)
if (!vdev->iommufd_device)
return -EINVAL;

- if (vdev->iommufd_attached)
- return -EBUSY;
-
rc = iommufd_device_attach(vdev->iommufd_device, pt_id);
if (rc)
return rc;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index c86cfe442884..69f3ceb18d7d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -236,6 +236,12 @@ struct vfio_device_bind_iommufd {
*
* Undo by VFIO_DEVICE_DETACH_IOMMUFD_PT or device fd close.
*
+ * If a vfio device is currently attached to a valid hw_pagetable, without doing
+ * a VFIO_DEVICE_DETACH_IOMMUFD_PT, a second VFIO_DEVICE_ATTACH_IOMMUFD_PT ioctl
+ * passing in another hw_pagetable (hwpt) id is allowed. This action, also known
+ * as a hw_pagetable replacement, will replace the device's currently attached
+ * hw_pagetable with a new hw_pagetable corresponding to the given pt_id.
+ *
* @argsz: user filled size of this data.
* @flags: must be 0.
* @pt_id: Input the target id which can represent an ioas or a hwpt
--
2.39.1


2023-02-07 21:21:15

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v2 10/10] vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages()

A driver that doesn't implement ops->dma_unmap shouldn't be allowed to do
vfio_pin/unpin_pages(), though it can use vfio_dma_rw() to access an iova
range. Deny !ops->dma_unmap cases in vfio_pin/unpin_pages().

Suggested-by: Kevin Tian <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/vfio/vfio_main.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 8559c3dfb335..c7f3251ad6e5 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1543,6 +1543,8 @@ int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova,

if (iova > ULONG_MAX)
return -EINVAL;
+ if (!device->ops->dma_unmap)
+ return -EINVAL;
/*
* VFIO ignores the sub page offset, npages is from the start of
* a PAGE_SIZE chunk of IOVA. The caller is expected to recover
@@ -1580,6 +1582,8 @@ void vfio_unpin_pages(struct vfio_device *device, dma_addr_t iova, int npage)
if (device->iommufd_access) {
if (WARN_ON(iova > ULONG_MAX))
return;
+ if (!device->ops->dma_unmap)
+ return;
iommufd_access_unpin_pages(device->iommufd_access,
ALIGN_DOWN(iova, PAGE_SIZE),
npage * PAGE_SIZE);
--
2.39.1


2023-02-08 08:08:53

by Yi Liu

[permalink] [raw]
Subject: RE: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

Hi Nic,

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> iommu_group_replace_domain() is introduced to support use cases where
> an
> iommu_group can be attached to a new domain without getting detached
> from
> the old one. This replacement feature will be useful, for cases such as:
> 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
> table with a larger table (PASID=N)
> 2) Nesting mode, when switching the attaching device from an S2 domain
> to an S1 domain, or when switching between relevant S1 domains.
> as it allows these cases to switch seamlessly without a DMA disruption.
>
> So, call iommu_group_replace_domain() in the
> iommufd_device_do_attach().
> And add a __iommmufd_device_detach helper to allow the replace routine
> to
> do a partial detach on the current hwpt that's being replaced. Though the
> updated locking logic is overcomplicated, it will be eased, once those
> iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's
> allocation/destroy() functions in the coming nesting series, as that'll
> depend on a new ->domain_alloc_user op in the iommu core.
>
> Also, block replace operations that are from/to auto_domains, i.e. only
> user-allocated hw_pagetables can be replaced or replaced with.
>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/iommufd/device.c | 101 +++++++++++++++++-------
> drivers/iommu/iommufd/iommufd_private.h | 2 +
> 2 files changed, 76 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/iommu/iommufd/device.c
> b/drivers/iommu/iommufd/device.c
> index b8c3e3baccb5..8a9834fc129a 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -9,6 +9,8 @@
> #include "io_pagetable.h"
> #include "iommufd_private.h"
>
> +MODULE_IMPORT_NS(IOMMUFD_INTERNAL);
> +
> static bool allow_unsafe_interrupts;
> module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR);
> MODULE_PARM_DESC(
> @@ -194,9 +196,61 @@ static bool
> iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt,
> return false;
> }
>
> +/**
> + * __iommmufd_device_detach - Detach a device from idev->hwpt to
> new_hwpt

This function doesn't do anything to make this device attached to new_hwpt.
It is done in the iommufd_device_attach_ioas(). New_hwpt here indicates if
this detach requires to do some extra thing. E.g. remove reserved iova from
the idev->hwpt->ioas. So may just say " Detach a device from idev->hwpt",
and explain the usage of new_hwpt in the below.

> + * @idev: device to detach
> + * @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple
> detach)

The new hw_pagetable to be attached.

> + * @detach_group: flag to call iommu_detach_group
> + *
> + * This is a cleanup helper shared by the replace and detach routines.
> Comparing
> + * to a detach routine, a replace routine only needs a partial detach
> procedure:
> + * it does not need the iommu_detach_group(); it will attach the device to
> a new
> + * hw_pagetable after a partial detach from the currently attached
> hw_pagetable,
> + * so certain steps can be skipped if two hw_pagetables have the same
> IOAS.
> + */
> +static void __iommmufd_device_detach(struct iommufd_device *idev,
> + struct iommufd_hw_pagetable
> *new_hwpt,
> + bool detach_group)
> +{
> + struct iommufd_hw_pagetable *hwpt = idev->hwpt;
> + struct iommufd_ioas *new_ioas = NULL;
> +
> + if (new_hwpt)
> + new_ioas = new_hwpt->ioas;
> +
> + mutex_lock(&hwpt->devices_lock);
> + list_del(&idev->devices_item);
> + if (hwpt->ioas != new_ioas)
> + mutex_lock(&hwpt->ioas->mutex);

The lock order is mostly hwpt->ioas->mutex and then hwpt->devices_lock.
See the iommufd_device_auto_get_domain(). If possible, may switch the
order sequence here. Also, rename hwpt to be cur_hwpt, this may help
reviewers to distinguish it from the hwpt in the caller of this function. It
looks to be a deadlock at first look, but not after closer reading.

> + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> + if (list_empty(&hwpt->devices)) {
> + iopt_table_remove_domain(&hwpt->ioas->iopt,
> + hwpt->domain);
> + list_del(&hwpt->hwpt_item);
> + }
> + if (detach_group)
> + iommu_detach_group(hwpt->domain, idev->group);
> + }
> + if (hwpt->ioas != new_ioas) {
> + iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev-
> >dev);
> + mutex_unlock(&hwpt->ioas->mutex);
> + }
> + mutex_unlock(&hwpt->devices_lock);
> +
> + if (hwpt->auto_domain)
> + iommufd_object_destroy_user(idev->ictx, &hwpt->obj);
> + else
> + refcount_dec(&hwpt->obj.users);
> +
> + idev->hwpt = NULL;
> +
> + refcount_dec(&idev->obj.users);
> +}
> +
> static int iommufd_device_do_attach(struct iommufd_device *idev,
> struct iommufd_hw_pagetable *hwpt)
> {
> + struct iommufd_hw_pagetable *cur_hwpt = idev->hwpt;
> phys_addr_t sw_msi_start = PHYS_ADDR_MAX;
> int rc;
>
> @@ -236,7 +290,7 @@ static int iommufd_device_do_attach(struct
> iommufd_device *idev,
> * the group once for the first device that is in the group.
> */
> if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> - rc = iommu_attach_group(hwpt->domain, idev->group);
> + rc = iommu_group_replace_domain(idev->group, hwpt-
> >domain);
> if (rc)
> goto out_iova;
>
> @@ -249,6 +303,10 @@ static int iommufd_device_do_attach(struct
> iommufd_device *idev,
> }
> }
>
> + /* Replace the cur_hwpt without iommu_detach_group() */
> + if (cur_hwpt)
> + __iommmufd_device_detach(idev, hwpt, false);
> +
> idev->hwpt = hwpt;
> refcount_inc(&hwpt->obj.users);
> list_add(&idev->devices_item, &hwpt->devices);
> @@ -256,7 +314,10 @@ static int iommufd_device_do_attach(struct
> iommufd_device *idev,
> return 0;
>
> out_detach:
> - iommu_detach_group(hwpt->domain, idev->group);
> + if (cur_hwpt)
> + iommu_group_replace_domain(idev->group, cur_hwpt-
> >domain);
> + else
> + iommu_detach_group(hwpt->domain, idev->group);
> out_iova:
> iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
> out_unlock:
> @@ -345,6 +406,13 @@ int iommufd_device_attach(struct iommufd_device
> *idev, u32 *pt_id)
> struct iommufd_hw_pagetable *hwpt =
> container_of(pt_obj, struct
> iommufd_hw_pagetable, obj);
>
> + if (idev->hwpt == hwpt)
> + goto out_done;
> + if (idev->hwpt && idev->hwpt->auto_domain) {
> + rc = -EBUSY;

This means if device was attached to an auto_created hwpt, then we
cannot replace it with a user allocated hwpt? If yes, this means the
replace is not available until user hwpt support, which is part of nesting.

> + goto out_put_pt_obj;
> + }
> +
> mutex_lock(&hwpt->ioas->mutex);
> rc = iommufd_device_do_attach(idev, hwpt);
> mutex_unlock(&hwpt->ioas->mutex);
> @@ -356,6 +424,8 @@ int iommufd_device_attach(struct iommufd_device
> *idev, u32 *pt_id)
> struct iommufd_ioas *ioas =
> container_of(pt_obj, struct iommufd_ioas, obj);
>
> + if (idev->hwpt)
> + return -EBUSY;

So we don't allow ioas replacement for physical devices. Is it?
Looks like emulated devices allows it.

> rc = iommufd_device_auto_get_domain(idev, ioas);
> if (rc)
> goto out_put_pt_obj;
> @@ -367,6 +437,7 @@ int iommufd_device_attach(struct iommufd_device
> *idev, u32 *pt_id)
> }
>
> refcount_inc(&idev->obj.users);
> +out_done:
> *pt_id = idev->hwpt->obj.id;
> rc = 0;
>
> @@ -385,31 +456,7 @@
> EXPORT_SYMBOL_NS_GPL(iommufd_device_attach, IOMMUFD);
> */
> void iommufd_device_detach(struct iommufd_device *idev)
> {
> - struct iommufd_hw_pagetable *hwpt = idev->hwpt;
> -
> - mutex_lock(&hwpt->ioas->mutex);
> - mutex_lock(&hwpt->devices_lock);
> - list_del(&idev->devices_item);
> - if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> - if (list_empty(&hwpt->devices)) {
> - iopt_table_remove_domain(&hwpt->ioas->iopt,
> - hwpt->domain);
> - list_del(&hwpt->hwpt_item);
> - }
> - iommu_detach_group(hwpt->domain, idev->group);
> - }
> - iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
> - mutex_unlock(&hwpt->devices_lock);
> - mutex_unlock(&hwpt->ioas->mutex);
> -
> - if (hwpt->auto_domain)
> - iommufd_object_destroy_user(idev->ictx, &hwpt->obj);
> - else
> - refcount_dec(&hwpt->obj.users);
> -
> - idev->hwpt = NULL;
> -
> - refcount_dec(&idev->obj.users);
> + __iommmufd_device_detach(idev, NULL, true);
> }
> EXPORT_SYMBOL_NS_GPL(iommufd_device_detach, IOMMUFD);
>
> diff --git a/drivers/iommu/iommufd/iommufd_private.h
> b/drivers/iommu/iommufd/iommufd_private.h
> index 593138bb37b8..200c783800ad 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -9,6 +9,8 @@
> #include <linux/refcount.h>
> #include <linux/uaccess.h>
>
> +#include "../iommu-priv.h"
> +
> struct iommu_domain;
> struct iommu_group;
> struct iommu_option;
> --
> 2.39.1

Regards,
Yi Liu

2023-02-08 08:13:10

by Yi Liu

[permalink] [raw]
Subject: RE: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> +/**
> + * __iommmufd_device_detach - Detach a device from idev->hwpt to

Nit: s/_iommmufd/__iommufd

Regards,
Yi Liu


2023-02-09 02:53:53

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 01/10] iommu: Move dev_iommu_ops() to private header

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> From: Yi Liu <[email protected]>
>
> dev_iommu_ops() is essentially only used in iommu subsystem, so
> move to a private header to avoid being abused by other drivers.
>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Yi Liu <[email protected]>
> Signed-off-by: Nicolin Chen <[email protected]>

Reviewed-by: Kevin Tian <[email protected]>

2023-02-09 02:53:59

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 00/10] Add IO page table replacement support

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> QEMU with this feature should have the vIOMMU maintain a cache of the
> guest io page table addresses and assign a unique IOAS to each unique
> guest page table.
>

I still didn't see a clear reply why above is required. Please elaborate.

2023-02-09 02:59:28

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> +int iommu_group_replace_domain(struct iommu_group *group,
> + struct iommu_domain *new_domain)
> +{
> + int ret;
> +
> + if (!new_domain)
> + return -EINVAL;

Is there value of allowing NULL new domain so this plays like
iommu_detach_group() then iommufd only needs call one
function in both attach/detach path?

> +
> + mutex_lock(&group->mutex);
> + ret = __iommu_group_set_domain(group, new_domain);
> + if (ret)
> + __iommu_group_set_domain(group, group->domain);
> + mutex_unlock(&group->mutex);
> + return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(iommu_group_replace_domain,
> IOMMUFD_INTERNAL);
> +
> static int iommu_group_do_set_platform_dma(struct device *dev, void
> *data)
> {
> const struct iommu_ops *ops = dev_iommu_ops(dev);
> --
> 2.39.1
>


2023-02-09 03:03:05

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 03/10] iommufd: Create access in vfio_iommufd_emulated_bind()

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> @@ -141,10 +141,19 @@ static const struct iommufd_access_ops
> vfio_user_ops = {
> int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
> struct iommufd_ctx *ictx, u32 *out_device_id)
> {
> + struct iommufd_access *user;
> +
> lockdep_assert_held(&vdev->dev_set->lock);
>
> - vdev->iommufd_ictx = ictx;
> iommufd_ctx_get(ictx);
> + user = iommufd_access_create(vdev->iommufd_ictx, &vfio_user_ops,
> vdev);
> + if (IS_ERR(user)) {
> + iommufd_ctx_put(vdev->iommufd_ictx);
> + return PTR_ERR(user);
> + }
> + iommufd_access_set_ioas(user, 0);

this is not required since ioas has been NULL after creation.

otherwise,

Reviewed-by: Kevin Tian <[email protected]>

2023-02-09 03:03:24

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 04/10] iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_SET_IOAS coverage

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> Add a new IOMMU_TEST_OP_ACCESS_SET_IOAS to allow setting access->ioas
> individually, corresponding to the iommufd_access_set_ioas() helper.
>
> Signed-off-by: Nicolin Chen <[email protected]>

Reviewed-by: Kevin Tian <[email protected]>

2023-02-09 03:13:36

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 05/10] iommufd: Add replace support in iommufd_access_set_ioas()

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> Support an access->ioas replacement in iommufd_access_set_ioas(), which
> sets the access->ioas to NULL provisionally so that any further incoming
> iommufd_access_pin_pages() callback can be blocked.
>
> Then, call access->ops->unmap() to clean up the entire iopt. To allow an
> iommufd_access_unpin_pages() callback to happen via this unmap() call,
> add an ioas_unpin pointer so the unpin routine won't be affected by the
> "access->ioas = NULL" trick above.
>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Nicolin Chen <[email protected]>

move patch10 before this one.

> ---
> drivers/iommu/iommufd/device.c | 16 ++++++++++++++--
> drivers/iommu/iommufd/iommufd_private.h | 1 +
> 2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/iommufd/device.c
> b/drivers/iommu/iommufd/device.c
> index f4bd6f532a90..10ce47484ffa 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -509,11 +509,23 @@ int iommufd_access_set_ioas(struct
> iommufd_access *access, u32 ioas_id)
> iommufd_ref_to_users(obj);
> }
>
> + /*
> + * Set ioas to NULL to block any further iommufd_access_pin_pages().
> + * iommufd_access_unpin_pages() can continue using access-
> >ioas_unpin.
> + */
> + access->ioas = NULL;
> +
> if (cur_ioas) {
> + if (new_ioas) {
> + mutex_unlock(&access->ioas_lock);
> + access->ops->unmap(access->data, 0, ULONG_MAX);
> + mutex_lock(&access->ioas_lock);
> + }

why does above only apply to a valid new_ioas? this is the cleanup on
cur_ioas then required even when new_ioas=NULL.

Instead here the check should be "if (access->ops->unmap)". with
patch10 the cleanup is required only for driver which is allowed to
pin pages.

2023-02-09 03:24:02

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 07/10] iommufd/device: Make hwpt_list list_add/del symmetric

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> Because list_del() is together with iopt_table_remove_domain(), it makes
> sense to have list_add_tail() together with iopt_table_add_domain().
>
> Also place the mutex outside the iommufd_device_do_attach() call, similar
> to what's in the iommufd_device_auto_get_domain() function.
>
> Co-developed-by: Yi Liu <[email protected]>
> Signed-off-by: Yi Liu <[email protected]>
> Reviewed-by: Kevin Tian <[email protected]>
> Signed-off-by: Nicolin Chen <[email protected]>

shouldn't this be a separate bug fix and backported? double adding a
list item would certainly clobber the list...


2023-02-09 04:03:12

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> iommu_group_replace_domain() is introduced to support use cases where
> an
> iommu_group can be attached to a new domain without getting detached
> from
> the old one. This replacement feature will be useful, for cases such as:
> 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
> table with a larger table (PASID=N)
> 2) Nesting mode, when switching the attaching device from an S2 domain
> to an S1 domain, or when switching between relevant S1 domains.
> as it allows these cases to switch seamlessly without a DMA disruption.
>
> So, call iommu_group_replace_domain() in the iommufd_device_do_attach().
> And add a __iommmufd_device_detach helper to allow the replace routine
> to
> do a partial detach on the current hwpt that's being replaced. Though the
> updated locking logic is overcomplicated, it will be eased, once those
> iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's
> allocation/destroy() functions in the coming nesting series, as that'll
> depend on a new ->domain_alloc_user op in the iommu core.

then why not moving those changes into this series to make it simple?

>
> Also, block replace operations that are from/to auto_domains, i.e. only
> user-allocated hw_pagetables can be replaced or replaced with.

where does this restriction come from? iommu_group_replace_domain()
can switch between any two UNMANAGED domains. What is the extra
problem in iommufd to support from/to auto domains?

>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/iommufd/device.c | 101 +++++++++++++++++-------
> drivers/iommu/iommufd/iommufd_private.h | 2 +
> 2 files changed, 76 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/iommu/iommufd/device.c
> b/drivers/iommu/iommufd/device.c
> index b8c3e3baccb5..8a9834fc129a 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -9,6 +9,8 @@
> #include "io_pagetable.h"
> #include "iommufd_private.h"
>
> +MODULE_IMPORT_NS(IOMMUFD_INTERNAL);
> +
> static bool allow_unsafe_interrupts;
> module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR);
> MODULE_PARM_DESC(
> @@ -194,9 +196,61 @@ static bool
> iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt,
> return false;
> }
>
> +/**
> + * __iommmufd_device_detach - Detach a device from idev->hwpt to
> new_hwpt

'from ... to ...' means a replace semantics. then this should be called
iommufd_device_replace_hwpt().

> + * @idev: device to detach
> + * @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple
> detach)
> + * @detach_group: flag to call iommu_detach_group
> + *
> + * This is a cleanup helper shared by the replace and detach routines.
> Comparing
> + * to a detach routine, a replace routine only needs a partial detach
> procedure:
> + * it does not need the iommu_detach_group(); it will attach the device to a
> new
> + * hw_pagetable after a partial detach from the currently attached
> hw_pagetable,
> + * so certain steps can be skipped if two hw_pagetables have the same IOAS.
> + */
> +static void __iommmufd_device_detach(struct iommufd_device *idev,
> + struct iommufd_hw_pagetable
> *new_hwpt,
> + bool detach_group)
> +{
> + struct iommufd_hw_pagetable *hwpt = idev->hwpt;
> + struct iommufd_ioas *new_ioas = NULL;
> +
> + if (new_hwpt)
> + new_ioas = new_hwpt->ioas;
> +
> + mutex_lock(&hwpt->devices_lock);
> + list_del(&idev->devices_item);
> + if (hwpt->ioas != new_ioas)
> + mutex_lock(&hwpt->ioas->mutex);

I think new_ioas->mutext was meant here.

> + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> + if (list_empty(&hwpt->devices)) {
> + iopt_table_remove_domain(&hwpt->ioas->iopt,
> + hwpt->domain);
> + list_del(&hwpt->hwpt_item);
> + }

I'm not sure how this can be fully shared between detach and replace.
Here some work e.g. above needs to be done before calling
iommu_group_replace_domain() while others can be done afterwards.

2023-02-09 04:06:44

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 09/10] vfio: Support IO page table replacement

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> Remove the vdev->iommufd_attached check, since the kernel can internally
> handle a replacement of the IO page table now.
>
> Also update the VFIO_DEVICE_ATTACH_IOMMUFD_PT kdoc in the uAPI
> header.
>
> Signed-off-by: Nicolin Chen <[email protected]>

Reviewed-by: Kevin Tian <[email protected]>

2023-02-09 04:20:41

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 10/10] vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages()

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> A driver that doesn't implement ops->dma_unmap shouldn't be allowed to
> do
> vfio_pin/unpin_pages(), though it can use vfio_dma_rw() to access an iova
> range. Deny !ops->dma_unmap cases in vfio_pin/unpin_pages().
>
> Suggested-by: Kevin Tian <[email protected]>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/vfio/vfio_main.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 8559c3dfb335..c7f3251ad6e5 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -1543,6 +1543,8 @@ int vfio_pin_pages(struct vfio_device *device,
> dma_addr_t iova,
>
> if (iova > ULONG_MAX)
> return -EINVAL;
> + if (!device->ops->dma_unmap)
> + return -EINVAL;
> /*
> * VFIO ignores the sub page offset, npages is from the start
> of
> * a PAGE_SIZE chunk of IOVA. The caller is expected to
> recover
> @@ -1580,6 +1582,8 @@ void vfio_unpin_pages(struct vfio_device *device,
> dma_addr_t iova, int npage)
> if (device->iommufd_access) {
> if (WARN_ON(iova > ULONG_MAX))
> return;
> + if (!device->ops->dma_unmap)
> + return;

IMHO this restriction applies to both iommufd and legacy container.

2023-02-09 13:23:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

On Thu, Feb 09, 2023 at 02:55:24AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> >
> > +int iommu_group_replace_domain(struct iommu_group *group,
> > + struct iommu_domain *new_domain)
> > +{
> > + int ret;
> > +
> > + if (!new_domain)
> > + return -EINVAL;
>
> Is there value of allowing NULL new domain so this plays like
> iommu_detach_group() then iommufd only needs call one
> function in both attach/detach path?

We've used NULL to mean the 'platform domain' in the iommu core code
in a few places, I'd prefer to avoid overloading NULL.

IMHO it doesn't help iommufd to substitute detach with replace.

Jason

2023-02-09 13:24:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] iommufd/device: Make hwpt_list list_add/del symmetric

On Thu, Feb 09, 2023 at 03:23:47AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> >
> > Because list_del() is together with iopt_table_remove_domain(), it makes
> > sense to have list_add_tail() together with iopt_table_add_domain().
> >
> > Also place the mutex outside the iommufd_device_do_attach() call, similar
> > to what's in the iommufd_device_auto_get_domain() function.
> >
> > Co-developed-by: Yi Liu <[email protected]>
> > Signed-off-by: Yi Liu <[email protected]>
> > Reviewed-by: Kevin Tian <[email protected]>
> > Signed-off-by: Nicolin Chen <[email protected]>
>
> shouldn't this be a separate bug fix and backported? double adding a
> list item would certainly clobber the list...

AFAIK there is no bug, this is just reorganizing things

Jason

2023-02-09 13:26:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages()

On Thu, Feb 09, 2023 at 04:10:04AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> >
> > A driver that doesn't implement ops->dma_unmap shouldn't be allowed to
> > do
> > vfio_pin/unpin_pages(), though it can use vfio_dma_rw() to access an iova
> > range. Deny !ops->dma_unmap cases in vfio_pin/unpin_pages().
> >
> > Suggested-by: Kevin Tian <[email protected]>
> > Signed-off-by: Nicolin Chen <[email protected]>
> > ---
> > drivers/vfio/vfio_main.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> > index 8559c3dfb335..c7f3251ad6e5 100644
> > --- a/drivers/vfio/vfio_main.c
> > +++ b/drivers/vfio/vfio_main.c
> > @@ -1543,6 +1543,8 @@ int vfio_pin_pages(struct vfio_device *device,
> > dma_addr_t iova,
> >
> > if (iova > ULONG_MAX)
> > return -EINVAL;
> > + if (!device->ops->dma_unmap)
> > + return -EINVAL;
> > /*
> > * VFIO ignores the sub page offset, npages is from the start
> > of
> > * a PAGE_SIZE chunk of IOVA. The caller is expected to
> > recover
> > @@ -1580,6 +1582,8 @@ void vfio_unpin_pages(struct vfio_device *device,
> > dma_addr_t iova, int npage)
> > if (device->iommufd_access) {
> > if (WARN_ON(iova > ULONG_MAX))
> > return;
> > + if (!device->ops->dma_unmap)
> > + return;
>
> IMHO this restriction applies to both iommufd and legacy container.

Yeah that makes sense

Jason

2023-02-09 16:13:31

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] Add IO page table replacement support

On Thu, Feb 09, 2023 at 02:50:39AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
>
>
> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> >
> > QEMU with this feature should have the vIOMMU maintain a cache of the
> > guest io page table addresses and assign a unique IOAS to each unique
> > guest page table.
> >
>
> I still didn't see a clear reply why above is required. Please elaborate.

Hmm..this piece was directly copied from Jason's inputs before I
sent v1. And I thought he explained in the previous version. I
can drop it in v3 if that makes you happy.

Thanks
Nic

2023-02-09 16:16:00

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] iommufd: Create access in vfio_iommufd_emulated_bind()

On Thu, Feb 09, 2023 at 02:56:39AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
>
>
> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> >
> > @@ -141,10 +141,19 @@ static const struct iommufd_access_ops
> > vfio_user_ops = {
> > int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
> > struct iommufd_ctx *ictx, u32 *out_device_id)
> > {
> > + struct iommufd_access *user;
> > +
> > lockdep_assert_held(&vdev->dev_set->lock);
> >
> > - vdev->iommufd_ictx = ictx;
> > iommufd_ctx_get(ictx);
> > + user = iommufd_access_create(vdev->iommufd_ictx, &vfio_user_ops,
> > vdev);
> > + if (IS_ERR(user)) {
> > + iommufd_ctx_put(vdev->iommufd_ictx);
> > + return PTR_ERR(user);
> > + }
> > + iommufd_access_set_ioas(user, 0);
>
> this is not required since ioas has been NULL after creation.

Will drop it.

> otherwise,
>
> Reviewed-by: Kevin Tian <[email protected]>

And add this too.

Thanks!
Nic

2023-02-09 16:19:59

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages()

On Thu, Feb 09, 2023 at 09:26:15AM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 09, 2023 at 04:10:04AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Wednesday, February 8, 2023 5:18 AM
> > >
> > > A driver that doesn't implement ops->dma_unmap shouldn't be allowed to
> > > do
> > > vfio_pin/unpin_pages(), though it can use vfio_dma_rw() to access an iova
> > > range. Deny !ops->dma_unmap cases in vfio_pin/unpin_pages().
> > >
> > > Suggested-by: Kevin Tian <[email protected]>
> > > Signed-off-by: Nicolin Chen <[email protected]>
> > > ---
> > > drivers/vfio/vfio_main.c | 4 ++++
> > > 1 file changed, 4 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> > > index 8559c3dfb335..c7f3251ad6e5 100644
> > > --- a/drivers/vfio/vfio_main.c
> > > +++ b/drivers/vfio/vfio_main.c
> > > @@ -1543,6 +1543,8 @@ int vfio_pin_pages(struct vfio_device *device,
> > > dma_addr_t iova,
> > >
> > > if (iova > ULONG_MAX)
> > > return -EINVAL;
> > > + if (!device->ops->dma_unmap)
> > > + return -EINVAL;
> > > /*
> > > * VFIO ignores the sub page offset, npages is from the start
> > > of
> > > * a PAGE_SIZE chunk of IOVA. The caller is expected to
> > > recover
> > > @@ -1580,6 +1582,8 @@ void vfio_unpin_pages(struct vfio_device *device,
> > > dma_addr_t iova, int npage)
> > > if (device->iommufd_access) {
> > > if (WARN_ON(iova > ULONG_MAX))
> > > return;
> > > + if (!device->ops->dma_unmap)
> > > + return;
> >
> > IMHO this restriction applies to both iommufd and legacy container.
>
> Yeah that makes sense

Will move them to the beginning of vfio_pin_pages/vfio_unpin_pages.

Thanks
Nic

2023-02-09 18:59:16

by Eric Farman

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] iommufd: Create access in vfio_iommufd_emulated_bind()

On Tue, 2023-02-07 at 13:17 -0800, Nicolin Chen wrote:
...snip...
> diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
> index 026f81a87dd7..dc9feab73db7 100644
> --- a/drivers/vfio/iommufd.c
> +++ b/drivers/vfio/iommufd.c
> @@ -141,10 +141,19 @@ static const struct iommufd_access_ops
> vfio_user_ops = {
>  int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
>                                struct iommufd_ctx *ictx, u32
> *out_device_id)
>  {
> +       struct iommufd_access *user;
> +
>         lockdep_assert_held(&vdev->dev_set->lock);
>  
> -       vdev->iommufd_ictx = ictx;
>         iommufd_ctx_get(ictx);
> +       user = iommufd_access_create(vdev->iommufd_ictx,
> &vfio_user_ops, vdev);
> +       if (IS_ERR(user)) {
> +               iommufd_ctx_put(vdev->iommufd_ictx);

Matthew noticed a vfio-ccw and -ap regression that blames this patch.

Probably both the iommufd_access_create() and iommufd_ctx_put() calls
want the ictx variable itself, instead of the (uninitialized) pointer
in the vfio_device. (At least that gets -ccw and -ap working again.)

Thanks,
Eric

> +               return PTR_ERR(user);
> +       }
> +       iommufd_access_set_ioas(user, 0);
> +       vdev->iommufd_access = user;
> +       vdev->iommufd_ictx = ictx;
>         return 0;
>  }
>  EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_bind);

2023-02-09 19:54:25

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] iommufd: Create access in vfio_iommufd_emulated_bind()

On Thu, Feb 09, 2023 at 01:58:47PM -0500, Eric Farman wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, 2023-02-07 at 13:17 -0800, Nicolin Chen wrote:
> ...snip...
> > diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
> > index 026f81a87dd7..dc9feab73db7 100644
> > --- a/drivers/vfio/iommufd.c
> > +++ b/drivers/vfio/iommufd.c
> > @@ -141,10 +141,19 @@ static const struct iommufd_access_ops
> > vfio_user_ops = {
> > int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
> > struct iommufd_ctx *ictx, u32
> > *out_device_id)
> > {
> > + struct iommufd_access *user;
> > +
> > lockdep_assert_held(&vdev->dev_set->lock);
> >
> > - vdev->iommufd_ictx = ictx;
> > iommufd_ctx_get(ictx);
> > + user = iommufd_access_create(vdev->iommufd_ictx,
> > &vfio_user_ops, vdev);
> > + if (IS_ERR(user)) {
> > + iommufd_ctx_put(vdev->iommufd_ictx);
>
> Matthew noticed a vfio-ccw and -ap regression that blames this patch.
>
> Probably both the iommufd_access_create() and iommufd_ctx_put() calls
> want the ictx variable itself, instead of the (uninitialized) pointer
> in the vfio_device. (At least that gets -ccw and -ap working again.)

Oops. Yes, it should be:

iommufd_ctx_get(ictx);
user = iommufd_access_create(ictx, &vfio_user_ops, vdev);
if (IS_ERR(user)) {
iommufd_ctx_put(ictx);

Will fix in v3.

Thanks!
Nic

2023-02-09 20:29:09

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] iommufd: Add replace support in iommufd_access_set_ioas()

On Thu, Feb 09, 2023 at 03:13:08AM +0000, Tian, Kevin wrote:

> > --- a/drivers/iommu/iommufd/device.c
> > +++ b/drivers/iommu/iommufd/device.c
> > @@ -509,11 +509,23 @@ int iommufd_access_set_ioas(struct
> > iommufd_access *access, u32 ioas_id)
> > iommufd_ref_to_users(obj);
> > }
> >
> > + /*
> > + * Set ioas to NULL to block any further iommufd_access_pin_pages().
> > + * iommufd_access_unpin_pages() can continue using access-
> > >ioas_unpin.
> > + */
> > + access->ioas = NULL;
> > +
> > if (cur_ioas) {
> > + if (new_ioas) {
> > + mutex_unlock(&access->ioas_lock);
> > + access->ops->unmap(access->data, 0, ULONG_MAX);
> > + mutex_lock(&access->ioas_lock);
> > + }
>
> why does above only apply to a valid new_ioas? this is the cleanup on
> cur_ioas then required even when new_ioas=NULL.

Though it'd make sense to put it in the common path, our current
detach routine doesn't call this unmap. If we do so, it'd become
something new to the normal detach routine. Or does this mean the
detach routine has been missing an unmap call so far?

Thanks
Nic

2023-02-09 20:50:39

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] iommufd: Add replace support in iommufd_access_set_ioas()

On Thu, Feb 09, 2023 at 12:28:45PM -0800, Nicolin Chen wrote:
> On Thu, Feb 09, 2023 at 03:13:08AM +0000, Tian, Kevin wrote:
>
> > > --- a/drivers/iommu/iommufd/device.c
> > > +++ b/drivers/iommu/iommufd/device.c
> > > @@ -509,11 +509,23 @@ int iommufd_access_set_ioas(struct
> > > iommufd_access *access, u32 ioas_id)
> > > iommufd_ref_to_users(obj);
> > > }
> > >
> > > + /*
> > > + * Set ioas to NULL to block any further iommufd_access_pin_pages().
> > > + * iommufd_access_unpin_pages() can continue using access-
> > > >ioas_unpin.
> > > + */
> > > + access->ioas = NULL;
> > > +
> > > if (cur_ioas) {
> > > + if (new_ioas) {
> > > + mutex_unlock(&access->ioas_lock);
> > > + access->ops->unmap(access->data, 0, ULONG_MAX);
> > > + mutex_lock(&access->ioas_lock);
> > > + }
> >
> > why does above only apply to a valid new_ioas? this is the cleanup on
> > cur_ioas then required even when new_ioas=NULL.
>
> Though it'd make sense to put it in the common path, our current
> detach routine doesn't call this unmap. If we do so, it'd become
> something new to the normal detach routine. Or does this mean the
> detach routine has been missing an unmap call so far?

By the time vfio_iommufd_emulated_unbind() is called the driver's
close_device() has already returned

At this point the driver should have removed all active pins.

We should not call back into the driver with unmap after its
close_device() returns.

However, this function is not on the close_device path so it should
always flush all existing mappings before attempting to change the
ioas to anything.

Jason

2023-02-09 20:56:10

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Wed, Feb 08, 2023 at 08:08:42AM +0000, Liu, Yi L wrote:

> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> >
> > iommu_group_replace_domain() is introduced to support use cases where
> > an
> > iommu_group can be attached to a new domain without getting detached
> > from
> > the old one. This replacement feature will be useful, for cases such as:
> > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
> > table with a larger table (PASID=N)
> > 2) Nesting mode, when switching the attaching device from an S2 domain
> > to an S1 domain, or when switching between relevant S1 domains.
> > as it allows these cases to switch seamlessly without a DMA disruption.
> >
> > So, call iommu_group_replace_domain() in the
> > iommufd_device_do_attach().
> > And add a __iommmufd_device_detach helper to allow the replace routine
> > to
> > do a partial detach on the current hwpt that's being replaced. Though the
> > updated locking logic is overcomplicated, it will be eased, once those
> > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's
> > allocation/destroy() functions in the coming nesting series, as that'll
> > depend on a new ->domain_alloc_user op in the iommu core.
> >
> > Also, block replace operations that are from/to auto_domains, i.e. only
> > user-allocated hw_pagetables can be replaced or replaced with.
> >
> > Signed-off-by: Nicolin Chen <[email protected]>
> > ---
> > drivers/iommu/iommufd/device.c | 101 +++++++++++++++++-------
> > drivers/iommu/iommufd/iommufd_private.h | 2 +
> > 2 files changed, 76 insertions(+), 27 deletions(-)
> >
> > diff --git a/drivers/iommu/iommufd/device.c
> > b/drivers/iommu/iommufd/device.c
> > index b8c3e3baccb5..8a9834fc129a 100644
> > --- a/drivers/iommu/iommufd/device.c
> > +++ b/drivers/iommu/iommufd/device.c
> > @@ -9,6 +9,8 @@
> > #include "io_pagetable.h"
> > #include "iommufd_private.h"
> >
> > +MODULE_IMPORT_NS(IOMMUFD_INTERNAL);
> > +
> > static bool allow_unsafe_interrupts;
> > module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR);
> > MODULE_PARM_DESC(
> > @@ -194,9 +196,61 @@ static bool
> > iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt,
> > return false;
> > }
> >
> > +/**
> > + * __iommmufd_device_detach - Detach a device from idev->hwpt to
> > new_hwpt
>
> This function doesn't do anything to make this device attached to new_hwpt.
> It is done in the iommufd_device_attach_ioas(). New_hwpt here indicates if
> this detach requires to do some extra thing. E.g. remove reserved iova from
> the idev->hwpt->ioas. So may just say " Detach a device from idev->hwpt",
> and explain the usage of new_hwpt in the below.

Yea. You are right.

> > + * @idev: device to detach
> > + * @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple
> > detach)
>
> The new hw_pagetable to be attached.

OK.

> > + * @detach_group: flag to call iommu_detach_group
> > + *
> > + * This is a cleanup helper shared by the replace and detach routines.
> > Comparing
> > + * to a detach routine, a replace routine only needs a partial detach
> > procedure:
> > + * it does not need the iommu_detach_group(); it will attach the device to
> > a new
> > + * hw_pagetable after a partial detach from the currently attached
> > hw_pagetable,
> > + * so certain steps can be skipped if two hw_pagetables have the same
> > IOAS.
> > + */
> > +static void __iommmufd_device_detach(struct iommufd_device *idev,
> > + struct iommufd_hw_pagetable
> > *new_hwpt,
> > + bool detach_group)
> > +{
> > + struct iommufd_hw_pagetable *hwpt = idev->hwpt;
> > + struct iommufd_ioas *new_ioas = NULL;
> > +
> > + if (new_hwpt)
> > + new_ioas = new_hwpt->ioas;
> > +
> > + mutex_lock(&hwpt->devices_lock);
> > + list_del(&idev->devices_item);
> > + if (hwpt->ioas != new_ioas)
> > + mutex_lock(&hwpt->ioas->mutex);
>
> The lock order is mostly hwpt->ioas->mutex and then hwpt->devices_lock.
> See the iommufd_device_auto_get_domain(). If possible, may switch the
> order sequence here.

Yea, I know it's a bit strange. Yet...

Our nesting series simplifies this part to:
if (cur_ioas != new_ioas) {
mutex_lock(&hwpt->ioas->mutex);
iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
mutex_unlock(&hwpt->ioas->mutex);
}

So, here is trying to avoid something like:
if (cur_ioas != new_ioas)
mutex_lock(&hwpt->ioas->mutex);
// doing something
if (cur_ioas != new_ioas)
iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
// doing something
if (cur_ioas != new_ioas)
mutex_unlock(&hwpt->ioas->mutex);

> Also, rename hwpt to be cur_hwpt, this may help
> reviewers to distinguish it from the hwpt in the caller of this function. It
> looks to be a deadlock at first look, but not after closer reading.

Sure.

> > @@ -345,6 +406,13 @@ int iommufd_device_attach(struct iommufd_device
> > *idev, u32 *pt_id)
> > struct iommufd_hw_pagetable *hwpt =
> > container_of(pt_obj, struct
> > iommufd_hw_pagetable, obj);
> >
> > + if (idev->hwpt == hwpt)
> > + goto out_done;
> > + if (idev->hwpt && idev->hwpt->auto_domain) {
> > + rc = -EBUSY;
>
> This means if device was attached to an auto_created hwpt, then we
> cannot replace it with a user allocated hwpt? If yes, this means the
> replace is not available until user hwpt support, which is part of nesting.

After aligning with Jason, this limit here might be wrong, as we
should be able to support replacing an IOAS. I'd need to take a
closer look and fix it in v3.

> > + if (idev->hwpt)
> > + return -EBUSY;
>
> So we don't allow ioas replacement for physical devices. Is it?
> Looks like emulated devices allows it.

This was to avoid an replace with an auto_domain. Similarly, it's
likely wrong, as I replied above.

Thanks
Nic

2023-02-09 20:56:37

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Wed, Feb 08, 2023 at 08:12:55AM +0000, Liu, Yi L wrote:
> External email: Use caution opening links or attachments
>
>
> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> >
> > +/**
> > + * __iommmufd_device_detach - Detach a device from idev->hwpt to
>
> Nit: s/_iommmufd/__iommufd

Will fix it. Thanks!

2023-02-09 21:13:35

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Thu, Feb 09, 2023 at 04:00:52AM +0000, Tian, Kevin wrote:

> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> >
> > iommu_group_replace_domain() is introduced to support use cases where
> > an
> > iommu_group can be attached to a new domain without getting detached
> > from
> > the old one. This replacement feature will be useful, for cases such as:
> > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
> > table with a larger table (PASID=N)
> > 2) Nesting mode, when switching the attaching device from an S2 domain
> > to an S1 domain, or when switching between relevant S1 domains.
> > as it allows these cases to switch seamlessly without a DMA disruption.
> >
> > So, call iommu_group_replace_domain() in the iommufd_device_do_attach().
> > And add a __iommmufd_device_detach helper to allow the replace routine
> > to
> > do a partial detach on the current hwpt that's being replaced. Though the
> > updated locking logic is overcomplicated, it will be eased, once those
> > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's
> > allocation/destroy() functions in the coming nesting series, as that'll
> > depend on a new ->domain_alloc_user op in the iommu core.
>
> then why not moving those changes into this series to make it simple?

The simplification depends on the new ->domain_alloc_user op and
its implementation in SMMU driver, which would be introduced by
the nesting series of VT-d and SMMU respectively.

At this point, it's hard to decide the best sequence of our three
series. Putting this replace series first simply because it seems
to be closer to get merged than the other two bigger series.

> > Also, block replace operations that are from/to auto_domains, i.e. only
> > user-allocated hw_pagetables can be replaced or replaced with.
>
> where does this restriction come from? iommu_group_replace_domain()
> can switch between any two UNMANAGED domains. What is the extra
> problem in iommufd to support from/to auto domains?

It was my misunderstanding. We should have supported that.
Will fix in v3 and add the corresponding support.

> > +/**
> > + * __iommmufd_device_detach - Detach a device from idev->hwpt to
> > new_hwpt
>
> 'from ... to ...' means a replace semantics. then this should be called
> iommufd_device_replace_hwpt().

Had a hard time to think about the naming, it's indeed a detach-
only routine, but it takes a parameter named new_hwpt...

Perhaps I should just follow Yi's suggestion by rephrasing the
narrative of this function.

> > +static void __iommmufd_device_detach(struct iommufd_device *idev,
> > + struct iommufd_hw_pagetable
> > *new_hwpt,
> > + bool detach_group)
> > +{
> > + struct iommufd_hw_pagetable *hwpt = idev->hwpt;
> > + struct iommufd_ioas *new_ioas = NULL;
> > +
> > + if (new_hwpt)
> > + new_ioas = new_hwpt->ioas;
> > +
> > + mutex_lock(&hwpt->devices_lock);
> > + list_del(&idev->devices_item);
> > + if (hwpt->ioas != new_ioas)
> > + mutex_lock(&hwpt->ioas->mutex);
>
> I think new_ioas->mutext was meant here.

new_hwpt is an input from an replace routine, where it holds the
new_ioas->mutex already. Yi's right that the code here is a bit
confusing. I will try to change it a bit for readability.

> > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> > + if (list_empty(&hwpt->devices)) {
> > + iopt_table_remove_domain(&hwpt->ioas->iopt,
> > + hwpt->domain);
> > + list_del(&hwpt->hwpt_item);
> > + }
>
> I'm not sure how this can be fully shared between detach and replace.
> Here some work e.g. above needs to be done before calling
> iommu_group_replace_domain() while others can be done afterwards.

This iopt_table_remove_domain/list_del is supposed to be done in
the hwpt's destroy() actually. We couldn't move it because it'd
need the new domain_alloc_user op and its implementation in ARM
driver. Overall, I think it should be safe to put it behind the
iommu_group_replace_domain().

Thanks
Nic

2023-02-09 22:18:32

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] iommufd: Add replace support in iommufd_access_set_ioas()

On Thu, Feb 09, 2023 at 04:49:55PM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 09, 2023 at 12:28:45PM -0800, Nicolin Chen wrote:
> > On Thu, Feb 09, 2023 at 03:13:08AM +0000, Tian, Kevin wrote:
> >
> > > > --- a/drivers/iommu/iommufd/device.c
> > > > +++ b/drivers/iommu/iommufd/device.c
> > > > @@ -509,11 +509,23 @@ int iommufd_access_set_ioas(struct
> > > > iommufd_access *access, u32 ioas_id)
> > > > iommufd_ref_to_users(obj);
> > > > }
> > > >
> > > > + /*
> > > > + * Set ioas to NULL to block any further iommufd_access_pin_pages().
> > > > + * iommufd_access_unpin_pages() can continue using access-
> > > > >ioas_unpin.
> > > > + */
> > > > + access->ioas = NULL;
> > > > +
> > > > if (cur_ioas) {
> > > > + if (new_ioas) {
> > > > + mutex_unlock(&access->ioas_lock);
> > > > + access->ops->unmap(access->data, 0, ULONG_MAX);
> > > > + mutex_lock(&access->ioas_lock);
> > > > + }
> > >
> > > why does above only apply to a valid new_ioas? this is the cleanup on
> > > cur_ioas then required even when new_ioas=NULL.
> >
> > Though it'd make sense to put it in the common path, our current
> > detach routine doesn't call this unmap. If we do so, it'd become
> > something new to the normal detach routine. Or does this mean the
> > detach routine has been missing an unmap call so far?
>
> By the time vfio_iommufd_emulated_unbind() is called the driver's
> close_device() has already returned
>
> At this point the driver should have removed all active pins.
>
> We should not call back into the driver with unmap after its
> close_device() returns.

I see. Just found that vfio_device_last_close().

> However, this function is not on the close_device path so it should
> always flush all existing mappings before attempting to change the
> ioas to anything.

OK. That means I also need to fix my change here:

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 8a9834fc129a..ba3fd35b7540 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -465,7 +465,10 @@ void iommufd_access_destroy_object(struct iommufd_object *obj)
struct iommufd_access *access =
container_of(obj, struct iommufd_access, obj);

- iommufd_access_set_ioas(access, 0);
+ if (access->ioas) {
+ iopt_remove_access(&access->ioas->iopt, access);
+ refcount_dec(&access->ioas->obj.users);
+ }
iommufd_ctx_put(access->ictx);
mutex_destroy(&access->ioas_lock);
}

Otherwise, the close_device path would invoke this function via
the unbind() call.

Thanks
Nic

2023-02-10 00:01:09

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Thu, Feb 09, 2023 at 01:13:07PM -0800, Nicolin Chen wrote:
> On Thu, Feb 09, 2023 at 04:00:52AM +0000, Tian, Kevin wrote:
>
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Wednesday, February 8, 2023 5:18 AM
> > >
> > > iommu_group_replace_domain() is introduced to support use cases where
> > > an
> > > iommu_group can be attached to a new domain without getting detached
> > > from
> > > the old one. This replacement feature will be useful, for cases such as:
> > > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
> > > table with a larger table (PASID=N)
> > > 2) Nesting mode, when switching the attaching device from an S2 domain
> > > to an S1 domain, or when switching between relevant S1 domains.
> > > as it allows these cases to switch seamlessly without a DMA disruption.
> > >
> > > So, call iommu_group_replace_domain() in the iommufd_device_do_attach().
> > > And add a __iommmufd_device_detach helper to allow the replace routine
> > > to
> > > do a partial detach on the current hwpt that's being replaced. Though the
> > > updated locking logic is overcomplicated, it will be eased, once those
> > > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's
> > > allocation/destroy() functions in the coming nesting series, as that'll
> > > depend on a new ->domain_alloc_user op in the iommu core.
> >
> > then why not moving those changes into this series to make it simple?
>
> The simplification depends on the new ->domain_alloc_user op and
> its implementation in SMMU driver, which would be introduced by
> the nesting series of VT-d and SMMU respectively.

Since we are not fixing all the drivers at this point, this argument
doesn't hold water.

It is as I said in the other email, there should be no changes to the
normal attach/replace path when adding manual HWPT creation - once
those are removed there should be minimal connection between these two
series.

Jason

2023-02-10 01:35:16

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 00/10] Add IO page table replacement support

> From: Nicolin Chen <[email protected]>
> Sent: Friday, February 10, 2023 12:13 AM
>
> On Thu, Feb 09, 2023 at 02:50:39AM +0000, Tian, Kevin wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Wednesday, February 8, 2023 5:18 AM
> > >
> > > QEMU with this feature should have the vIOMMU maintain a cache of the
> > > guest io page table addresses and assign a unique IOAS to each unique
> > > guest page table.
> > >
> >
> > I still didn't see a clear reply why above is required. Please elaborate.
>
> Hmm..this piece was directly copied from Jason's inputs before I
> sent v1. And I thought he explained in the previous version. I
> can drop it in v3 if that makes you happy.
>

Jason's reply was:

> I think the guidance is about the change to VFIO uAPI where it is now
> possible to change the domain attached, previously that was not
> possible

I cannot connect it to the above statement. Anyway if you think
that guidance to userspace is necessary please connect it to the
replace semantics to explain the motivation.

2023-02-10 01:35:59

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, February 9, 2023 9:23 PM
>
> On Thu, Feb 09, 2023 at 02:55:24AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Wednesday, February 8, 2023 5:18 AM
> > >
> > > +int iommu_group_replace_domain(struct iommu_group *group,
> > > + struct iommu_domain *new_domain)
> > > +{
> > > + int ret;
> > > +
> > > + if (!new_domain)
> > > + return -EINVAL;
> >
> > Is there value of allowing NULL new domain so this plays like
> > iommu_detach_group() then iommufd only needs call one
> > function in both attach/detach path?
>
> We've used NULL to mean the 'platform domain' in the iommu core code
> in a few places, I'd prefer to avoid overloading NULL.
>
> IMHO it doesn't help iommufd to substitute detach with replace.
>

OK

2023-02-10 01:46:29

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 07/10] iommufd/device: Make hwpt_list list_add/del symmetric

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, February 9, 2023 9:24 PM
>
> On Thu, Feb 09, 2023 at 03:23:47AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Wednesday, February 8, 2023 5:18 AM
> > >
> > > Because list_del() is together with iopt_table_remove_domain(), it makes
> > > sense to have list_add_tail() together with iopt_table_add_domain().
> > >
> > > Also place the mutex outside the iommufd_device_do_attach() call,
> similar
> > > to what's in the iommufd_device_auto_get_domain() function.
> > >
> > > Co-developed-by: Yi Liu <[email protected]>
> > > Signed-off-by: Yi Liu <[email protected]>
> > > Reviewed-by: Kevin Tian <[email protected]>
> > > Signed-off-by: Nicolin Chen <[email protected]>
> >
> > shouldn't this be a separate bug fix and backported? double adding a
> > list item would certainly clobber the list...
>
> AFAIK there is no bug, this is just reorganizing things
>

there is semantics change.

here is the current code:

case IOMMUFD_OBJ_HW_PAGETABLE: {
struct iommufd_hw_pagetable *hwpt =
container_of(pt_obj, struct iommufd_hw_pagetable, obj);

rc = iommufd_device_do_attach(idev, hwpt);
if (rc)
goto out_put_pt_obj;

mutex_lock(&hwpt->ioas->mutex);
list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list);
mutex_unlock(&hwpt->ioas->mutex);
break;
}

Above means every attach to hwpt will try to add the hwpt to the
list tail. Isn't it a bug?

with this patch the hwpt is added to the list only when it's attached
by the first device, i.e. when iopt_table_add_domain() is called.

if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
rc = iommu_attach_group(hwpt->domain, idev->group);
if (rc)
goto out_iova;

if (list_empty(&hwpt->devices)) {
rc = iopt_table_add_domain(&hwpt->ioas->iopt,
hwpt->domain);
if (rc)
goto out_detach;
list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list);
}
}

so it's actually a bug fix.

2023-02-10 02:11:30

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

> From: Nicolin Chen <[email protected]>
> Sent: Friday, February 10, 2023 5:13 AM
>
> > > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> > > + if (list_empty(&hwpt->devices)) {
> > > + iopt_table_remove_domain(&hwpt->ioas->iopt,
> > > + hwpt->domain);
> > > + list_del(&hwpt->hwpt_item);
> > > + }
> >
> > I'm not sure how this can be fully shared between detach and replace.
> > Here some work e.g. above needs to be done before calling
> > iommu_group_replace_domain() while others can be done afterwards.
>
> This iopt_table_remove_domain/list_del is supposed to be done in
> the hwpt's destroy() actually. We couldn't move it because it'd
> need the new domain_alloc_user op and its implementation in ARM
> driver. Overall, I think it should be safe to put it behind the
> iommu_group_replace_domain().
>

My confusion is that we have different flows between detach/attach
and replace.

today with separate detach+attach we have following flow:

Remove device from current hwpt;
if (last_device in hwpt) {
Remove hwpt domain from current iopt;
if (last_device in group)
detach group from hwpt domain;
}

if (first device in group) {
attach group to new hwpt domain;
if (first_device in hwpt)
Add hwpt domain to new iopt;
Add device to new hwpt;

but replace flow is different on the detach part:

if (first device in group) {
replace group's domain from current hwpt to new hwpt;
if (first_device in hwpt)
Add hwpt domain to new iopt;
}

Remove device from old hwpt;
if (last_device in old hwpt)
Remove hwpt domain from old iopt;

Add device to new hwpt;

I'm yet to figure out whether we have sufficient lock protection to
prevent other paths from using old iopt/hwpt to find the device
which is already attached to a different domain.

2023-02-10 20:50:41

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Thu, Feb 09, 2023 at 08:01:00PM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 09, 2023 at 01:13:07PM -0800, Nicolin Chen wrote:
> > On Thu, Feb 09, 2023 at 04:00:52AM +0000, Tian, Kevin wrote:
> >
> > > > From: Nicolin Chen <[email protected]>
> > > > Sent: Wednesday, February 8, 2023 5:18 AM
> > > >
> > > > iommu_group_replace_domain() is introduced to support use cases where
> > > > an
> > > > iommu_group can be attached to a new domain without getting detached
> > > > from
> > > > the old one. This replacement feature will be useful, for cases such as:
> > > > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
> > > > table with a larger table (PASID=N)
> > > > 2) Nesting mode, when switching the attaching device from an S2 domain
> > > > to an S1 domain, or when switching between relevant S1 domains.
> > > > as it allows these cases to switch seamlessly without a DMA disruption.
> > > >
> > > > So, call iommu_group_replace_domain() in the iommufd_device_do_attach().
> > > > And add a __iommmufd_device_detach helper to allow the replace routine
> > > > to
> > > > do a partial detach on the current hwpt that's being replaced. Though the
> > > > updated locking logic is overcomplicated, it will be eased, once those
> > > > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's
> > > > allocation/destroy() functions in the coming nesting series, as that'll
> > > > depend on a new ->domain_alloc_user op in the iommu core.
> > >
> > > then why not moving those changes into this series to make it simple?
> >
> > The simplification depends on the new ->domain_alloc_user op and
> > its implementation in SMMU driver, which would be introduced by
> > the nesting series of VT-d and SMMU respectively.
>
> Since we are not fixing all the drivers at this point, this argument
> doesn't hold water.
>
> It is as I said in the other email, there should be no changes to the
> normal attach/replace path when adding manual HWPT creation - once
> those are removed there should be minimal connection between these two
> series.

Yes. I replied here earlier than the discussion with you in
that thread. I think I should just drop this line of commit
messages.

Thanks
Nic

2023-02-10 21:18:01

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] iommufd/device: Make hwpt_list list_add/del symmetric

On Fri, Feb 10, 2023 at 01:46:18AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, February 9, 2023 9:24 PM
> >
> > On Thu, Feb 09, 2023 at 03:23:47AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <[email protected]>
> > > > Sent: Wednesday, February 8, 2023 5:18 AM
> > > >
> > > > Because list_del() is together with iopt_table_remove_domain(), it makes
> > > > sense to have list_add_tail() together with iopt_table_add_domain().
> > > >
> > > > Also place the mutex outside the iommufd_device_do_attach() call,
> > similar
> > > > to what's in the iommufd_device_auto_get_domain() function.
> > > >
> > > > Co-developed-by: Yi Liu <[email protected]>
> > > > Signed-off-by: Yi Liu <[email protected]>
> > > > Reviewed-by: Kevin Tian <[email protected]>
> > > > Signed-off-by: Nicolin Chen <[email protected]>
> > >
> > > shouldn't this be a separate bug fix and backported? double adding a
> > > list item would certainly clobber the list...
> >
> > AFAIK there is no bug, this is just reorganizing things
> >
>
> there is semantics change.
>
> here is the current code:
>
> case IOMMUFD_OBJ_HW_PAGETABLE: {
> struct iommufd_hw_pagetable *hwpt =
> container_of(pt_obj, struct iommufd_hw_pagetable, obj);
>
> rc = iommufd_device_do_attach(idev, hwpt);
> if (rc)
> goto out_put_pt_obj;
>
> mutex_lock(&hwpt->ioas->mutex);
> list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list);
> mutex_unlock(&hwpt->ioas->mutex);
> break;
> }
>
> Above means every attach to hwpt will try to add the hwpt to the
> list tail. Isn't it a bug?

Yes, that looks like a bug..

But this patch isn't the right way to fix that.

The HWPT should be permanently linked to the IOAS as long as it
exists, and the linkage should happen when it is first created.

So attaching a HWPT to another device should never re-link it to the
ioas, thus delete these lines here.

However it looks like iommufd_device_detach() is technically wrong
too, it should only detach the IOAS and HWPT if it is going to destroy
the HWPT. We can't hit those kinds of bugs ATM because we cannot
create naked HWPTs that are not autodomains.

Maybe something like this.. I'll look closer next week

Jason

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index d81f93a321afcb..4e87a44533048a 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -279,7 +279,7 @@ static int iommufd_device_auto_get_domain(struct iommufd_device *idev,
*/
mutex_lock(&ioas->mutex);
list_for_each_entry(hwpt, &ioas->hwpt_list, hwpt_item) {
- if (!hwpt->auto_domain)
+ if (!hwpt->auto_domain || iommufd_object_alive(&hwpt->obj))
continue;

rc = iommufd_device_do_attach(idev, hwpt);
@@ -304,6 +304,7 @@ static int iommufd_device_auto_get_domain(struct iommufd_device *idev,
rc = iommufd_device_do_attach(idev, hwpt);
if (rc)
goto out_abort;
+
list_add_tail(&hwpt->hwpt_item, &ioas->hwpt_list);

mutex_unlock(&ioas->mutex);
@@ -346,10 +347,6 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id)
rc = iommufd_device_do_attach(idev, hwpt);
if (rc)
goto out_put_pt_obj;
-
- mutex_lock(&hwpt->ioas->mutex);
- list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list);
- mutex_unlock(&hwpt->ioas->mutex);
break;
}
case IOMMUFD_OBJ_IOAS: {
@@ -390,14 +387,8 @@ void iommufd_device_detach(struct iommufd_device *idev)
mutex_lock(&hwpt->ioas->mutex);
mutex_lock(&hwpt->devices_lock);
list_del(&idev->devices_item);
- if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
- if (list_empty(&hwpt->devices)) {
- iopt_table_remove_domain(&hwpt->ioas->iopt,
- hwpt->domain);
- list_del(&hwpt->hwpt_item);
- }
+ if (!iommufd_hw_pagetable_has_group(hwpt, idev->group))
iommu_detach_group(hwpt->domain, idev->group);
- }
iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
mutex_unlock(&hwpt->devices_lock);
mutex_unlock(&hwpt->ioas->mutex);
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 43d473989a0667..b11738bbdff7ec 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -13,6 +13,11 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)

WARN_ON(!list_empty(&hwpt->devices));

+ mutex_lock(&hwpt->ioas->mutex);
+ iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain);
+ list_del(&hwpt->hwpt_item);
+ mutex_unlock(&hwpt->ioas->mutex);
+
iommu_domain_free(hwpt->domain);
refcount_dec(&hwpt->ioas->obj.users);
mutex_destroy(&hwpt->devices_lock);

2023-02-10 23:52:11

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

On Tue, 7 Feb 2023 13:17:54 -0800
Nicolin Chen <[email protected]> wrote:

> qemu has a need to replace the translations associated with a domain
> when the guest does large-scale operations like switching between an
> IDENTITY domain and, say, dma-iommu.c.
>
> Currently, it does this by replacing all the mappings in a single
> domain, but this is very inefficient and means that domains have to be
> per-device rather than per-translation.
>
> Provide a high-level API to allow replacements of one domain with
> another. This is similar to a detach/attach cycle except it doesn't
> force the group to go to the blocking domain in-between.
>
> By removing this forced blocking domain the iommu driver has the
> opportunity to implement an atomic replacement of the domains to the
> greatest extent its hardware allows.
>
> It could be possible to adderss this by simply removing the protection
> from the iommu_attach_group(), but it is not so clear if that is safe
> for the few users. Thus, add a new API to serve this new purpose.
>
> Atomic replacement allows the qemu emulation of the viommu to be more
> complete, as real hardware has this ability.

I was under the impression that we could not atomically switch a
device's domain relative to in-flight DMA. IIRC, the discussion was
relative to VT-d, and I vaguely recall something about the domain
needing to be invalidated before it could be replaced. Am I
mis-remembering or has this since been solved? Adding Ashok, who might
have been involved in one of those conversations.

Or maybe atomic is the wrong word here since we expect no in-flight DMA
during the sort of mode transitions referred to here, and we're really
just trying to convey that we can do this via a single operation with
reduced latency? Thanks,

Alex

> All drivers are already required to support changing between active
> UNMANAGED domains when using their attach_dev ops.
>
> This API is expected to be used by IOMMUFD, so add to the iommu-priv
> header and mark it as IOMMUFD_INTERNAL.
>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/iommu-priv.h | 4 ++++
> drivers/iommu/iommu.c | 28 ++++++++++++++++++++++++++++
> 2 files changed, 32 insertions(+)
>
> diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
> index 9e1497027cff..b546795a7e49 100644
> --- a/drivers/iommu/iommu-priv.h
> +++ b/drivers/iommu/iommu-priv.h
> @@ -15,4 +15,8 @@ static inline const struct iommu_ops *dev_iommu_ops(struct device *dev)
> */
> return dev->iommu->iommu_dev->ops;
> }
> +
> +extern int iommu_group_replace_domain(struct iommu_group *group,
> + struct iommu_domain *new_domain);
> +
> #endif /* __LINUX_IOMMU_PRIV_H */
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index a18b7f1a4e6e..15e07d39cd8d 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2151,6 +2151,34 @@ int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
> }
> EXPORT_SYMBOL_GPL(iommu_attach_group);
>
> +/**
> + * iommu_group_replace_domain - replace the domain that a group is attached to
> + * @new_domain: new IOMMU domain to replace with
> + * @group: IOMMU group that will be attached to the new domain
> + *
> + * This API allows the group to switch domains without being forced to go to
> + * the blocking domain in-between.
> + *
> + * If the currently attached domain is a core domain (e.g. a default_domain),
> + * it will act just like the iommu_attach_group().
> + */
> +int iommu_group_replace_domain(struct iommu_group *group,
> + struct iommu_domain *new_domain)
> +{
> + int ret;
> +
> + if (!new_domain)
> + return -EINVAL;
> +
> + mutex_lock(&group->mutex);
> + ret = __iommu_group_set_domain(group, new_domain);
> + if (ret)
> + __iommu_group_set_domain(group, group->domain);
> + mutex_unlock(&group->mutex);
> + return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(iommu_group_replace_domain, IOMMUFD_INTERNAL);
> +
> static int iommu_group_do_set_platform_dma(struct device *dev, void *data)
> {
> const struct iommu_ops *ops = dev_iommu_ops(dev);


2023-02-11 00:10:50

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:

> > > > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> > > > + if (list_empty(&hwpt->devices)) {
> > > > + iopt_table_remove_domain(&hwpt->ioas->iopt,
> > > > + hwpt->domain);
> > > > + list_del(&hwpt->hwpt_item);
> > > > + }
> > >
> > > I'm not sure how this can be fully shared between detach and replace.
> > > Here some work e.g. above needs to be done before calling
> > > iommu_group_replace_domain() while others can be done afterwards.
> >
> > This iopt_table_remove_domain/list_del is supposed to be done in
> > the hwpt's destroy() actually. We couldn't move it because it'd
> > need the new domain_alloc_user op and its implementation in ARM
> > driver. Overall, I think it should be safe to put it behind the
> > iommu_group_replace_domain().
> >
>
> My confusion is that we have different flows between detach/attach
> and replace.
>
> today with separate detach+attach we have following flow:
>
> Remove device from current hwpt;
> if (last_device in hwpt) {
> Remove hwpt domain from current iopt;
> if (last_device in group)
> detach group from hwpt domain;
> }
>
> if (first device in group) {
> attach group to new hwpt domain;
> if (first_device in hwpt)
> Add hwpt domain to new iopt;
> Add device to new hwpt;
>
> but replace flow is different on the detach part:
>
> if (first device in group) {
> replace group's domain from current hwpt to new hwpt;
> if (first_device in hwpt)
> Add hwpt domain to new iopt;
> }
>
> Remove device from old hwpt;
> if (last_device in old hwpt)
> Remove hwpt domain from old iopt;
>
> Add device to new hwpt;

Oh... thinking it carefully, I see the flow does look a bit off.
Perhaps it's better to have a similar flow for replace.

However, I think something would be still different due to its
tricky nature, especially for a multi-device iommu_group.

An iommu_group_detach happens only when a device is the last one
in its group to go through the routine via a DETACH ioctl, while
an iommu_group_replace_domain() happens only when the device is
the first one in its group to go through the routine via another
ATTACH ioctl. However, when the first device does a replace, the
cleanup routine of the old hwpt is a NOP, since there are still
other devices (same group) in the old hwpt. And two implications
here:
1) Any other device in the same group has to forcibly switch to
the new domain, when the first device does a replace.
2) The actual hwpt cleanup can only happen at the last device's
replace call.

This also means that kernel has to rely on the integrity of the
user space that it must replace all active devices in the group:

For a three-device iommu_group,
[scenario 1]
a. ATTACH dev1 to hwpt1;
b. ATTACH dev2 to hwpt1;
c. ATTACH dev3 to hwpt1;
d. ATTACH (REPLACE) dev1 to hwpt2;
(no hwpt1 cleanup; replace dev2&3 too; do hwpt2 init)
e. ATTACH (REPLACE) dev2 to hwpt2; // user space must do
(no hwpt1 cleanup; no dev2 replace; no hwpt2 init)
f. ATTACH (REPLACE) dev3 to hwpt2; // user space must do
(do hwpt1 cleanup; no dev3 replace; no hwpt2 init)

[scenario 2]
a. ATTACH dev1 to hwpt1;
b. ATTACH dev2 to hwpt1;
c. ATTACH dev3 to hwpt1;
d. DETACH dev3 from hwpt1;
(detach dev3; no hwpt1 cleanup)
f. ATTACH (REPLACE) dev1 to hwpt2;
(no hwpt1 cleanup; replace dev2&3 too; do hwpt2 init)
g. ATTACH (REPLACE) dev2 to hwpt2; // user space must do
(do hwpt1 cleanup; no dev2 replace; no hwpt2 init)
h. (optional) ATTACH dev3 to hwpt2; // clean ATTACH, not a REPLACE
(no hwpt1 cleanup; no dev3 replace; no hwpt2 init)

[scenario 3]
a. ATTACH dev1 to hwpt1;
b. ATTACH dev2 to hwpt1;
c. ATTACH dev3 to hwpt1;
d. DETACH dev3 from hwpt1;
(detach dev3; no hwpt1 cleanup)
e. DETACH dev2 from hwpt1;
(detach dev2; no hwpt1 cleanup)
f. ATTACH (REPLACE) dev1 to hwpt2;
(do hwpt1 cleanup; replace dev2&3 too; do hwpt2 init)
g. (optional) ATTACH dev2 to hwpt2; // clean ATTACH, not a REPLACE
(no hwpt1 cleanup; no dev2 replace; no hwpt2 init)
h. (optional) ATTACH dev3 to hwpt2; // clean ATTACH, not a REPLACE
(no hwpt1 cleanup; no dev3 replace; no hwpt2 init)

Following the narratives above,

[current detach+attach flow]
// DETACH dev1 from hwpt1;
Log dev1 out of the hwpt1's device list;
NOP; // hwpt1 has its group;
iopt_remove_reserved_iova(hwpt1->iopt, dev1);
idev1->hwpt = NULL;
refcount_dec();
// DETACH dev2 from hwpt1;
Log dev2 out of the hwpt1's device list;
if (hwpt1 does not have its group) { // last device to detach
if (hwpt1's device list is empty)
iopt_table_remove_domain/list_del(hwpt1);
iommu_detach_group();
}
iopt_remove_reserved_iova(hwpt1->iopt, dev2);
idev2->hwpt = NULL;
refcount_dec();
...
// ATTACH dev1 to hwpt2;
iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev1);
if (hwpt2 does not have its group) { // first device to attach
iommu_attach_group();
if (hwpt2's device list is empty)
iopt_table_add_domain/list_add(hwpt2);
}
idev1->hwpt = hwpt2;
refcount_inc();
Log dev1 in the hwpt2's device list;
// ATTACH dev2 to hwpt2;
iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev2);
NOP; // hwpt2 has its group;
idev2->hwpt = hwpt2;
refcount_inc();
Log dev2 in to the hwpt2's device list;


[correct (?) replace flow - scenario 1 above]

// 1.d Switch (REPLACE) dev1 from hwpt1 to hwpt2;
partial detach (dev1) {
Log dev1 out of the hwpt1's device list;
NOP // hwpt1 has its group, and hwpt1's device list isn't empty
iopt_remove_reserved_iova(hwpt1->iopt, dev1);
refcount_dec();
}
iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev1);
if (hwpt2 does not have its group) { // first device to replace
iommu_group_replace_domain();
if (hwpt2's device list is empty)
iopt_table_add_domain/list_add(hwpt2);
}
idev1->hwpt = hwpt2;
refcount_int();
Log dev1 in the hwpt2's device list;

// 1.e Switch (REPLACE) dev2 from hwpt1 to hwpt2;
partial detach (dev2) {
Log dev2 out of the hwpt1's device list;
NOP // hwpt1 has its group, and hwpt1's device list isn't empty
iopt_remove_reserved_iova(hwpt1->iopt, dev2);
refcount_dec();
}
iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev2);
NOP; // hwpt2 has its group, and hwpt2's device list isn't empty
idev2->hwpt = hwpt2;
refcount_int();
Log dev2 in the hwpt2's device list;

// 1.f Switch (REPLACE) dev3 from hwpt1 to hwpt2;
partial detach (dev3) {
Log dev3 out of the hwpt1's device list;
if (hwpt1 does not have its group) { // last device to detach
if (hwpt1's device list is empty)
iopt_table_remove_domain/list_del(hwpt1);
}
iopt_remove_reserved_iova(hwpt1->iopt, dev3);
refcount_dec();
}
iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev3);
NOP; // hwpt2 has its group, and hwpt2's device list isn't empty
idev3->hwpt = hwpt2;
refcount_int();
Log dev3 in the hwpt2's device list;

And, this would also complicate the error-out routines too...

> I'm yet to figure out whether we have sufficient lock protection to
> prevent other paths from using old iopt/hwpt to find the device
> which is already attached to a different domain.

With the correct (?) flow, I think it'd be safer for one-device
group. But it's gets tricker for the multi-device case above:
the dev2 and dev3 are already in the new domain, but all their
iommufd objects (idev->hwpt and iopt) are lagging. Do we need a
lock across the three IOCTLs?

One way to avoid it is to force-update idev2 and idev3 too when
idev1 does a replace -- by iterating all same-group devices out
of the old hwpt. This, however, feels a violation against being
device-centric...

i.e. scenario 1:
a. ATTACH dev1 to hwpt1;
b. ATTACH dev2 to hwpt1;
c. ATTACH dev3 to hwpt1;
d. ATTACH (REPLACE) dev1 to hwpt2;
(do hwpt1 cleanup; replace dev2&3 and update idev2&3 too; do hwpt2 init)
e. ATTACH (REPLACE) dev2 to hwpt2; // user space must do, or be aware of 1.d
(kernel does dummy)
f. ATTACH (REPLACE) dev3 to hwpt2; // user space must do, or be aware of 1.d
(kernel does dummy)

Thanks
Nic

2023-02-11 00:45:04

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

On Fri, Feb 10, 2023 at 04:51:10PM -0700, Alex Williamson wrote:
> On Tue, 7 Feb 2023 13:17:54 -0800
> Nicolin Chen <[email protected]> wrote:
>
> > qemu has a need to replace the translations associated with a domain
> > when the guest does large-scale operations like switching between an
> > IDENTITY domain and, say, dma-iommu.c.
> >
> > Currently, it does this by replacing all the mappings in a single
> > domain, but this is very inefficient and means that domains have to be
> > per-device rather than per-translation.
> >
> > Provide a high-level API to allow replacements of one domain with
> > another. This is similar to a detach/attach cycle except it doesn't
> > force the group to go to the blocking domain in-between.
> >
> > By removing this forced blocking domain the iommu driver has the
> > opportunity to implement an atomic replacement of the domains to the
> > greatest extent its hardware allows.
> >
> > It could be possible to adderss this by simply removing the protection
> > from the iommu_attach_group(), but it is not so clear if that is safe
> > for the few users. Thus, add a new API to serve this new purpose.
> >
> > Atomic replacement allows the qemu emulation of the viommu to be more
> > complete, as real hardware has this ability.
>
> I was under the impression that we could not atomically switch a
> device's domain relative to in-flight DMA.

Certainly all the HW can be proper atomic but not necessarily easily -
the usual issue is a SW complication to manage the software controlled
cache tags in a way that doesn't corrupt the cache.

This is because the cache tag and the io page table top are in
different 64 bit words so atomic writes don't cover both, and thus the
IOMMU HW could tear the two stores and mismatch the cache tag to the
table top. This would corrupt the cache.

The easiest way to avoid this is for SW to use the same DID for the
new and old tables. This is possible if this translation entry is the
only user of the DID. A more complex way would use a temporary DID
that can be safely corrupted. But realistically I'd expect VT-d
drivers to simply make the PASID invalid for the duration of the
update.

However something like AMD has a single cache tag for every entry in
the PASID table so you could do an atomic replace trivially. Just
update the PASID and invalidate the caches.

ARM has a flexible PASID table and atomic replace would be part of
resizing it. eg you can atomically update the top of the PASID table
with a single 64 bit store.

So replace lets the drivers implement those special behaviors if it
makes sense for them.

> Or maybe atomic is the wrong word here since we expect no in-flight DMA
> during the sort of mode transitions referred to here, and we're really
> just trying to convey that we can do this via a single operation with
> reduced latency? Thanks,

atomic means DMA will either translate with the old domain or the new
domain but never a blocking domain. Keep in mind that with nesting
"domain" can mean a full PASID table in guest memory.

I should reiterate that replace is not an API that is required to be
atomic.

Jason

2023-02-13 02:12:58

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 07/10] iommufd/device: Make hwpt_list list_add/del symmetric

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, February 11, 2023 5:17 AM
>
> >
> > there is semantics change.
> >
> > here is the current code:
> >
> > case IOMMUFD_OBJ_HW_PAGETABLE: {
> > struct iommufd_hw_pagetable *hwpt =
> > container_of(pt_obj, struct iommufd_hw_pagetable,
> obj);
> >
> > rc = iommufd_device_do_attach(idev, hwpt);
> > if (rc)
> > goto out_put_pt_obj;
> >
> > mutex_lock(&hwpt->ioas->mutex);
> > list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list);
> > mutex_unlock(&hwpt->ioas->mutex);
> > break;
> > }
> >
> > Above means every attach to hwpt will try to add the hwpt to the
> > list tail. Isn't it a bug?
>
> Yes, that looks like a bug..
>
> But this patch isn't the right way to fix that.
>
> The HWPT should be permanently linked to the IOAS as long as it
> exists, and the linkage should happen when it is first created.
>
> So attaching a HWPT to another device should never re-link it to the
> ioas, thus delete these lines here.
>

yes, this makes more sense

2023-02-13 02:24:52

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, February 11, 2023 8:45 AM
>
> On Fri, Feb 10, 2023 at 04:51:10PM -0700, Alex Williamson wrote:
> > On Tue, 7 Feb 2023 13:17:54 -0800
> > Nicolin Chen <[email protected]> wrote:
> >
> > > qemu has a need to replace the translations associated with a domain
> > > when the guest does large-scale operations like switching between an
> > > IDENTITY domain and, say, dma-iommu.c.
> > >
> > > Currently, it does this by replacing all the mappings in a single
> > > domain, but this is very inefficient and means that domains have to be
> > > per-device rather than per-translation.
> > >
> > > Provide a high-level API to allow replacements of one domain with
> > > another. This is similar to a detach/attach cycle except it doesn't
> > > force the group to go to the blocking domain in-between.
> > >
> > > By removing this forced blocking domain the iommu driver has the
> > > opportunity to implement an atomic replacement of the domains to the
> > > greatest extent its hardware allows.
> > >
> > > It could be possible to adderss this by simply removing the protection
> > > from the iommu_attach_group(), but it is not so clear if that is safe
> > > for the few users. Thus, add a new API to serve this new purpose.
> > >
> > > Atomic replacement allows the qemu emulation of the viommu to be
> more
> > > complete, as real hardware has this ability.
> >
> > I was under the impression that we could not atomically switch a
> > device's domain relative to in-flight DMA.

it's possible as long as the mappings for in-flight DMA don't change
in the transition.

>
> Certainly all the HW can be proper atomic but not necessarily easily -
> the usual issue is a SW complication to manage the software controlled
> cache tags in a way that doesn't corrupt the cache.
>
> This is because the cache tag and the io page table top are in
> different 64 bit words so atomic writes don't cover both, and thus the
> IOMMU HW could tear the two stores and mismatch the cache tag to the
> table top. This would corrupt the cache.

VT-d spec recommends using 128bit cmpxchg instruction to update
page table pointer and DID together.

>
> The easiest way to avoid this is for SW to use the same DID for the
> new and old tables. This is possible if this translation entry is the
> only user of the DID. A more complex way would use a temporary DID
> that can be safely corrupted. But realistically I'd expect VT-d
> drivers to simply make the PASID invalid for the duration of the
> update.
>
> However something like AMD has a single cache tag for every entry in
> the PASID table so you could do an atomic replace trivially. Just
> update the PASID and invalidate the caches.
>
> ARM has a flexible PASID table and atomic replace would be part of
> resizing it. eg you can atomically update the top of the PASID table
> with a single 64 bit store.
>
> So replace lets the drivers implement those special behaviors if it
> makes sense for them.
>
> > Or maybe atomic is the wrong word here since we expect no in-flight DMA
> > during the sort of mode transitions referred to here, and we're really
> > just trying to convey that we can do this via a single operation with
> > reduced latency? Thanks,
>
> atomic means DMA will either translate with the old domain or the new
> domain but never a blocking domain. Keep in mind that with nesting
> "domain" can mean a full PASID table in guest memory.
>
> I should reiterate that replace is not an API that is required to be
> atomic.
>

yes. Just as explained in the commit message:

"
By removing this forced blocking domain the iommu driver has the
opportunity to implement an atomic replacement of the domains to the
greatest extent its hardware allows.
"

API level replace only implies removing transition to/from blocking
domain. and then it's driver's call whether it wants to take this
chance to implement a true 'atomic' behavior.

2023-02-13 02:34:26

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

> From: Nicolin Chen <[email protected]>
> Sent: Saturday, February 11, 2023 8:10 AM
>
> On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:
>
> > > > > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> > > > > + if (list_empty(&hwpt->devices)) {
> > > > > + iopt_table_remove_domain(&hwpt->ioas->iopt,
> > > > > + hwpt->domain);
> > > > > + list_del(&hwpt->hwpt_item);
> > > > > + }
> > > >
> > > > I'm not sure how this can be fully shared between detach and replace.
> > > > Here some work e.g. above needs to be done before calling
> > > > iommu_group_replace_domain() while others can be done afterwards.
> > >
> > > This iopt_table_remove_domain/list_del is supposed to be done in
> > > the hwpt's destroy() actually. We couldn't move it because it'd
> > > need the new domain_alloc_user op and its implementation in ARM
> > > driver. Overall, I think it should be safe to put it behind the
> > > iommu_group_replace_domain().
> > >
> >
> > My confusion is that we have different flows between detach/attach
> > and replace.
> >
> > today with separate detach+attach we have following flow:
> >
> > Remove device from current hwpt;
> > if (last_device in hwpt) {
> > Remove hwpt domain from current iopt;
> > if (last_device in group)
> > detach group from hwpt domain;
> > }
> >
> > if (first device in group) {
> > attach group to new hwpt domain;
> > if (first_device in hwpt)
> > Add hwpt domain to new iopt;
> > Add device to new hwpt;
> >
> > but replace flow is different on the detach part:
> >
> > if (first device in group) {
> > replace group's domain from current hwpt to new hwpt;
> > if (first_device in hwpt)
> > Add hwpt domain to new iopt;
> > }
> >
> > Remove device from old hwpt;
> > if (last_device in old hwpt)
> > Remove hwpt domain from old iopt;
> >
> > Add device to new hwpt;
>
> Oh... thinking it carefully, I see the flow does look a bit off.
> Perhaps it's better to have a similar flow for replace.
>
> However, I think something would be still different due to its
> tricky nature, especially for a multi-device iommu_group.
>
> An iommu_group_detach happens only when a device is the last one
> in its group to go through the routine via a DETACH ioctl, while
> an iommu_group_replace_domain() happens only when the device is
> the first one in its group to go through the routine via another
> ATTACH ioctl. However, when the first device does a replace, the
> cleanup routine of the old hwpt is a NOP, since there are still
> other devices (same group) in the old hwpt. And two implications
> here:
> 1) Any other device in the same group has to forcibly switch to
> the new domain, when the first device does a replace.
> 2) The actual hwpt cleanup can only happen at the last device's
> replace call.
>
> This also means that kernel has to rely on the integrity of the
> user space that it must replace all active devices in the group:
>

Jason suggested to move hwpt cleanup out of the detach path
in his reply to patch7. Presumably with that fix the major tricky
point about hwpt in following scenarios would disappear. Let's
see how it will work out then. ????

2023-02-13 07:48:57

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Mon, Feb 13, 2023 at 02:34:18AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
>
>
> > From: Nicolin Chen <[email protected]>
> > Sent: Saturday, February 11, 2023 8:10 AM
> >
> > On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:
> >
> > > > > > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> > > > > > + if (list_empty(&hwpt->devices)) {
> > > > > > + iopt_table_remove_domain(&hwpt->ioas->iopt,
> > > > > > + hwpt->domain);
> > > > > > + list_del(&hwpt->hwpt_item);
> > > > > > + }
> > > > >
> > > > > I'm not sure how this can be fully shared between detach and replace.
> > > > > Here some work e.g. above needs to be done before calling
> > > > > iommu_group_replace_domain() while others can be done afterwards.
> > > >
> > > > This iopt_table_remove_domain/list_del is supposed to be done in
> > > > the hwpt's destroy() actually. We couldn't move it because it'd
> > > > need the new domain_alloc_user op and its implementation in ARM
> > > > driver. Overall, I think it should be safe to put it behind the
> > > > iommu_group_replace_domain().
> > > >
> > >
> > > My confusion is that we have different flows between detach/attach
> > > and replace.
> > >
> > > today with separate detach+attach we have following flow:
> > >
> > > Remove device from current hwpt;
> > > if (last_device in hwpt) {
> > > Remove hwpt domain from current iopt;
> > > if (last_device in group)
> > > detach group from hwpt domain;
> > > }
> > >
> > > if (first device in group) {
> > > attach group to new hwpt domain;
> > > if (first_device in hwpt)
> > > Add hwpt domain to new iopt;
> > > Add device to new hwpt;
> > >
> > > but replace flow is different on the detach part:
> > >
> > > if (first device in group) {
> > > replace group's domain from current hwpt to new hwpt;
> > > if (first_device in hwpt)
> > > Add hwpt domain to new iopt;
> > > }
> > >
> > > Remove device from old hwpt;
> > > if (last_device in old hwpt)
> > > Remove hwpt domain from old iopt;
> > >
> > > Add device to new hwpt;
> >
> > Oh... thinking it carefully, I see the flow does look a bit off.
> > Perhaps it's better to have a similar flow for replace.
> >
> > However, I think something would be still different due to its
> > tricky nature, especially for a multi-device iommu_group.
> >
> > An iommu_group_detach happens only when a device is the last one
> > in its group to go through the routine via a DETACH ioctl, while
> > an iommu_group_replace_domain() happens only when the device is
> > the first one in its group to go through the routine via another
> > ATTACH ioctl. However, when the first device does a replace, the
> > cleanup routine of the old hwpt is a NOP, since there are still
> > other devices (same group) in the old hwpt. And two implications
> > here:
> > 1) Any other device in the same group has to forcibly switch to
> > the new domain, when the first device does a replace.
> > 2) The actual hwpt cleanup can only happen at the last device's
> > replace call.
> >
> > This also means that kernel has to rely on the integrity of the
> > user space that it must replace all active devices in the group:
> >
>
> Jason suggested to move hwpt cleanup out of the detach path
> in his reply to patch7. Presumably with that fix the major tricky
> point about hwpt in following scenarios would disappear. Let's
> see how it will work out then. ????

What about point 1? If dev2 and dev3 are already replaced when
doing iommu_group_replace_domain() on dev1, their idev objects
still have the old hwpt/iopt until user space does another two
IOCTLs on them, right?

Should we only call iommu_group_replace_domain() when the last
device in the group gets a replace IOCTL?

Thanks
Nic

2023-02-13 08:27:56

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

> From: Nicolin Chen <[email protected]>
> Sent: Monday, February 13, 2023 3:49 PM
>
> On Mon, Feb 13, 2023 at 02:34:18AM +0000, Tian, Kevin wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Saturday, February 11, 2023 8:10 AM
> > >
> > > On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:
> > >
> > > > > > > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> > > > > > > + if (list_empty(&hwpt->devices)) {
> > > > > > > + iopt_table_remove_domain(&hwpt->ioas->iopt,
> > > > > > > + hwpt->domain);
> > > > > > > + list_del(&hwpt->hwpt_item);
> > > > > > > + }
> > > > > >
> > > > > > I'm not sure how this can be fully shared between detach and
> replace.
> > > > > > Here some work e.g. above needs to be done before calling
> > > > > > iommu_group_replace_domain() while others can be done
> afterwards.
> > > > >
> > > > > This iopt_table_remove_domain/list_del is supposed to be done in
> > > > > the hwpt's destroy() actually. We couldn't move it because it'd
> > > > > need the new domain_alloc_user op and its implementation in ARM
> > > > > driver. Overall, I think it should be safe to put it behind the
> > > > > iommu_group_replace_domain().
> > > > >
> > > >
> > > > My confusion is that we have different flows between detach/attach
> > > > and replace.
> > > >
> > > > today with separate detach+attach we have following flow:
> > > >
> > > > Remove device from current hwpt;
> > > > if (last_device in hwpt) {
> > > > Remove hwpt domain from current iopt;
> > > > if (last_device in group)
> > > > detach group from hwpt domain;
> > > > }
> > > >
> > > > if (first device in group) {
> > > > attach group to new hwpt domain;
> > > > if (first_device in hwpt)
> > > > Add hwpt domain to new iopt;
> > > > Add device to new hwpt;
> > > >
> > > > but replace flow is different on the detach part:
> > > >
> > > > if (first device in group) {
> > > > replace group's domain from current hwpt to new hwpt;
> > > > if (first_device in hwpt)
> > > > Add hwpt domain to new iopt;
> > > > }
> > > >
> > > > Remove device from old hwpt;
> > > > if (last_device in old hwpt)
> > > > Remove hwpt domain from old iopt;
> > > >
> > > > Add device to new hwpt;
> > >
> > > Oh... thinking it carefully, I see the flow does look a bit off.
> > > Perhaps it's better to have a similar flow for replace.
> > >
> > > However, I think something would be still different due to its
> > > tricky nature, especially for a multi-device iommu_group.
> > >
> > > An iommu_group_detach happens only when a device is the last one
> > > in its group to go through the routine via a DETACH ioctl, while
> > > an iommu_group_replace_domain() happens only when the device is
> > > the first one in its group to go through the routine via another
> > > ATTACH ioctl. However, when the first device does a replace, the
> > > cleanup routine of the old hwpt is a NOP, since there are still
> > > other devices (same group) in the old hwpt. And two implications
> > > here:
> > > 1) Any other device in the same group has to forcibly switch to
> > > the new domain, when the first device does a replace.
> > > 2) The actual hwpt cleanup can only happen at the last device's
> > > replace call.
> > >
> > > This also means that kernel has to rely on the integrity of the
> > > user space that it must replace all active devices in the group:
> > >
> >
> > Jason suggested to move hwpt cleanup out of the detach path
> > in his reply to patch7. Presumably with that fix the major tricky
> > point about hwpt in following scenarios would disappear. Let's
> > see how it will work out then. ????
>
> What about point 1? If dev2 and dev3 are already replaced when
> doing iommu_group_replace_domain() on dev1, their idev objects
> still have the old hwpt/iopt until user space does another two
> IOCTLs on them, right?
>
> Should we only call iommu_group_replace_domain() when the last
> device in the group gets a replace IOCTL?
>

With Jason's proposal now the attach/detach paths only handle
the connection between device and hwpt. I don't see a problem
here with devices in a group attached to different hwpt's in this
short transition window.

There is no impact to ioas/iopt operations (though the user is not
expected to change it before the group transition is fully completed).
Just that the old hwpt cannot be destroyed until the last dev
detaches from it.

2023-02-13 08:35:04

by Lu Baolu

[permalink] [raw]
Subject: Re: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

On 2023/2/13 10:24, Tian, Kevin wrote:
>> From: Jason Gunthorpe<[email protected]>
>> Sent: Saturday, February 11, 2023 8:45 AM
>>
>> On Fri, Feb 10, 2023 at 04:51:10PM -0700, Alex Williamson wrote:
>>> On Tue, 7 Feb 2023 13:17:54 -0800
>>> Nicolin Chen<[email protected]> wrote:
>>>
>>>> qemu has a need to replace the translations associated with a domain
>>>> when the guest does large-scale operations like switching between an
>>>> IDENTITY domain and, say, dma-iommu.c.
>>>>
>>>> Currently, it does this by replacing all the mappings in a single
>>>> domain, but this is very inefficient and means that domains have to be
>>>> per-device rather than per-translation.
>>>>
>>>> Provide a high-level API to allow replacements of one domain with
>>>> another. This is similar to a detach/attach cycle except it doesn't
>>>> force the group to go to the blocking domain in-between.
>>>>
>>>> By removing this forced blocking domain the iommu driver has the
>>>> opportunity to implement an atomic replacement of the domains to the
>>>> greatest extent its hardware allows.
>>>>
>>>> It could be possible to adderss this by simply removing the protection
>>>> from the iommu_attach_group(), but it is not so clear if that is safe
>>>> for the few users. Thus, add a new API to serve this new purpose.
>>>>
>>>> Atomic replacement allows the qemu emulation of the viommu to be
>> more
>>>> complete, as real hardware has this ability.
>>> I was under the impression that we could not atomically switch a
>>> device's domain relative to in-flight DMA.
> it's possible as long as the mappings for in-flight DMA don't change
> in the transition.
>

It also requires the mappings in old and new domains are identical. In
another word, any IOVA should be translated to a same result no matter
through the old domain, the new domain, or hardware caches.

A similar replacement has been implemented in the code. For example,
the Intel IOMMU driver dynamically transforms a large range (2M or 1G)
mapping from discrete 4k pages to a super pages to improve the paging
cache efficiency.

Best regards,
baolu

2023-02-13 14:45:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

On Mon, Feb 13, 2023 at 02:24:40AM +0000, Tian, Kevin wrote:

> > This is because the cache tag and the io page table top are in
> > different 64 bit words so atomic writes don't cover both, and thus the
> > IOMMU HW could tear the two stores and mismatch the cache tag to the
> > table top. This would corrupt the cache.
>
> VT-d spec recommends using 128bit cmpxchg instruction to update
> page table pointer and DID together.

Oh really? Such a thing exists? That neat!

Jason

2023-02-13 14:50:43

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote:

> What about point 1? If dev2 and dev3 are already replaced when
> doing iommu_group_replace_domain() on dev1, their idev objects
> still have the old hwpt/iopt until user space does another two
> IOCTLs on them, right?

We have a complicated model for multi-device groups...

The first device in the group to change domains must move all the
devices in the group

But userspace is still expected to run through and change all the
other devices

So replace should be a NOP if the group is already linked to the right
domain.

This patch needs to make sure that incosistency in the view betwen the
iommu_group and the iommufd model doesn't cause a functional
problem.

Jason

2023-02-14 03:29:37

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

> From: Jason Gunthorpe <[email protected]>
> Sent: Monday, February 13, 2023 10:46 PM
>
> On Mon, Feb 13, 2023 at 02:24:40AM +0000, Tian, Kevin wrote:
>
> > > This is because the cache tag and the io page table top are in
> > > different 64 bit words so atomic writes don't cover both, and thus the
> > > IOMMU HW could tear the two stores and mismatch the cache tag to the
> > > table top. This would corrupt the cache.
> >
> > VT-d spec recommends using 128bit cmpxchg instruction to update
> > page table pointer and DID together.
>
> Oh really? Such a thing exists? That neat!
>

yes. see cmpxchg_double().

2023-02-14 10:54:35

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote:
> On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote:
>
> > What about point 1? If dev2 and dev3 are already replaced when
> > doing iommu_group_replace_domain() on dev1, their idev objects
> > still have the old hwpt/iopt until user space does another two
> > IOCTLs on them, right?
>
> We have a complicated model for multi-device groups...
>
> The first device in the group to change domains must move all the
> devices in the group
>
> But userspace is still expected to run through and change all the
> other devices
>
> So replace should be a NOP if the group is already linked to the right
> domain.
>
> This patch needs to make sure that incosistency in the view betwen the
> iommu_group and the iommufd model doesn't cause a functional
> problem.

Yea, I was thinking that we'd need to block any access to the
idev->hwpt of a pending device's, before the kernel finishes
the "NOP" IOCTL from userspace, maybe with a helper:
(iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain)

Thanks
Nic

2023-02-14 10:59:40

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:

> My confusion is that we have different flows between detach/attach
> and replace.
>
> today with separate detach+attach we have following flow:
>
> Remove device from current hwpt;
> if (last_device in hwpt) {
> Remove hwpt domain from current iopt;
> if (last_device in group)
> detach group from hwpt domain;
> }
>
> if (first device in group) {
> attach group to new hwpt domain;
> if (first_device in hwpt)
> Add hwpt domain to new iopt;
> Add device to new hwpt;
>
> but replace flow is different on the detach part:
>
> if (first device in group) {
> replace group's domain from current hwpt to new hwpt;
> if (first_device in hwpt)
> Add hwpt domain to new iopt;
> }
>
> Remove device from old hwpt;
> if (last_device in old hwpt)
> Remove hwpt domain from old iopt;
>
> Add device to new hwpt;
>
> I'm yet to figure out whether we have sufficient lock protection to
> prevent other paths from using old iopt/hwpt to find the device
> which is already attached to a different domain.

With Jason's new series, the detach() routine is lighter now.

I wonder if it'd be safer now to do the detach() call after
iommu_group_replace_domain()?

Thanks
Nic

@@ -196,10 +198,41 @@ static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt,
return false;
}

+/**
+ * __iommufd_device_detach - Detach a device from idev->hwpt
+ * @idev: device to detach
+ * @detach_group: flag to call iommu_detach_group
+ *
+ * This is a cleanup helper shared by the replace and detach routines. Comparing
+ * to a detach routine, a replace call does not need the iommu_detach_group().
+ */
+static void __iommufd_device_detach(struct iommufd_device *idev,
+ bool detach_group)
+{
+ struct iommufd_hw_pagetable *hwpt = idev->hwpt;
+
+ mutex_lock(&hwpt->devices_lock);
+ list_del(&idev->devices_item);
+ if (detach_group && !iommufd_hw_pagetable_has_group(hwpt, idev->group))
+ iommu_detach_group(hwpt->domain, idev->group);
+ iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
+ mutex_unlock(&hwpt->devices_lock);
+
+ if (hwpt->auto_domain)
+ iommufd_object_destroy_user(idev->ictx, &hwpt->obj);
+ else
+ refcount_dec(&hwpt->obj.users);
+
+ idev->hwpt = NULL;
+
+ refcount_dec(&idev->obj.users);
+}
+
/* On success this consumes a hwpt reference from the caller */
static int iommufd_device_do_attach(struct iommufd_device *idev,
struct iommufd_hw_pagetable *hwpt)
{
+ struct iommufd_hw_pagetable *cur_hwpt = idev->hwpt;
phys_addr_t sw_msi_start = PHYS_ADDR_MAX;
int rc;

@@ -237,7 +270,7 @@ static int iommufd_device_do_attach(struct iommufd_device *idev,
* the group once for the first device that is in the group.
*/
if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
- rc = iommu_attach_group(hwpt->domain, idev->group);
+ rc = iommu_group_replace_domain(idev->group, hwpt->domain);
if (rc)
goto out_iova;

@@ -249,6 +282,10 @@ static int iommufd_device_do_attach(struct iommufd_device *idev,
}
}

+ /* Detach from the cur_hwpt without iommu_detach_group() */
+ if (cur_hwpt)
+ __iommufd_device_detach(idev, false);
+
idev->hwpt = hwpt;
/* The HWPT reference from the caller is moved to this list */
list_add(&idev->devices_item, &hwpt->devices);

2023-02-15 01:37:33

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

> From: Nicolin Chen <[email protected]>
> Sent: Tuesday, February 14, 2023 6:54 PM
>
> On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote:
> > On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote:
> >
> > > What about point 1? If dev2 and dev3 are already replaced when
> > > doing iommu_group_replace_domain() on dev1, their idev objects
> > > still have the old hwpt/iopt until user space does another two
> > > IOCTLs on them, right?
> >
> > We have a complicated model for multi-device groups...
> >
> > The first device in the group to change domains must move all the
> > devices in the group
> >
> > But userspace is still expected to run through and change all the
> > other devices
> >
> > So replace should be a NOP if the group is already linked to the right
> > domain.
> >
> > This patch needs to make sure that incosistency in the view betwen the
> > iommu_group and the iommufd model doesn't cause a functional
> > problem.
>
> Yea, I was thinking that we'd need to block any access to the
> idev->hwpt of a pending device's, before the kernel finishes
> the "NOP" IOCTL from userspace, maybe with a helper:
> (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain)
>

I didn't see what would be broken w/o such blocking measure.

Can you elaborate?

2023-02-15 01:38:59

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

> From: Nicolin Chen <[email protected]>
> Sent: Tuesday, February 14, 2023 7:00 PM
>
> On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:
>
> > My confusion is that we have different flows between detach/attach
> > and replace.
> >
> > today with separate detach+attach we have following flow:
> >
> > Remove device from current hwpt;
> > if (last_device in hwpt) {
> > Remove hwpt domain from current iopt;
> > if (last_device in group)
> > detach group from hwpt domain;
> > }
> >
> > if (first device in group) {
> > attach group to new hwpt domain;
> > if (first_device in hwpt)
> > Add hwpt domain to new iopt;
> > Add device to new hwpt;
> >
> > but replace flow is different on the detach part:
> >
> > if (first device in group) {
> > replace group's domain from current hwpt to new hwpt;
> > if (first_device in hwpt)
> > Add hwpt domain to new iopt;
> > }
> >
> > Remove device from old hwpt;
> > if (last_device in old hwpt)
> > Remove hwpt domain from old iopt;
> >
> > Add device to new hwpt;
> >
> > I'm yet to figure out whether we have sufficient lock protection to
> > prevent other paths from using old iopt/hwpt to find the device
> > which is already attached to a different domain.
>
> With Jason's new series, the detach() routine is lighter now.
>
> I wonder if it'd be safer now to do the detach() call after
> iommu_group_replace_domain()?
>

yes, looks so.

2023-02-15 01:58:56

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Wed, Feb 15, 2023 at 01:37:19AM +0000, Tian, Kevin wrote:

> > From: Nicolin Chen <[email protected]>
> > Sent: Tuesday, February 14, 2023 6:54 PM
> >
> > On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote:
> > > On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote:
> > >
> > > > What about point 1? If dev2 and dev3 are already replaced when
> > > > doing iommu_group_replace_domain() on dev1, their idev objects
> > > > still have the old hwpt/iopt until user space does another two
> > > > IOCTLs on them, right?
> > >
> > > We have a complicated model for multi-device groups...
> > >
> > > The first device in the group to change domains must move all the
> > > devices in the group
> > >
> > > But userspace is still expected to run through and change all the
> > > other devices
> > >
> > > So replace should be a NOP if the group is already linked to the right
> > > domain.
> > >
> > > This patch needs to make sure that incosistency in the view betwen the
> > > iommu_group and the iommufd model doesn't cause a functional
> > > problem.
> >
> > Yea, I was thinking that we'd need to block any access to the
> > idev->hwpt of a pending device's, before the kernel finishes
> > the "NOP" IOCTL from userspace, maybe with a helper:
> > (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain)
> >
>
> I didn't see what would be broken w/o such blocking measure.
>
> Can you elaborate?

iommu_group { idev1, idev2 }

(1) Attach all devices to domain1 {
group->domain = domain1;
idev1->hwpt = hwpt1; // domain1
idev2->hwpt = hwpt1; // domain1
}

(2) Attach (replace) idev1 only to domain2 {
group->domain = domain2
idev1->hwpt = hwpt2; // domain2
idev2->hwpt == domain1 // != iommu_get_domain_for_dev()
}

Then if user space isn't aware of these and continues to do
IOMMU_IOAS_MAP for idev2. Though IOVA mappings may be added
onto the domain2 correctly, yet domain2's iopt itree won't
reflect that, until idev2->hwpt is updated too, right?

And the tricky thing is that, though we advocate a device-
centric uAPI, we'd still have to ask user space to align the
devices in the same iommu_group, which is also not present
in QEMU's RFC v3 series.

The traditional detach+attach flow doesn't seem to have this
issue, since there's no re-entry so the work flow is always
that detaching all devices first before attaching to another
domain.

Thanks
Nic

2023-02-15 02:16:12

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 15, 2023 9:59 AM
>
> On Wed, Feb 15, 2023 at 01:37:19AM +0000, Tian, Kevin wrote:
>
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Tuesday, February 14, 2023 6:54 PM
> > >
> > > On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote:
> > > > On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote:
> > > >
> > > > > What about point 1? If dev2 and dev3 are already replaced when
> > > > > doing iommu_group_replace_domain() on dev1, their idev objects
> > > > > still have the old hwpt/iopt until user space does another two
> > > > > IOCTLs on them, right?
> > > >
> > > > We have a complicated model for multi-device groups...
> > > >
> > > > The first device in the group to change domains must move all the
> > > > devices in the group
> > > >
> > > > But userspace is still expected to run through and change all the
> > > > other devices
> > > >
> > > > So replace should be a NOP if the group is already linked to the right
> > > > domain.
> > > >
> > > > This patch needs to make sure that incosistency in the view betwen the
> > > > iommu_group and the iommufd model doesn't cause a functional
> > > > problem.
> > >
> > > Yea, I was thinking that we'd need to block any access to the
> > > idev->hwpt of a pending device's, before the kernel finishes
> > > the "NOP" IOCTL from userspace, maybe with a helper:
> > > (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain)
> > >
> >
> > I didn't see what would be broken w/o such blocking measure.
> >
> > Can you elaborate?
>
> iommu_group { idev1, idev2 }
>
> (1) Attach all devices to domain1 {
> group->domain = domain1;
> idev1->hwpt = hwpt1; // domain1
> idev2->hwpt = hwpt1; // domain1
> }
>
> (2) Attach (replace) idev1 only to domain2 {
> group->domain = domain2
> idev1->hwpt = hwpt2; // domain2
> idev2->hwpt == domain1 // != iommu_get_domain_for_dev()
> }
>
> Then if user space isn't aware of these and continues to do
> IOMMU_IOAS_MAP for idev2. Though IOVA mappings may be added
> onto the domain2 correctly, yet domain2's iopt itree won't
> reflect that, until idev2->hwpt is updated too, right?

IOMMU_IOAS_MAP is for ioas instead of for device.

even w/o any device attached we still allow ioas map.

also note in case of domain1 still actively attached to idev3 in
another group, banning IOAS_MAP due to idev2 in transition
is also incorrect for idev3.

>
> And the tricky thing is that, though we advocate a device-
> centric uAPI, we'd still have to ask user space to align the
> devices in the same iommu_group, which is also not present
> in QEMU's RFC v3 series.

IMHO this requirement has been stated clearly from the start
when designing this device centric interface. QEMU should be
improved to take care of it.


2023-02-15 06:10:57

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 8, 2023 5:18 AM
>
> +int iommu_group_replace_domain(struct iommu_group *group,
> + struct iommu_domain *new_domain)
> +{
> + int ret;
> +
> + if (!new_domain)
> + return -EINVAL;
> +
> + mutex_lock(&group->mutex);
> + ret = __iommu_group_set_domain(group, new_domain);
> + if (ret)
> + __iommu_group_set_domain(group, group->domain);

Just realize the error unwind is a nop given below:

__iommu_group_set_domain()
{
if (group->domain == new_domain)
return 0;

...

There was an attempt [1] to fix error unwind in iommu_attach_group(), by
temporarily set group->domain to NULL before calling set_domain().

Jason, I wonder why this recovering cannot be done in
__iommu_group_set_domain() directly, e.g.:

ret = __iommu_group_for_each_dev(group, new_domain,
iommu_group_do_attach_device);
if (ret) {
__iommu_group_for_each_dev(group, group->domain,
iommu_group_do_attach_device);
return ret;
}
group->domain = new_domain;

Thanks,
Kevin

[1] https://lore.kernel.org/linux-iommu/[email protected]/

2023-02-15 07:15:40

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Wed, Feb 15, 2023 at 02:15:59AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
>
>
> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, February 15, 2023 9:59 AM
> >
> > On Wed, Feb 15, 2023 at 01:37:19AM +0000, Tian, Kevin wrote:
> >
> > > > From: Nicolin Chen <[email protected]>
> > > > Sent: Tuesday, February 14, 2023 6:54 PM
> > > >
> > > > On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote:
> > > > > On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote:
> > > > >
> > > > > > What about point 1? If dev2 and dev3 are already replaced when
> > > > > > doing iommu_group_replace_domain() on dev1, their idev objects
> > > > > > still have the old hwpt/iopt until user space does another two
> > > > > > IOCTLs on them, right?
> > > > >
> > > > > We have a complicated model for multi-device groups...
> > > > >
> > > > > The first device in the group to change domains must move all the
> > > > > devices in the group
> > > > >
> > > > > But userspace is still expected to run through and change all the
> > > > > other devices
> > > > >
> > > > > So replace should be a NOP if the group is already linked to the right
> > > > > domain.
> > > > >
> > > > > This patch needs to make sure that incosistency in the view betwen the
> > > > > iommu_group and the iommufd model doesn't cause a functional
> > > > > problem.
> > > >
> > > > Yea, I was thinking that we'd need to block any access to the
> > > > idev->hwpt of a pending device's, before the kernel finishes
> > > > the "NOP" IOCTL from userspace, maybe with a helper:
> > > > (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain)
> > > >
> > >
> > > I didn't see what would be broken w/o such blocking measure.
> > >
> > > Can you elaborate?
> >
> > iommu_group { idev1, idev2 }
> >
> > (1) Attach all devices to domain1 {
> > group->domain = domain1;
> > idev1->hwpt = hwpt1; // domain1
> > idev2->hwpt = hwpt1; // domain1
> > }
> >
> > (2) Attach (replace) idev1 only to domain2 {
> > group->domain = domain2
> > idev1->hwpt = hwpt2; // domain2
> > idev2->hwpt == domain1 // != iommu_get_domain_for_dev()
> > }
> >
> > Then if user space isn't aware of these and continues to do
> > IOMMU_IOAS_MAP for idev2. Though IOVA mappings may be added
> > onto the domain2 correctly, yet domain2's iopt itree won't
> > reflect that, until idev2->hwpt is updated too, right?
>
> IOMMU_IOAS_MAP is for ioas instead of for device.
>
> even w/o any device attached we still allow ioas map.
>
> also note in case of domain1 still actively attached to idev3 in
> another group, banning IOAS_MAP due to idev2 in transition
> is also incorrect for idev3.

OK. That's likely true. It doesn't seem to be correct to block an
IOMMU_IOAS_MAP.

But things will be out of control, if user space continues mapping
something onto domain1's iopt for idev2, even after it is attached
covertly to domain2's iopt by the replace routine. I wonder how
kernel should handle this and keep the consistency between IOMMUFD
objects and iommu_group.

> > And the tricky thing is that, though we advocate a device-
> > centric uAPI, we'd still have to ask user space to align the
> > devices in the same iommu_group, which is also not present
> > in QEMU's RFC v3 series.
>
> IMHO this requirement has been stated clearly from the start
> when designing this device centric interface. QEMU should be
> improved to take care of it.

OK. It has to be so...

Thanks
Nic

2023-02-15 07:16:55

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Wed, Feb 15, 2023 at 01:38:32AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
>
>
> > From: Nicolin Chen <[email protected]>
> > Sent: Tuesday, February 14, 2023 7:00 PM
> >
> > On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:
> >
> > > My confusion is that we have different flows between detach/attach
> > > and replace.
> > >
> > > today with separate detach+attach we have following flow:
> > >
> > > Remove device from current hwpt;
> > > if (last_device in hwpt) {
> > > Remove hwpt domain from current iopt;
> > > if (last_device in group)
> > > detach group from hwpt domain;
> > > }
> > >
> > > if (first device in group) {
> > > attach group to new hwpt domain;
> > > if (first_device in hwpt)
> > > Add hwpt domain to new iopt;
> > > Add device to new hwpt;
> > >
> > > but replace flow is different on the detach part:
> > >
> > > if (first device in group) {
> > > replace group's domain from current hwpt to new hwpt;
> > > if (first_device in hwpt)
> > > Add hwpt domain to new iopt;
> > > }
> > >
> > > Remove device from old hwpt;
> > > if (last_device in old hwpt)
> > > Remove hwpt domain from old iopt;
> > >
> > > Add device to new hwpt;
> > >
> > > I'm yet to figure out whether we have sufficient lock protection to
> > > prevent other paths from using old iopt/hwpt to find the device
> > > which is already attached to a different domain.
> >
> > With Jason's new series, the detach() routine is lighter now.
> >
> > I wonder if it'd be safer now to do the detach() call after
> > iommu_group_replace_domain()?
> >
>
> yes, looks so.

Will rebase a v3 once Jason updates his series. Thanks!

2023-02-15 07:24:46

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, February 15, 2023 3:15 PM
>
> But things will be out of control, if user space continues mapping
> something onto domain1's iopt for idev2, even after it is attached
> covertly to domain2's iopt by the replace routine. I wonder how
> kernel should handle this and keep the consistency between IOMMUFD
> objects and iommu_group.
>

this is where I don't understand.

domain mapping just reflects what an address space has.

Take Qemu for example. w/o vIOMMU domain mappings is added/
removed according to the change in the GPA address space.

w/ vIOMMU then it is synced with guest page table.

why would userspace construct mappings for a specific device?

when device is moved from one domain to another domain, it just
bears with whatever the new domain allows it to access.


2023-02-15 12:51:25

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

On Tue, Feb 14, 2023 at 11:15:09PM -0800, Nicolin Chen wrote:

> But things will be out of control, if user space continues mapping
> something onto domain1's iopt for idev2, even after it is attached
> covertly to domain2's iopt by the replace routine. I wonder how
> kernel should handle this and keep the consistency between IOMMUFD
> objects and iommu_group.

I've been looking at this, the reason the locking is such a PITA is
because we are still trying to use the device list short cut.

We need to have a iommu group object instead then everything will work
smoothly.

Jason

2023-02-15 12:52:58

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

On Wed, Feb 15, 2023 at 06:10:47AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> >
> > +int iommu_group_replace_domain(struct iommu_group *group,
> > + struct iommu_domain *new_domain)
> > +{
> > + int ret;
> > +
> > + if (!new_domain)
> > + return -EINVAL;
> > +
> > + mutex_lock(&group->mutex);
> > + ret = __iommu_group_set_domain(group, new_domain);
> > + if (ret)
> > + __iommu_group_set_domain(group, group->domain);
>
> Just realize the error unwind is a nop given below:
>
> __iommu_group_set_domain()
> {
> if (group->domain == new_domain)
> return 0;
>
> ...
>
> There was an attempt [1] to fix error unwind in iommu_attach_group(), by
> temporarily set group->domain to NULL before calling set_domain().
>
> Jason, I wonder why this recovering cannot be done in
> __iommu_group_set_domain() directly, e.g.:
>
> ret = __iommu_group_for_each_dev(group, new_domain,
> iommu_group_do_attach_device);
> if (ret) {
> __iommu_group_for_each_dev(group, group->domain,
> iommu_group_do_attach_device);
> return ret;
> }
> group->domain = new_domain;

We talked about this already, some times this is not the correct
recovery case, eg if we are going to a blocking domain we need to drop
all references to the prior domain, not put them back.

Failures are WARN_ON events not error recovery.

Jason

2023-02-22 02:12:03

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, February 15, 2023 8:53 PM
>
> On Wed, Feb 15, 2023 at 06:10:47AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Wednesday, February 8, 2023 5:18 AM
> > >
> > > +int iommu_group_replace_domain(struct iommu_group *group,
> > > + struct iommu_domain *new_domain)
> > > +{
> > > + int ret;
> > > +
> > > + if (!new_domain)
> > > + return -EINVAL;
> > > +
> > > + mutex_lock(&group->mutex);
> > > + ret = __iommu_group_set_domain(group, new_domain);
> > > + if (ret)
> > > + __iommu_group_set_domain(group, group->domain);
> >
> > Just realize the error unwind is a nop given below:
> >
> > __iommu_group_set_domain()
> > {
> > if (group->domain == new_domain)
> > return 0;
> >
> > ...
> >
> > There was an attempt [1] to fix error unwind in iommu_attach_group(), by
> > temporarily set group->domain to NULL before calling set_domain().
> >
> > Jason, I wonder why this recovering cannot be done in
> > __iommu_group_set_domain() directly, e.g.:
> >
> > ret = __iommu_group_for_each_dev(group, new_domain,
> > iommu_group_do_attach_device);
> > if (ret) {
> > __iommu_group_for_each_dev(group, group->domain,
> > iommu_group_do_attach_device);
> > return ret;
> > }
> > group->domain = new_domain;
>
> We talked about this already, some times this is not the correct
> recovery case, eg if we are going to a blocking domain we need to drop
> all references to the prior domain, not put them back.
>
> Failures are WARN_ON events not error recovery.
>

OK, I remember that. Then here looks we also need temporarily
set group->domain to NULL before calling set_domain() to recover,
as [1] does.

[1] https://lore.kernel.org/linux-iommu/[email protected]/

2023-02-24 00:57:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

On Wed, Feb 22, 2023 at 02:11:39AM +0000, Tian, Kevin wrote:

> > > There was an attempt [1] to fix error unwind in iommu_attach_group(), by
> > > temporarily set group->domain to NULL before calling set_domain().
> > >
> > > Jason, I wonder why this recovering cannot be done in
> > > __iommu_group_set_domain() directly, e.g.:
> > >
> > > ret = __iommu_group_for_each_dev(group, new_domain,
> > > iommu_group_do_attach_device);
> > > if (ret) {
> > > __iommu_group_for_each_dev(group, group->domain,
> > > iommu_group_do_attach_device);
> > > return ret;
> > > }
> > > group->domain = new_domain;
> >
> > We talked about this already, some times this is not the correct
> > recovery case, eg if we are going to a blocking domain we need to drop
> > all references to the prior domain, not put them back.
> >
> > Failures are WARN_ON events not error recovery.
> >
>
> OK, I remember that. Then here looks we also need temporarily
> set group->domain to NULL before calling set_domain() to recover,
> as [1] does.

Sigh, this is too much.

I made a series to clean up all the domain attach logic so the error
handling is all in one place and all the same.

What do you think?

https://github.com/jgunthorpe/linux/commits/iommufd_hwpt

Jason

2023-02-24 08:07:53

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, February 24, 2023 8:58 AM
>
> On Wed, Feb 22, 2023 at 02:11:39AM +0000, Tian, Kevin wrote:
>
> > > > There was an attempt [1] to fix error unwind in iommu_attach_group(),
> by
> > > > temporarily set group->domain to NULL before calling set_domain().
> > > >
> > > > Jason, I wonder why this recovering cannot be done in
> > > > __iommu_group_set_domain() directly, e.g.:
> > > >
> > > > ret = __iommu_group_for_each_dev(group, new_domain,
> > > > iommu_group_do_attach_device);
> > > > if (ret) {
> > > > __iommu_group_for_each_dev(group, group->domain,
> > > > iommu_group_do_attach_device);
> > > > return ret;
> > > > }
> > > > group->domain = new_domain;
> > >
> > > We talked about this already, some times this is not the correct
> > > recovery case, eg if we are going to a blocking domain we need to drop
> > > all references to the prior domain, not put them back.
> > >
> > > Failures are WARN_ON events not error recovery.
> > >
> >
> > OK, I remember that. Then here looks we also need temporarily
> > set group->domain to NULL before calling set_domain() to recover,
> > as [1] does.
>
> Sigh, this is too much.
>
> I made a series to clean up all the domain attach logic so the error
> handling is all in one place and all the same.
>
> What do you think?
>
> https://github.com/jgunthorpe/linux/commits/iommufd_hwpt
>

Yeah, that sounds a right cleanup at a glance.