2021-04-09 03:46:54

by Shenming Lu

[permalink] [raw]
Subject: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

Hi,

Requesting for your comments and suggestions. :-)

The static pinning and mapping problem in VFIO and possible solutions
have been discussed a lot [1, 2]. One of the solutions is to add I/O
Page Fault support for VFIO devices. Different from those relatively
complicated software approaches such as presenting a vIOMMU that provides
the DMA buffer information (might include para-virtualized optimizations),
IOPF mainly depends on the hardware faulting capability, such as the PCIe
PRI extension or Arm SMMU stall model. What's more, the IOPF support in
the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
support for VFIO passthrough based on the IOPF part of SVA in this series.

We have measured its performance with UADK [4] (passthrough an accelerator
to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):

Run hisi_sec_test...
- with varying sending times and message lengths
- with/without IOPF enabled (speed slowdown)

when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
slowdown (num of faults)
times VFIO IOPF host SVA
1 63.4% (518) 82.8% (512)
100 22.9% (1058) 47.9% (1024)
1000 2.6% (1071) 8.5% (1024)

when msg_len = 10MB (and PREMAP_LEN = 512):
slowdown (num of faults)
times VFIO IOPF
1 32.6% (13)
100 3.5% (26)
1000 1.6% (26)

History:

v2 -> v3
- Nit fixes.
- No reason to disable reporting the unrecoverable faults. (baolu)
- Maintain a global IOPF enabled group list.
- Split the pre-mapping optimization to be a separate patch.
- Add selective faulting support (use vfio_pin_pages to indicate the
non-faultable scope and add a new struct vfio_range to record it,
untested). (Kevin)

v1 -> v2
- Numerous improvements following the suggestions. Thanks a lot to all
of you.

Note that PRI is not supported at the moment since there is no hardware.

Links:
[1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
2016.
[2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
[3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/[email protected]/
[4] https://github.com/Linaro/uadk

Thanks,
Shenming


Shenming Lu (8):
iommu: Evolve the device fault reporting framework
vfio/type1: Add a page fault handler
vfio/type1: Add an MMU notifier to avoid pinning
vfio/type1: Pre-map more pages than requested in the IOPF handling
vfio/type1: VFIO_IOMMU_ENABLE_IOPF
vfio/type1: No need to statically pin and map if IOPF enabled
vfio/type1: Add selective DMA faulting support
vfio: Add nested IOPF support

.../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 3 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 +-
drivers/iommu/iommu.c | 56 +-
drivers/vfio/vfio.c | 85 +-
drivers/vfio/vfio_iommu_type1.c | 1000 ++++++++++++++++-
include/linux/iommu.h | 19 +-
include/linux/vfio.h | 13 +
include/uapi/linux/iommu.h | 4 +
include/uapi/linux/vfio.h | 6 +
9 files changed, 1181 insertions(+), 23 deletions(-)

--
2.19.1


2021-04-09 03:46:57

by Shenming Lu

[permalink] [raw]
Subject: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF

Since enabling IOPF for devices may lead to a slow ramp up of performance,
we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
registering the VFIO IOPF handler.

Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
inflight page faults when disabling.

Signed-off-by: Shenming Lu <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 223 +++++++++++++++++++++++++++++++-
include/uapi/linux/vfio.h | 6 +
2 files changed, 226 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 01e296c6dc9e..7df5711e743a 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -71,6 +71,7 @@ struct vfio_iommu {
struct rb_root dma_list;
struct blocking_notifier_head notifier;
struct mmu_notifier mn;
+ struct mm_struct *mm;
unsigned int dma_avail;
unsigned int vaddr_invalid_count;
uint64_t pgsize_bitmap;
@@ -81,6 +82,7 @@ struct vfio_iommu {
bool dirty_page_tracking;
bool pinned_page_dirty_scope;
bool container_open;
+ bool iopf_enabled;
};

struct vfio_domain {
@@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
return node ? iopf_group : NULL;
}

+static void vfio_link_iopf_group(struct vfio_iopf_group *new)
+{
+ struct rb_node **link, *parent = NULL;
+ struct vfio_iopf_group *iopf_group;
+
+ mutex_lock(&iopf_group_list_lock);
+
+ link = &iopf_group_list.rb_node;
+
+ while (*link) {
+ parent = *link;
+ iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
+
+ if (new->iommu_group < iopf_group->iommu_group)
+ link = &(*link)->rb_left;
+ else
+ link = &(*link)->rb_right;
+ }
+
+ rb_link_node(&new->node, parent, link);
+ rb_insert_color(&new->node, &iopf_group_list);
+
+ mutex_unlock(&iopf_group_list_lock);
+}
+
+static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
+{
+ mutex_lock(&iopf_group_list_lock);
+ rb_erase(&old->node, &iopf_group_list);
+ mutex_unlock(&iopf_group_list_lock);
+}
+
static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
{
struct mm_struct *mm;
@@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
list_splice_tail(iova_copy, iova);
}

+static int vfio_dev_domian_nested(struct device *dev, int *nested)
+{
+ struct iommu_domain *domain;
+
+ domain = iommu_get_domain_for_dev(dev);
+ if (!domain)
+ return -ENODEV;
+
+ return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);
+}
+
+static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data);
+
+static int dev_enable_iopf(struct device *dev, void *data)
+{
+ int *enabled_dev_cnt = data;
+ int nested;
+ u32 flags;
+ int ret;
+
+ ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
+ if (ret)
+ return ret;
+
+ ret = vfio_dev_domian_nested(dev, &nested);
+ if (ret)
+ goto out_disable;
+
+ if (nested)
+ flags = FAULT_REPORT_NESTED_L2;
+ else
+ flags = FAULT_REPORT_FLAT;
+
+ ret = iommu_register_device_fault_handler(dev,
+ vfio_iommu_type1_dma_map_iopf, flags, dev);
+ if (ret)
+ goto out_disable;
+
+ (*enabled_dev_cnt)++;
+ return 0;
+
+out_disable:
+ iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
+ return ret;
+}
+
+static int dev_disable_iopf(struct device *dev, void *data)
+{
+ int *enabled_dev_cnt = data;
+
+ if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
+ return -1;
+
+ WARN_ON(iommu_unregister_device_fault_handler(dev));
+ WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF));
+
+ if (enabled_dev_cnt)
+ (*enabled_dev_cnt)--;
+
+ return 0;
+}
+
static int vfio_iommu_type1_attach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
@@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
struct iommu_domain_geometry geo;
LIST_HEAD(iova_copy);
LIST_HEAD(group_resv_regions);
+ int iopf_enabled_dev_cnt = 0;
+ struct vfio_iopf_group *iopf_group = NULL;

mutex_lock(&iommu->lock);

@@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
if (ret)
goto out_domain;

+ if (iommu->iopf_enabled) {
+ ret = iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
+ dev_enable_iopf);
+ if (ret)
+ goto out_detach;
+
+ iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
+ if (!iopf_group) {
+ ret = -ENOMEM;
+ goto out_detach;
+ }
+
+ iopf_group->iommu_group = iommu_group;
+ iopf_group->iommu = iommu;
+
+ vfio_link_iopf_group(iopf_group);
+ }
+
/* Get aperture info */
iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);

@@ -2534,9 +2650,11 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
vfio_test_domain_fgsp(domain);

/* replay mappings on new domains */
- ret = vfio_iommu_replay(iommu, domain);
- if (ret)
- goto out_detach;
+ if (!iommu->iopf_enabled) {
+ ret = vfio_iommu_replay(iommu, domain);
+ if (ret)
+ goto out_detach;
+ }

if (resv_msi) {
ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
@@ -2567,6 +2685,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
iommu_domain_free(domain->domain);
vfio_iommu_iova_free(&iova_copy);
vfio_iommu_resv_free(&group_resv_regions);
+ if (iommu->iopf_enabled) {
+ if (iopf_group) {
+ vfio_unlink_iopf_group(iopf_group);
+ kfree(iopf_group);
+ }
+
+ iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
+ dev_disable_iopf);
+ }
out_free:
kfree(domain);
kfree(group);
@@ -2728,6 +2855,19 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
if (!group)
continue;

+ if (iommu->iopf_enabled) {
+ struct vfio_iopf_group *iopf_group;
+
+ iopf_group = vfio_find_iopf_group(iommu_group);
+ if (!WARN_ON(!iopf_group)) {
+ vfio_unlink_iopf_group(iopf_group);
+ kfree(iopf_group);
+ }
+
+ iommu_group_for_each_dev(iommu_group, NULL,
+ dev_disable_iopf);
+ }
+
vfio_iommu_detach_group(domain, group);
update_dirty_scope = !group->pinned_page_dirty_scope;
list_del(&group->next);
@@ -2846,6 +2986,11 @@ static void vfio_iommu_type1_release(void *iommu_data)

vfio_iommu_iova_free(&iommu->iova_list);

+ if (iommu->iopf_enabled) {
+ mmu_notifier_unregister(&iommu->mn, iommu->mm);
+ mmdrop(iommu->mm);
+ }
+
kfree(iommu);
}

@@ -3441,6 +3586,76 @@ static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
.invalidate_range = mn_invalidate_range,
};

+static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
+{
+ struct vfio_domain *d;
+ struct vfio_group *g;
+ struct vfio_iopf_group *iopf_group;
+ int enabled_dev_cnt = 0;
+ int ret;
+
+ if (!current->mm)
+ return -ENODEV;
+
+ mutex_lock(&iommu->lock);
+
+ mmgrab(current->mm);
+ iommu->mm = current->mm;
+ iommu->mn.ops = &vfio_iommu_type1_mn_ops;
+ ret = mmu_notifier_register(&iommu->mn, current->mm);
+ if (ret)
+ goto out_drop;
+
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ list_for_each_entry(g, &d->group_list, next) {
+ ret = iommu_group_for_each_dev(g->iommu_group,
+ &enabled_dev_cnt, dev_enable_iopf);
+ if (ret)
+ goto out_unwind;
+
+ iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
+ if (!iopf_group) {
+ ret = -ENOMEM;
+ goto out_unwind;
+ }
+
+ iopf_group->iommu_group = g->iommu_group;
+ iopf_group->iommu = iommu;
+
+ vfio_link_iopf_group(iopf_group);
+ }
+ }
+
+ iommu->iopf_enabled = true;
+ goto out_unlock;
+
+out_unwind:
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ list_for_each_entry(g, &d->group_list, next) {
+ iopf_group = vfio_find_iopf_group(g->iommu_group);
+ if (iopf_group) {
+ vfio_unlink_iopf_group(iopf_group);
+ kfree(iopf_group);
+ }
+
+ if (iommu_group_for_each_dev(g->iommu_group,
+ &enabled_dev_cnt, dev_disable_iopf))
+ goto out_unregister;
+ }
+ }
+
+out_unregister:
+ mmu_notifier_unregister(&iommu->mn, current->mm);
+
+out_drop:
+ iommu->mm = NULL;
+ mmdrop(current->mm);
+
+out_unlock:
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
static long vfio_iommu_type1_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
@@ -3457,6 +3672,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
return vfio_iommu_type1_unmap_dma(iommu, arg);
case VFIO_IOMMU_DIRTY_PAGES:
return vfio_iommu_type1_dirty_pages(iommu, arg);
+ case VFIO_IOMMU_ENABLE_IOPF:
+ return vfio_iommu_type1_enable_iopf(iommu);
default:
return -ENOTTY;
}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 8ce36c1d53ca..5497036bebdc 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1208,6 +1208,12 @@ struct vfio_iommu_type1_dirty_bitmap_get {

#define VFIO_IOMMU_DIRTY_PAGES _IO(VFIO_TYPE, VFIO_BASE + 17)

+/*
+ * IOCTL to enable IOPF for the container.
+ * Called right after VFIO_SET_IOMMU.
+ */
+#define VFIO_IOMMU_ENABLE_IOPF _IO(VFIO_TYPE, VFIO_BASE + 18)
+
/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */

/*
--
2.19.1

2021-04-09 03:47:16

by Shenming Lu

[permalink] [raw]
Subject: [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling

To optimize for fewer page fault handlings, we can pre-map more pages
than requested at once.

Note that IOPF_PREMAP_LEN is just an arbitrary value for now, which we
could try further tuning.

Signed-off-by: Shenming Lu <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 131 ++++++++++++++++++++++++++++++--
1 file changed, 123 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1cb9d1f2717b..01e296c6dc9e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3217,6 +3217,91 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
return -EINVAL;
}

+/*
+ * To optimize for fewer page fault handlings, try to
+ * pre-map more pages than requested.
+ */
+#define IOPF_PREMAP_LEN 512
+
+/*
+ * Return 0 on success or a negative error code, the
+ * number of pages contiguously pinned is in @pinned.
+ */
+static int pin_pages_iopf(struct vfio_dma *dma, unsigned long vaddr,
+ unsigned long npages, unsigned long *pfn_base,
+ unsigned long *pinned, struct vfio_batch *batch)
+{
+ struct mm_struct *mm;
+ unsigned long pfn;
+ int ret = 0;
+ *pinned = 0;
+
+ mm = get_task_mm(dma->task);
+ if (!mm)
+ return -ENODEV;
+
+ if (batch->size) {
+ *pfn_base = page_to_pfn(batch->pages[batch->offset]);
+ pfn = *pfn_base;
+ } else {
+ *pfn_base = 0;
+ }
+
+ while (npages) {
+ if (!batch->size) {
+ unsigned long req_pages = min_t(unsigned long, npages,
+ batch->capacity);
+
+ ret = vaddr_get_pfns(mm, vaddr, req_pages, dma->prot,
+ &pfn, batch->pages);
+ if (ret < 0)
+ goto out;
+
+ batch->size = ret;
+ batch->offset = 0;
+ ret = 0;
+
+ if (!*pfn_base)
+ *pfn_base = pfn;
+ }
+
+ while (true) {
+ if (pfn != *pfn_base + *pinned)
+ goto out;
+
+ (*pinned)++;
+ npages--;
+ vaddr += PAGE_SIZE;
+ batch->offset++;
+ batch->size--;
+
+ if (!batch->size)
+ break;
+
+ pfn = page_to_pfn(batch->pages[batch->offset]);
+ }
+
+ if (unlikely(disable_hugepages))
+ break;
+ }
+
+out:
+ if (batch->size == 1 && !batch->offset) {
+ put_pfn(pfn, dma->prot);
+ batch->size = 0;
+ }
+
+ mmput(mm);
+ return ret;
+}
+
+static void unpin_pages_iopf(struct vfio_dma *dma,
+ unsigned long pfn, unsigned long npages)
+{
+ while (npages--)
+ put_pfn(pfn++, dma->prot);
+}
+
/* VFIO I/O Page Fault handler */
static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
{
@@ -3225,9 +3310,11 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
struct vfio_iopf_group *iopf_group;
struct vfio_iommu *iommu;
struct vfio_dma *dma;
+ struct vfio_batch batch;
dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
int access_flags = 0;
- unsigned long bit_offset, vaddr, pfn;
+ size_t premap_len, map_len, mapped_len = 0;
+ unsigned long bit_offset, vaddr, pfn, i, npages;
int ret;
enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
struct iommu_page_response resp = {0};
@@ -3263,19 +3350,47 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
if (IOPF_MAPPED_BITMAP_GET(dma, bit_offset))
goto out_success;

+ premap_len = IOPF_PREMAP_LEN << PAGE_SHIFT;
+ npages = dma->size >> PAGE_SHIFT;
+ map_len = PAGE_SIZE;
+ for (i = bit_offset + 1; i < npages; i++) {
+ if (map_len >= premap_len || IOPF_MAPPED_BITMAP_GET(dma, i))
+ break;
+ map_len += PAGE_SIZE;
+ }
vaddr = iova - dma->iova + dma->vaddr;
+ vfio_batch_init(&batch);

- if (vfio_pin_page_external(dma, vaddr, &pfn, false))
- goto out_invalid;
+ while (map_len) {
+ ret = pin_pages_iopf(dma, vaddr + mapped_len,
+ map_len >> PAGE_SHIFT, &pfn,
+ &npages, &batch);
+ if (!npages)
+ break;

- if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
- put_pfn(pfn, dma->prot);
- goto out_invalid;
+ if (vfio_iommu_map(iommu, iova + mapped_len, pfn,
+ npages, dma->prot)) {
+ unpin_pages_iopf(dma, pfn, npages);
+ vfio_batch_unpin(&batch, dma);
+ break;
+ }
+
+ bitmap_set(dma->iopf_mapped_bitmap,
+ bit_offset + (mapped_len >> PAGE_SHIFT), npages);
+
+ unpin_pages_iopf(dma, pfn, npages);
+
+ map_len -= npages << PAGE_SHIFT;
+ mapped_len += npages << PAGE_SHIFT;
+
+ if (ret)
+ break;
}

- bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
+ vfio_batch_fini(&batch);

- put_pfn(pfn, dma->prot);
+ if (!mapped_len)
+ goto out_invalid;

out_success:
status = IOMMU_PAGE_RESP_SUCCESS;
--
2.19.1

2021-04-09 03:47:23

by Shenming Lu

[permalink] [raw]
Subject: [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled

If IOPF enabled for the VFIO container, there is no need to statically
pin and map the entire DMA range, we can do it on demand. And unmap
according to the IOPF mapped bitmap when removing vfio_dma.

Note that we still mark all pages dirty even if IOPF enabled, we may
add IOPF-based fine grained dirty tracking support in the future.

Signed-off-by: Shenming Lu <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 38 +++++++++++++++++++++++++++------
1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 7df5711e743a..dcc93c3b258c 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -175,6 +175,7 @@ struct vfio_iopf_group {
#define IOPF_MAPPED_BITMAP_GET(dma, i) \
((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG] \
>> ((i) % BITS_PER_LONG)) & 0x1)
+#define IOPF_MAPPED_BITMAP_BYTES(n) DIRTY_BITMAP_BYTES(n)

#define WAITED 1

@@ -959,7 +960,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
* already pinned and accounted. Accouting should be done if there is no
* iommu capable domain in the container.
*/
- do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
+ do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
+ iommu->iopf_enabled;

for (i = 0; i < npage; i++) {
struct vfio_pfn *vpfn;
@@ -1048,7 +1050,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,

mutex_lock(&iommu->lock);

- do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
+ do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
+ iommu->iopf_enabled;
for (i = 0; i < npage; i++) {
struct vfio_dma *dma;
dma_addr_t iova;
@@ -1169,7 +1172,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
if (!dma->size)
return 0;

- if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+ if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
return 0;

/*
@@ -1306,11 +1309,20 @@ static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
}
}

+static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma *dma)
+{
+ vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size);
+
+ kfree(dma->iopf_mapped_bitmap);
+}
+
static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
{
WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
vfio_unmap_unpin(iommu, dma, true);
vfio_unlink_dma(iommu, dma);
+ if (iommu->iopf_enabled)
+ vfio_dma_clean_iopf(iommu, dma);
put_task_struct(dma->task);
vfio_dma_bitmap_free(dma);
if (dma->vaddr_invalid) {
@@ -1359,7 +1371,8 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
* mark all pages dirty if any IOMMU capable device is not able
* to report dirty pages and all pages are pinned and mapped.
*/
- if (iommu->num_non_pinned_groups && dma->iommu_mapped)
+ if (iommu->num_non_pinned_groups &&
+ (dma->iommu_mapped || iommu->iopf_enabled))
bitmap_set(dma->bitmap, 0, nbits);

if (shift) {
@@ -1772,6 +1785,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
goto out_unlock;
}

+ if (iommu->iopf_enabled) {
+ dma->iopf_mapped_bitmap = kvzalloc(IOPF_MAPPED_BITMAP_BYTES(
+ size >> PAGE_SHIFT), GFP_KERNEL);
+ if (!dma->iopf_mapped_bitmap) {
+ ret = -ENOMEM;
+ kfree(dma);
+ goto out_unlock;
+ }
+ }
+
iommu->dma_avail--;
dma->iova = iova;
dma->vaddr = vaddr;
@@ -1811,8 +1834,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
/* Insert zero-sized and grow as we map chunks of it */
vfio_link_dma(iommu, dma);

- /* Don't pin and map if container doesn't contain IOMMU capable domain*/
- if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+ /*
+ * Don't pin and map if container doesn't contain IOMMU capable domain,
+ * or IOPF enabled for the container.
+ */
+ if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
dma->size = size;
else
ret = vfio_pin_map_dma(iommu, dma, size);
--
2.19.1

2021-04-09 03:47:52

by Shenming Lu

[permalink] [raw]
Subject: [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning

To avoid pinning pages when they are mapped in IOMMU page tables, we
add an MMU notifier to tell the addresses which are no longer valid
and try to unmap them.

Signed-off-by: Shenming Lu <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 112 +++++++++++++++++++++++++++++++-
1 file changed, 109 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ab0ff60ee207..1cb9d1f2717b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -40,6 +40,7 @@
#include <linux/notifier.h>
#include <linux/dma-iommu.h>
#include <linux/irqdomain.h>
+#include <linux/mmu_notifier.h>

#define DRIVER_VERSION "0.2"
#define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
@@ -69,6 +70,7 @@ struct vfio_iommu {
struct mutex lock;
struct rb_root dma_list;
struct blocking_notifier_head notifier;
+ struct mmu_notifier mn;
unsigned int dma_avail;
unsigned int vaddr_invalid_count;
uint64_t pgsize_bitmap;
@@ -1204,6 +1206,72 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
return unlocked;
}

+/* Unmap the IOPF mapped pages in the specified range. */
+static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
+ struct vfio_dma *dma,
+ dma_addr_t start, dma_addr_t end)
+{
+ struct iommu_iotlb_gather *gathers;
+ struct vfio_domain *d;
+ int i, num_domains = 0;
+
+ list_for_each_entry(d, &iommu->domain_list, next)
+ num_domains++;
+
+ gathers = kzalloc(sizeof(*gathers) * num_domains, GFP_KERNEL);
+ if (gathers) {
+ for (i = 0; i < num_domains; i++)
+ iommu_iotlb_gather_init(&gathers[i]);
+ }
+
+ while (start < end) {
+ unsigned long bit_offset;
+ size_t len;
+
+ bit_offset = (start - dma->iova) >> PAGE_SHIFT;
+
+ for (len = 0; start + len < end; len += PAGE_SIZE) {
+ if (!IOPF_MAPPED_BITMAP_GET(dma,
+ bit_offset + (len >> PAGE_SHIFT)))
+ break;
+ }
+
+ if (len) {
+ i = 0;
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ size_t unmapped;
+
+ if (gathers)
+ unmapped = iommu_unmap_fast(d->domain,
+ start, len,
+ &gathers[i++]);
+ else
+ unmapped = iommu_unmap(d->domain,
+ start, len);
+
+ if (WARN_ON(unmapped != len))
+ goto out;
+ }
+
+ bitmap_clear(dma->iopf_mapped_bitmap,
+ bit_offset, len >> PAGE_SHIFT);
+
+ cond_resched();
+ }
+
+ start += (len + PAGE_SIZE);
+ }
+
+out:
+ if (gathers) {
+ i = 0;
+ list_for_each_entry(d, &iommu->domain_list, next)
+ iommu_iotlb_sync(d->domain, &gathers[i++]);
+
+ kfree(gathers);
+ }
+}
+
static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
{
WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
@@ -3197,17 +3265,18 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)

vaddr = iova - dma->iova + dma->vaddr;

- if (vfio_pin_page_external(dma, vaddr, &pfn, true))
+ if (vfio_pin_page_external(dma, vaddr, &pfn, false))
goto out_invalid;

if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
- if (put_pfn(pfn, dma->prot))
- vfio_lock_acct(dma, -1, true);
+ put_pfn(pfn, dma->prot);
goto out_invalid;
}

bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);

+ put_pfn(pfn, dma->prot);
+
out_success:
status = IOMMU_PAGE_RESP_SUCCESS;

@@ -3220,6 +3289,43 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
return 0;
}

+static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct vfio_iommu *iommu = container_of(mn, struct vfio_iommu, mn);
+ struct rb_node *n;
+ int ret;
+
+ mutex_lock(&iommu->lock);
+
+ ret = vfio_wait_all_valid(iommu);
+ if (WARN_ON(ret < 0))
+ return;
+
+ for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
+ struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+ unsigned long start_n, end_n;
+
+ if (end <= dma->vaddr || start >= dma->vaddr + dma->size)
+ continue;
+
+ start_n = ALIGN_DOWN(max_t(unsigned long, start, dma->vaddr),
+ PAGE_SIZE);
+ end_n = ALIGN(min_t(unsigned long, end, dma->vaddr + dma->size),
+ PAGE_SIZE);
+
+ vfio_unmap_partial_iopf(iommu, dma,
+ start_n - dma->vaddr + dma->iova,
+ end_n - dma->vaddr + dma->iova);
+ }
+
+ mutex_unlock(&iommu->lock);
+}
+
+static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
+ .invalidate_range = mn_invalidate_range,
+};
+
static long vfio_iommu_type1_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
--
2.19.1

2021-04-09 03:48:01

by Shenming Lu

[permalink] [raw]
Subject: [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support

Some devices only allow selective DMA faulting. Similar to the selective
dirty page tracking, the vendor driver can call vfio_pin_pages() to
indicate the non-faultable scope, we add a new struct vfio_range to
record it, then when the IOPF handler receives any page request out
of the scope, we can directly return with an invalid response.

Suggested-by: Kevin Tian <[email protected]>
Signed-off-by: Shenming Lu <[email protected]>
---
drivers/vfio/vfio.c | 4 +-
drivers/vfio/vfio_iommu_type1.c | 357 +++++++++++++++++++++++++++++++-
include/linux/vfio.h | 1 +
3 files changed, 358 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 38779e6fd80c..44c8dfabf7de 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2013,7 +2013,8 @@ int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn, int npage)
container = group->container;
driver = container->iommu_driver;
if (likely(driver && driver->ops->unpin_pages))
- ret = driver->ops->unpin_pages(container->iommu_data, user_pfn,
+ ret = driver->ops->unpin_pages(container->iommu_data,
+ group->iommu_group, user_pfn,
npage);
else
ret = -ENOTTY;
@@ -2112,6 +2113,7 @@ int vfio_group_unpin_pages(struct vfio_group *group,
driver = container->iommu_driver;
if (likely(driver && driver->ops->unpin_pages))
ret = driver->ops->unpin_pages(container->iommu_data,
+ group->iommu_group,
user_iova_pfn, npage);
else
ret = -ENOTTY;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index dcc93c3b258c..ba2b5a1cf6e9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -150,10 +150,19 @@ struct vfio_regions {
static struct rb_root iopf_group_list = RB_ROOT;
static DEFINE_MUTEX(iopf_group_list_lock);

+struct vfio_range {
+ struct rb_node node;
+ dma_addr_t base_iova;
+ size_t span;
+ unsigned int ref_count;
+};
+
struct vfio_iopf_group {
struct rb_node node;
struct iommu_group *iommu_group;
struct vfio_iommu *iommu;
+ struct rb_root pinned_range_list;
+ bool selective_faulting;
};

#define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \
@@ -496,6 +505,255 @@ static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
mutex_unlock(&iopf_group_list_lock);
}

+/*
+ * Helper functions for range list, handle one page at a time.
+ */
+static struct vfio_range *vfio_find_range(struct rb_root *range_list,
+ dma_addr_t iova)
+{
+ struct rb_node *node = range_list->rb_node;
+ struct vfio_range *range;
+
+ while (node) {
+ range = rb_entry(node, struct vfio_range, node);
+
+ if (iova + PAGE_SIZE <= range->base_iova)
+ node = node->rb_left;
+ else if (iova >= range->base_iova + range->span)
+ node = node->rb_right;
+ else
+ return range;
+ }
+
+ return NULL;
+}
+
+/* Do the possible merge adjacent to the input range. */
+static void vfio_merge_range_list(struct rb_root *range_list,
+ struct vfio_range *range)
+{
+ struct rb_node *node_prev = rb_prev(&range->node);
+ struct rb_node *node_next = rb_next(&range->node);
+
+ if (node_next) {
+ struct vfio_range *range_next = rb_entry(node_next,
+ struct vfio_range,
+ node);
+
+ if (range_next->base_iova == (range->base_iova + range->span) &&
+ range_next->ref_count == range->ref_count) {
+ rb_erase(node_next, range_list);
+ range->span += range_next->span;
+ kfree(range_next);
+ }
+ }
+
+ if (node_prev) {
+ struct vfio_range *range_prev = rb_entry(node_prev,
+ struct vfio_range,
+ node);
+
+ if (range->base_iova == (range_prev->base_iova + range_prev->span)
+ && range->ref_count == range_prev->ref_count) {
+ rb_erase(&range->node, range_list);
+ range_prev->span += range->span;
+ kfree(range);
+ }
+ }
+}
+
+static void vfio_link_range(struct rb_root *range_list, struct vfio_range *new)
+{
+ struct rb_node **link, *parent = NULL;
+ struct vfio_range *range;
+
+ link = &range_list->rb_node;
+
+ while (*link) {
+ parent = *link;
+ range = rb_entry(parent, struct vfio_range, node);
+
+ if (new->base_iova < range->base_iova)
+ link = &(*link)->rb_left;
+ else
+ link = &(*link)->rb_right;
+ }
+
+ rb_link_node(&new->node, parent, link);
+ rb_insert_color(&new->node, range_list);
+
+ vfio_merge_range_list(range_list, new);
+}
+
+static int vfio_add_to_range_list(struct rb_root *range_list,
+ dma_addr_t iova)
+{
+ struct vfio_range *range = vfio_find_range(range_list, iova);
+
+ if (range) {
+ struct vfio_range *new_prev, *new_next;
+ size_t span_prev, span_next;
+
+ /* May split the found range into three parts. */
+ span_prev = iova - range->base_iova;
+ span_next = range->span - span_prev - PAGE_SIZE;
+
+ if (span_prev) {
+ new_prev = kzalloc(sizeof(*new_prev), GFP_KERNEL);
+ if (!new_prev)
+ return -ENOMEM;
+
+ new_prev->base_iova = range->base_iova;
+ new_prev->span = span_prev;
+ new_prev->ref_count = range->ref_count;
+ }
+
+ if (span_next) {
+ new_next = kzalloc(sizeof(*new_next), GFP_KERNEL);
+ if (!new_next) {
+ if (span_prev)
+ kfree(new_prev);
+ return -ENOMEM;
+ }
+
+ new_next->base_iova = iova + PAGE_SIZE;
+ new_next->span = span_next;
+ new_next->ref_count = range->ref_count;
+ }
+
+ range->base_iova = iova;
+ range->span = PAGE_SIZE;
+ range->ref_count++;
+ vfio_merge_range_list(range_list, range);
+
+ if (span_prev)
+ vfio_link_range(range_list, new_prev);
+
+ if (span_next)
+ vfio_link_range(range_list, new_next);
+ } else {
+ struct vfio_range *new;
+
+ new = kzalloc(sizeof(*new), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ new->base_iova = iova;
+ new->span = PAGE_SIZE;
+ new->ref_count = 1;
+
+ vfio_link_range(range_list, new);
+ }
+
+ return 0;
+}
+
+static int vfio_remove_from_range_list(struct rb_root *range_list,
+ dma_addr_t iova)
+{
+ struct vfio_range *range = vfio_find_range(range_list, iova);
+ struct vfio_range *news[3];
+ size_t span_prev, span_in, span_next;
+ int i, num_news;
+
+ if (!range)
+ return 0;
+
+ span_prev = iova - range->base_iova;
+ span_in = range->ref_count > 1 ? PAGE_SIZE : 0;
+ span_next = range->span - span_prev - PAGE_SIZE;
+
+ num_news = (int)!!span_prev + (int)!!span_in + (int)!!span_next;
+ if (!num_news) {
+ rb_erase(&range->node, range_list);
+ kfree(range);
+ return 0;
+ }
+
+ for (i = 0; i < num_news - 1; i++) {
+ news[i] = kzalloc(sizeof(struct vfio_range), GFP_KERNEL);
+ if (!news[i]) {
+ if (i > 0)
+ kfree(news[0]);
+ return -ENOMEM;
+ }
+ }
+ /* Reuse the found range. */
+ news[i] = range;
+
+ i = 0;
+ if (span_prev) {
+ news[i]->base_iova = range->base_iova;
+ news[i]->span = span_prev;
+ news[i++]->ref_count = range->ref_count;
+ }
+ if (span_in) {
+ news[i]->base_iova = iova;
+ news[i]->span = span_in;
+ news[i++]->ref_count = range->ref_count - 1;
+ }
+ if (span_next) {
+ news[i]->base_iova = iova + PAGE_SIZE;
+ news[i]->span = span_next;
+ news[i]->ref_count = range->ref_count;
+ }
+
+ vfio_merge_range_list(range_list, range);
+
+ for (i = 0; i < num_news - 1; i++)
+ vfio_link_range(range_list, news[i]);
+
+ return 0;
+}
+
+static void vfio_range_list_free(struct rb_root *range_list)
+{
+ struct rb_node *n;
+
+ while ((n = rb_first(range_list))) {
+ struct vfio_range *range = rb_entry(n, struct vfio_range, node);
+
+ rb_erase(&range->node, range_list);
+ kfree(range);
+ }
+}
+
+static int vfio_range_list_get_copy(struct vfio_iopf_group *iopf_group,
+ struct rb_root *range_list_copy)
+{
+ struct rb_root *range_list = &iopf_group->pinned_range_list;
+ struct rb_node *n, **link = &range_list_copy->rb_node, *parent = NULL;
+ int ret;
+
+ for (n = rb_first(range_list); n; n = rb_next(n)) {
+ struct vfio_range *range, *range_copy;
+
+ range = rb_entry(n, struct vfio_range, node);
+
+ range_copy = kzalloc(sizeof(*range_copy), GFP_KERNEL);
+ if (!range_copy) {
+ ret = -ENOMEM;
+ goto out_free;
+ }
+
+ range_copy->base_iova = range->base_iova;
+ range_copy->span = range->span;
+ range_copy->ref_count = range->ref_count;
+
+ rb_link_node(&range_copy->node, parent, link);
+ rb_insert_color(&range_copy->node, range_list_copy);
+
+ parent = *link;
+ link = &(*link)->rb_right;
+ }
+
+ return 0;
+
+out_free:
+ vfio_range_list_free(range_list_copy);
+ return ret;
+}
+
static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
{
struct mm_struct *mm;
@@ -910,6 +1168,9 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
return unlocked;
}

+static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
+ struct iommu_group *iommu_group);
+
static int vfio_iommu_type1_pin_pages(void *iommu_data,
struct iommu_group *iommu_group,
unsigned long *user_pfn,
@@ -923,6 +1184,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
struct vfio_dma *dma;
bool do_accounting;
dma_addr_t iova;
+ struct vfio_iopf_group *iopf_group = NULL;
+ struct rb_root range_list_copy = RB_ROOT;

if (!iommu || !user_pfn || !phys_pfn)
return -EINVAL;
@@ -955,6 +1218,31 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
goto pin_done;
}

+ /*
+ * Some devices only allow selective DMA faulting. Similar to the
+ * selective dirty tracking, the vendor driver can call vfio_pin_pages()
+ * to indicate the non-faultable scope, and we record it to filter
+ * out the invalid page requests in the IOPF handler.
+ */
+ if (iommu->iopf_enabled) {
+ iopf_group = vfio_find_iopf_group(iommu_group);
+ if (iopf_group) {
+ /*
+ * We don't want to work on the original range
+ * list as the list gets modified and in case
+ * of failure we have to retain the original
+ * list. Get a copy here.
+ */
+ ret = vfio_range_list_get_copy(iopf_group,
+ &range_list_copy);
+ if (ret)
+ goto pin_done;
+ } else {
+ WARN_ON(!find_iommu_group(iommu->external_domain,
+ iommu_group));
+ }
+ }
+
/*
* If iommu capable domain exist in the container then all pages are
* already pinned and accounted. Accouting should be done if there is no
@@ -981,6 +1269,15 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
vpfn = vfio_iova_get_vfio_pfn(dma, iova);
if (vpfn) {
phys_pfn[i] = vpfn->pfn;
+ if (iopf_group) {
+ ret = vfio_add_to_range_list(&range_list_copy,
+ iova);
+ if (ret) {
+ vfio_unpin_page_external(dma, iova,
+ do_accounting);
+ goto pin_unwind;
+ }
+ }
continue;
}

@@ -997,6 +1294,15 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
goto pin_unwind;
}

+ if (iopf_group) {
+ ret = vfio_add_to_range_list(&range_list_copy, iova);
+ if (ret) {
+ vfio_unpin_page_external(dma, iova,
+ do_accounting);
+ goto pin_unwind;
+ }
+ }
+
if (iommu->dirty_page_tracking) {
unsigned long pgshift = __ffs(iommu->pgsize_bitmap);

@@ -1010,6 +1316,13 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
}
ret = i;

+ if (iopf_group) {
+ vfio_range_list_free(&iopf_group->pinned_range_list);
+ iopf_group->pinned_range_list.rb_node = range_list_copy.rb_node;
+ if (!iopf_group->selective_faulting)
+ iopf_group->selective_faulting = true;
+ }
+
group = vfio_iommu_find_iommu_group(iommu, iommu_group);
if (!group->pinned_page_dirty_scope) {
group->pinned_page_dirty_scope = true;
@@ -1019,6 +1332,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
goto pin_done;

pin_unwind:
+ if (iopf_group)
+ vfio_range_list_free(&range_list_copy);
phys_pfn[i] = 0;
for (j = 0; j < i; j++) {
dma_addr_t iova;
@@ -1034,12 +1349,14 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
}

static int vfio_iommu_type1_unpin_pages(void *iommu_data,
+ struct iommu_group *iommu_group,
unsigned long *user_pfn,
int npage)
{
struct vfio_iommu *iommu = iommu_data;
+ struct vfio_iopf_group *iopf_group = NULL;
bool do_accounting;
- int i;
+ int i, ret;

if (!iommu || !user_pfn)
return -EINVAL;
@@ -1050,6 +1367,13 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,

mutex_lock(&iommu->lock);

+ if (iommu->iopf_enabled) {
+ iopf_group = vfio_find_iopf_group(iommu_group);
+ if (!iopf_group)
+ WARN_ON(!find_iommu_group(iommu->external_domain,
+ iommu_group));
+ }
+
do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
iommu->iopf_enabled;
for (i = 0; i < npage; i++) {
@@ -1058,14 +1382,24 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,

iova = user_pfn[i] << PAGE_SHIFT;
dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
- if (!dma)
+ if (!dma) {
+ ret = -EINVAL;
goto unpin_exit;
+ }
+
+ if (iopf_group) {
+ ret = vfio_remove_from_range_list(
+ &iopf_group->pinned_range_list, iova);
+ if (ret)
+ goto unpin_exit;
+ }
+
vfio_unpin_page_external(dma, iova, do_accounting);
}

unpin_exit:
mutex_unlock(&iommu->lock);
- return i > npage ? npage : (i > 0 ? i : -EINVAL);
+ return i > npage ? npage : (i > 0 ? i : ret);
}

static long vfio_sync_unpin(struct vfio_dma *dma, struct vfio_domain *domain,
@@ -2591,6 +2925,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,

iopf_group->iommu_group = iommu_group;
iopf_group->iommu = iommu;
+ iopf_group->pinned_range_list = RB_ROOT;

vfio_link_iopf_group(iopf_group);
}
@@ -2886,6 +3221,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,

iopf_group = vfio_find_iopf_group(iommu_group);
if (!WARN_ON(!iopf_group)) {
+ WARN_ON(!RB_EMPTY_ROOT(
+ &iopf_group->pinned_range_list));
vfio_unlink_iopf_group(iopf_group);
kfree(iopf_group);
}
@@ -3482,6 +3819,7 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
struct vfio_iommu *iommu;
struct vfio_dma *dma;
struct vfio_batch batch;
+ struct vfio_range *range;
dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
int access_flags = 0;
size_t premap_len, map_len, mapped_len = 0;
@@ -3506,6 +3844,12 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)

mutex_lock(&iommu->lock);

+ if (iopf_group->selective_faulting) {
+ range = vfio_find_range(&iopf_group->pinned_range_list, iova);
+ if (!range)
+ goto out_invalid;
+ }
+
ret = vfio_find_dma_valid(iommu, iova, PAGE_SIZE, &dma);
if (ret < 0)
goto out_invalid;
@@ -3523,6 +3867,12 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)

premap_len = IOPF_PREMAP_LEN << PAGE_SHIFT;
npages = dma->size >> PAGE_SHIFT;
+ if (iopf_group->selective_faulting) {
+ dma_addr_t range_end = range->base_iova + range->span;
+
+ if (range_end < dma->iova + dma->size)
+ npages = (range_end - dma->iova) >> PAGE_SHIFT;
+ }
map_len = PAGE_SIZE;
for (i = bit_offset + 1; i < npages; i++) {
if (map_len >= premap_len || IOPF_MAPPED_BITMAP_GET(dma, i))
@@ -3647,6 +3997,7 @@ static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)

iopf_group->iommu_group = g->iommu_group;
iopf_group->iommu = iommu;
+ iopf_group->pinned_range_list = RB_ROOT;

vfio_link_iopf_group(iopf_group);
}
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index b7e18bde5aa8..a7b426d579df 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -87,6 +87,7 @@ struct vfio_iommu_driver_ops {
int npage, int prot,
unsigned long *phys_pfn);
int (*unpin_pages)(void *iommu_data,
+ struct iommu_group *group,
unsigned long *user_pfn, int npage);
int (*register_notifier)(void *iommu_data,
unsigned long *events,
--
2.19.1

2021-04-09 03:49:39

by Shenming Lu

[permalink] [raw]
Subject: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

To set up nested mode, drivers such as vfio_pci need to register a
handler to receive stage/level 1 faults from the IOMMU, but since
currently each device can only have one iommu dev fault handler,
and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
we choose to update the registered handler (a consolidated one) via
flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
stage 1 faults in the handler to the guest through a newly added
vfio_device_ops callback.

Signed-off-by: Shenming Lu <[email protected]>
---
drivers/vfio/vfio.c | 81 +++++++++++++++++++++++++++++++++
drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
include/linux/vfio.h | 12 +++++
3 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 44c8dfabf7de..4245f15914bf 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
}
EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);

+/*
+ * Register/Update the VFIO IOPF handler to receive
+ * nested stage/level 1 faults.
+ */
+int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
+{
+ struct vfio_container *container;
+ struct vfio_group *group;
+ struct vfio_iommu_driver *driver;
+ int ret;
+
+ if (!dev)
+ return -EINVAL;
+
+ group = vfio_group_get_from_dev(dev);
+ if (!group)
+ return -ENODEV;
+
+ ret = vfio_group_add_container_user(group);
+ if (ret)
+ goto out;
+
+ container = group->container;
+ driver = container->iommu_driver;
+ if (likely(driver && driver->ops->register_handler))
+ ret = driver->ops->register_handler(container->iommu_data, dev);
+ else
+ ret = -ENOTTY;
+
+ vfio_group_try_dissolve_container(group);
+
+out:
+ vfio_group_put(group);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
+
+int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
+{
+ struct vfio_container *container;
+ struct vfio_group *group;
+ struct vfio_iommu_driver *driver;
+ int ret;
+
+ if (!dev)
+ return -EINVAL;
+
+ group = vfio_group_get_from_dev(dev);
+ if (!group)
+ return -ENODEV;
+
+ ret = vfio_group_add_container_user(group);
+ if (ret)
+ goto out;
+
+ container = group->container;
+ driver = container->iommu_driver;
+ if (likely(driver && driver->ops->unregister_handler))
+ ret = driver->ops->unregister_handler(container->iommu_data, dev);
+ else
+ ret = -ENOTTY;
+
+ vfio_group_try_dissolve_container(group);
+
+out:
+ vfio_group_put(group);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
+
+int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
+{
+ struct vfio_device *device = dev_get_drvdata(dev);
+
+ if (unlikely(!device->ops->transfer))
+ return -EOPNOTSUPP;
+
+ return device->ops->transfer(device->device_data, fault);
+}
+EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
+
/**
* Module/class support
*/
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ba2b5a1cf6e9..9d1adeddb303 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
struct vfio_batch batch;
struct vfio_range *range;
dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
- int access_flags = 0;
+ int access_flags = 0, nested;
size_t premap_len, map_len, mapped_len = 0;
unsigned long bit_offset, vaddr, pfn, i, npages;
int ret;
enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
struct iommu_page_response resp = {0};

+ if (vfio_dev_domian_nested(dev, &nested))
+ return -ENODEV;
+
+ /*
+ * When configured in nested mode, further deliver the
+ * stage/level 1 faults to the guest.
+ */
+ if (nested) {
+ bool l2;
+
+ if (fault->type == IOMMU_FAULT_PAGE_REQ)
+ l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
+ if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
+ l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
+
+ if (!l2)
+ return vfio_transfer_iommu_fault(dev, fault);
+ }
+
if (fault->type != IOMMU_FAULT_PAGE_REQ)
return -EOPNOTSUPP;

@@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
wake_up_all(&iommu->vaddr_wait);
}

+static int vfio_iommu_type1_register_handler(void *iommu_data,
+ struct device *dev)
+{
+ struct vfio_iommu *iommu = iommu_data;
+
+ if (iommu->iopf_enabled)
+ return iommu_update_device_fault_handler(dev, ~0,
+ FAULT_REPORT_NESTED_L1);
+ else
+ return iommu_register_device_fault_handler(dev,
+ vfio_iommu_type1_dma_map_iopf,
+ FAULT_REPORT_NESTED_L1, dev);
+}
+
+static int vfio_iommu_type1_unregister_handler(void *iommu_data,
+ struct device *dev)
+{
+ struct vfio_iommu *iommu = iommu_data;
+
+ if (iommu->iopf_enabled)
+ return iommu_update_device_fault_handler(dev,
+ ~FAULT_REPORT_NESTED_L1, 0);
+ else
+ return iommu_unregister_device_fault_handler(dev);
+}
+
static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
.name = "vfio-iommu-type1",
.owner = THIS_MODULE,
@@ -4216,6 +4261,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
.dma_rw = vfio_iommu_type1_dma_rw,
.group_iommu_domain = vfio_iommu_type1_group_iommu_domain,
.notify = vfio_iommu_type1_notify,
+ .register_handler = vfio_iommu_type1_register_handler,
+ .unregister_handler = vfio_iommu_type1_unregister_handler,
};

static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index a7b426d579df..4621d8f0395d 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -29,6 +29,8 @@
* @match: Optional device name match callback (return: 0 for no-match, >0 for
* match, -errno for abort (ex. match with insufficient or incorrect
* additional args)
+ * @transfer: Optional. Transfer the received stage/level 1 faults to the guest
+ * for nested mode.
*/
struct vfio_device_ops {
char *name;
@@ -43,6 +45,7 @@ struct vfio_device_ops {
int (*mmap)(void *device_data, struct vm_area_struct *vma);
void (*request)(void *device_data, unsigned int count);
int (*match)(void *device_data, char *buf);
+ int (*transfer)(void *device_data, struct iommu_fault *fault);
};

extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
@@ -100,6 +103,10 @@ struct vfio_iommu_driver_ops {
struct iommu_group *group);
void (*notify)(void *iommu_data,
enum vfio_iommu_notify_type event);
+ int (*register_handler)(void *iommu_data,
+ struct device *dev);
+ int (*unregister_handler)(void *iommu_data,
+ struct device *dev);
};

extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
@@ -161,6 +168,11 @@ extern int vfio_unregister_notifier(struct device *dev,
struct kvm;
extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);

+extern int vfio_iommu_dev_fault_handler_register_nested(struct device *dev);
+extern int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev);
+extern int vfio_transfer_iommu_fault(struct device *dev,
+ struct iommu_fault *fault);
+
/*
* Sub-module helpers
*/
--
2.19.1

2021-04-26 01:43:31

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

On 2021/4/9 11:44, Shenming Lu wrote:
> Hi,
>
> Requesting for your comments and suggestions. :-)

Kind ping...

>
> The static pinning and mapping problem in VFIO and possible solutions
> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> Page Fault support for VFIO devices. Different from those relatively
> complicated software approaches such as presenting a vIOMMU that provides
> the DMA buffer information (might include para-virtualized optimizations),
> IOPF mainly depends on the hardware faulting capability, such as the PCIe
> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
> support for VFIO passthrough based on the IOPF part of SVA in this series.
>
> We have measured its performance with UADK [4] (passthrough an accelerator
> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>
> Run hisi_sec_test...
> - with varying sending times and message lengths
> - with/without IOPF enabled (speed slowdown)
>
> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
> slowdown (num of faults)
> times VFIO IOPF host SVA
> 1 63.4% (518) 82.8% (512)
> 100 22.9% (1058) 47.9% (1024)
> 1000 2.6% (1071) 8.5% (1024)
>
> when msg_len = 10MB (and PREMAP_LEN = 512):
> slowdown (num of faults)
> times VFIO IOPF
> 1 32.6% (13)
> 100 3.5% (26)
> 1000 1.6% (26)
>
> History:
>
> v2 -> v3
> - Nit fixes.
> - No reason to disable reporting the unrecoverable faults. (baolu)
> - Maintain a global IOPF enabled group list.
> - Split the pre-mapping optimization to be a separate patch.
> - Add selective faulting support (use vfio_pin_pages to indicate the
> non-faultable scope and add a new struct vfio_range to record it,
> untested). (Kevin)
>
> v1 -> v2
> - Numerous improvements following the suggestions. Thanks a lot to all
> of you.
>
> Note that PRI is not supported at the moment since there is no hardware.
>
> Links:
> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
> 2016.
> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
> for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
> [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/[email protected]/
> [4] https://github.com/Linaro/uadk
>
> Thanks,
> Shenming
>
>
> Shenming Lu (8):
> iommu: Evolve the device fault reporting framework
> vfio/type1: Add a page fault handler
> vfio/type1: Add an MMU notifier to avoid pinning
> vfio/type1: Pre-map more pages than requested in the IOPF handling
> vfio/type1: VFIO_IOMMU_ENABLE_IOPF
> vfio/type1: No need to statically pin and map if IOPF enabled
> vfio/type1: Add selective DMA faulting support
> vfio: Add nested IOPF support
>
> .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 3 +-
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 +-
> drivers/iommu/iommu.c | 56 +-
> drivers/vfio/vfio.c | 85 +-
> drivers/vfio/vfio_iommu_type1.c | 1000 ++++++++++++++++-
> include/linux/iommu.h | 19 +-
> include/linux/vfio.h | 13 +
> include/uapi/linux/iommu.h | 4 +
> include/uapi/linux/vfio.h | 6 +
> 9 files changed, 1181 insertions(+), 23 deletions(-)
>

2021-05-11 11:32:27

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

Hi Alex,

Hope for some suggestions or comments from you since there seems to be many unsure
points in this series. :-)

Thanks,
Shenming


On 2021/4/26 9:41, Shenming Lu wrote:
> On 2021/4/9 11:44, Shenming Lu wrote:
>> Hi,
>>
>> Requesting for your comments and suggestions. :-)
>
> Kind ping...
>
>>
>> The static pinning and mapping problem in VFIO and possible solutions
>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>> Page Fault support for VFIO devices. Different from those relatively
>> complicated software approaches such as presenting a vIOMMU that provides
>> the DMA buffer information (might include para-virtualized optimizations),
>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
>> support for VFIO passthrough based on the IOPF part of SVA in this series.
>>
>> We have measured its performance with UADK [4] (passthrough an accelerator
>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>>
>> Run hisi_sec_test...
>> - with varying sending times and message lengths
>> - with/without IOPF enabled (speed slowdown)
>>
>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>> slowdown (num of faults)
>> times VFIO IOPF host SVA
>> 1 63.4% (518) 82.8% (512)
>> 100 22.9% (1058) 47.9% (1024)
>> 1000 2.6% (1071) 8.5% (1024)
>>
>> when msg_len = 10MB (and PREMAP_LEN = 512):
>> slowdown (num of faults)
>> times VFIO IOPF
>> 1 32.6% (13)
>> 100 3.5% (26)
>> 1000 1.6% (26)
>>
>> History:
>>
>> v2 -> v3
>> - Nit fixes.
>> - No reason to disable reporting the unrecoverable faults. (baolu)
>> - Maintain a global IOPF enabled group list.
>> - Split the pre-mapping optimization to be a separate patch.
>> - Add selective faulting support (use vfio_pin_pages to indicate the
>> non-faultable scope and add a new struct vfio_range to record it,
>> untested). (Kevin)
>>
>> v1 -> v2
>> - Numerous improvements following the suggestions. Thanks a lot to all
>> of you.
>>
>> Note that PRI is not supported at the moment since there is no hardware.
>>
>> Links:
>> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
>> 2016.
>> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
>> for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
>> [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/[email protected]/
>> [4] https://github.com/Linaro/uadk
>>
>> Thanks,
>> Shenming
>>
>>
>> Shenming Lu (8):
>> iommu: Evolve the device fault reporting framework
>> vfio/type1: Add a page fault handler
>> vfio/type1: Add an MMU notifier to avoid pinning
>> vfio/type1: Pre-map more pages than requested in the IOPF handling
>> vfio/type1: VFIO_IOMMU_ENABLE_IOPF
>> vfio/type1: No need to statically pin and map if IOPF enabled
>> vfio/type1: Add selective DMA faulting support
>> vfio: Add nested IOPF support
>>
>> .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 3 +-
>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 +-
>> drivers/iommu/iommu.c | 56 +-
>> drivers/vfio/vfio.c | 85 +-
>> drivers/vfio/vfio_iommu_type1.c | 1000 ++++++++++++++++-
>> include/linux/iommu.h | 19 +-
>> include/linux/vfio.h | 13 +
>> include/uapi/linux/iommu.h | 4 +
>> include/uapi/linux/vfio.h | 6 +
>> 9 files changed, 1181 insertions(+), 23 deletions(-)
>>

2021-05-19 18:30:21

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

On Fri, 9 Apr 2021 11:44:20 +0800
Shenming Lu <[email protected]> wrote:

> To set up nested mode, drivers such as vfio_pci need to register a
> handler to receive stage/level 1 faults from the IOMMU, but since
> currently each device can only have one iommu dev fault handler,
> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
> we choose to update the registered handler (a consolidated one) via
> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
> stage 1 faults in the handler to the guest through a newly added
> vfio_device_ops callback.

Are there proposed in-kernel drivers that would use any of these
symbols?

> Signed-off-by: Shenming Lu <[email protected]>
> ---
> drivers/vfio/vfio.c | 81 +++++++++++++++++++++++++++++++++
> drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
> include/linux/vfio.h | 12 +++++
> 3 files changed, 141 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 44c8dfabf7de..4245f15914bf 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
> }
> EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>
> +/*
> + * Register/Update the VFIO IOPF handler to receive
> + * nested stage/level 1 faults.
> + */
> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
> +{
> + struct vfio_container *container;
> + struct vfio_group *group;
> + struct vfio_iommu_driver *driver;
> + int ret;
> +
> + if (!dev)
> + return -EINVAL;
> +
> + group = vfio_group_get_from_dev(dev);
> + if (!group)
> + return -ENODEV;
> +
> + ret = vfio_group_add_container_user(group);
> + if (ret)
> + goto out;
> +
> + container = group->container;
> + driver = container->iommu_driver;
> + if (likely(driver && driver->ops->register_handler))
> + ret = driver->ops->register_handler(container->iommu_data, dev);
> + else
> + ret = -ENOTTY;
> +
> + vfio_group_try_dissolve_container(group);
> +
> +out:
> + vfio_group_put(group);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
> +
> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
> +{
> + struct vfio_container *container;
> + struct vfio_group *group;
> + struct vfio_iommu_driver *driver;
> + int ret;
> +
> + if (!dev)
> + return -EINVAL;
> +
> + group = vfio_group_get_from_dev(dev);
> + if (!group)
> + return -ENODEV;
> +
> + ret = vfio_group_add_container_user(group);
> + if (ret)
> + goto out;
> +
> + container = group->container;
> + driver = container->iommu_driver;
> + if (likely(driver && driver->ops->unregister_handler))
> + ret = driver->ops->unregister_handler(container->iommu_data, dev);
> + else
> + ret = -ENOTTY;
> +
> + vfio_group_try_dissolve_container(group);
> +
> +out:
> + vfio_group_put(group);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
> +
> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
> +{
> + struct vfio_device *device = dev_get_drvdata(dev);
> +
> + if (unlikely(!device->ops->transfer))
> + return -EOPNOTSUPP;
> +
> + return device->ops->transfer(device->device_data, fault);
> +}
> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
> +
> /**
> * Module/class support
> */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index ba2b5a1cf6e9..9d1adeddb303 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
> struct vfio_batch batch;
> struct vfio_range *range;
> dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
> - int access_flags = 0;
> + int access_flags = 0, nested;
> size_t premap_len, map_len, mapped_len = 0;
> unsigned long bit_offset, vaddr, pfn, i, npages;
> int ret;
> enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
> struct iommu_page_response resp = {0};
>
> + if (vfio_dev_domian_nested(dev, &nested))
> + return -ENODEV;
> +
> + /*
> + * When configured in nested mode, further deliver the
> + * stage/level 1 faults to the guest.
> + */
> + if (nested) {
> + bool l2;
> +
> + if (fault->type == IOMMU_FAULT_PAGE_REQ)
> + l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
> + if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
> + l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
> +
> + if (!l2)
> + return vfio_transfer_iommu_fault(dev, fault);
> + }
> +
> if (fault->type != IOMMU_FAULT_PAGE_REQ)
> return -EOPNOTSUPP;
>
> @@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
> wake_up_all(&iommu->vaddr_wait);
> }
>
> +static int vfio_iommu_type1_register_handler(void *iommu_data,
> + struct device *dev)
> +{
> + struct vfio_iommu *iommu = iommu_data;
> +
> + if (iommu->iopf_enabled)
> + return iommu_update_device_fault_handler(dev, ~0,
> + FAULT_REPORT_NESTED_L1);
> + else
> + return iommu_register_device_fault_handler(dev,
> + vfio_iommu_type1_dma_map_iopf,
> + FAULT_REPORT_NESTED_L1, dev);
> +}
> +
> +static int vfio_iommu_type1_unregister_handler(void *iommu_data,
> + struct device *dev)
> +{
> + struct vfio_iommu *iommu = iommu_data;
> +
> + if (iommu->iopf_enabled)
> + return iommu_update_device_fault_handler(dev,
> + ~FAULT_REPORT_NESTED_L1, 0);
> + else
> + return iommu_unregister_device_fault_handler(dev);
> +}


The path through vfio to register this is pretty ugly, but I don't see
any reason for the the update interfaces here, the previously
registered handler just changes its behavior.


> +
> static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
> .name = "vfio-iommu-type1",
> .owner = THIS_MODULE,
> @@ -4216,6 +4261,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
> .dma_rw = vfio_iommu_type1_dma_rw,
> .group_iommu_domain = vfio_iommu_type1_group_iommu_domain,
> .notify = vfio_iommu_type1_notify,
> + .register_handler = vfio_iommu_type1_register_handler,
> + .unregister_handler = vfio_iommu_type1_unregister_handler,
> };
>
> static int __init vfio_iommu_type1_init(void)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index a7b426d579df..4621d8f0395d 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -29,6 +29,8 @@
> * @match: Optional device name match callback (return: 0 for no-match, >0 for
> * match, -errno for abort (ex. match with insufficient or incorrect
> * additional args)
> + * @transfer: Optional. Transfer the received stage/level 1 faults to the guest
> + * for nested mode.
> */
> struct vfio_device_ops {
> char *name;
> @@ -43,6 +45,7 @@ struct vfio_device_ops {
> int (*mmap)(void *device_data, struct vm_area_struct *vma);
> void (*request)(void *device_data, unsigned int count);
> int (*match)(void *device_data, char *buf);
> + int (*transfer)(void *device_data, struct iommu_fault *fault);
> };
>
> extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
> @@ -100,6 +103,10 @@ struct vfio_iommu_driver_ops {
> struct iommu_group *group);
> void (*notify)(void *iommu_data,
> enum vfio_iommu_notify_type event);
> + int (*register_handler)(void *iommu_data,
> + struct device *dev);
> + int (*unregister_handler)(void *iommu_data,
> + struct device *dev);
> };
>
> extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
> @@ -161,6 +168,11 @@ extern int vfio_unregister_notifier(struct device *dev,
> struct kvm;
> extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
>
> +extern int vfio_iommu_dev_fault_handler_register_nested(struct device *dev);
> +extern int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev);
> +extern int vfio_transfer_iommu_fault(struct device *dev,
> + struct iommu_fault *fault);
> +
> /*
> * Sub-module helpers
> */


2021-05-19 18:30:32

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled

On Fri, 9 Apr 2021 11:44:18 +0800
Shenming Lu <[email protected]> wrote:

> If IOPF enabled for the VFIO container, there is no need to statically
> pin and map the entire DMA range, we can do it on demand. And unmap
> according to the IOPF mapped bitmap when removing vfio_dma.
>
> Note that we still mark all pages dirty even if IOPF enabled, we may
> add IOPF-based fine grained dirty tracking support in the future.
>
> Signed-off-by: Shenming Lu <[email protected]>
> ---
> drivers/vfio/vfio_iommu_type1.c | 38 +++++++++++++++++++++++++++------
> 1 file changed, 32 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 7df5711e743a..dcc93c3b258c 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -175,6 +175,7 @@ struct vfio_iopf_group {
> #define IOPF_MAPPED_BITMAP_GET(dma, i) \
> ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG] \
> >> ((i) % BITS_PER_LONG)) & 0x1)
> +#define IOPF_MAPPED_BITMAP_BYTES(n) DIRTY_BITMAP_BYTES(n)
>
> #define WAITED 1
>
> @@ -959,7 +960,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> * already pinned and accounted. Accouting should be done if there is no
> * iommu capable domain in the container.
> */
> - do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
> + do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
> + iommu->iopf_enabled;
>
> for (i = 0; i < npage; i++) {
> struct vfio_pfn *vpfn;
> @@ -1048,7 +1050,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>
> mutex_lock(&iommu->lock);
>
> - do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
> + do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
> + iommu->iopf_enabled;

pin/unpin are actually still pinning pages, why does iopf exempt them
from accounting?


> for (i = 0; i < npage; i++) {
> struct vfio_dma *dma;
> dma_addr_t iova;
> @@ -1169,7 +1172,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> if (!dma->size)
> return 0;
>
> - if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
> return 0;
>
> /*
> @@ -1306,11 +1309,20 @@ static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
> }
> }
>
> +static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma *dma)
> +{
> + vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size);
> +
> + kfree(dma->iopf_mapped_bitmap);
> +}
> +
> static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> {
> WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
> vfio_unmap_unpin(iommu, dma, true);
> vfio_unlink_dma(iommu, dma);
> + if (iommu->iopf_enabled)
> + vfio_dma_clean_iopf(iommu, dma);
> put_task_struct(dma->task);
> vfio_dma_bitmap_free(dma);
> if (dma->vaddr_invalid) {
> @@ -1359,7 +1371,8 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
> * mark all pages dirty if any IOMMU capable device is not able
> * to report dirty pages and all pages are pinned and mapped.
> */
> - if (iommu->num_non_pinned_groups && dma->iommu_mapped)
> + if (iommu->num_non_pinned_groups &&
> + (dma->iommu_mapped || iommu->iopf_enabled))
> bitmap_set(dma->bitmap, 0, nbits);

This seems like really poor integration of iopf into dirty page
tracking. I'd expect dirty logging to flush the mapped pages and
write faults to mark pages dirty. Shouldn't the fault handler also
provide only the access faulted, so for example a read fault wouldn't
mark the page dirty?

>
> if (shift) {
> @@ -1772,6 +1785,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> goto out_unlock;
> }
>
> + if (iommu->iopf_enabled) {
> + dma->iopf_mapped_bitmap = kvzalloc(IOPF_MAPPED_BITMAP_BYTES(
> + size >> PAGE_SHIFT), GFP_KERNEL);
> + if (!dma->iopf_mapped_bitmap) {
> + ret = -ENOMEM;
> + kfree(dma);
> + goto out_unlock;
> + }


So we're assuming nothing can fault and therefore nothing can reference
the iopf_mapped_bitmap until this point in the series?


> + }
> +
> iommu->dma_avail--;
> dma->iova = iova;
> dma->vaddr = vaddr;
> @@ -1811,8 +1834,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> /* Insert zero-sized and grow as we map chunks of it */
> vfio_link_dma(iommu, dma);
>
> - /* Don't pin and map if container doesn't contain IOMMU capable domain*/
> - if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> + /*
> + * Don't pin and map if container doesn't contain IOMMU capable domain,
> + * or IOPF enabled for the container.
> + */
> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
> dma->size = size;
> else
> ret = vfio_pin_map_dma(iommu, dma, size);


2021-05-19 18:30:57

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning

On Fri, 9 Apr 2021 11:44:15 +0800
Shenming Lu <[email protected]> wrote:

> To avoid pinning pages when they are mapped in IOMMU page tables, we
> add an MMU notifier to tell the addresses which are no longer valid
> and try to unmap them.
>
> Signed-off-by: Shenming Lu <[email protected]>
> ---
> drivers/vfio/vfio_iommu_type1.c | 112 +++++++++++++++++++++++++++++++-
> 1 file changed, 109 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index ab0ff60ee207..1cb9d1f2717b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -40,6 +40,7 @@
> #include <linux/notifier.h>
> #include <linux/dma-iommu.h>
> #include <linux/irqdomain.h>
> +#include <linux/mmu_notifier.h>
>
> #define DRIVER_VERSION "0.2"
> #define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
> @@ -69,6 +70,7 @@ struct vfio_iommu {
> struct mutex lock;
> struct rb_root dma_list;
> struct blocking_notifier_head notifier;
> + struct mmu_notifier mn;
> unsigned int dma_avail;
> unsigned int vaddr_invalid_count;
> uint64_t pgsize_bitmap;
> @@ -1204,6 +1206,72 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> return unlocked;
> }
>
> +/* Unmap the IOPF mapped pages in the specified range. */
> +static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
> + struct vfio_dma *dma,
> + dma_addr_t start, dma_addr_t end)
> +{
> + struct iommu_iotlb_gather *gathers;
> + struct vfio_domain *d;
> + int i, num_domains = 0;
> +
> + list_for_each_entry(d, &iommu->domain_list, next)
> + num_domains++;
> +
> + gathers = kzalloc(sizeof(*gathers) * num_domains, GFP_KERNEL);
> + if (gathers) {
> + for (i = 0; i < num_domains; i++)
> + iommu_iotlb_gather_init(&gathers[i]);
> + }


If we're always serialized on iommu->lock, would it make sense to have
gathers pre-allocated on the vfio_iommu object?

> +
> + while (start < end) {
> + unsigned long bit_offset;
> + size_t len;
> +
> + bit_offset = (start - dma->iova) >> PAGE_SHIFT;
> +
> + for (len = 0; start + len < end; len += PAGE_SIZE) {
> + if (!IOPF_MAPPED_BITMAP_GET(dma,
> + bit_offset + (len >> PAGE_SHIFT)))
> + break;


There are bitmap helpers for this, find_first_bit(),
find_next_zero_bit().


> + }
> +
> + if (len) {
> + i = 0;
> + list_for_each_entry(d, &iommu->domain_list, next) {
> + size_t unmapped;
> +
> + if (gathers)
> + unmapped = iommu_unmap_fast(d->domain,
> + start, len,
> + &gathers[i++]);
> + else
> + unmapped = iommu_unmap(d->domain,
> + start, len);
> +
> + if (WARN_ON(unmapped != len))

The IOMMU API does not guarantee arbitrary unmap sizes, this will
trigger and this exit path is wrong. If we've already unmapped the
IOMMU, shouldn't we proceed with @unmapped rather than @len so the
device can re-fault the extra mappings? Otherwise the IOMMU state
doesn't match the iopf bitmap state.

> + goto out;
> + }
> +
> + bitmap_clear(dma->iopf_mapped_bitmap,
> + bit_offset, len >> PAGE_SHIFT);
> +
> + cond_resched();
> + }
> +
> + start += (len + PAGE_SIZE);
> + }
> +
> +out:
> + if (gathers) {
> + i = 0;
> + list_for_each_entry(d, &iommu->domain_list, next)
> + iommu_iotlb_sync(d->domain, &gathers[i++]);
> +
> + kfree(gathers);
> + }
> +}
> +
> static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> {
> WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
> @@ -3197,17 +3265,18 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>
> vaddr = iova - dma->iova + dma->vaddr;
>
> - if (vfio_pin_page_external(dma, vaddr, &pfn, true))
> + if (vfio_pin_page_external(dma, vaddr, &pfn, false))
> goto out_invalid;
>
> if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
> - if (put_pfn(pfn, dma->prot))
> - vfio_lock_acct(dma, -1, true);
> + put_pfn(pfn, dma->prot);
> goto out_invalid;
> }
>
> bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
>
> + put_pfn(pfn, dma->prot);
> +
> out_success:
> status = IOMMU_PAGE_RESP_SUCCESS;
>
> @@ -3220,6 +3289,43 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
> return 0;
> }
>
> +static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm,
> + unsigned long start, unsigned long end)
> +{
> + struct vfio_iommu *iommu = container_of(mn, struct vfio_iommu, mn);
> + struct rb_node *n;
> + int ret;
> +
> + mutex_lock(&iommu->lock);
> +
> + ret = vfio_wait_all_valid(iommu);
> + if (WARN_ON(ret < 0))
> + return;

Is WARN_ON sufficient for this error condition? We've been told to
evacuate a range of mm, the device still has DMA access, we've removed
page pinning. This seems like a BUG_ON condition to me, we can't allow
the system to continue in any way with pages getting unmapped from
under the device.

> +
> + for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
> + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> + unsigned long start_n, end_n;
> +
> + if (end <= dma->vaddr || start >= dma->vaddr + dma->size)
> + continue;
> +
> + start_n = ALIGN_DOWN(max_t(unsigned long, start, dma->vaddr),
> + PAGE_SIZE);
> + end_n = ALIGN(min_t(unsigned long, end, dma->vaddr + dma->size),
> + PAGE_SIZE);
> +
> + vfio_unmap_partial_iopf(iommu, dma,
> + start_n - dma->vaddr + dma->iova,
> + end_n - dma->vaddr + dma->iova);
> + }
> +
> + mutex_unlock(&iommu->lock);
> +}
> +
> +static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
> + .invalidate_range = mn_invalidate_range,
> +};
> +
> static long vfio_iommu_type1_ioctl(void *iommu_data,
> unsigned int cmd, unsigned long arg)
> {

Again, this patch series is difficult to follow because we're
introducing dead code until the mmu notifier actually has a path to be
registered. We shouldn't be taking any faults until iopf is enabled,
so it seems like we can add more of the core support alongside this
code.


2021-05-19 18:31:02

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling

On Fri, 9 Apr 2021 11:44:16 +0800
Shenming Lu <[email protected]> wrote:

> To optimize for fewer page fault handlings, we can pre-map more pages
> than requested at once.
>
> Note that IOPF_PREMAP_LEN is just an arbitrary value for now, which we
> could try further tuning.

I'd prefer that the series introduced full end-to-end functionality
before trying to improve performance. The pre-map value seems
arbitrary here and as noted in the previous patch, the IOMMU API does
not guarantee unmaps of ranges smaller than the original mapping. This
would need to map with single page granularity in order to guarantee
page granularity at the mmu notifier when the IOMMU supports
superpages. Thanks,

Alex


2021-05-19 18:31:58

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support

On Fri, 9 Apr 2021 11:44:19 +0800
Shenming Lu <[email protected]> wrote:

> Some devices only allow selective DMA faulting. Similar to the selective
> dirty page tracking, the vendor driver can call vfio_pin_pages() to
> indicate the non-faultable scope, we add a new struct vfio_range to
> record it, then when the IOPF handler receives any page request out
> of the scope, we can directly return with an invalid response.

Seems like this highlights a deficiency in the design, that the user
can't specify mappings as iopf enabled or disabled. Also, if the
vendor driver has pinned pages within the range, shouldn't that prevent
them from faulting in the first place? Why do we need yet more
tracking structures? Pages pinned by the vendor driver need to count
against the user's locked memory limits regardless of iopf. Thanks,

Alex


2021-05-19 18:32:55

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

On Fri, 9 Apr 2021 11:44:12 +0800
Shenming Lu <[email protected]> wrote:

> Hi,
>
> Requesting for your comments and suggestions. :-)
>
> The static pinning and mapping problem in VFIO and possible solutions
> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> Page Fault support for VFIO devices. Different from those relatively
> complicated software approaches such as presenting a vIOMMU that provides
> the DMA buffer information (might include para-virtualized optimizations),
> IOPF mainly depends on the hardware faulting capability, such as the PCIe
> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
> support for VFIO passthrough based on the IOPF part of SVA in this series.

The SVA proposals are being reworked to make use of a new IOASID
object, it's not clear to me that this shouldn't also make use of that
work as it does a significant expansion of the type1 IOMMU with fault
handlers that would duplicate new work using that new model.

> We have measured its performance with UADK [4] (passthrough an accelerator
> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>
> Run hisi_sec_test...
> - with varying sending times and message lengths
> - with/without IOPF enabled (speed slowdown)
>
> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
> slowdown (num of faults)
> times VFIO IOPF host SVA
> 1 63.4% (518) 82.8% (512)
> 100 22.9% (1058) 47.9% (1024)
> 1000 2.6% (1071) 8.5% (1024)
>
> when msg_len = 10MB (and PREMAP_LEN = 512):
> slowdown (num of faults)
> times VFIO IOPF
> 1 32.6% (13)
> 100 3.5% (26)
> 1000 1.6% (26)

It seems like this is only an example that you can make a benchmark
show anything you want. The best results would be to pre-map
everything, which is what we have without this series. What is an
acceptable overhead to incur to avoid page pinning? What if userspace
had more fine grained control over which mappings were available for
faulting and which were statically mapped? I don't really see what
sense the pre-mapping range makes. If I assume the user is QEMU in a
non-vIOMMU configuration, pre-mapping the beginning of each RAM section
doesn't make any logical sense relative to device DMA.

Comments per patch to follow. Thanks,

Alex


> History:
>
> v2 -> v3
> - Nit fixes.
> - No reason to disable reporting the unrecoverable faults. (baolu)
> - Maintain a global IOPF enabled group list.
> - Split the pre-mapping optimization to be a separate patch.
> - Add selective faulting support (use vfio_pin_pages to indicate the
> non-faultable scope and add a new struct vfio_range to record it,
> untested). (Kevin)
>
> v1 -> v2
> - Numerous improvements following the suggestions. Thanks a lot to all
> of you.
>
> Note that PRI is not supported at the moment since there is no hardware.
>
> Links:
> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
> 2016.
> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
> for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
> [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/[email protected]/
> [4] https://github.com/Linaro/uadk
>
> Thanks,
> Shenming
>
>
> Shenming Lu (8):
> iommu: Evolve the device fault reporting framework
> vfio/type1: Add a page fault handler
> vfio/type1: Add an MMU notifier to avoid pinning
> vfio/type1: Pre-map more pages than requested in the IOPF handling
> vfio/type1: VFIO_IOMMU_ENABLE_IOPF
> vfio/type1: No need to statically pin and map if IOPF enabled
> vfio/type1: Add selective DMA faulting support
> vfio: Add nested IOPF support
>
> .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 3 +-
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 +-
> drivers/iommu/iommu.c | 56 +-
> drivers/vfio/vfio.c | 85 +-
> drivers/vfio/vfio_iommu_type1.c | 1000 ++++++++++++++++-
> include/linux/iommu.h | 19 +-
> include/linux/vfio.h | 13 +
> include/uapi/linux/iommu.h | 4 +
> include/uapi/linux/vfio.h | 6 +
> 9 files changed, 1181 insertions(+), 23 deletions(-)
>


2021-05-19 18:33:03

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF

On Fri, 9 Apr 2021 11:44:17 +0800
Shenming Lu <[email protected]> wrote:

> Since enabling IOPF for devices may lead to a slow ramp up of performance,
> we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
> IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
> registering the VFIO IOPF handler.
>
> Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
> inflight page faults when disabling.
>
> Signed-off-by: Shenming Lu <[email protected]>
> ---
> drivers/vfio/vfio_iommu_type1.c | 223 +++++++++++++++++++++++++++++++-
> include/uapi/linux/vfio.h | 6 +
> 2 files changed, 226 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 01e296c6dc9e..7df5711e743a 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -71,6 +71,7 @@ struct vfio_iommu {
> struct rb_root dma_list;
> struct blocking_notifier_head notifier;
> struct mmu_notifier mn;
> + struct mm_struct *mm;

We currently have no requirement that a single mm is used for all
container mappings. Does enabling IOPF impose such a requirement?
Shouldn't MAP/UNMAP enforce that?

> unsigned int dma_avail;
> unsigned int vaddr_invalid_count;
> uint64_t pgsize_bitmap;
> @@ -81,6 +82,7 @@ struct vfio_iommu {
> bool dirty_page_tracking;
> bool pinned_page_dirty_scope;
> bool container_open;
> + bool iopf_enabled;
> };
>
> struct vfio_domain {
> @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
> return node ? iopf_group : NULL;
> }
>
> +static void vfio_link_iopf_group(struct vfio_iopf_group *new)
> +{
> + struct rb_node **link, *parent = NULL;
> + struct vfio_iopf_group *iopf_group;
> +
> + mutex_lock(&iopf_group_list_lock);
> +
> + link = &iopf_group_list.rb_node;
> +
> + while (*link) {
> + parent = *link;
> + iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
> +
> + if (new->iommu_group < iopf_group->iommu_group)
> + link = &(*link)->rb_left;
> + else
> + link = &(*link)->rb_right;
> + }
> +
> + rb_link_node(&new->node, parent, link);
> + rb_insert_color(&new->node, &iopf_group_list);
> +
> + mutex_unlock(&iopf_group_list_lock);
> +}
> +
> +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
> +{
> + mutex_lock(&iopf_group_list_lock);
> + rb_erase(&old->node, &iopf_group_list);
> + mutex_unlock(&iopf_group_list_lock);
> +}
> +
> static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> {
> struct mm_struct *mm;
> @@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
> list_splice_tail(iova_copy, iova);
> }
>
> +static int vfio_dev_domian_nested(struct device *dev, int *nested)


s/domian/domain/


> +{
> + struct iommu_domain *domain;
> +
> + domain = iommu_get_domain_for_dev(dev);
> + if (!domain)
> + return -ENODEV;
> +
> + return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);


Wouldn't this be easier to use if it returned bool, false on -errno?


> +}
> +
> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data);
> +
> +static int dev_enable_iopf(struct device *dev, void *data)
> +{
> + int *enabled_dev_cnt = data;
> + int nested;
> + u32 flags;
> + int ret;
> +
> + ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
> + if (ret)
> + return ret;
> +
> + ret = vfio_dev_domian_nested(dev, &nested);
> + if (ret)
> + goto out_disable;
> +
> + if (nested)
> + flags = FAULT_REPORT_NESTED_L2;
> + else
> + flags = FAULT_REPORT_FLAT;
> +
> + ret = iommu_register_device_fault_handler(dev,
> + vfio_iommu_type1_dma_map_iopf, flags, dev);
> + if (ret)
> + goto out_disable;
> +
> + (*enabled_dev_cnt)++;
> + return 0;
> +
> +out_disable:
> + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
> + return ret;
> +}
> +
> +static int dev_disable_iopf(struct device *dev, void *data)
> +{
> + int *enabled_dev_cnt = data;
> +
> + if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
> + return -1;


Use an errno.


> +
> + WARN_ON(iommu_unregister_device_fault_handler(dev));
> + WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF));
> +
> + if (enabled_dev_cnt)
> + (*enabled_dev_cnt)--;


Why do we need these counters?

> +
> + return 0;
> +}
> +
> static int vfio_iommu_type1_attach_group(void *iommu_data,
> struct iommu_group *iommu_group)
> {
> @@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> struct iommu_domain_geometry geo;
> LIST_HEAD(iova_copy);
> LIST_HEAD(group_resv_regions);
> + int iopf_enabled_dev_cnt = 0;
> + struct vfio_iopf_group *iopf_group = NULL;
>
> mutex_lock(&iommu->lock);
>
> @@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> if (ret)
> goto out_domain;
>
> + if (iommu->iopf_enabled) {
> + ret = iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
> + dev_enable_iopf);
> + if (ret)
> + goto out_detach;
> +
> + iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
> + if (!iopf_group) {
> + ret = -ENOMEM;
> + goto out_detach;
> + }
> +
> + iopf_group->iommu_group = iommu_group;
> + iopf_group->iommu = iommu;
> +
> + vfio_link_iopf_group(iopf_group);

This seems backwards, once we enable iopf we can take a fault, so the
structure to lookup the data for the device should be setup first. I'm
still not convinced this iopf_group rb tree is a good solution to get
from the device to the iommu either. vfio-core could traverse dev ->
iommu_group -> vfio_group -> container -> iommu_data.


> + }
> +
> /* Get aperture info */
> iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
>
> @@ -2534,9 +2650,11 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> vfio_test_domain_fgsp(domain);
>
> /* replay mappings on new domains */
> - ret = vfio_iommu_replay(iommu, domain);
> - if (ret)
> - goto out_detach;
> + if (!iommu->iopf_enabled) {
> + ret = vfio_iommu_replay(iommu, domain);
> + if (ret)
> + goto out_detach;
> + }

Test in vfio_iommu_replay()?

>
> if (resv_msi) {
> ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
> @@ -2567,6 +2685,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> iommu_domain_free(domain->domain);
> vfio_iommu_iova_free(&iova_copy);
> vfio_iommu_resv_free(&group_resv_regions);
> + if (iommu->iopf_enabled) {
> + if (iopf_group) {
> + vfio_unlink_iopf_group(iopf_group);
> + kfree(iopf_group);
> + }
> +
> + iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
> + dev_disable_iopf);

Backwards, disable fault reporting before unlinking lookup.

> + }
> out_free:
> kfree(domain);
> kfree(group);
> @@ -2728,6 +2855,19 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> if (!group)
> continue;
>
> + if (iommu->iopf_enabled) {
> + struct vfio_iopf_group *iopf_group;
> +
> + iopf_group = vfio_find_iopf_group(iommu_group);
> + if (!WARN_ON(!iopf_group)) {
> + vfio_unlink_iopf_group(iopf_group);
> + kfree(iopf_group);
> + }
> +
> + iommu_group_for_each_dev(iommu_group, NULL,
> + dev_disable_iopf);


Backwards. Also unregistering the fault handler can fail if there are
pending faults. It appears the fault handler and iopf lookup could also
race with this function, ex. fault handler gets an iopf object before
it's removed from the rb-tree, blocks trying to acquire iommu->lock,
disable_iopf fails, detach proceeds, fault handler has use after free
error.


> + }
> +
> vfio_iommu_detach_group(domain, group);
> update_dirty_scope = !group->pinned_page_dirty_scope;
> list_del(&group->next);
> @@ -2846,6 +2986,11 @@ static void vfio_iommu_type1_release(void *iommu_data)
>
> vfio_iommu_iova_free(&iommu->iova_list);
>
> + if (iommu->iopf_enabled) {
> + mmu_notifier_unregister(&iommu->mn, iommu->mm);
> + mmdrop(iommu->mm);
> + }
> +
> kfree(iommu);
> }
>
> @@ -3441,6 +3586,76 @@ static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
> .invalidate_range = mn_invalidate_range,
> };
>
> +static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
> +{
> + struct vfio_domain *d;
> + struct vfio_group *g;
> + struct vfio_iopf_group *iopf_group;
> + int enabled_dev_cnt = 0;
> + int ret;
> +
> + if (!current->mm)
> + return -ENODEV;
> +
> + mutex_lock(&iommu->lock);
> +
> + mmgrab(current->mm);


As above, this is a new and undocumented requirement.


> + iommu->mm = current->mm;
> + iommu->mn.ops = &vfio_iommu_type1_mn_ops;
> + ret = mmu_notifier_register(&iommu->mn, current->mm);
> + if (ret)
> + goto out_drop;
> +
> + list_for_each_entry(d, &iommu->domain_list, next) {
> + list_for_each_entry(g, &d->group_list, next) {
> + ret = iommu_group_for_each_dev(g->iommu_group,
> + &enabled_dev_cnt, dev_enable_iopf);
> + if (ret)
> + goto out_unwind;
> +
> + iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
> + if (!iopf_group) {
> + ret = -ENOMEM;
> + goto out_unwind;
> + }
> +
> + iopf_group->iommu_group = g->iommu_group;
> + iopf_group->iommu = iommu;
> +
> + vfio_link_iopf_group(iopf_group);
> + }
> + }
> +
> + iommu->iopf_enabled = true;
> + goto out_unlock;
> +
> +out_unwind:
> + list_for_each_entry(d, &iommu->domain_list, next) {
> + list_for_each_entry(g, &d->group_list, next) {
> + iopf_group = vfio_find_iopf_group(g->iommu_group);
> + if (iopf_group) {
> + vfio_unlink_iopf_group(iopf_group);
> + kfree(iopf_group);
> + }
> +
> + if (iommu_group_for_each_dev(g->iommu_group,
> + &enabled_dev_cnt, dev_disable_iopf))
> + goto out_unregister;


This seems broken, we break out of the unwind if any group_for_each
fails??

> + }
> + }
> +
> +out_unregister:
> + mmu_notifier_unregister(&iommu->mn, current->mm);
> +
> +out_drop:
> + iommu->mm = NULL;
> + mmdrop(current->mm);
> +
> +out_unlock:
> + mutex_unlock(&iommu->lock);
> + return ret;
> +}


Is there an assumption that the user does this before performing any
mapping? I don't see where current mappings would be converted which
means we'll have pinned pages leaked and vfio_dma objects without a
pinned page bitmap.

> +
> static long vfio_iommu_type1_ioctl(void *iommu_data,
> unsigned int cmd, unsigned long arg)
> {
> @@ -3457,6 +3672,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> return vfio_iommu_type1_unmap_dma(iommu, arg);
> case VFIO_IOMMU_DIRTY_PAGES:
> return vfio_iommu_type1_dirty_pages(iommu, arg);
> + case VFIO_IOMMU_ENABLE_IOPF:
> + return vfio_iommu_type1_enable_iopf(iommu);
> default:
> return -ENOTTY;
> }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8ce36c1d53ca..5497036bebdc 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1208,6 +1208,12 @@ struct vfio_iommu_type1_dirty_bitmap_get {
>
> #define VFIO_IOMMU_DIRTY_PAGES _IO(VFIO_TYPE, VFIO_BASE + 17)
>
> +/*
> + * IOCTL to enable IOPF for the container.
> + * Called right after VFIO_SET_IOMMU.
> + */
> +#define VFIO_IOMMU_ENABLE_IOPF _IO(VFIO_TYPE, VFIO_BASE + 18)


We have extensions and flags and capabilities, and nothing tells the
user whether this feature is available to them without trying it?


> +
> /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>
> /*


2021-05-21 09:56:53

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:19 +0800
> Shenming Lu <[email protected]> wrote:
>
>> Some devices only allow selective DMA faulting. Similar to the selective
>> dirty page tracking, the vendor driver can call vfio_pin_pages() to
>> indicate the non-faultable scope, we add a new struct vfio_range to
>> record it, then when the IOPF handler receives any page request out
>> of the scope, we can directly return with an invalid response.
>
> Seems like this highlights a deficiency in the design, that the user
> can't specify mappings as iopf enabled or disabled. Also, if the
> vendor driver has pinned pages within the range, shouldn't that prevent
> them from faulting in the first place? Why do we need yet more
> tracking structures? Pages pinned by the vendor driver need to count
> against the user's locked memory limits regardless of iopf. Thanks,

Currently we only have a vfio_pfn struct to track the external pinned pages
(single page granularity), so I add a vfio_range struct for efficient lookup.

Yeah, by this patch, for the non-pinned scope, we can directly return INVALID,
but for the pinned(non-faultable) scope, tracking the pinned range doesn't seem
to help more...

Thanks,
Shenming

>
> Alex
>
> .
>

2021-05-21 09:58:14

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

Hi Alex,

On 2021/5/19 2:57, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:12 +0800
> Shenming Lu <[email protected]> wrote:
>
>> Hi,
>>
>> Requesting for your comments and suggestions. :-)
>>
>> The static pinning and mapping problem in VFIO and possible solutions
>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>> Page Fault support for VFIO devices. Different from those relatively
>> complicated software approaches such as presenting a vIOMMU that provides
>> the DMA buffer information (might include para-virtualized optimizations),
>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
>> support for VFIO passthrough based on the IOPF part of SVA in this series.
>
> The SVA proposals are being reworked to make use of a new IOASID
> object, it's not clear to me that this shouldn't also make use of that
> work as it does a significant expansion of the type1 IOMMU with fault
> handlers that would duplicate new work using that new model.

It seems that the IOASID extension for guest SVA would not affect this series,
will we do any host-guest IOASID translation in the VFIO fault handler?

>
>> We have measured its performance with UADK [4] (passthrough an accelerator
>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>>
>> Run hisi_sec_test...
>> - with varying sending times and message lengths
>> - with/without IOPF enabled (speed slowdown)
>>
>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>> slowdown (num of faults)
>> times VFIO IOPF host SVA
>> 1 63.4% (518) 82.8% (512)
>> 100 22.9% (1058) 47.9% (1024)
>> 1000 2.6% (1071) 8.5% (1024)
>>
>> when msg_len = 10MB (and PREMAP_LEN = 512):
>> slowdown (num of faults)
>> times VFIO IOPF
>> 1 32.6% (13)
>> 100 3.5% (26)
>> 1000 1.6% (26)
>
> It seems like this is only an example that you can make a benchmark
> show anything you want. The best results would be to pre-map
> everything, which is what we have without this series. What is an
> acceptable overhead to incur to avoid page pinning? What if userspace
> had more fine grained control over which mappings were available for
> faulting and which were statically mapped? I don't really see what
> sense the pre-mapping range makes. If I assume the user is QEMU in a
> non-vIOMMU configuration, pre-mapping the beginning of each RAM section
> doesn't make any logical sense relative to device DMA.

As you said in Patch 4, we can introduce full end-to-end functionality
before trying to improve performance, and I will drop the pre-mapping patch
in the current stage...

Is there a need that userspace wants more fine grained control over which
mappings are available for faulting? If so, we may evolve the MAP ioctl
to support for specifying the faulting range.

As for the overhead of IOPF, it is unavoidable if enabling on-demand paging
(and page faults occur almost only when first accessing), and it seems that
there is a large optimization space compared to CPU page faulting.

Thanks,
Shenming

>
> Comments per patch to follow. Thanks,
>
> Alex
>
>
>> History:
>>
>> v2 -> v3
>> - Nit fixes.
>> - No reason to disable reporting the unrecoverable faults. (baolu)
>> - Maintain a global IOPF enabled group list.
>> - Split the pre-mapping optimization to be a separate patch.
>> - Add selective faulting support (use vfio_pin_pages to indicate the
>> non-faultable scope and add a new struct vfio_range to record it,
>> untested). (Kevin)
>>
>> v1 -> v2
>> - Numerous improvements following the suggestions. Thanks a lot to all
>> of you.
>>
>> Note that PRI is not supported at the moment since there is no hardware.
>>
>> Links:
>> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
>> 2016.
>> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
>> for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
>> [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/[email protected]/
>> [4] https://github.com/Linaro/uadk
>>
>> Thanks,
>> Shenming
>>
>>
>> Shenming Lu (8):
>> iommu: Evolve the device fault reporting framework
>> vfio/type1: Add a page fault handler
>> vfio/type1: Add an MMU notifier to avoid pinning
>> vfio/type1: Pre-map more pages than requested in the IOPF handling
>> vfio/type1: VFIO_IOMMU_ENABLE_IOPF
>> vfio/type1: No need to statically pin and map if IOPF enabled
>> vfio/type1: Add selective DMA faulting support
>> vfio: Add nested IOPF support
>>
>> .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 3 +-
>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 +-
>> drivers/iommu/iommu.c | 56 +-
>> drivers/vfio/vfio.c | 85 +-
>> drivers/vfio/vfio_iommu_type1.c | 1000 ++++++++++++++++-
>> include/linux/iommu.h | 19 +-
>> include/linux/vfio.h | 13 +
>> include/uapi/linux/iommu.h | 4 +
>> include/uapi/linux/vfio.h | 6 +
>> 9 files changed, 1181 insertions(+), 23 deletions(-)
>>
>
> .
>

2021-05-21 18:23:40

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:16 +0800
> Shenming Lu <[email protected]> wrote:
>
>> To optimize for fewer page fault handlings, we can pre-map more pages
>> than requested at once.
>>
>> Note that IOPF_PREMAP_LEN is just an arbitrary value for now, which we
>> could try further tuning.
>
> I'd prefer that the series introduced full end-to-end functionality
> before trying to improve performance. The pre-map value seems
> arbitrary here and as noted in the previous patch, the IOMMU API does
> not guarantee unmaps of ranges smaller than the original mapping. This
> would need to map with single page granularity in order to guarantee
> page granularity at the mmu notifier when the IOMMU supports
> superpages. Thanks,

Yeah, I will drop this patch in the current stage.

Thanks,
Shenming

>
> Alex
>
> .
>

2021-05-21 20:08:01

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:15 +0800
> Shenming Lu <[email protected]> wrote:
>
>> To avoid pinning pages when they are mapped in IOMMU page tables, we
>> add an MMU notifier to tell the addresses which are no longer valid
>> and try to unmap them.
>>
>> Signed-off-by: Shenming Lu <[email protected]>
>> ---
>> drivers/vfio/vfio_iommu_type1.c | 112 +++++++++++++++++++++++++++++++-
>> 1 file changed, 109 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index ab0ff60ee207..1cb9d1f2717b 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -40,6 +40,7 @@
>> #include <linux/notifier.h>
>> #include <linux/dma-iommu.h>
>> #include <linux/irqdomain.h>
>> +#include <linux/mmu_notifier.h>
>>
>> #define DRIVER_VERSION "0.2"
>> #define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
>> @@ -69,6 +70,7 @@ struct vfio_iommu {
>> struct mutex lock;
>> struct rb_root dma_list;
>> struct blocking_notifier_head notifier;
>> + struct mmu_notifier mn;
>> unsigned int dma_avail;
>> unsigned int vaddr_invalid_count;
>> uint64_t pgsize_bitmap;
>> @@ -1204,6 +1206,72 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>> return unlocked;
>> }
>>
>> +/* Unmap the IOPF mapped pages in the specified range. */
>> +static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
>> + struct vfio_dma *dma,
>> + dma_addr_t start, dma_addr_t end)
>> +{
>> + struct iommu_iotlb_gather *gathers;
>> + struct vfio_domain *d;
>> + int i, num_domains = 0;
>> +
>> + list_for_each_entry(d, &iommu->domain_list, next)
>> + num_domains++;
>> +
>> + gathers = kzalloc(sizeof(*gathers) * num_domains, GFP_KERNEL);
>> + if (gathers) {
>> + for (i = 0; i < num_domains; i++)
>> + iommu_iotlb_gather_init(&gathers[i]);
>> + }
>
>
> If we're always serialized on iommu->lock, would it make sense to have
> gathers pre-allocated on the vfio_iommu object?

Yeah, we can do it.

>
>> +
>> + while (start < end) {
>> + unsigned long bit_offset;
>> + size_t len;
>> +
>> + bit_offset = (start - dma->iova) >> PAGE_SHIFT;
>> +
>> + for (len = 0; start + len < end; len += PAGE_SIZE) {
>> + if (!IOPF_MAPPED_BITMAP_GET(dma,
>> + bit_offset + (len >> PAGE_SHIFT)))
>> + break;
>
>
> There are bitmap helpers for this, find_first_bit(),
> find_next_zero_bit().

Thanks for the hint. :-)

>
>
>> + }
>> +
>> + if (len) {
>> + i = 0;
>> + list_for_each_entry(d, &iommu->domain_list, next) {
>> + size_t unmapped;
>> +
>> + if (gathers)
>> + unmapped = iommu_unmap_fast(d->domain,
>> + start, len,
>> + &gathers[i++]);
>> + else
>> + unmapped = iommu_unmap(d->domain,
>> + start, len);
>> +
>> + if (WARN_ON(unmapped != len))
>
> The IOMMU API does not guarantee arbitrary unmap sizes, this will
> trigger and this exit path is wrong. If we've already unmapped the
> IOMMU, shouldn't we proceed with @unmapped rather than @len so the
> device can re-fault the extra mappings? Otherwise the IOMMU state
> doesn't match the iopf bitmap state.

OK, I will correct it.

And can we assume that the @unmapped values (returned from iommu_unmap)
of all domains are the same (since all domains share the same iopf_mapped_bitmap)?

>
>> + goto out;
>> + }
>> +
>> + bitmap_clear(dma->iopf_mapped_bitmap,
>> + bit_offset, len >> PAGE_SHIFT);
>> +
>> + cond_resched();
>> + }
>> +
>> + start += (len + PAGE_SIZE);
>> + }
>> +
>> +out:
>> + if (gathers) {
>> + i = 0;
>> + list_for_each_entry(d, &iommu->domain_list, next)
>> + iommu_iotlb_sync(d->domain, &gathers[i++]);
>> +
>> + kfree(gathers);
>> + }
>> +}
>> +
>> static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>> {
>> WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
>> @@ -3197,17 +3265,18 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>>
>> vaddr = iova - dma->iova + dma->vaddr;
>>
>> - if (vfio_pin_page_external(dma, vaddr, &pfn, true))
>> + if (vfio_pin_page_external(dma, vaddr, &pfn, false))
>> goto out_invalid;
>>
>> if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
>> - if (put_pfn(pfn, dma->prot))
>> - vfio_lock_acct(dma, -1, true);
>> + put_pfn(pfn, dma->prot);
>> goto out_invalid;
>> }
>>
>> bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
>>
>> + put_pfn(pfn, dma->prot);
>> +
>> out_success:
>> status = IOMMU_PAGE_RESP_SUCCESS;
>>
>> @@ -3220,6 +3289,43 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>> return 0;
>> }
>>
>> +static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm,
>> + unsigned long start, unsigned long end)
>> +{
>> + struct vfio_iommu *iommu = container_of(mn, struct vfio_iommu, mn);
>> + struct rb_node *n;
>> + int ret;
>> +
>> + mutex_lock(&iommu->lock);
>> +
>> + ret = vfio_wait_all_valid(iommu);
>> + if (WARN_ON(ret < 0))
>> + return;
>
> Is WARN_ON sufficient for this error condition? We've been told to
> evacuate a range of mm, the device still has DMA access, we've removed
> page pinning. This seems like a BUG_ON condition to me, we can't allow
> the system to continue in any way with pages getting unmapped from
> under the device.

Make sense.

>
>> +
>> + for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
>> + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> + unsigned long start_n, end_n;
>> +
>> + if (end <= dma->vaddr || start >= dma->vaddr + dma->size)
>> + continue;
>> +
>> + start_n = ALIGN_DOWN(max_t(unsigned long, start, dma->vaddr),
>> + PAGE_SIZE);
>> + end_n = ALIGN(min_t(unsigned long, end, dma->vaddr + dma->size),
>> + PAGE_SIZE);
>> +
>> + vfio_unmap_partial_iopf(iommu, dma,
>> + start_n - dma->vaddr + dma->iova,
>> + end_n - dma->vaddr + dma->iova);
>> + }
>> +
>> + mutex_unlock(&iommu->lock);
>> +}
>> +
>> +static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
>> + .invalidate_range = mn_invalidate_range,
>> +};
>> +
>> static long vfio_iommu_type1_ioctl(void *iommu_data,
>> unsigned int cmd, unsigned long arg)
>> {
>
> Again, this patch series is difficult to follow because we're
> introducing dead code until the mmu notifier actually has a path to be
> registered.

Sorry again for this, I will be careful for the sequence of the series later.

> We shouldn't be taking any faults until iopf is enabled,
> so it seems like we can add more of the core support alongside this
> code.

Could you be more specific about this? :-)

We can add a requirement that the VFIO_IOMMU_ENABLE_IOPF ioctl (in Patch 5)
must be called right after VFIO_SET_IOMMU, so that IOPF is enabled before
setting up any DMA mapping. Is this sufficient?

Thanks,
Shenming

>
> .
>

2021-05-21 20:08:04

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:18 +0800
> Shenming Lu <[email protected]> wrote:
>
>> If IOPF enabled for the VFIO container, there is no need to statically
>> pin and map the entire DMA range, we can do it on demand. And unmap
>> according to the IOPF mapped bitmap when removing vfio_dma.
>>
>> Note that we still mark all pages dirty even if IOPF enabled, we may
>> add IOPF-based fine grained dirty tracking support in the future.
>>
>> Signed-off-by: Shenming Lu <[email protected]>
>> ---
>> drivers/vfio/vfio_iommu_type1.c | 38 +++++++++++++++++++++++++++------
>> 1 file changed, 32 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 7df5711e743a..dcc93c3b258c 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -175,6 +175,7 @@ struct vfio_iopf_group {
>> #define IOPF_MAPPED_BITMAP_GET(dma, i) \
>> ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG] \
>> >> ((i) % BITS_PER_LONG)) & 0x1)
>> +#define IOPF_MAPPED_BITMAP_BYTES(n) DIRTY_BITMAP_BYTES(n)
>>
>> #define WAITED 1
>>
>> @@ -959,7 +960,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>> * already pinned and accounted. Accouting should be done if there is no
>> * iommu capable domain in the container.
>> */
>> - do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
>> + do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
>> + iommu->iopf_enabled;
>>
>> for (i = 0; i < npage; i++) {
>> struct vfio_pfn *vpfn;
>> @@ -1048,7 +1050,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
>>
>> mutex_lock(&iommu->lock);
>>
>> - do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
>> + do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
>> + iommu->iopf_enabled;
>
> pin/unpin are actually still pinning pages, why does iopf exempt them
> from accounting?

If iopf_enabled is true, do_accounting will be true too, we will account
the external pinned pages?

>
>
>> for (i = 0; i < npage; i++) {
>> struct vfio_dma *dma;
>> dma_addr_t iova;
>> @@ -1169,7 +1172,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>> if (!dma->size)
>> return 0;
>>
>> - if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
>> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
>> return 0;
>>
>> /*
>> @@ -1306,11 +1309,20 @@ static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
>> }
>> }
>>
>> +static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma *dma)
>> +{
>> + vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size);
>> +
>> + kfree(dma->iopf_mapped_bitmap);
>> +}
>> +
>> static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>> {
>> WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
>> vfio_unmap_unpin(iommu, dma, true);
>> vfio_unlink_dma(iommu, dma);
>> + if (iommu->iopf_enabled)
>> + vfio_dma_clean_iopf(iommu, dma);
>> put_task_struct(dma->task);
>> vfio_dma_bitmap_free(dma);
>> if (dma->vaddr_invalid) {
>> @@ -1359,7 +1371,8 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
>> * mark all pages dirty if any IOMMU capable device is not able
>> * to report dirty pages and all pages are pinned and mapped.
>> */
>> - if (iommu->num_non_pinned_groups && dma->iommu_mapped)
>> + if (iommu->num_non_pinned_groups &&
>> + (dma->iommu_mapped || iommu->iopf_enabled))
>> bitmap_set(dma->bitmap, 0, nbits);
>
> This seems like really poor integration of iopf into dirty page
> tracking. I'd expect dirty logging to flush the mapped pages and
> write faults to mark pages dirty. Shouldn't the fault handler also
> provide only the access faulted, so for example a read fault wouldn't
> mark the page dirty?
I just want to keep the behavior here as before, if IOPF enabled, we
will still mark all pages dirty.

We can distinguish between write and read faults in the fault handler,
so there is a way to add IOPF-based fine grained dirty tracking support...
But I am not sure whether there is a need to implement this, we can
consider this in the future?

>
>>
>> if (shift) {
>> @@ -1772,6 +1785,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>> goto out_unlock;
>> }
>>
>> + if (iommu->iopf_enabled) {
>> + dma->iopf_mapped_bitmap = kvzalloc(IOPF_MAPPED_BITMAP_BYTES(
>> + size >> PAGE_SHIFT), GFP_KERNEL);
>> + if (!dma->iopf_mapped_bitmap) {
>> + ret = -ENOMEM;
>> + kfree(dma);
>> + goto out_unlock;
>> + }
>
>
> So we're assuming nothing can fault and therefore nothing can reference
> the iopf_mapped_bitmap until this point in the series?

I will move this to the front of this series.

Thanks,
Shenming

>
>
>> + }
>> +
>> iommu->dma_avail--;
>> dma->iova = iova;
>> dma->vaddr = vaddr;
>> @@ -1811,8 +1834,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>> /* Insert zero-sized and grow as we map chunks of it */
>> vfio_link_dma(iommu, dma);
>>
>> - /* Don't pin and map if container doesn't contain IOMMU capable domain*/
>> - if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
>> + /*
>> + * Don't pin and map if container doesn't contain IOMMU capable domain,
>> + * or IOPF enabled for the container.
>> + */
>> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
>> dma->size = size;
>> else
>> ret = vfio_pin_map_dma(iommu, dma, size);
>
> .
>

2021-05-21 20:08:10

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:17 +0800
> Shenming Lu <[email protected]> wrote:
>
>> Since enabling IOPF for devices may lead to a slow ramp up of performance,
>> we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
>> IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
>> registering the VFIO IOPF handler.
>>
>> Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
>> inflight page faults when disabling.
>>
>> Signed-off-by: Shenming Lu <[email protected]>
>> ---
>> drivers/vfio/vfio_iommu_type1.c | 223 +++++++++++++++++++++++++++++++-
>> include/uapi/linux/vfio.h | 6 +
>> 2 files changed, 226 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 01e296c6dc9e..7df5711e743a 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -71,6 +71,7 @@ struct vfio_iommu {
>> struct rb_root dma_list;
>> struct blocking_notifier_head notifier;
>> struct mmu_notifier mn;
>> + struct mm_struct *mm;
>
> We currently have no requirement that a single mm is used for all
> container mappings. Does enabling IOPF impose such a requirement?
> Shouldn't MAP/UNMAP enforce that?

Did you mean that there is a possibility that each vfio_dma in a
container has a different mm? If so, could we register one MMU
notifier for each vfio_dma in the MAP ioctl?

>
>> unsigned int dma_avail;
>> unsigned int vaddr_invalid_count;
>> uint64_t pgsize_bitmap;
>> @@ -81,6 +82,7 @@ struct vfio_iommu {
>> bool dirty_page_tracking;
>> bool pinned_page_dirty_scope;
>> bool container_open;
>> + bool iopf_enabled;
>> };
>>
>> struct vfio_domain {
>> @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
>> return node ? iopf_group : NULL;
>> }
>>
>> +static void vfio_link_iopf_group(struct vfio_iopf_group *new)
>> +{
>> + struct rb_node **link, *parent = NULL;
>> + struct vfio_iopf_group *iopf_group;
>> +
>> + mutex_lock(&iopf_group_list_lock);
>> +
>> + link = &iopf_group_list.rb_node;
>> +
>> + while (*link) {
>> + parent = *link;
>> + iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
>> +
>> + if (new->iommu_group < iopf_group->iommu_group)
>> + link = &(*link)->rb_left;
>> + else
>> + link = &(*link)->rb_right;
>> + }
>> +
>> + rb_link_node(&new->node, parent, link);
>> + rb_insert_color(&new->node, &iopf_group_list);
>> +
>> + mutex_unlock(&iopf_group_list_lock);
>> +}
>> +
>> +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
>> +{
>> + mutex_lock(&iopf_group_list_lock);
>> + rb_erase(&old->node, &iopf_group_list);
>> + mutex_unlock(&iopf_group_list_lock);
>> +}
>> +
>> static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>> {
>> struct mm_struct *mm;
>> @@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
>> list_splice_tail(iova_copy, iova);
>> }
>>
>> +static int vfio_dev_domian_nested(struct device *dev, int *nested)
>
>
> s/domian/domain/
>
>
>> +{
>> + struct iommu_domain *domain;
>> +
>> + domain = iommu_get_domain_for_dev(dev);
>> + if (!domain)
>> + return -ENODEV;
>> +
>> + return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);
>
>
> Wouldn't this be easier to use if it returned bool, false on -errno?

Make sense.

>
>
>> +}
>> +
>> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data);
>> +
>> +static int dev_enable_iopf(struct device *dev, void *data)
>> +{
>> + int *enabled_dev_cnt = data;
>> + int nested;
>> + u32 flags;
>> + int ret;
>> +
>> + ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
>> + if (ret)
>> + return ret;
>> +
>> + ret = vfio_dev_domian_nested(dev, &nested);
>> + if (ret)
>> + goto out_disable;
>> +
>> + if (nested)
>> + flags = FAULT_REPORT_NESTED_L2;
>> + else
>> + flags = FAULT_REPORT_FLAT;
>> +
>> + ret = iommu_register_device_fault_handler(dev,
>> + vfio_iommu_type1_dma_map_iopf, flags, dev);
>> + if (ret)
>> + goto out_disable;
>> +
>> + (*enabled_dev_cnt)++;
>> + return 0;
>> +
>> +out_disable:
>> + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
>> + return ret;
>> +}
>> +
>> +static int dev_disable_iopf(struct device *dev, void *data)
>> +{
>> + int *enabled_dev_cnt = data;
>> +
>> + if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
>> + return -1;
>
>
> Use an errno.>
>
>> +
>> + WARN_ON(iommu_unregister_device_fault_handler(dev));
>> + WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF));
>> +
>> + if (enabled_dev_cnt)
>> + (*enabled_dev_cnt)--;
>
>
> Why do we need these counters?

We use this counter to record the number of IOPF enabled devices, and if
any device fails to be enabled, we will exit the loop (three nested loop)
and go to the unwind process, which will disable the first @enabled_dev_cnt
devices.

>
>> +
>> + return 0;
>> +}
>> +
>> static int vfio_iommu_type1_attach_group(void *iommu_data,
>> struct iommu_group *iommu_group)
>> {
>> @@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>> struct iommu_domain_geometry geo;
>> LIST_HEAD(iova_copy);
>> LIST_HEAD(group_resv_regions);
>> + int iopf_enabled_dev_cnt = 0;
>> + struct vfio_iopf_group *iopf_group = NULL;
>>
>> mutex_lock(&iommu->lock);
>>
>> @@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>> if (ret)
>> goto out_domain;
>>
>> + if (iommu->iopf_enabled) {
>> + ret = iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
>> + dev_enable_iopf);
>> + if (ret)
>> + goto out_detach;
>> +
>> + iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
>> + if (!iopf_group) {
>> + ret = -ENOMEM;
>> + goto out_detach;
>> + }
>> +
>> + iopf_group->iommu_group = iommu_group;
>> + iopf_group->iommu = iommu;
>> +
>> + vfio_link_iopf_group(iopf_group);
>
> This seems backwards, once we enable iopf we can take a fault, so the
> structure to lookup the data for the device should be setup first. I'm
> still not convinced this iopf_group rb tree is a good solution to get
> from the device to the iommu either. vfio-core could traverse dev ->
> iommu_group -> vfio_group -> container -> iommu_data.

Make sense.

>
>
>> + }
>> +
>> /* Get aperture info */
>> iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
>>
>> @@ -2534,9 +2650,11 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>> vfio_test_domain_fgsp(domain);
>>
>> /* replay mappings on new domains */
>> - ret = vfio_iommu_replay(iommu, domain);
>> - if (ret)
>> - goto out_detach;
>> + if (!iommu->iopf_enabled) {
>> + ret = vfio_iommu_replay(iommu, domain);
>> + if (ret)
>> + goto out_detach;
>> + }
>
> Test in vfio_iommu_replay()?

Not yet, I will do more tests later.

>
>>
>> if (resv_msi) {
>> ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
>> @@ -2567,6 +2685,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>> iommu_domain_free(domain->domain);
>> vfio_iommu_iova_free(&iova_copy);
>> vfio_iommu_resv_free(&group_resv_regions);
>> + if (iommu->iopf_enabled) {
>> + if (iopf_group) {
>> + vfio_unlink_iopf_group(iopf_group);
>> + kfree(iopf_group);
>> + }
>> +
>> + iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
>> + dev_disable_iopf);
>
> Backwards, disable fault reporting before unlinking lookup.

Make sense.

>
>> + }
>> out_free:
>> kfree(domain);
>> kfree(group);
>> @@ -2728,6 +2855,19 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>> if (!group)
>> continue;
>>
>> + if (iommu->iopf_enabled) {
>> + struct vfio_iopf_group *iopf_group;
>> +
>> + iopf_group = vfio_find_iopf_group(iommu_group);
>> + if (!WARN_ON(!iopf_group)) {
>> + vfio_unlink_iopf_group(iopf_group);
>> + kfree(iopf_group);
>> + }
>> +
>> + iommu_group_for_each_dev(iommu_group, NULL,
>> + dev_disable_iopf);
>
>
> Backwards. Also unregistering the fault handler can fail if there are
> pending faults. It appears the fault handler and iopf lookup could also
> race with this function, ex. fault handler gets an iopf object before
> it's removed from the rb-tree, blocks trying to acquire iommu->lock,
> disable_iopf fails, detach proceeds, fault handler has use after free
> error.

We have WARN_ON for the failure of the unregistering.

Yeah, it seems that using vfio_iopf_group is not a good choice...

>
>
>> + }
>> +
>> vfio_iommu_detach_group(domain, group);
>> update_dirty_scope = !group->pinned_page_dirty_scope;
>> list_del(&group->next);
>> @@ -2846,6 +2986,11 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>
>> vfio_iommu_iova_free(&iommu->iova_list);
>>
>> + if (iommu->iopf_enabled) {
>> + mmu_notifier_unregister(&iommu->mn, iommu->mm);
>> + mmdrop(iommu->mm);
>> + }
>> +
>> kfree(iommu);
>> }
>>
>> @@ -3441,6 +3586,76 @@ static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
>> .invalidate_range = mn_invalidate_range,
>> };
>>
>> +static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
>> +{
>> + struct vfio_domain *d;
>> + struct vfio_group *g;
>> + struct vfio_iopf_group *iopf_group;
>> + int enabled_dev_cnt = 0;
>> + int ret;
>> +
>> + if (!current->mm)
>> + return -ENODEV;
>> +
>> + mutex_lock(&iommu->lock);
>> +
>> + mmgrab(current->mm);
>
>
> As above, this is a new and undocumented requirement.
>
>
>> + iommu->mm = current->mm;
>> + iommu->mn.ops = &vfio_iommu_type1_mn_ops;
>> + ret = mmu_notifier_register(&iommu->mn, current->mm);
>> + if (ret)
>> + goto out_drop;
>> +
>> + list_for_each_entry(d, &iommu->domain_list, next) {
>> + list_for_each_entry(g, &d->group_list, next) {
>> + ret = iommu_group_for_each_dev(g->iommu_group,
>> + &enabled_dev_cnt, dev_enable_iopf);
>> + if (ret)
>> + goto out_unwind;
>> +
>> + iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
>> + if (!iopf_group) {
>> + ret = -ENOMEM;
>> + goto out_unwind;
>> + }
>> +
>> + iopf_group->iommu_group = g->iommu_group;
>> + iopf_group->iommu = iommu;
>> +
>> + vfio_link_iopf_group(iopf_group);
>> + }
>> + }
>> +
>> + iommu->iopf_enabled = true;
>> + goto out_unlock;
>> +
>> +out_unwind:
>> + list_for_each_entry(d, &iommu->domain_list, next) {
>> + list_for_each_entry(g, &d->group_list, next) {
>> + iopf_group = vfio_find_iopf_group(g->iommu_group);
>> + if (iopf_group) {
>> + vfio_unlink_iopf_group(iopf_group);
>> + kfree(iopf_group);
>> + }
>> +
>> + if (iommu_group_for_each_dev(g->iommu_group,
>> + &enabled_dev_cnt, dev_disable_iopf))
>> + goto out_unregister;
>
>
> This seems broken, we break out of the unwind if any group_for_each
> fails??

The iommu_group_for_each_dev function will return a negative value if
we have disabled all previously enabled devices (@enabled_dev_cnt).

>
>> + }
>> + }
>> +
>> +out_unregister:
>> + mmu_notifier_unregister(&iommu->mn, current->mm);
>> +
>> +out_drop:
>> + iommu->mm = NULL;
>> + mmdrop(current->mm);
>> +
>> +out_unlock:
>> + mutex_unlock(&iommu->lock);
>> + return ret;
>> +}
>
>
> Is there an assumption that the user does this before performing any
> mapping? I don't see where current mappings would be converted which
> means we'll have pinned pages leaked and vfio_dma objects without a
> pinned page bitmap.

Yeah, we have an assumption that this ioctl (ENABLE_IOPF) must be called
before any MAP/UNMAP ioctl...

I will document these later. :-)

>
>> +
>> static long vfio_iommu_type1_ioctl(void *iommu_data,
>> unsigned int cmd, unsigned long arg)
>> {
>> @@ -3457,6 +3672,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>> return vfio_iommu_type1_unmap_dma(iommu, arg);
>> case VFIO_IOMMU_DIRTY_PAGES:
>> return vfio_iommu_type1_dirty_pages(iommu, arg);
>> + case VFIO_IOMMU_ENABLE_IOPF:
>> + return vfio_iommu_type1_enable_iopf(iommu);
>> default:
>> return -ENOTTY;
>> }
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 8ce36c1d53ca..5497036bebdc 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -1208,6 +1208,12 @@ struct vfio_iommu_type1_dirty_bitmap_get {
>>
>> #define VFIO_IOMMU_DIRTY_PAGES _IO(VFIO_TYPE, VFIO_BASE + 17)
>>
>> +/*
>> + * IOCTL to enable IOPF for the container.
>> + * Called right after VFIO_SET_IOMMU.
>> + */
>> +#define VFIO_IOMMU_ENABLE_IOPF _IO(VFIO_TYPE, VFIO_BASE + 18)
>
>
> We have extensions and flags and capabilities, and nothing tells the
> user whether this feature is available to them without trying it?

Yeah, we can have a capability for IOPF.
It is rough now... I will try to extend it later. :-)

Thanks,
Shenming

>
>
>> +
>> /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>>
>> /*
>
> .
>

2021-05-21 20:08:57

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:20 +0800
> Shenming Lu <[email protected]> wrote:
>
>> To set up nested mode, drivers such as vfio_pci need to register a
>> handler to receive stage/level 1 faults from the IOMMU, but since
>> currently each device can only have one iommu dev fault handler,
>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
>> we choose to update the registered handler (a consolidated one) via
>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
>> stage 1 faults in the handler to the guest through a newly added
>> vfio_device_ops callback.
>
> Are there proposed in-kernel drivers that would use any of these
> symbols?

I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can
use these symbols to consolidate the two page fault handlers into one.

[1] https://patchwork.kernel.org/project/kvm/cover/[email protected]/

>
>> Signed-off-by: Shenming Lu <[email protected]>
>> ---
>> drivers/vfio/vfio.c | 81 +++++++++++++++++++++++++++++++++
>> drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
>> include/linux/vfio.h | 12 +++++
>> 3 files changed, 141 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index 44c8dfabf7de..4245f15914bf 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
>> }
>> EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>>
>> +/*
>> + * Register/Update the VFIO IOPF handler to receive
>> + * nested stage/level 1 faults.
>> + */
>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
>> +{
>> + struct vfio_container *container;
>> + struct vfio_group *group;
>> + struct vfio_iommu_driver *driver;
>> + int ret;
>> +
>> + if (!dev)
>> + return -EINVAL;
>> +
>> + group = vfio_group_get_from_dev(dev);
>> + if (!group)
>> + return -ENODEV;
>> +
>> + ret = vfio_group_add_container_user(group);
>> + if (ret)
>> + goto out;
>> +
>> + container = group->container;
>> + driver = container->iommu_driver;
>> + if (likely(driver && driver->ops->register_handler))
>> + ret = driver->ops->register_handler(container->iommu_data, dev);
>> + else
>> + ret = -ENOTTY;
>> +
>> + vfio_group_try_dissolve_container(group);
>> +
>> +out:
>> + vfio_group_put(group);
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
>> +
>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
>> +{
>> + struct vfio_container *container;
>> + struct vfio_group *group;
>> + struct vfio_iommu_driver *driver;
>> + int ret;
>> +
>> + if (!dev)
>> + return -EINVAL;
>> +
>> + group = vfio_group_get_from_dev(dev);
>> + if (!group)
>> + return -ENODEV;
>> +
>> + ret = vfio_group_add_container_user(group);
>> + if (ret)
>> + goto out;
>> +
>> + container = group->container;
>> + driver = container->iommu_driver;
>> + if (likely(driver && driver->ops->unregister_handler))
>> + ret = driver->ops->unregister_handler(container->iommu_data, dev);
>> + else
>> + ret = -ENOTTY;
>> +
>> + vfio_group_try_dissolve_container(group);
>> +
>> +out:
>> + vfio_group_put(group);
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
>> +
>> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
>> +{
>> + struct vfio_device *device = dev_get_drvdata(dev);
>> +
>> + if (unlikely(!device->ops->transfer))
>> + return -EOPNOTSUPP;
>> +
>> + return device->ops->transfer(device->device_data, fault);
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
>> +
>> /**
>> * Module/class support
>> */
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index ba2b5a1cf6e9..9d1adeddb303 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>> struct vfio_batch batch;
>> struct vfio_range *range;
>> dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
>> - int access_flags = 0;
>> + int access_flags = 0, nested;
>> size_t premap_len, map_len, mapped_len = 0;
>> unsigned long bit_offset, vaddr, pfn, i, npages;
>> int ret;
>> enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
>> struct iommu_page_response resp = {0};
>>
>> + if (vfio_dev_domian_nested(dev, &nested))
>> + return -ENODEV;
>> +
>> + /*
>> + * When configured in nested mode, further deliver the
>> + * stage/level 1 faults to the guest.
>> + */
>> + if (nested) {
>> + bool l2;
>> +
>> + if (fault->type == IOMMU_FAULT_PAGE_REQ)
>> + l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
>> + if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
>> + l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
>> +
>> + if (!l2)
>> + return vfio_transfer_iommu_fault(dev, fault);
>> + }
>> +
>> if (fault->type != IOMMU_FAULT_PAGE_REQ)
>> return -EOPNOTSUPP;
>>
>> @@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
>> wake_up_all(&iommu->vaddr_wait);
>> }
>>
>> +static int vfio_iommu_type1_register_handler(void *iommu_data,
>> + struct device *dev)
>> +{
>> + struct vfio_iommu *iommu = iommu_data;
>> +
>> + if (iommu->iopf_enabled)
>> + return iommu_update_device_fault_handler(dev, ~0,
>> + FAULT_REPORT_NESTED_L1);
>> + else
>> + return iommu_register_device_fault_handler(dev,
>> + vfio_iommu_type1_dma_map_iopf,
>> + FAULT_REPORT_NESTED_L1, dev);
>> +}
>> +
>> +static int vfio_iommu_type1_unregister_handler(void *iommu_data,
>> + struct device *dev)
>> +{
>> + struct vfio_iommu *iommu = iommu_data;
>> +
>> + if (iommu->iopf_enabled)
>> + return iommu_update_device_fault_handler(dev,
>> + ~FAULT_REPORT_NESTED_L1, 0);
>> + else
>> + return iommu_unregister_device_fault_handler(dev);
>> +}
>
>
> The path through vfio to register this is pretty ugly, but I don't see
> any reason for the the update interfaces here, the previously
> registered handler just changes its behavior.

Yeah, this seems not an elegant way...

If IOPF(L2) enabled, the fault handler has already been registered, so for
nested mode setup, we only need to change the flags of the handler in the
IOMMU driver to receive L1 faults.
(assume that L1 IOPF is configured after L2 IOPF)

Currently each device can only have one iommu dev fault handler, and L1
and L2 IOPF are configured separately in nested mode, I am also wondering
that is there a better solution for this.

Thanks,
Shenming

>
>
>> +
>> static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>> .name = "vfio-iommu-type1",
>> .owner = THIS_MODULE,
>> @@ -4216,6 +4261,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>> .dma_rw = vfio_iommu_type1_dma_rw,
>> .group_iommu_domain = vfio_iommu_type1_group_iommu_domain,
>> .notify = vfio_iommu_type1_notify,
>> + .register_handler = vfio_iommu_type1_register_handler,
>> + .unregister_handler = vfio_iommu_type1_unregister_handler,
>> };
>>
>> static int __init vfio_iommu_type1_init(void)
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index a7b426d579df..4621d8f0395d 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -29,6 +29,8 @@
>> * @match: Optional device name match callback (return: 0 for no-match, >0 for
>> * match, -errno for abort (ex. match with insufficient or incorrect
>> * additional args)
>> + * @transfer: Optional. Transfer the received stage/level 1 faults to the guest
>> + * for nested mode.
>> */
>> struct vfio_device_ops {
>> char *name;
>> @@ -43,6 +45,7 @@ struct vfio_device_ops {
>> int (*mmap)(void *device_data, struct vm_area_struct *vma);
>> void (*request)(void *device_data, unsigned int count);
>> int (*match)(void *device_data, char *buf);
>> + int (*transfer)(void *device_data, struct iommu_fault *fault);
>> };
>>
>> extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
>> @@ -100,6 +103,10 @@ struct vfio_iommu_driver_ops {
>> struct iommu_group *group);
>> void (*notify)(void *iommu_data,
>> enum vfio_iommu_notify_type event);
>> + int (*register_handler)(void *iommu_data,
>> + struct device *dev);
>> + int (*unregister_handler)(void *iommu_data,
>> + struct device *dev);
>> };
>>
>> extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
>> @@ -161,6 +168,11 @@ extern int vfio_unregister_notifier(struct device *dev,
>> struct kvm;
>> extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
>>
>> +extern int vfio_iommu_dev_fault_handler_register_nested(struct device *dev);
>> +extern int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev);
>> +extern int vfio_transfer_iommu_fault(struct device *dev,
>> + struct iommu_fault *fault);
>> +
>> /*
>> * Sub-module helpers
>> */
>
> .
>

2021-05-24 13:13:19

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

On 2021/5/21 15:59, Shenming Lu wrote:
> On 2021/5/19 2:58, Alex Williamson wrote:
>> On Fri, 9 Apr 2021 11:44:20 +0800
>> Shenming Lu <[email protected]> wrote:
>>
>>> To set up nested mode, drivers such as vfio_pci need to register a
>>> handler to receive stage/level 1 faults from the IOMMU, but since
>>> currently each device can only have one iommu dev fault handler,
>>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
>>> we choose to update the registered handler (a consolidated one) via
>>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
>>> stage 1 faults in the handler to the guest through a newly added
>>> vfio_device_ops callback.
>>
>> Are there proposed in-kernel drivers that would use any of these
>> symbols?
>
> I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can
> use these symbols to consolidate the two page fault handlers into one.
>
> [1] https://patchwork.kernel.org/project/kvm/cover/[email protected]/
>
>>
>>> Signed-off-by: Shenming Lu <[email protected]>
>>> ---
>>> drivers/vfio/vfio.c | 81 +++++++++++++++++++++++++++++++++
>>> drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
>>> include/linux/vfio.h | 12 +++++
>>> 3 files changed, 141 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>> index 44c8dfabf7de..4245f15914bf 100644
>>> --- a/drivers/vfio/vfio.c
>>> +++ b/drivers/vfio/vfio.c
>>> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
>>> }
>>> EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>>>
>>> +/*
>>> + * Register/Update the VFIO IOPF handler to receive
>>> + * nested stage/level 1 faults.
>>> + */
>>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
>>> +{
>>> + struct vfio_container *container;
>>> + struct vfio_group *group;
>>> + struct vfio_iommu_driver *driver;
>>> + int ret;
>>> +
>>> + if (!dev)
>>> + return -EINVAL;
>>> +
>>> + group = vfio_group_get_from_dev(dev);
>>> + if (!group)
>>> + return -ENODEV;
>>> +
>>> + ret = vfio_group_add_container_user(group);
>>> + if (ret)
>>> + goto out;
>>> +
>>> + container = group->container;
>>> + driver = container->iommu_driver;
>>> + if (likely(driver && driver->ops->register_handler))
>>> + ret = driver->ops->register_handler(container->iommu_data, dev);
>>> + else
>>> + ret = -ENOTTY;
>>> +
>>> + vfio_group_try_dissolve_container(group);
>>> +
>>> +out:
>>> + vfio_group_put(group);
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
>>> +
>>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
>>> +{
>>> + struct vfio_container *container;
>>> + struct vfio_group *group;
>>> + struct vfio_iommu_driver *driver;
>>> + int ret;
>>> +
>>> + if (!dev)
>>> + return -EINVAL;
>>> +
>>> + group = vfio_group_get_from_dev(dev);
>>> + if (!group)
>>> + return -ENODEV;
>>> +
>>> + ret = vfio_group_add_container_user(group);
>>> + if (ret)
>>> + goto out;
>>> +
>>> + container = group->container;
>>> + driver = container->iommu_driver;
>>> + if (likely(driver && driver->ops->unregister_handler))
>>> + ret = driver->ops->unregister_handler(container->iommu_data, dev);
>>> + else
>>> + ret = -ENOTTY;
>>> +
>>> + vfio_group_try_dissolve_container(group);
>>> +
>>> +out:
>>> + vfio_group_put(group);
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
>>> +
>>> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
>>> +{
>>> + struct vfio_device *device = dev_get_drvdata(dev);
>>> +
>>> + if (unlikely(!device->ops->transfer))
>>> + return -EOPNOTSUPP;
>>> +
>>> + return device->ops->transfer(device->device_data, fault);
>>> +}
>>> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
>>> +
>>> /**
>>> * Module/class support
>>> */
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index ba2b5a1cf6e9..9d1adeddb303 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>>> struct vfio_batch batch;
>>> struct vfio_range *range;
>>> dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
>>> - int access_flags = 0;
>>> + int access_flags = 0, nested;
>>> size_t premap_len, map_len, mapped_len = 0;
>>> unsigned long bit_offset, vaddr, pfn, i, npages;
>>> int ret;
>>> enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
>>> struct iommu_page_response resp = {0};
>>>
>>> + if (vfio_dev_domian_nested(dev, &nested))
>>> + return -ENODEV;
>>> +
>>> + /*
>>> + * When configured in nested mode, further deliver the
>>> + * stage/level 1 faults to the guest.
>>> + */
>>> + if (nested) {
>>> + bool l2;
>>> +
>>> + if (fault->type == IOMMU_FAULT_PAGE_REQ)
>>> + l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
>>> + if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
>>> + l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
>>> +
>>> + if (!l2)
>>> + return vfio_transfer_iommu_fault(dev, fault);
>>> + }
>>> +
>>> if (fault->type != IOMMU_FAULT_PAGE_REQ)
>>> return -EOPNOTSUPP;
>>>
>>> @@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
>>> wake_up_all(&iommu->vaddr_wait);
>>> }
>>>
>>> +static int vfio_iommu_type1_register_handler(void *iommu_data,
>>> + struct device *dev)
>>> +{
>>> + struct vfio_iommu *iommu = iommu_data;
>>> +
>>> + if (iommu->iopf_enabled)
>>> + return iommu_update_device_fault_handler(dev, ~0,
>>> + FAULT_REPORT_NESTED_L1);
>>> + else
>>> + return iommu_register_device_fault_handler(dev,
>>> + vfio_iommu_type1_dma_map_iopf,
>>> + FAULT_REPORT_NESTED_L1, dev);
>>> +}
>>> +
>>> +static int vfio_iommu_type1_unregister_handler(void *iommu_data,
>>> + struct device *dev)
>>> +{
>>> + struct vfio_iommu *iommu = iommu_data;
>>> +
>>> + if (iommu->iopf_enabled)
>>> + return iommu_update_device_fault_handler(dev,
>>> + ~FAULT_REPORT_NESTED_L1, 0);
>>> + else
>>> + return iommu_unregister_device_fault_handler(dev);
>>> +}
>>
>>
>> The path through vfio to register this is pretty ugly, but I don't see
>> any reason for the the update interfaces here, the previously
>> registered handler just changes its behavior.
>
> Yeah, this seems not an elegant way...
>
> If IOPF(L2) enabled, the fault handler has already been registered, so for
> nested mode setup, we only need to change the flags of the handler in the
> IOMMU driver to receive L1 faults.
> (assume that L1 IOPF is configured after L2 IOPF)
>
> Currently each device can only have one iommu dev fault handler, and L1
> and L2 IOPF are configured separately in nested mode, I am also wondering
> that is there a better solution for this.

Hi Alex, Eric,

Let me simply add, maybe there is another way for this:
Would it be better to set host IOPF enabled (L2 faults) in the VFIO_IOMMU_MAP_DMA
ioctl (no need to add a new ioctl, and we can specify whether this mapping is IOPF
available or statically pinned), and set guest IOPF enabled (L1 faults) in the
VFIO_IOMMU_SET_PASID_TABLE (from Eric's series) ioctl?
And we have no requirement for the sequence of these two ioctls. The first called
one will register the handler, and the later one will just update the handler...

Thanks,
Shenming

>
> Thanks,
> Shenming
>
>>
>>
>>> +
>>> static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>>> .name = "vfio-iommu-type1",
>>> .owner = THIS_MODULE,
>>> @@ -4216,6 +4261,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>>> .dma_rw = vfio_iommu_type1_dma_rw,
>>> .group_iommu_domain = vfio_iommu_type1_group_iommu_domain,
>>> .notify = vfio_iommu_type1_notify,
>>> + .register_handler = vfio_iommu_type1_register_handler,
>>> + .unregister_handler = vfio_iommu_type1_unregister_handler,
>>> };
>>>
>>> static int __init vfio_iommu_type1_init(void)
>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>> index a7b426d579df..4621d8f0395d 100644
>>> --- a/include/linux/vfio.h
>>> +++ b/include/linux/vfio.h
>>> @@ -29,6 +29,8 @@
>>> * @match: Optional device name match callback (return: 0 for no-match, >0 for
>>> * match, -errno for abort (ex. match with insufficient or incorrect
>>> * additional args)
>>> + * @transfer: Optional. Transfer the received stage/level 1 faults to the guest
>>> + * for nested mode.
>>> */
>>> struct vfio_device_ops {
>>> char *name;
>>> @@ -43,6 +45,7 @@ struct vfio_device_ops {
>>> int (*mmap)(void *device_data, struct vm_area_struct *vma);
>>> void (*request)(void *device_data, unsigned int count);
>>> int (*match)(void *device_data, char *buf);
>>> + int (*transfer)(void *device_data, struct iommu_fault *fault);
>>> };
>>>
>>> extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
>>> @@ -100,6 +103,10 @@ struct vfio_iommu_driver_ops {
>>> struct iommu_group *group);
>>> void (*notify)(void *iommu_data,
>>> enum vfio_iommu_notify_type event);
>>> + int (*register_handler)(void *iommu_data,
>>> + struct device *dev);
>>> + int (*unregister_handler)(void *iommu_data,
>>> + struct device *dev);
>>> };
>>>
>>> extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
>>> @@ -161,6 +168,11 @@ extern int vfio_unregister_notifier(struct device *dev,
>>> struct kvm;
>>> extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
>>>
>>> +extern int vfio_iommu_dev_fault_handler_register_nested(struct device *dev);
>>> +extern int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev);
>>> +extern int vfio_transfer_iommu_fault(struct device *dev,
>>> + struct iommu_fault *fault);
>>> +
>>> /*
>>> * Sub-module helpers
>>> */
>>
>> .
>>

2021-05-24 22:13:23

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF

On Fri, 21 May 2021 14:38:25 +0800
Shenming Lu <[email protected]> wrote:

> On 2021/5/19 2:58, Alex Williamson wrote:
> > On Fri, 9 Apr 2021 11:44:17 +0800
> > Shenming Lu <[email protected]> wrote:
> >
> >> Since enabling IOPF for devices may lead to a slow ramp up of performance,
> >> we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
> >> IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
> >> registering the VFIO IOPF handler.
> >>
> >> Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
> >> inflight page faults when disabling.
> >>
> >> Signed-off-by: Shenming Lu <[email protected]>
> >> ---
> >> drivers/vfio/vfio_iommu_type1.c | 223 +++++++++++++++++++++++++++++++-
> >> include/uapi/linux/vfio.h | 6 +
> >> 2 files changed, 226 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 01e296c6dc9e..7df5711e743a 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -71,6 +71,7 @@ struct vfio_iommu {
> >> struct rb_root dma_list;
> >> struct blocking_notifier_head notifier;
> >> struct mmu_notifier mn;
> >> + struct mm_struct *mm;
> >
> > We currently have no requirement that a single mm is used for all
> > container mappings. Does enabling IOPF impose such a requirement?
> > Shouldn't MAP/UNMAP enforce that?
>
> Did you mean that there is a possibility that each vfio_dma in a
> container has a different mm? If so, could we register one MMU
> notifier for each vfio_dma in the MAP ioctl?

We don't prevent it currently. There needs to be some balance,
typically we'll have one mm, so would it make sense to have potentially
thousands of mmu notifiers registered against the same mm?

> >> unsigned int dma_avail;
> >> unsigned int vaddr_invalid_count;
> >> uint64_t pgsize_bitmap;
> >> @@ -81,6 +82,7 @@ struct vfio_iommu {
> >> bool dirty_page_tracking;
> >> bool pinned_page_dirty_scope;
> >> bool container_open;
> >> + bool iopf_enabled;
> >> };
> >>
> >> struct vfio_domain {
> >> @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
> >> return node ? iopf_group : NULL;
> >> }
> >>
> >> +static void vfio_link_iopf_group(struct vfio_iopf_group *new)
> >> +{
> >> + struct rb_node **link, *parent = NULL;
> >> + struct vfio_iopf_group *iopf_group;
> >> +
> >> + mutex_lock(&iopf_group_list_lock);
> >> +
> >> + link = &iopf_group_list.rb_node;
> >> +
> >> + while (*link) {
> >> + parent = *link;
> >> + iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
> >> +
> >> + if (new->iommu_group < iopf_group->iommu_group)
> >> + link = &(*link)->rb_left;
> >> + else
> >> + link = &(*link)->rb_right;
> >> + }
> >> +
> >> + rb_link_node(&new->node, parent, link);
> >> + rb_insert_color(&new->node, &iopf_group_list);
> >> +
> >> + mutex_unlock(&iopf_group_list_lock);
> >> +}
> >> +
> >> +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
> >> +{
> >> + mutex_lock(&iopf_group_list_lock);
> >> + rb_erase(&old->node, &iopf_group_list);
> >> + mutex_unlock(&iopf_group_list_lock);
> >> +}
> >> +
> >> static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> >> {
> >> struct mm_struct *mm;
> >> @@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
> >> list_splice_tail(iova_copy, iova);
> >> }
> >>
> >> +static int vfio_dev_domian_nested(struct device *dev, int *nested)
> >
> >
> > s/domian/domain/
> >
> >
> >> +{
> >> + struct iommu_domain *domain;
> >> +
> >> + domain = iommu_get_domain_for_dev(dev);
> >> + if (!domain)
> >> + return -ENODEV;
> >> +
> >> + return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);
> >
> >
> > Wouldn't this be easier to use if it returned bool, false on -errno?
>
> Make sense.
>
> >
> >
> >> +}
> >> +
> >> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data);
> >> +
> >> +static int dev_enable_iopf(struct device *dev, void *data)
> >> +{
> >> + int *enabled_dev_cnt = data;
> >> + int nested;
> >> + u32 flags;
> >> + int ret;
> >> +
> >> + ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
> >> + if (ret)
> >> + return ret;
> >> +
> >> + ret = vfio_dev_domian_nested(dev, &nested);
> >> + if (ret)
> >> + goto out_disable;
> >> +
> >> + if (nested)
> >> + flags = FAULT_REPORT_NESTED_L2;
> >> + else
> >> + flags = FAULT_REPORT_FLAT;
> >> +
> >> + ret = iommu_register_device_fault_handler(dev,
> >> + vfio_iommu_type1_dma_map_iopf, flags, dev);
> >> + if (ret)
> >> + goto out_disable;
> >> +
> >> + (*enabled_dev_cnt)++;
> >> + return 0;
> >> +
> >> +out_disable:
> >> + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
> >> + return ret;
> >> +}
> >> +
> >> +static int dev_disable_iopf(struct device *dev, void *data)
> >> +{
> >> + int *enabled_dev_cnt = data;
> >> +
> >> + if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
> >> + return -1;
> >
> >
> > Use an errno.>
> >
> >> +
> >> + WARN_ON(iommu_unregister_device_fault_handler(dev));
> >> + WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF));
> >> +
> >> + if (enabled_dev_cnt)
> >> + (*enabled_dev_cnt)--;
> >
> >
> > Why do we need these counters?
>
> We use this counter to record the number of IOPF enabled devices, and if
> any device fails to be enabled, we will exit the loop (three nested loop)
> and go to the unwind process, which will disable the first @enabled_dev_cnt
> devices.
>
> >
> >> +
> >> + return 0;
> >> +}
> >> +
> >> static int vfio_iommu_type1_attach_group(void *iommu_data,
> >> struct iommu_group *iommu_group)
> >> {
> >> @@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >> struct iommu_domain_geometry geo;
> >> LIST_HEAD(iova_copy);
> >> LIST_HEAD(group_resv_regions);
> >> + int iopf_enabled_dev_cnt = 0;
> >> + struct vfio_iopf_group *iopf_group = NULL;
> >>
> >> mutex_lock(&iommu->lock);
> >>
> >> @@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >> if (ret)
> >> goto out_domain;
> >>
> >> + if (iommu->iopf_enabled) {
> >> + ret = iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
> >> + dev_enable_iopf);
> >> + if (ret)
> >> + goto out_detach;
> >> +
> >> + iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
> >> + if (!iopf_group) {
> >> + ret = -ENOMEM;
> >> + goto out_detach;
> >> + }
> >> +
> >> + iopf_group->iommu_group = iommu_group;
> >> + iopf_group->iommu = iommu;
> >> +
> >> + vfio_link_iopf_group(iopf_group);
> >
> > This seems backwards, once we enable iopf we can take a fault, so the
> > structure to lookup the data for the device should be setup first. I'm
> > still not convinced this iopf_group rb tree is a good solution to get
> > from the device to the iommu either. vfio-core could traverse dev ->
> > iommu_group -> vfio_group -> container -> iommu_data.
>
> Make sense.
>
> >
> >
> >> + }
> >> +
> >> /* Get aperture info */
> >> iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
> >>
> >> @@ -2534,9 +2650,11 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >> vfio_test_domain_fgsp(domain);
> >>
> >> /* replay mappings on new domains */
> >> - ret = vfio_iommu_replay(iommu, domain);
> >> - if (ret)
> >> - goto out_detach;
> >> + if (!iommu->iopf_enabled) {
> >> + ret = vfio_iommu_replay(iommu, domain);
> >> + if (ret)
> >> + goto out_detach;
> >> + }
> >
> > Test in vfio_iommu_replay()?
>
> Not yet, I will do more tests later.
>
> >
> >>
> >> if (resv_msi) {
> >> ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
> >> @@ -2567,6 +2685,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >> iommu_domain_free(domain->domain);
> >> vfio_iommu_iova_free(&iova_copy);
> >> vfio_iommu_resv_free(&group_resv_regions);
> >> + if (iommu->iopf_enabled) {
> >> + if (iopf_group) {
> >> + vfio_unlink_iopf_group(iopf_group);
> >> + kfree(iopf_group);
> >> + }
> >> +
> >> + iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
> >> + dev_disable_iopf);
> >
> > Backwards, disable fault reporting before unlinking lookup.
>
> Make sense.
>
> >
> >> + }
> >> out_free:
> >> kfree(domain);
> >> kfree(group);
> >> @@ -2728,6 +2855,19 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> >> if (!group)
> >> continue;
> >>
> >> + if (iommu->iopf_enabled) {
> >> + struct vfio_iopf_group *iopf_group;
> >> +
> >> + iopf_group = vfio_find_iopf_group(iommu_group);
> >> + if (!WARN_ON(!iopf_group)) {
> >> + vfio_unlink_iopf_group(iopf_group);
> >> + kfree(iopf_group);
> >> + }
> >> +
> >> + iommu_group_for_each_dev(iommu_group, NULL,
> >> + dev_disable_iopf);
> >
> >
> > Backwards. Also unregistering the fault handler can fail if there are
> > pending faults. It appears the fault handler and iopf lookup could also
> > race with this function, ex. fault handler gets an iopf object before
> > it's removed from the rb-tree, blocks trying to acquire iommu->lock,
> > disable_iopf fails, detach proceeds, fault handler has use after free
> > error.
>
> We have WARN_ON for the failure of the unregistering.
>
> Yeah, it seems that using vfio_iopf_group is not a good choice...
>
> >
> >
> >> + }
> >> +
> >> vfio_iommu_detach_group(domain, group);
> >> update_dirty_scope = !group->pinned_page_dirty_scope;
> >> list_del(&group->next);
> >> @@ -2846,6 +2986,11 @@ static void vfio_iommu_type1_release(void *iommu_data)
> >>
> >> vfio_iommu_iova_free(&iommu->iova_list);
> >>
> >> + if (iommu->iopf_enabled) {
> >> + mmu_notifier_unregister(&iommu->mn, iommu->mm);
> >> + mmdrop(iommu->mm);
> >> + }
> >> +
> >> kfree(iommu);
> >> }
> >>
> >> @@ -3441,6 +3586,76 @@ static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
> >> .invalidate_range = mn_invalidate_range,
> >> };
> >>
> >> +static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
> >> +{
> >> + struct vfio_domain *d;
> >> + struct vfio_group *g;
> >> + struct vfio_iopf_group *iopf_group;
> >> + int enabled_dev_cnt = 0;
> >> + int ret;
> >> +
> >> + if (!current->mm)
> >> + return -ENODEV;
> >> +
> >> + mutex_lock(&iommu->lock);
> >> +
> >> + mmgrab(current->mm);
> >
> >
> > As above, this is a new and undocumented requirement.
> >
> >
> >> + iommu->mm = current->mm;
> >> + iommu->mn.ops = &vfio_iommu_type1_mn_ops;
> >> + ret = mmu_notifier_register(&iommu->mn, current->mm);
> >> + if (ret)
> >> + goto out_drop;
> >> +
> >> + list_for_each_entry(d, &iommu->domain_list, next) {
> >> + list_for_each_entry(g, &d->group_list, next) {
> >> + ret = iommu_group_for_each_dev(g->iommu_group,
> >> + &enabled_dev_cnt, dev_enable_iopf);
> >> + if (ret)
> >> + goto out_unwind;
> >> +
> >> + iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
> >> + if (!iopf_group) {
> >> + ret = -ENOMEM;
> >> + goto out_unwind;
> >> + }
> >> +
> >> + iopf_group->iommu_group = g->iommu_group;
> >> + iopf_group->iommu = iommu;
> >> +
> >> + vfio_link_iopf_group(iopf_group);
> >> + }
> >> + }
> >> +
> >> + iommu->iopf_enabled = true;
> >> + goto out_unlock;
> >> +
> >> +out_unwind:
> >> + list_for_each_entry(d, &iommu->domain_list, next) {
> >> + list_for_each_entry(g, &d->group_list, next) {
> >> + iopf_group = vfio_find_iopf_group(g->iommu_group);
> >> + if (iopf_group) {
> >> + vfio_unlink_iopf_group(iopf_group);
> >> + kfree(iopf_group);
> >> + }
> >> +
> >> + if (iommu_group_for_each_dev(g->iommu_group,
> >> + &enabled_dev_cnt, dev_disable_iopf))
> >> + goto out_unregister;
> >
> >
> > This seems broken, we break out of the unwind if any group_for_each
> > fails??
>
> The iommu_group_for_each_dev function will return a negative value if
> we have disabled all previously enabled devices (@enabled_dev_cnt).

What's the harm in calling disable IOPF on a device where it's already
disabled? This is why I'm not sure the value of keeping a counter. It
also presumes that for_each_dev iterates in the same order every time.

> >> + }
> >> + }
> >> +
> >> +out_unregister:
> >> + mmu_notifier_unregister(&iommu->mn, current->mm);
> >> +
> >> +out_drop:
> >> + iommu->mm = NULL;
> >> + mmdrop(current->mm);
> >> +
> >> +out_unlock:
> >> + mutex_unlock(&iommu->lock);
> >> + return ret;
> >> +}
> >
> >
> > Is there an assumption that the user does this before performing any
> > mapping? I don't see where current mappings would be converted which
> > means we'll have pinned pages leaked and vfio_dma objects without a
> > pinned page bitmap.
>
> Yeah, we have an assumption that this ioctl (ENABLE_IOPF) must be called
> before any MAP/UNMAP ioctl...
>
> I will document these later. :-)


That can't be fixed with documentation, if it's a requirement, the code
needs to enforce it. But also why is it a requirement? Theoretically
we could unpin and de-account existing mappings and allow the mmu
notifier to handle these as well, right? Thanks,

Alex

2021-05-24 22:15:46

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

On Fri, 21 May 2021 14:37:21 +0800
Shenming Lu <[email protected]> wrote:

> Hi Alex,
>
> On 2021/5/19 2:57, Alex Williamson wrote:
> > On Fri, 9 Apr 2021 11:44:12 +0800
> > Shenming Lu <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> Requesting for your comments and suggestions. :-)
> >>
> >> The static pinning and mapping problem in VFIO and possible solutions
> >> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> >> Page Fault support for VFIO devices. Different from those relatively
> >> complicated software approaches such as presenting a vIOMMU that provides
> >> the DMA buffer information (might include para-virtualized optimizations),
> >> IOPF mainly depends on the hardware faulting capability, such as the PCIe
> >> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> >> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
> >> support for VFIO passthrough based on the IOPF part of SVA in this series.
> >
> > The SVA proposals are being reworked to make use of a new IOASID
> > object, it's not clear to me that this shouldn't also make use of that
> > work as it does a significant expansion of the type1 IOMMU with fault
> > handlers that would duplicate new work using that new model.
>
> It seems that the IOASID extension for guest SVA would not affect this series,
> will we do any host-guest IOASID translation in the VFIO fault handler?

Surely it will, we don't currently have any IOMMU fault handling or
forwarding of IOMMU faults through to the vfio bus driver, both of
those would be included in an IOASID implementation. I think Jason's
vision is to use IOASID to deprecate type1 for all use cases, so even
if we were to choose to implement IOPF in type1 we should agree on
common interfaces with IOASID.

> >> We have measured its performance with UADK [4] (passthrough an accelerator
> >> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
> >>
> >> Run hisi_sec_test...
> >> - with varying sending times and message lengths
> >> - with/without IOPF enabled (speed slowdown)
> >>
> >> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
> >> slowdown (num of faults)
> >> times VFIO IOPF host SVA
> >> 1 63.4% (518) 82.8% (512)
> >> 100 22.9% (1058) 47.9% (1024)
> >> 1000 2.6% (1071) 8.5% (1024)
> >>
> >> when msg_len = 10MB (and PREMAP_LEN = 512):
> >> slowdown (num of faults)
> >> times VFIO IOPF
> >> 1 32.6% (13)
> >> 100 3.5% (26)
> >> 1000 1.6% (26)
> >
> > It seems like this is only an example that you can make a benchmark
> > show anything you want. The best results would be to pre-map
> > everything, which is what we have without this series. What is an
> > acceptable overhead to incur to avoid page pinning? What if userspace
> > had more fine grained control over which mappings were available for
> > faulting and which were statically mapped? I don't really see what
> > sense the pre-mapping range makes. If I assume the user is QEMU in a
> > non-vIOMMU configuration, pre-mapping the beginning of each RAM section
> > doesn't make any logical sense relative to device DMA.
>
> As you said in Patch 4, we can introduce full end-to-end functionality
> before trying to improve performance, and I will drop the pre-mapping patch
> in the current stage...
>
> Is there a need that userspace wants more fine grained control over which
> mappings are available for faulting? If so, we may evolve the MAP ioctl
> to support for specifying the faulting range.

You're essentially enabling this for a vfio bus driver via patch 7/8,
pinning for selective DMA faulting. How would a driver in userspace
make equivalent requests? In the case of performance, the user could
mlock the page but they have no mechanism here to pre-fault it. Should
they?

> As for the overhead of IOPF, it is unavoidable if enabling on-demand paging
> (and page faults occur almost only when first accessing), and it seems that
> there is a large optimization space compared to CPU page faulting.

Yes, there's of course going to be overhead in terms of latency for the
page faults. My point was more that when a host is not under memory
pressure we should trend towards the performance of pinned, static
mappings and we should be able to create arbitrarily large pre-fault
behavior to show that. But I think what we really want to enable via
IOPF is density, right? Therefore how many more assigned device guests
can you run on a host with IOPF? How does the slope, plateau, and
inflection point of their aggregate throughput compare to static
pinning? VM startup time is probably also a useful metric, ie. trading
device latency for startup latency. Thanks,

Alex

2021-05-25 00:02:48

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

On Mon, 24 May 2021 21:11:11 +0800
Shenming Lu <[email protected]> wrote:

> On 2021/5/21 15:59, Shenming Lu wrote:
> > On 2021/5/19 2:58, Alex Williamson wrote:
> >> On Fri, 9 Apr 2021 11:44:20 +0800
> >> Shenming Lu <[email protected]> wrote:
> >>
> >>> To set up nested mode, drivers such as vfio_pci need to register a
> >>> handler to receive stage/level 1 faults from the IOMMU, but since
> >>> currently each device can only have one iommu dev fault handler,
> >>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
> >>> we choose to update the registered handler (a consolidated one) via
> >>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
> >>> stage 1 faults in the handler to the guest through a newly added
> >>> vfio_device_ops callback.
> >>
> >> Are there proposed in-kernel drivers that would use any of these
> >> symbols?
> >
> > I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can
> > use these symbols to consolidate the two page fault handlers into one.
> >
> > [1] https://patchwork.kernel.org/project/kvm/cover/[email protected]/
> >
> >>
> >>> Signed-off-by: Shenming Lu <[email protected]>
> >>> ---
> >>> drivers/vfio/vfio.c | 81 +++++++++++++++++++++++++++++++++
> >>> drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
> >>> include/linux/vfio.h | 12 +++++
> >>> 3 files changed, 141 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>> index 44c8dfabf7de..4245f15914bf 100644
> >>> --- a/drivers/vfio/vfio.c
> >>> +++ b/drivers/vfio/vfio.c
> >>> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
> >>> }
> >>> EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
> >>>
> >>> +/*
> >>> + * Register/Update the VFIO IOPF handler to receive
> >>> + * nested stage/level 1 faults.
> >>> + */
> >>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
> >>> +{
> >>> + struct vfio_container *container;
> >>> + struct vfio_group *group;
> >>> + struct vfio_iommu_driver *driver;
> >>> + int ret;
> >>> +
> >>> + if (!dev)
> >>> + return -EINVAL;
> >>> +
> >>> + group = vfio_group_get_from_dev(dev);
> >>> + if (!group)
> >>> + return -ENODEV;
> >>> +
> >>> + ret = vfio_group_add_container_user(group);
> >>> + if (ret)
> >>> + goto out;
> >>> +
> >>> + container = group->container;
> >>> + driver = container->iommu_driver;
> >>> + if (likely(driver && driver->ops->register_handler))
> >>> + ret = driver->ops->register_handler(container->iommu_data, dev);
> >>> + else
> >>> + ret = -ENOTTY;
> >>> +
> >>> + vfio_group_try_dissolve_container(group);
> >>> +
> >>> +out:
> >>> + vfio_group_put(group);
> >>> + return ret;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
> >>> +
> >>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
> >>> +{
> >>> + struct vfio_container *container;
> >>> + struct vfio_group *group;
> >>> + struct vfio_iommu_driver *driver;
> >>> + int ret;
> >>> +
> >>> + if (!dev)
> >>> + return -EINVAL;
> >>> +
> >>> + group = vfio_group_get_from_dev(dev);
> >>> + if (!group)
> >>> + return -ENODEV;
> >>> +
> >>> + ret = vfio_group_add_container_user(group);
> >>> + if (ret)
> >>> + goto out;
> >>> +
> >>> + container = group->container;
> >>> + driver = container->iommu_driver;
> >>> + if (likely(driver && driver->ops->unregister_handler))
> >>> + ret = driver->ops->unregister_handler(container->iommu_data, dev);
> >>> + else
> >>> + ret = -ENOTTY;
> >>> +
> >>> + vfio_group_try_dissolve_container(group);
> >>> +
> >>> +out:
> >>> + vfio_group_put(group);
> >>> + return ret;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
> >>> +
> >>> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
> >>> +{
> >>> + struct vfio_device *device = dev_get_drvdata(dev);
> >>> +
> >>> + if (unlikely(!device->ops->transfer))
> >>> + return -EOPNOTSUPP;
> >>> +
> >>> + return device->ops->transfer(device->device_data, fault);
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
> >>> +
> >>> /**
> >>> * Module/class support
> >>> */
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>> index ba2b5a1cf6e9..9d1adeddb303 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
> >>> struct vfio_batch batch;
> >>> struct vfio_range *range;
> >>> dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
> >>> - int access_flags = 0;
> >>> + int access_flags = 0, nested;
> >>> size_t premap_len, map_len, mapped_len = 0;
> >>> unsigned long bit_offset, vaddr, pfn, i, npages;
> >>> int ret;
> >>> enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
> >>> struct iommu_page_response resp = {0};
> >>>
> >>> + if (vfio_dev_domian_nested(dev, &nested))
> >>> + return -ENODEV;
> >>> +
> >>> + /*
> >>> + * When configured in nested mode, further deliver the
> >>> + * stage/level 1 faults to the guest.
> >>> + */
> >>> + if (nested) {
> >>> + bool l2;
> >>> +
> >>> + if (fault->type == IOMMU_FAULT_PAGE_REQ)
> >>> + l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
> >>> + if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
> >>> + l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
> >>> +
> >>> + if (!l2)
> >>> + return vfio_transfer_iommu_fault(dev, fault);
> >>> + }
> >>> +
> >>> if (fault->type != IOMMU_FAULT_PAGE_REQ)
> >>> return -EOPNOTSUPP;
> >>>
> >>> @@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
> >>> wake_up_all(&iommu->vaddr_wait);
> >>> }
> >>>
> >>> +static int vfio_iommu_type1_register_handler(void *iommu_data,
> >>> + struct device *dev)
> >>> +{
> >>> + struct vfio_iommu *iommu = iommu_data;
> >>> +
> >>> + if (iommu->iopf_enabled)
> >>> + return iommu_update_device_fault_handler(dev, ~0,
> >>> + FAULT_REPORT_NESTED_L1);
> >>> + else
> >>> + return iommu_register_device_fault_handler(dev,
> >>> + vfio_iommu_type1_dma_map_iopf,
> >>> + FAULT_REPORT_NESTED_L1, dev);
> >>> +}
> >>> +
> >>> +static int vfio_iommu_type1_unregister_handler(void *iommu_data,
> >>> + struct device *dev)
> >>> +{
> >>> + struct vfio_iommu *iommu = iommu_data;
> >>> +
> >>> + if (iommu->iopf_enabled)
> >>> + return iommu_update_device_fault_handler(dev,
> >>> + ~FAULT_REPORT_NESTED_L1, 0);
> >>> + else
> >>> + return iommu_unregister_device_fault_handler(dev);
> >>> +}
> >>
> >>
> >> The path through vfio to register this is pretty ugly, but I don't see
> >> any reason for the the update interfaces here, the previously
> >> registered handler just changes its behavior.
> >
> > Yeah, this seems not an elegant way...
> >
> > If IOPF(L2) enabled, the fault handler has already been registered, so for
> > nested mode setup, we only need to change the flags of the handler in the
> > IOMMU driver to receive L1 faults.
> > (assume that L1 IOPF is configured after L2 IOPF)
> >
> > Currently each device can only have one iommu dev fault handler, and L1
> > and L2 IOPF are configured separately in nested mode, I am also wondering
> > that is there a better solution for this.

I haven't fully read all the references, but who imposes the fact that
there's only one fault handler per device? If type1 could register one
handler and the vfio-pci bus driver another for the other level, would
we need this path through vfio-core?

> Let me simply add, maybe there is another way for this:
> Would it be better to set host IOPF enabled (L2 faults) in the VFIO_IOMMU_MAP_DMA
> ioctl (no need to add a new ioctl, and we can specify whether this mapping is IOPF
> available or statically pinned), and set guest IOPF enabled (L1 faults) in the
> VFIO_IOMMU_SET_PASID_TABLE (from Eric's series) ioctl?
> And we have no requirement for the sequence of these two ioctls. The first called
> one will register the handler, and the later one will just update the handler...

This is looking more and more like it belongs with the IOASID work. I
think Eric has shifted his focus there too. Thanks,

Alex

2021-05-27 11:17:48

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF

On 2021/5/25 6:11, Alex Williamson wrote:
> On Fri, 21 May 2021 14:38:25 +0800
> Shenming Lu <[email protected]> wrote:
>
>> On 2021/5/19 2:58, Alex Williamson wrote:
>>> On Fri, 9 Apr 2021 11:44:17 +0800
>>> Shenming Lu <[email protected]> wrote:
>>>
>>>> Since enabling IOPF for devices may lead to a slow ramp up of performance,
>>>> we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
>>>> IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
>>>> registering the VFIO IOPF handler.
>>>>
>>>> Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
>>>> inflight page faults when disabling.
>>>>
>>>> Signed-off-by: Shenming Lu <[email protected]>
>>>> ---
>>>> drivers/vfio/vfio_iommu_type1.c | 223 +++++++++++++++++++++++++++++++-
>>>> include/uapi/linux/vfio.h | 6 +
>>>> 2 files changed, 226 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>> index 01e296c6dc9e..7df5711e743a 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -71,6 +71,7 @@ struct vfio_iommu {
>>>> struct rb_root dma_list;
>>>> struct blocking_notifier_head notifier;
>>>> struct mmu_notifier mn;
>>>> + struct mm_struct *mm;
>>>
>>> We currently have no requirement that a single mm is used for all
>>> container mappings. Does enabling IOPF impose such a requirement?
>>> Shouldn't MAP/UNMAP enforce that?
>>
>> Did you mean that there is a possibility that each vfio_dma in a
>> container has a different mm? If so, could we register one MMU
>> notifier for each vfio_dma in the MAP ioctl?
>
> We don't prevent it currently. There needs to be some balance,
> typically we'll have one mm, so would it make sense to have potentially
> thousands of mmu notifiers registered against the same mm?

If we have one MMU notifier per vfio_dma, there is no need to search the
iommu->dma_list when receiving an address invalidation event in
mn_invalidate_range().
Or could we divide the vfio_dma ranges into parts with different mm, and
each part would have one MMU notifier?

>
>>>> unsigned int dma_avail;
>>>> unsigned int vaddr_invalid_count;
>>>> uint64_t pgsize_bitmap;
>>>> @@ -81,6 +82,7 @@ struct vfio_iommu {
>>>> bool dirty_page_tracking;
>>>> bool pinned_page_dirty_scope;
>>>> bool container_open;
>>>> + bool iopf_enabled;
>>>> };
>>>>
>>>> struct vfio_domain {
>>>> @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
>>>> return node ? iopf_group : NULL;
>>>> }
>>>>
>>>> +static void vfio_link_iopf_group(struct vfio_iopf_group *new)
>>>> +{
>>>> + struct rb_node **link, *parent = NULL;
>>>> + struct vfio_iopf_group *iopf_group;
>>>> +
>>>> + mutex_lock(&iopf_group_list_lock);
>>>> +
>>>> + link = &iopf_group_list.rb_node;
>>>> +
>>>> + while (*link) {
>>>> + parent = *link;
>>>> + iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
>>>> +
>>>> + if (new->iommu_group < iopf_group->iommu_group)
>>>> + link = &(*link)->rb_left;
>>>> + else
>>>> + link = &(*link)->rb_right;
>>>> + }
>>>> +
>>>> + rb_link_node(&new->node, parent, link);
>>>> + rb_insert_color(&new->node, &iopf_group_list);
>>>> +
>>>> + mutex_unlock(&iopf_group_list_lock);
>>>> +}
>>>> +
>>>> +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
>>>> +{
>>>> + mutex_lock(&iopf_group_list_lock);
>>>> + rb_erase(&old->node, &iopf_group_list);
>>>> + mutex_unlock(&iopf_group_list_lock);
>>>> +}
>>>> +
>>>> static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>> {
>>>> struct mm_struct *mm;
>>>> @@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
>>>> list_splice_tail(iova_copy, iova);
>>>> }
>>>>
>>>> +static int vfio_dev_domian_nested(struct device *dev, int *nested)
>>>
>>>
>>> s/domian/domain/
>>>
>>>
>>>> +{
>>>> + struct iommu_domain *domain;
>>>> +
>>>> + domain = iommu_get_domain_for_dev(dev);
>>>> + if (!domain)
>>>> + return -ENODEV;
>>>> +
>>>> + return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);
>>>
>>>
>>> Wouldn't this be easier to use if it returned bool, false on -errno?
>>
>> Make sense.
>>
>>>
>>>
>>>> +}
>>>> +
>>>> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data);
>>>> +
>>>> +static int dev_enable_iopf(struct device *dev, void *data)
>>>> +{
>>>> + int *enabled_dev_cnt = data;
>>>> + int nested;
>>>> + u32 flags;
>>>> + int ret;
>>>> +
>>>> + ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
>>>> + if (ret)
>>>> + return ret;
>>>> +
>>>> + ret = vfio_dev_domian_nested(dev, &nested);
>>>> + if (ret)
>>>> + goto out_disable;
>>>> +
>>>> + if (nested)
>>>> + flags = FAULT_REPORT_NESTED_L2;
>>>> + else
>>>> + flags = FAULT_REPORT_FLAT;
>>>> +
>>>> + ret = iommu_register_device_fault_handler(dev,
>>>> + vfio_iommu_type1_dma_map_iopf, flags, dev);
>>>> + if (ret)
>>>> + goto out_disable;
>>>> +
>>>> + (*enabled_dev_cnt)++;
>>>> + return 0;
>>>> +
>>>> +out_disable:
>>>> + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
>>>> + return ret;
>>>> +}
>>>> +
>>>> +static int dev_disable_iopf(struct device *dev, void *data)
>>>> +{
>>>> + int *enabled_dev_cnt = data;
>>>> +
>>>> + if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
>>>> + return -1;
>>>
>>>
>>> Use an errno.>
>>>
>>>> +
>>>> + WARN_ON(iommu_unregister_device_fault_handler(dev));
>>>> + WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF));
>>>> +
>>>> + if (enabled_dev_cnt)
>>>> + (*enabled_dev_cnt)--;
>>>
>>>
>>> Why do we need these counters?
>>
>> We use this counter to record the number of IOPF enabled devices, and if
>> any device fails to be enabled, we will exit the loop (three nested loop)
>> and go to the unwind process, which will disable the first @enabled_dev_cnt
>> devices.
>>
>>>
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>> struct iommu_group *iommu_group)
>>>> {
>>>> @@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>> struct iommu_domain_geometry geo;
>>>> LIST_HEAD(iova_copy);
>>>> LIST_HEAD(group_resv_regions);
>>>> + int iopf_enabled_dev_cnt = 0;
>>>> + struct vfio_iopf_group *iopf_group = NULL;
>>>>
>>>> mutex_lock(&iommu->lock);
>>>>
>>>> @@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>> if (ret)
>>>> goto out_domain;
>>>>
>>>> + if (iommu->iopf_enabled) {
>>>> + ret = iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
>>>> + dev_enable_iopf);
>>>> + if (ret)
>>>> + goto out_detach;
>>>> +
>>>> + iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
>>>> + if (!iopf_group) {
>>>> + ret = -ENOMEM;
>>>> + goto out_detach;
>>>> + }
>>>> +
>>>> + iopf_group->iommu_group = iommu_group;
>>>> + iopf_group->iommu = iommu;
>>>> +
>>>> + vfio_link_iopf_group(iopf_group);
>>>
>>> This seems backwards, once we enable iopf we can take a fault, so the
>>> structure to lookup the data for the device should be setup first. I'm
>>> still not convinced this iopf_group rb tree is a good solution to get
>>> from the device to the iommu either. vfio-core could traverse dev ->
>>> iommu_group -> vfio_group -> container -> iommu_data.
>>
>> Make sense.
>>
>>>
>>>
>>>> + }
>>>> +
>>>> /* Get aperture info */
>>>> iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
>>>>
>>>> @@ -2534,9 +2650,11 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>> vfio_test_domain_fgsp(domain);
>>>>
>>>> /* replay mappings on new domains */
>>>> - ret = vfio_iommu_replay(iommu, domain);
>>>> - if (ret)
>>>> - goto out_detach;
>>>> + if (!iommu->iopf_enabled) {
>>>> + ret = vfio_iommu_replay(iommu, domain);
>>>> + if (ret)
>>>> + goto out_detach;
>>>> + }
>>>
>>> Test in vfio_iommu_replay()?
>>
>> Not yet, I will do more tests later.
>>
>>>
>>>>
>>>> if (resv_msi) {
>>>> ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
>>>> @@ -2567,6 +2685,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>> iommu_domain_free(domain->domain);
>>>> vfio_iommu_iova_free(&iova_copy);
>>>> vfio_iommu_resv_free(&group_resv_regions);
>>>> + if (iommu->iopf_enabled) {
>>>> + if (iopf_group) {
>>>> + vfio_unlink_iopf_group(iopf_group);
>>>> + kfree(iopf_group);
>>>> + }
>>>> +
>>>> + iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
>>>> + dev_disable_iopf);
>>>
>>> Backwards, disable fault reporting before unlinking lookup.
>>
>> Make sense.
>>
>>>
>>>> + }
>>>> out_free:
>>>> kfree(domain);
>>>> kfree(group);
>>>> @@ -2728,6 +2855,19 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>>>> if (!group)
>>>> continue;
>>>>
>>>> + if (iommu->iopf_enabled) {
>>>> + struct vfio_iopf_group *iopf_group;
>>>> +
>>>> + iopf_group = vfio_find_iopf_group(iommu_group);
>>>> + if (!WARN_ON(!iopf_group)) {
>>>> + vfio_unlink_iopf_group(iopf_group);
>>>> + kfree(iopf_group);
>>>> + }
>>>> +
>>>> + iommu_group_for_each_dev(iommu_group, NULL,
>>>> + dev_disable_iopf);
>>>
>>>
>>> Backwards. Also unregistering the fault handler can fail if there are
>>> pending faults. It appears the fault handler and iopf lookup could also
>>> race with this function, ex. fault handler gets an iopf object before
>>> it's removed from the rb-tree, blocks trying to acquire iommu->lock,
>>> disable_iopf fails, detach proceeds, fault handler has use after free
>>> error.
>>
>> We have WARN_ON for the failure of the unregistering.
>>
>> Yeah, it seems that using vfio_iopf_group is not a good choice...
>>
>>>
>>>
>>>> + }
>>>> +
>>>> vfio_iommu_detach_group(domain, group);
>>>> update_dirty_scope = !group->pinned_page_dirty_scope;
>>>> list_del(&group->next);
>>>> @@ -2846,6 +2986,11 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>>>
>>>> vfio_iommu_iova_free(&iommu->iova_list);
>>>>
>>>> + if (iommu->iopf_enabled) {
>>>> + mmu_notifier_unregister(&iommu->mn, iommu->mm);
>>>> + mmdrop(iommu->mm);
>>>> + }
>>>> +
>>>> kfree(iommu);
>>>> }
>>>>
>>>> @@ -3441,6 +3586,76 @@ static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
>>>> .invalidate_range = mn_invalidate_range,
>>>> };
>>>>
>>>> +static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
>>>> +{
>>>> + struct vfio_domain *d;
>>>> + struct vfio_group *g;
>>>> + struct vfio_iopf_group *iopf_group;
>>>> + int enabled_dev_cnt = 0;
>>>> + int ret;
>>>> +
>>>> + if (!current->mm)
>>>> + return -ENODEV;
>>>> +
>>>> + mutex_lock(&iommu->lock);
>>>> +
>>>> + mmgrab(current->mm);
>>>
>>>
>>> As above, this is a new and undocumented requirement.
>>>
>>>
>>>> + iommu->mm = current->mm;
>>>> + iommu->mn.ops = &vfio_iommu_type1_mn_ops;
>>>> + ret = mmu_notifier_register(&iommu->mn, current->mm);
>>>> + if (ret)
>>>> + goto out_drop;
>>>> +
>>>> + list_for_each_entry(d, &iommu->domain_list, next) {
>>>> + list_for_each_entry(g, &d->group_list, next) {
>>>> + ret = iommu_group_for_each_dev(g->iommu_group,
>>>> + &enabled_dev_cnt, dev_enable_iopf);
>>>> + if (ret)
>>>> + goto out_unwind;
>>>> +
>>>> + iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
>>>> + if (!iopf_group) {
>>>> + ret = -ENOMEM;
>>>> + goto out_unwind;
>>>> + }
>>>> +
>>>> + iopf_group->iommu_group = g->iommu_group;
>>>> + iopf_group->iommu = iommu;
>>>> +
>>>> + vfio_link_iopf_group(iopf_group);
>>>> + }
>>>> + }
>>>> +
>>>> + iommu->iopf_enabled = true;
>>>> + goto out_unlock;
>>>> +
>>>> +out_unwind:
>>>> + list_for_each_entry(d, &iommu->domain_list, next) {
>>>> + list_for_each_entry(g, &d->group_list, next) {
>>>> + iopf_group = vfio_find_iopf_group(g->iommu_group);
>>>> + if (iopf_group) {
>>>> + vfio_unlink_iopf_group(iopf_group);
>>>> + kfree(iopf_group);
>>>> + }
>>>> +
>>>> + if (iommu_group_for_each_dev(g->iommu_group,
>>>> + &enabled_dev_cnt, dev_disable_iopf))
>>>> + goto out_unregister;
>>>
>>>
>>> This seems broken, we break out of the unwind if any group_for_each
>>> fails??
>>
>> The iommu_group_for_each_dev function will return a negative value if
>> we have disabled all previously enabled devices (@enabled_dev_cnt).
>
> What's the harm in calling disable IOPF on a device where it's already
> disabled? This is why I'm not sure the value of keeping a counter. It
> also presumes that for_each_dev iterates in the same order every time.

In the current implementation (can only have one iommu dev fault handler),
there is no harm in calling disable IOPF on a device where it's already
disabled, but if someone else has already enabled IOPF on the device,
the IOPF enabling here would fail, and goto the unwind process, then
if we disable all the devices regardless of enabled or disabled, we
may disable the devices which were enabled by others.

And if we could have more than one handler per dev in the future, the
iommu interfaces could be re-designed to be harmless for this...

Besides, if we can't assume that for_each_dev iterates in the same order
every time, we may need to keep more than a counter...

>
>>>> + }
>>>> + }
>>>> +
>>>> +out_unregister:
>>>> + mmu_notifier_unregister(&iommu->mn, current->mm);
>>>> +
>>>> +out_drop:
>>>> + iommu->mm = NULL;
>>>> + mmdrop(current->mm);
>>>> +
>>>> +out_unlock:
>>>> + mutex_unlock(&iommu->lock);
>>>> + return ret;
>>>> +}
>>>
>>>
>>> Is there an assumption that the user does this before performing any
>>> mapping? I don't see where current mappings would be converted which
>>> means we'll have pinned pages leaked and vfio_dma objects without a
>>> pinned page bitmap.
>>
>> Yeah, we have an assumption that this ioctl (ENABLE_IOPF) must be called
>> before any MAP/UNMAP ioctl...
>>
>> I will document these later. :-)
>
>
> That can't be fixed with documentation, if it's a requirement, the code
> needs to enforce it. But also why is it a requirement? Theoretically
> we could unpin and de-account existing mappings and allow the mmu
> notifier to handle these as well, right? Thanks,

Make sense. It is not a necessary requirement.

Thanks,
Shenming

>
> Alex
>
> .
>

2021-05-27 11:27:14

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

On 2021/5/25 6:11, Alex Williamson wrote:
> On Fri, 21 May 2021 14:37:21 +0800
> Shenming Lu <[email protected]> wrote:
>
>> Hi Alex,
>>
>> On 2021/5/19 2:57, Alex Williamson wrote:
>>> On Fri, 9 Apr 2021 11:44:12 +0800
>>> Shenming Lu <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> Requesting for your comments and suggestions. :-)
>>>>
>>>> The static pinning and mapping problem in VFIO and possible solutions
>>>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>>>> Page Fault support for VFIO devices. Different from those relatively
>>>> complicated software approaches such as presenting a vIOMMU that provides
>>>> the DMA buffer information (might include para-virtualized optimizations),
>>>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>>>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>>>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
>>>> support for VFIO passthrough based on the IOPF part of SVA in this series.
>>>
>>> The SVA proposals are being reworked to make use of a new IOASID
>>> object, it's not clear to me that this shouldn't also make use of that
>>> work as it does a significant expansion of the type1 IOMMU with fault
>>> handlers that would duplicate new work using that new model.
>>
>> It seems that the IOASID extension for guest SVA would not affect this series,
>> will we do any host-guest IOASID translation in the VFIO fault handler?
>
> Surely it will, we don't currently have any IOMMU fault handling or
> forwarding of IOMMU faults through to the vfio bus driver, both of
> those would be included in an IOASID implementation. I think Jason's
> vision is to use IOASID to deprecate type1 for all use cases, so even
> if we were to choose to implement IOPF in type1 we should agree on
> common interfaces with IOASID.

Yeah, the guest IOPF(L1) handling may include the host-guest IOASID translation,
which can be placed in the IOASID layer (in fact it can be placed in many places
such as the vfio pci driver since it don't really handle the fault event, it just
transfers the event to the vIOMMU).
But the host IOPF(L2) has no relationship with IOASID at all, it needs to have a
knowledge of the vfio_dma ranges.
Could we add the host IOPF support to type1 first (integrate it within the MAP ioctl)?
And we may migrate the generic iommu controls (such as MAP/UNMAP...) from type1 to
IOASID in the future (it seems to be a huge work, I will be very happy if I could
help this)... :-)

>
>>>> We have measured its performance with UADK [4] (passthrough an accelerator
>>>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>>>>
>>>> Run hisi_sec_test...
>>>> - with varying sending times and message lengths
>>>> - with/without IOPF enabled (speed slowdown)
>>>>
>>>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>>>> slowdown (num of faults)
>>>> times VFIO IOPF host SVA
>>>> 1 63.4% (518) 82.8% (512)
>>>> 100 22.9% (1058) 47.9% (1024)
>>>> 1000 2.6% (1071) 8.5% (1024)
>>>>
>>>> when msg_len = 10MB (and PREMAP_LEN = 512):
>>>> slowdown (num of faults)
>>>> times VFIO IOPF
>>>> 1 32.6% (13)
>>>> 100 3.5% (26)
>>>> 1000 1.6% (26)
>>>
>>> It seems like this is only an example that you can make a benchmark
>>> show anything you want. The best results would be to pre-map
>>> everything, which is what we have without this series. What is an
>>> acceptable overhead to incur to avoid page pinning? What if userspace
>>> had more fine grained control over which mappings were available for
>>> faulting and which were statically mapped? I don't really see what
>>> sense the pre-mapping range makes. If I assume the user is QEMU in a
>>> non-vIOMMU configuration, pre-mapping the beginning of each RAM section
>>> doesn't make any logical sense relative to device DMA.
>>
>> As you said in Patch 4, we can introduce full end-to-end functionality
>> before trying to improve performance, and I will drop the pre-mapping patch
>> in the current stage...
>>
>> Is there a need that userspace wants more fine grained control over which
>> mappings are available for faulting? If so, we may evolve the MAP ioctl
>> to support for specifying the faulting range.
>
> You're essentially enabling this for a vfio bus driver via patch 7/8,
> pinning for selective DMA faulting. How would a driver in userspace
> make equivalent requests? In the case of performance, the user could
> mlock the page but they have no mechanism here to pre-fault it. Should
> they?

Make sense.

It seems that we should additionally iommu_map the pages which are IOPF
enabled and pinned in vfio_iommu_type1_pin_pages, and there is no need
to add more tracking structures in Patch 7...

>
>> As for the overhead of IOPF, it is unavoidable if enabling on-demand paging
>> (and page faults occur almost only when first accessing), and it seems that
>> there is a large optimization space compared to CPU page faulting.
>
> Yes, there's of course going to be overhead in terms of latency for the
> page faults. My point was more that when a host is not under memory
> pressure we should trend towards the performance of pinned, static
> mappings and we should be able to create arbitrarily large pre-fault
> behavior to show that.

Make sense.

> But I think what we really want to enable via IOPF is density, right?

density? Did you mean the proportion of the IOPF enabled mappings?

> Therefore how many more assigned device guests
> can you run on a host with IOPF?> How does the slope, plateau, and
> inflection point of their aggregate throughput compare to static
> pinning? VM startup time is probably also a useful metric, ie. trading
> device latency for startup latency. Thanks,

Yeah, these are what we have to consider and test later. :-)
And the slope, plateau, and inflection point of the aggregate throughput
may depend on the specific device driver's behavior (such as whether they
reuse the DMA buffer)...

Thanks,
Shenming

>
> Alex
>
> .
>

2021-05-27 20:47:53

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

On 2021/5/25 6:11, Alex Williamson wrote:
> On Mon, 24 May 2021 21:11:11 +0800
> Shenming Lu <[email protected]> wrote:
>
>> On 2021/5/21 15:59, Shenming Lu wrote:
>>> On 2021/5/19 2:58, Alex Williamson wrote:
>>>> On Fri, 9 Apr 2021 11:44:20 +0800
>>>> Shenming Lu <[email protected]> wrote:
>>>>
>>>>> To set up nested mode, drivers such as vfio_pci need to register a
>>>>> handler to receive stage/level 1 faults from the IOMMU, but since
>>>>> currently each device can only have one iommu dev fault handler,
>>>>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
>>>>> we choose to update the registered handler (a consolidated one) via
>>>>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
>>>>> stage 1 faults in the handler to the guest through a newly added
>>>>> vfio_device_ops callback.
>>>>
>>>> Are there proposed in-kernel drivers that would use any of these
>>>> symbols?
>>>
>>> I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can
>>> use these symbols to consolidate the two page fault handlers into one.
>>>
>>> [1] https://patchwork.kernel.org/project/kvm/cover/[email protected]/
>>>
>>>>
>>>>> Signed-off-by: Shenming Lu <[email protected]>
>>>>> ---
>>>>> drivers/vfio/vfio.c | 81 +++++++++++++++++++++++++++++++++
>>>>> drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
>>>>> include/linux/vfio.h | 12 +++++
>>>>> 3 files changed, 141 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>> index 44c8dfabf7de..4245f15914bf 100644
>>>>> --- a/drivers/vfio/vfio.c
>>>>> +++ b/drivers/vfio/vfio.c
>>>>> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>>>>>
>>>>> +/*
>>>>> + * Register/Update the VFIO IOPF handler to receive
>>>>> + * nested stage/level 1 faults.
>>>>> + */
>>>>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
>>>>> +{
>>>>> + struct vfio_container *container;
>>>>> + struct vfio_group *group;
>>>>> + struct vfio_iommu_driver *driver;
>>>>> + int ret;
>>>>> +
>>>>> + if (!dev)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + group = vfio_group_get_from_dev(dev);
>>>>> + if (!group)
>>>>> + return -ENODEV;
>>>>> +
>>>>> + ret = vfio_group_add_container_user(group);
>>>>> + if (ret)
>>>>> + goto out;
>>>>> +
>>>>> + container = group->container;
>>>>> + driver = container->iommu_driver;
>>>>> + if (likely(driver && driver->ops->register_handler))
>>>>> + ret = driver->ops->register_handler(container->iommu_data, dev);
>>>>> + else
>>>>> + ret = -ENOTTY;
>>>>> +
>>>>> + vfio_group_try_dissolve_container(group);
>>>>> +
>>>>> +out:
>>>>> + vfio_group_put(group);
>>>>> + return ret;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
>>>>> +
>>>>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
>>>>> +{
>>>>> + struct vfio_container *container;
>>>>> + struct vfio_group *group;
>>>>> + struct vfio_iommu_driver *driver;
>>>>> + int ret;
>>>>> +
>>>>> + if (!dev)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + group = vfio_group_get_from_dev(dev);
>>>>> + if (!group)
>>>>> + return -ENODEV;
>>>>> +
>>>>> + ret = vfio_group_add_container_user(group);
>>>>> + if (ret)
>>>>> + goto out;
>>>>> +
>>>>> + container = group->container;
>>>>> + driver = container->iommu_driver;
>>>>> + if (likely(driver && driver->ops->unregister_handler))
>>>>> + ret = driver->ops->unregister_handler(container->iommu_data, dev);
>>>>> + else
>>>>> + ret = -ENOTTY;
>>>>> +
>>>>> + vfio_group_try_dissolve_container(group);
>>>>> +
>>>>> +out:
>>>>> + vfio_group_put(group);
>>>>> + return ret;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
>>>>> +
>>>>> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
>>>>> +{
>>>>> + struct vfio_device *device = dev_get_drvdata(dev);
>>>>> +
>>>>> + if (unlikely(!device->ops->transfer))
>>>>> + return -EOPNOTSUPP;
>>>>> +
>>>>> + return device->ops->transfer(device->device_data, fault);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
>>>>> +
>>>>> /**
>>>>> * Module/class support
>>>>> */
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index ba2b5a1cf6e9..9d1adeddb303 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
>>>>> struct vfio_batch batch;
>>>>> struct vfio_range *range;
>>>>> dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
>>>>> - int access_flags = 0;
>>>>> + int access_flags = 0, nested;
>>>>> size_t premap_len, map_len, mapped_len = 0;
>>>>> unsigned long bit_offset, vaddr, pfn, i, npages;
>>>>> int ret;
>>>>> enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
>>>>> struct iommu_page_response resp = {0};
>>>>>
>>>>> + if (vfio_dev_domian_nested(dev, &nested))
>>>>> + return -ENODEV;
>>>>> +
>>>>> + /*
>>>>> + * When configured in nested mode, further deliver the
>>>>> + * stage/level 1 faults to the guest.
>>>>> + */
>>>>> + if (nested) {
>>>>> + bool l2;
>>>>> +
>>>>> + if (fault->type == IOMMU_FAULT_PAGE_REQ)
>>>>> + l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
>>>>> + if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
>>>>> + l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
>>>>> +
>>>>> + if (!l2)
>>>>> + return vfio_transfer_iommu_fault(dev, fault);
>>>>> + }
>>>>> +
>>>>> if (fault->type != IOMMU_FAULT_PAGE_REQ)
>>>>> return -EOPNOTSUPP;
>>>>>
>>>>> @@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
>>>>> wake_up_all(&iommu->vaddr_wait);
>>>>> }
>>>>>
>>>>> +static int vfio_iommu_type1_register_handler(void *iommu_data,
>>>>> + struct device *dev)
>>>>> +{
>>>>> + struct vfio_iommu *iommu = iommu_data;
>>>>> +
>>>>> + if (iommu->iopf_enabled)
>>>>> + return iommu_update_device_fault_handler(dev, ~0,
>>>>> + FAULT_REPORT_NESTED_L1);
>>>>> + else
>>>>> + return iommu_register_device_fault_handler(dev,
>>>>> + vfio_iommu_type1_dma_map_iopf,
>>>>> + FAULT_REPORT_NESTED_L1, dev);
>>>>> +}
>>>>> +
>>>>> +static int vfio_iommu_type1_unregister_handler(void *iommu_data,
>>>>> + struct device *dev)
>>>>> +{
>>>>> + struct vfio_iommu *iommu = iommu_data;
>>>>> +
>>>>> + if (iommu->iopf_enabled)
>>>>> + return iommu_update_device_fault_handler(dev,
>>>>> + ~FAULT_REPORT_NESTED_L1, 0);
>>>>> + else
>>>>> + return iommu_unregister_device_fault_handler(dev);
>>>>> +}
>>>>
>>>>
>>>> The path through vfio to register this is pretty ugly, but I don't see
>>>> any reason for the the update interfaces here, the previously
>>>> registered handler just changes its behavior.
>>>
>>> Yeah, this seems not an elegant way...
>>>
>>> If IOPF(L2) enabled, the fault handler has already been registered, so for
>>> nested mode setup, we only need to change the flags of the handler in the
>>> IOMMU driver to receive L1 faults.
>>> (assume that L1 IOPF is configured after L2 IOPF)
>>>
>>> Currently each device can only have one iommu dev fault handler, and L1
>>> and L2 IOPF are configured separately in nested mode, I am also wondering
>>> that is there a better solution for this.
>
> I haven't fully read all the references, but who imposes the fact that
> there's only one fault handler per device? If type1 could register one
> handler and the vfio-pci bus driver another for the other level, would
> we need this path through vfio-core?

If we could register more than one handler per device, things would become
much more simple, and the path through vfio-core would not be needed.

Hi Baolu,
Is there any restriction for having more than one handler per device?

>
>> Let me simply add, maybe there is another way for this:
>> Would it be better to set host IOPF enabled (L2 faults) in the VFIO_IOMMU_MAP_DMA
>> ioctl (no need to add a new ioctl, and we can specify whether this mapping is IOPF
>> available or statically pinned), and set guest IOPF enabled (L1 faults) in the
>> VFIO_IOMMU_SET_PASID_TABLE (from Eric's series) ioctl?
>> And we have no requirement for the sequence of these two ioctls. The first called
>> one will register the handler, and the later one will just update the handler...
>
> This is looking more and more like it belongs with the IOASID work. I
> think Eric has shifted his focus there too. Thanks,

I will pay more attention to the IOASID work.

Thanks,
Shenming

>
> Alex
>
> .
>

2021-05-27 20:48:27

by Baolu Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

Hi Shenming and Alex,

On 5/27/21 7:03 PM, Shenming Lu wrote:
>> I haven't fully read all the references, but who imposes the fact that
>> there's only one fault handler per device? If type1 could register one
>> handler and the vfio-pci bus driver another for the other level, would
>> we need this path through vfio-core?
> If we could register more than one handler per device, things would become
> much more simple, and the path through vfio-core would not be needed.
>
> Hi Baolu,
> Is there any restriction for having more than one handler per device?
>

Currently, each device could only have one fault handler. But one device
might consume multiple page tables. From this point of view, it's more
reasonable to have one handler per page table.

Best regards,
baolu

2021-06-01 04:38:03

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

On 2021/5/27 19:18, Lu Baolu wrote:
> Hi Shenming and Alex,
>
> On 5/27/21 7:03 PM, Shenming Lu wrote:
>>> I haven't fully read all the references, but who imposes the fact that
>>> there's only one fault handler per device?  If type1 could register one
>>> handler and the vfio-pci bus driver another for the other level, would
>>> we need this path through vfio-core?
>> If we could register more than one handler per device, things would become
>> much more simple, and the path through vfio-core would not be needed.
>>
>> Hi Baolu,
>> Is there any restriction for having more than one handler per device?
>>
>
> Currently, each device could only have one fault handler. But one device
> might consume multiple page tables. From this point of view, it's more
> reasonable to have one handler per page table.

Sounds good to me. I have pointed it out in the IOASID uAPI proposal. :-)

Thanks,
Shenming

>
> Best regards,
> baolu
> .