On Mon, 14 Mar 2022 15:44:34 -0400
Matthew Rosato <[email protected]> wrote:
> s390x will introduce a new IOMMU domain type where the mappings are
> managed by KVM rather than in response to userspace mapping ioctls. Allow
> for specifying this type on the VFIO_SET_IOMMU ioctl and triggering the
> appropriate iommu interface for overriding the default domain.
>
> Signed-off-by: Matthew Rosato <[email protected]>
> ---
> drivers/vfio/vfio_iommu_type1.c | 12 +++++++++++-
> include/uapi/linux/vfio.h | 6 ++++++
> 2 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 9394aa9444c1..0bec97077d61 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -77,6 +77,7 @@ struct vfio_iommu {
> bool nesting;
> bool dirty_page_tracking;
> bool container_open;
> + bool kvm;
> struct list_head emulated_iommu_groups;
> };
>
> @@ -2203,7 +2204,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> goto out_free_group;
>
> ret = -EIO;
> - domain->domain = iommu_domain_alloc(bus);
> +
> + if (iommu->kvm)
> + domain->domain = iommu_domain_alloc_type(bus, IOMMU_DOMAIN_KVM);
> + else
> + domain->domain = iommu_domain_alloc(bus);
> +
> if (!domain->domain)
> goto out_free_domain;
>
> @@ -2552,6 +2558,9 @@ static void *vfio_iommu_type1_open(unsigned long arg)
> case VFIO_TYPE1v2_IOMMU:
> iommu->v2 = true;
> break;
> + case VFIO_KVM_IOMMU:
> + iommu->kvm = true;
> + break;
> default:
> kfree(iommu);
> return ERR_PTR(-EINVAL);
> @@ -2637,6 +2646,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
> case VFIO_TYPE1_NESTING_IOMMU:
> case VFIO_UNMAP_ALL:
> case VFIO_UPDATE_VADDR:
> + case VFIO_KVM_IOMMU:
> return 1;
> case VFIO_DMA_CC_IOMMU:
> if (!iommu)
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ef33ea002b0b..666edb6957ac 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -52,6 +52,12 @@
> /* Supports the vaddr flag for DMA map and unmap */
> #define VFIO_UPDATE_VADDR 10
>
> +/*
> + * The KVM_IOMMU type implies that the hypervisor will control the mappings
> + * rather than userspace
> + */
> +#define VFIO_KVM_IOMMU 11
Then why is this hosted in the type1 code that exposes a wide variety
of userspace interfaces? Thanks,
Alex
On Mon, Mar 14, 2022 at 04:50:33PM -0600, Alex Williamson wrote:
> > +/*
> > + * The KVM_IOMMU type implies that the hypervisor will control the mappings
> > + * rather than userspace
> > + */
> > +#define VFIO_KVM_IOMMU 11
>
> Then why is this hosted in the type1 code that exposes a wide variety
> of userspace interfaces? Thanks,
It is really badly named, this is the root level of a 2 stage nested
IO page table, and this approach needed a special flag to distinguish
the setup from the normal iommu_domain.
If we do try to stick this into VFIO it should probably use the
VFIO_TYPE1_NESTING_IOMMU instead - however, we would like to delete
that flag entirely as it was never fully implemented, was never used,
and isn't part of what we are proposing for IOMMU nesting on ARM
anyhow. (So far I've found nobody to explain what the plan here was..)
This is why I said the second level should be an explicit iommu_domain
all on its own that is explicitly coupled to the KVM to read the page
tables, if necessary.
But I'm not sure that reading the userspace io page tables with KVM is
even the best thing to do - the iommu driver already has the pinned
memory, it would be faster and more modular to traverse the io page
tables through the pfns in the root iommu_domain than by having KVM do
the translations. Lets see what Matthew says..
Jason
On 3/14/22 7:18 PM, Jason Gunthorpe wrote:
> On Mon, Mar 14, 2022 at 04:50:33PM -0600, Alex Williamson wrote:
>
>>> +/*
>>> + * The KVM_IOMMU type implies that the hypervisor will control the mappings
>>> + * rather than userspace
>>> + */
>>> +#define VFIO_KVM_IOMMU 11
>>
>> Then why is this hosted in the type1 code that exposes a wide variety
>> of userspace interfaces? Thanks,
>
> It is really badly named, this is the root level of a 2 stage nested
> IO page table, and this approach needed a special flag to distinguish
> the setup from the normal iommu_domain.
^^ Yes, this.
>
> If we do try to stick this into VFIO it should probably use the
> VFIO_TYPE1_NESTING_IOMMU instead - however, we would like to delete
> that flag entirely as it was never fully implemented, was never used,
> and isn't part of what we are proposing for IOMMU nesting on ARM
> anyhow. (So far I've found nobody to explain what the plan here was..)
>
I'm open to suggestions on how better to tie this into vfio. The
scenario basically plays out that:
1) the iommu will be domain_alloc'd once VFIO_SET_IOMMU is issued -- so
at that time (or earlier) we have to make the decision on whether to use
the standard IOMMU or this alternate KVM/nested IOMMU.
2) At the time VFIO_SET_IOMMU is received, we have not yet associated
the vfio group with a KVM, so we can't (today) use this as an indicator
to guess which IOMMU strategy to use.
3) Ideally, even if we changed QEMU vfio to make the KVM association
earlier, it would be nice to still be able to indicate that we want to
use the standard iommu/type1 despite a KVM association existing (e.g.
backwards compatibility with older QEMU that lacks 'interpretation'
support, nested virtualization scenarios).
> This is why I said the second level should be an explicit iommu_domain
> all on its own that is explicitly coupled to the KVM to read the page
> tables, if necessary.
Maybe I misunderstood this. Are you proposing 2 layers of IOMMU that
interact with each other within host kernel space?
A second level runs the guest tables, pins the appropriate pieces from
the guest to get the resulting phys_addr(s) which are then passed via
iommu to a first level via map (or unmap)?
>
> But I'm not sure that reading the userspace io page tables with KVM is
> even the best thing to do - the iommu driver already has the pinned
> memory, it would be faster and more modular to traverse the io page
> tables through the pfns in the root iommu_domain than by having KVM do
> the translations. Lets see what Matthew says..
OK, you lost me a bit here. And this may be associated with the above.
So, what the current implementation is doing is reading the guest DMA
tables (which we must pin the first time we access them) and then map
the PTEs of the associated guest DMA entries into the associated host
DMA table (so, again pin and place the address, or unpin and
invalidate). Basically we are shadowing the first level DMA table as a
copy of the second level DMA table with the host address(es) of the
pinned guest page(s).
I'm unclear where you are proposing the pinning be done if not by the
iommu domain traversing the tables to perform the 'shadow' operation.