> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, March 15, 2022 7:18 AM
>
> On Mon, Mar 14, 2022 at 04:50:33PM -0600, Alex Williamson wrote:
>
> > > +/*
> > > + * The KVM_IOMMU type implies that the hypervisor will control the
> mappings
> > > + * rather than userspace
> > > + */
> > > +#define VFIO_KVM_IOMMU 11
> >
> > Then why is this hosted in the type1 code that exposes a wide variety
> > of userspace interfaces? Thanks,
>
> It is really badly named, this is the root level of a 2 stage nested
> IO page table, and this approach needed a special flag to distinguish
> the setup from the normal iommu_domain.
>
> If we do try to stick this into VFIO it should probably use the
> VFIO_TYPE1_NESTING_IOMMU instead - however, we would like to delete
> that flag entirely as it was never fully implemented, was never used,
> and isn't part of what we are proposing for IOMMU nesting on ARM
> anyhow. (So far I've found nobody to explain what the plan here was..)
>
> This is why I said the second level should be an explicit iommu_domain
> all on its own that is explicitly coupled to the KVM to read the page
> tables, if necessary.
>
> But I'm not sure that reading the userspace io page tables with KVM is
> even the best thing to do - the iommu driver already has the pinned
> memory, it would be faster and more modular to traverse the io page
> tables through the pfns in the root iommu_domain than by having KVM do
> the translations. Lets see what Matthew says..
>
Reading this thread it's sort of like an optimization to software nesting.
If that is the case does it make more sense to complete the basic form
of software nesting first and then adds this optimization?
The basic form would allow the userspace to create a special domain
type which points to a user/guest page table (like hardware nesting)
but doesn't install the user page table to the IOMMU hardware (unlike
hardware nesting). When receiving invalidate cmd from userspace
the iommu driver walks the user page table (1st-level) and the parent
page table (2nd-level) to generate a shadow mapping for the
invalidated range in the non-nested hardware page table of this
special domain type.
Once that works what this series does just changes the matter of
how the invalidate cmd is triggered. Previously iommu driver receives
invalidate cmd from Qemu (via iommufd uAPI) while now receiving
the cmd from kvm (via iommufd kAPI) upon interception of RPCIT.
From this angle once the connection between iommufd and kvm fd
is established there is even no direct talk between iommu driver and
kvm.
Thanks
Kevin
On 3/15/22 3:57 AM, Tian, Kevin wrote:
>> From: Jason Gunthorpe <[email protected]>
>> Sent: Tuesday, March 15, 2022 7:18 AM
>>
>> On Mon, Mar 14, 2022 at 04:50:33PM -0600, Alex Williamson wrote:
>>
>>>> +/*
>>>> + * The KVM_IOMMU type implies that the hypervisor will control the
>> mappings
>>>> + * rather than userspace
>>>> + */
>>>> +#define VFIO_KVM_IOMMU 11
>>>
>>> Then why is this hosted in the type1 code that exposes a wide variety
>>> of userspace interfaces? Thanks,
>>
>> It is really badly named, this is the root level of a 2 stage nested
>> IO page table, and this approach needed a special flag to distinguish
>> the setup from the normal iommu_domain.
>>
>> If we do try to stick this into VFIO it should probably use the
>> VFIO_TYPE1_NESTING_IOMMU instead - however, we would like to delete
>> that flag entirely as it was never fully implemented, was never used,
>> and isn't part of what we are proposing for IOMMU nesting on ARM
>> anyhow. (So far I've found nobody to explain what the plan here was..)
>>
>> This is why I said the second level should be an explicit iommu_domain
>> all on its own that is explicitly coupled to the KVM to read the page
>> tables, if necessary.
>>
>> But I'm not sure that reading the userspace io page tables with KVM is
>> even the best thing to do - the iommu driver already has the pinned
>> memory, it would be faster and more modular to traverse the io page
>> tables through the pfns in the root iommu_domain than by having KVM do
>> the translations. Lets see what Matthew says..
>>
>
> Reading this thread it's sort of like an optimization to software nesting.
Yes, we want to avoid breaking to userspace for a very frequent
operation (RPCIT / updating shadow mappings)
> If that is the case does it make more sense to complete the basic form
> of software nesting first and then adds this optimization?
>
> The basic form would allow the userspace to create a special domain
> type which points to a user/guest page table (like hardware nesting)
> but doesn't install the user page table to the IOMMU hardware (unlike
> hardware nesting). When receiving invalidate cmd from userspace > the iommu driver walks the user page table (1st-level) and the parent
> page table (2nd-level) to generate a shadow mapping for the
> invalidated range in the non-nested hardware page table of this
> special domain type.
>
> Once that works what this series does just changes the matter of
> how the invalidate cmd is triggered. Previously iommu driver receives
> invalidate cmd from Qemu (via iommufd uAPI) while now receiving
> the cmd from kvm (via iommufd kAPI) upon interception of RPCIT.
> From this angle once the connection between iommufd and kvm fd
> is established there is even no direct talk between iommu driver and
> kvm.
But something somewhere still needs to be responsible for
pinning/unpinning of the guest table entries upon each RPCIT
interception. e.g. the RPCIT intercept can happen because the guest
wants to invalidate some old mappings or has generated some new mappings
over a range, so we must shadow the new mappings (by pinning the guest
entries and placing them in the host hardware table / unpinning
invalidated ones and clearing their entry in the host hardware table).
On 3/15/22 10:17 AM, Matthew Rosato wrote:
> On 3/15/22 3:57 AM, Tian, Kevin wrote:
>>> From: Jason Gunthorpe <[email protected]>
>>> Sent: Tuesday, March 15, 2022 7:18 AM
>>>
>>> On Mon, Mar 14, 2022 at 04:50:33PM -0600, Alex Williamson wrote:
>>>
>>>>> +/*
>>>>> + * The KVM_IOMMU type implies that the hypervisor will control the
>>> mappings
>>>>> + * rather than userspace
>>>>> + */
>>>>> +#define VFIO_KVM_IOMMU 11
>>>>
>>>> Then why is this hosted in the type1 code that exposes a wide variety
>>>> of userspace interfaces? Thanks,
>>>
>>> It is really badly named, this is the root level of a 2 stage nested
>>> IO page table, and this approach needed a special flag to distinguish
>>> the setup from the normal iommu_domain.
>>>
>>> If we do try to stick this into VFIO it should probably use the
>>> VFIO_TYPE1_NESTING_IOMMU instead - however, we would like to delete
>>> that flag entirely as it was never fully implemented, was never used,
>>> and isn't part of what we are proposing for IOMMU nesting on ARM
>>> anyhow. (So far I've found nobody to explain what the plan here was..)
>>>
>>> This is why I said the second level should be an explicit iommu_domain
>>> all on its own that is explicitly coupled to the KVM to read the page
>>> tables, if necessary.
>>>
>>> But I'm not sure that reading the userspace io page tables with KVM is
>>> even the best thing to do - the iommu driver already has the pinned
>>> memory, it would be faster and more modular to traverse the io page
>>> tables through the pfns in the root iommu_domain than by having KVM do
>>> the translations. Lets see what Matthew says..
>>>
>>
>> Reading this thread it's sort of like an optimization to software
>> nesting.
>
> Yes, we want to avoid breaking to userspace for a very frequent
> operation (RPCIT / updating shadow mappings)
>
>> If that is the case does it make more sense to complete the basic form
>> of software nesting first and then adds this optimization?
>>
>> The basic form would allow the userspace to create a special domain
>> type which points to a user/guest page table (like hardware nesting)
>> but doesn't install the user page table to the IOMMU hardware (unlike
>> hardware nesting). When receiving invalidate cmd from userspace > the
>> iommu driver walks the user page table (1st-level) and the parent
>> page table (2nd-level) to generate a shadow mapping for the
>> invalidated range in the non-nested hardware page table of this
>> special domain type.
>>
>> Once that works what this series does just changes the matter of
>> how the invalidate cmd is triggered. Previously iommu driver receives
>> invalidate cmd from Qemu (via iommufd uAPI) while now receiving
>> the cmd from kvm (via iommufd kAPI) upon interception of RPCIT.
>> From this angle once the connection between iommufd and kvm fd
>> is established there is even no direct talk between iommu driver and
>> kvm.
>
> But something somewhere still needs to be responsible for
> pinning/unpinning of the guest table entries upon each RPCIT
> interception. e.g. the RPCIT intercept can happen because the guest
> wants to invalidate some old mappings or has generated some new mappings
> over a range, so we must shadow the new mappings (by pinning the guest
> entries and placing them in the host hardware table / unpinning
> invalidated ones and clearing their entry in the host hardware table).
>
OK, this got clarified by Jason in another thread: What I was missing
here was an assumption that the 1st-level has already mapped and pinned
all of guest physical address space; in that case there's no need to
invoke pin/unpin operations against a kvm from within the iommu domain
(this series as-is does not pin all of the guest physical address space;
it does pins/unpins on-demand at RPCIT time)