Hi,
The static pinning and mapping problem in VFIO and possible solutions
have been discussed a lot [1, 2]. One of the solutions is to add I/O
page fault support for VFIO devices. Different from those relatively
complicated software approaches such as presenting a vIOMMU that provides
the DMA buffer information (might include para-virtualized optimizations),
IOPF mainly depends on the hardware faulting capability, such as the PCIe
PRI extension or Arm SMMU stall model. What's more, the IOPF support in
the IOMMU driver is being implemented in SVA [3]. So do we consider to
add IOPF support for VFIO passthrough based on the IOPF part of SVA at
present?
We have implemented a basic demo only for one stage of translation (GPA
-> HPA in virtualization, note that it can be configured at either stage),
and tested on Hisilicon Kunpeng920 board. The nested mode is more complicated
since VFIO only handles the second stage page faults (same as the non-nested
case), while the first stage page faults need to be further delivered to
the guest, which is being implemented in [4] on ARM. My thought on this
is to report the page faults to VFIO regardless of the occured stage (try
to carry the stage information), and handle respectively according to the
configured mode in VFIO. Or the IOMMU driver might evolve to support more...
Might TODO:
- Optimize the faulting path, and measure the performance (it might still
be a big issue).
- Add support for PRI.
- Add a MMU notifier to avoid pinning.
- Add support for the nested mode.
...
Any comments and suggestions are very welcome. :-)
Links:
[1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
2016.
[2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
[3] iommu: I/O page faults for SMMUv3:
https://patchwork.kernel.org/project/linux-arm-kernel/cover/[email protected]/
[4] SMMUv3 Nested Stage Setup (VFIO part):
https://patchwork.kernel.org/project/kvm/cover/[email protected]/
Thanks,
Shenming
Shenming Lu (4):
vfio/type1: Add a bitmap to track IOPF mapped pages
vfio: Add a page fault handler
vfio: Try to enable IOPF for VFIO devices
vfio: Allow to pin and map dynamically
drivers/vfio/vfio.c | 75 ++++++++++++++++++
drivers/vfio/vfio_iommu_type1.c | 131 +++++++++++++++++++++++++++++++-
include/linux/vfio.h | 6 ++
3 files changed, 211 insertions(+), 1 deletion(-)
--
2.19.1
On Mon, 25 Jan 2021 17:03:58 +0800
Shenming Lu <[email protected]> wrote:
> Hi,
>
> The static pinning and mapping problem in VFIO and possible solutions
> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> page fault support for VFIO devices. Different from those relatively
> complicated software approaches such as presenting a vIOMMU that provides
> the DMA buffer information (might include para-virtualized optimizations),
> IOPF mainly depends on the hardware faulting capability, such as the PCIe
> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> the IOMMU driver is being implemented in SVA [3]. So do we consider to
> add IOPF support for VFIO passthrough based on the IOPF part of SVA at
> present?
>
> We have implemented a basic demo only for one stage of translation (GPA
> -> HPA in virtualization, note that it can be configured at either stage),
> and tested on Hisilicon Kunpeng920 board. The nested mode is more complicated
> since VFIO only handles the second stage page faults (same as the non-nested
> case), while the first stage page faults need to be further delivered to
> the guest, which is being implemented in [4] on ARM. My thought on this
> is to report the page faults to VFIO regardless of the occured stage (try
> to carry the stage information), and handle respectively according to the
> configured mode in VFIO. Or the IOMMU driver might evolve to support more...
>
> Might TODO:
> - Optimize the faulting path, and measure the performance (it might still
> be a big issue).
> - Add support for PRI.
> - Add a MMU notifier to avoid pinning.
> - Add support for the nested mode.
> ...
>
> Any comments and suggestions are very welcome. :-)
I expect performance to be pretty bad here, the lookup involved per
fault is excessive. There are cases where a user is not going to be
willing to have a slow ramp up of performance for their devices as they
fault in pages, so we might need to considering making this
configurable through the vfio interface. Our page mapping also only
grows here, should mappings expire or do we need a least recently
mapped tracker to avoid exceeding the user's locked memory limit? How
does a user know what to set for a locked memory limit? The behavior
here would lead to cases where an idle system might be ok, but as soon
as load increases with more inflight DMA, we start seeing
"unpredictable" I/O faults from the user perspective. Seems like there
are lots of outstanding considerations and I'd also like to hear from
the SVA folks about how this meshes with their work. Thanks,
Alex
On 2021/1/30 6:57, Alex Williamson wrote:
> On Mon, 25 Jan 2021 17:03:58 +0800
> Shenming Lu <[email protected]> wrote:
>
>> Hi,
>>
>> The static pinning and mapping problem in VFIO and possible solutions
>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>> page fault support for VFIO devices. Different from those relatively
>> complicated software approaches such as presenting a vIOMMU that provides
>> the DMA buffer information (might include para-virtualized optimizations),
>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>> the IOMMU driver is being implemented in SVA [3]. So do we consider to
>> add IOPF support for VFIO passthrough based on the IOPF part of SVA at
>> present?
>>
>> We have implemented a basic demo only for one stage of translation (GPA
>> -> HPA in virtualization, note that it can be configured at either stage),
>> and tested on Hisilicon Kunpeng920 board. The nested mode is more complicated
>> since VFIO only handles the second stage page faults (same as the non-nested
>> case), while the first stage page faults need to be further delivered to
>> the guest, which is being implemented in [4] on ARM. My thought on this
>> is to report the page faults to VFIO regardless of the occured stage (try
>> to carry the stage information), and handle respectively according to the
>> configured mode in VFIO. Or the IOMMU driver might evolve to support more...
>>
>> Might TODO:
>> - Optimize the faulting path, and measure the performance (it might still
>> be a big issue).
>> - Add support for PRI.
>> - Add a MMU notifier to avoid pinning.
>> - Add support for the nested mode.
>> ...
>>
>> Any comments and suggestions are very welcome. :-)
>
> I expect performance to be pretty bad here, the lookup involved per
> fault is excessive.
We might consider to prepin more pages as a further optimization.
> There are cases where a user is not going to be
> willing to have a slow ramp up of performance for their devices as they
> fault in pages, so we might need to considering making this
> configurable through the vfio interface.
Yeah, makes sense, I will try to implement this: maybe add a ioctl called
VFIO_IOMMU_ENABLE_IOPF for Type1 VFIO IOMMU...
> Our page mapping also only
> grows here, should mappings expire or do we need a least recently
> mapped tracker to avoid exceeding the user's locked memory limit? How
> does a user know what to set for a locked memory limit?
Yeah, we can add a LRU(mapped) tracker to release the pages when exceeding
a memory limit, maybe have a thread to periodically check this.
And as for the memory limit, maybe we could give the user some levels
(10%(default)/30%/50%/70%/unlimited of the total user memory (mapping size))
to choose from via the VFIO_IOMMU_ENABLE_IOPF ioctl...
> The behavior
> here would lead to cases where an idle system might be ok, but as soon
> as load increases with more inflight DMA, we start seeing
> "unpredictable" I/O faults from the user perspective.
"unpredictable" I/O faults? We might see more problems after more testing...
Thanks,
Shenming
> Seems like there
> are lots of outstanding considerations and I'd also like to hear from
> the SVA folks about how this meshes with their work. Thanks,
>
> Alex
>
> .
>
> From: Alex Williamson <[email protected]>
> Sent: Saturday, January 30, 2021 6:58 AM
>
> On Mon, 25 Jan 2021 17:03:58 +0800
> Shenming Lu <[email protected]> wrote:
>
> > Hi,
> >
> > The static pinning and mapping problem in VFIO and possible solutions
> > have been discussed a lot [1, 2]. One of the solutions is to add I/O
> > page fault support for VFIO devices. Different from those relatively
> > complicated software approaches such as presenting a vIOMMU that
> provides
> > the DMA buffer information (might include para-virtualized optimizations),
> > IOPF mainly depends on the hardware faulting capability, such as the PCIe
> > PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> > the IOMMU driver is being implemented in SVA [3]. So do we consider to
> > add IOPF support for VFIO passthrough based on the IOPF part of SVA at
> > present?
> >
> > We have implemented a basic demo only for one stage of translation (GPA
> > -> HPA in virtualization, note that it can be configured at either stage),
> > and tested on Hisilicon Kunpeng920 board. The nested mode is more
> complicated
> > since VFIO only handles the second stage page faults (same as the non-
> nested
> > case), while the first stage page faults need to be further delivered to
> > the guest, which is being implemented in [4] on ARM. My thought on this
> > is to report the page faults to VFIO regardless of the occured stage (try
> > to carry the stage information), and handle respectively according to the
> > configured mode in VFIO. Or the IOMMU driver might evolve to support
> more...
> >
> > Might TODO:
> > - Optimize the faulting path, and measure the performance (it might still
> > be a big issue).
> > - Add support for PRI.
> > - Add a MMU notifier to avoid pinning.
> > - Add support for the nested mode.
> > ...
> >
> > Any comments and suggestions are very welcome. :-)
>
> I expect performance to be pretty bad here, the lookup involved per
> fault is excessive. There are cases where a user is not going to be
> willing to have a slow ramp up of performance for their devices as they
> fault in pages, so we might need to considering making this
> configurable through the vfio interface. Our page mapping also only
There is another factor to be considered. The presence of IOMMU_
DEV_FEAT_IOPF just indicates the device capability of triggering I/O
page fault through the IOMMU, but not exactly means that the device
can tolerate I/O page fault for arbitrary DMA requests. In reality, many
devices allow I/O faulting only in selective contexts. However, there
is no standard way (e.g. PCISIG) for the device to report whether
arbitrary I/O fault is allowed. Then we may have to maintain device
specific knowledge in software, e.g. in an opt-in table to list devices
which allows arbitrary faults. For devices which only support selective
faulting, a mediator (either through vendor extensions on vfio-pci-core
or a mdev wrapper) might be necessary to help lock down non-faultable
mappings and then enable faulting on the rest mappings.
> grows here, should mappings expire or do we need a least recently
> mapped tracker to avoid exceeding the user's locked memory limit? How
> does a user know what to set for a locked memory limit? The behavior
> here would lead to cases where an idle system might be ok, but as soon
> as load increases with more inflight DMA, we start seeing
> "unpredictable" I/O faults from the user perspective. Seems like there
> are lots of outstanding considerations and I'd also like to hear from
> the SVA folks about how this meshes with their work. Thanks,
>
The main overlap between this feature and SVA is the IOPF reporting
framework, which currently still has gap to support both in nested
mode, as discussed here:
https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/
Once that gap is resolved in the future, the VFIO fault handler just
adopts different actions according to the fault-level: 1st level faults
are forwarded to userspace thru the vSVA path while 2nd-level faults
are fixed (or warned if not intended) by VFIO itself thru the IOMMU
mapping interface.
Thanks
Kevin
On 2021/2/1 15:56, Tian, Kevin wrote:
>> From: Alex Williamson <[email protected]>
>> Sent: Saturday, January 30, 2021 6:58 AM
>>
>> On Mon, 25 Jan 2021 17:03:58 +0800
>> Shenming Lu <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> The static pinning and mapping problem in VFIO and possible solutions
>>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>>> page fault support for VFIO devices. Different from those relatively
>>> complicated software approaches such as presenting a vIOMMU that
>> provides
>>> the DMA buffer information (might include para-virtualized optimizations),
>>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>>> the IOMMU driver is being implemented in SVA [3]. So do we consider to
>>> add IOPF support for VFIO passthrough based on the IOPF part of SVA at
>>> present?
>>>
>>> We have implemented a basic demo only for one stage of translation (GPA
>>> -> HPA in virtualization, note that it can be configured at either stage),
>>> and tested on Hisilicon Kunpeng920 board. The nested mode is more
>> complicated
>>> since VFIO only handles the second stage page faults (same as the non-
>> nested
>>> case), while the first stage page faults need to be further delivered to
>>> the guest, which is being implemented in [4] on ARM. My thought on this
>>> is to report the page faults to VFIO regardless of the occured stage (try
>>> to carry the stage information), and handle respectively according to the
>>> configured mode in VFIO. Or the IOMMU driver might evolve to support
>> more...
>>>
>>> Might TODO:
>>> - Optimize the faulting path, and measure the performance (it might still
>>> be a big issue).
>>> - Add support for PRI.
>>> - Add a MMU notifier to avoid pinning.
>>> - Add support for the nested mode.
>>> ...
>>>
>>> Any comments and suggestions are very welcome. :-)
>>
>> I expect performance to be pretty bad here, the lookup involved per
>> fault is excessive. There are cases where a user is not going to be
>> willing to have a slow ramp up of performance for their devices as they
>> fault in pages, so we might need to considering making this
>> configurable through the vfio interface. Our page mapping also only
>
> There is another factor to be considered. The presence of IOMMU_
> DEV_FEAT_IOPF just indicates the device capability of triggering I/O
> page fault through the IOMMU, but not exactly means that the device
> can tolerate I/O page fault for arbitrary DMA requests.
Yes, so I add a iopf_enabled field in VFIO to indicate the whole path faulting
capability and set it to true after registering a VFIO page fault handler.
> In reality, many
> devices allow I/O faulting only in selective contexts. However, there
> is no standard way (e.g. PCISIG) for the device to report whether
> arbitrary I/O fault is allowed. Then we may have to maintain device
> specific knowledge in software, e.g. in an opt-in table to list devices
> which allows arbitrary faults. For devices which only support selective
> faulting, a mediator (either through vendor extensions on vfio-pci-core
> or a mdev wrapper) might be necessary to help lock down non-faultable
> mappings and then enable faulting on the rest mappings.
For devices which only support selective faulting, they could tell it to the
IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
>
>> grows here, should mappings expire or do we need a least recently
>> mapped tracker to avoid exceeding the user's locked memory limit? How
>> does a user know what to set for a locked memory limit? The behavior
>> here would lead to cases where an idle system might be ok, but as soon
>> as load increases with more inflight DMA, we start seeing
>> "unpredictable" I/O faults from the user perspective. Seems like there
>> are lots of outstanding considerations and I'd also like to hear from
>> the SVA folks about how this meshes with their work. Thanks,
>>
>
> The main overlap between this feature and SVA is the IOPF reporting
> framework, which currently still has gap to support both in nested
> mode, as discussed here:
>
> https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/
>
> Once that gap is resolved in the future, the VFIO fault handler just
> adopts different actions according to the fault-level: 1st level faults
> are forwarded to userspace thru the vSVA path while 2nd-level faults
> are fixed (or warned if not intended) by VFIO itself thru the IOMMU
> mapping interface.
I understand what you mean is:
From the perspective of VFIO, first, we need to set FEAT_IOPF, and then regster its
own handler with a flag to indicate FLAT or NESTED and which level is concerned,
thus the VFIO handler can handle the page faults directly according to the carried
level information.
Is there any plan for evolving(implementing) the IOMMU driver to support this? Or
could we help this? :-)
Thanks,
Shenming
>
> Thanks
> Kevin
>
> From: Shenming Lu <[email protected]>
> Sent: Tuesday, February 2, 2021 2:42 PM
>
> On 2021/2/1 15:56, Tian, Kevin wrote:
> >> From: Alex Williamson <[email protected]>
> >> Sent: Saturday, January 30, 2021 6:58 AM
> >>
> >> On Mon, 25 Jan 2021 17:03:58 +0800
> >> Shenming Lu <[email protected]> wrote:
> >>
> >>> Hi,
> >>>
> >>> The static pinning and mapping problem in VFIO and possible solutions
> >>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> >>> page fault support for VFIO devices. Different from those relatively
> >>> complicated software approaches such as presenting a vIOMMU that
> >> provides
> >>> the DMA buffer information (might include para-virtualized
> optimizations),
> >>> IOPF mainly depends on the hardware faulting capability, such as the
> PCIe
> >>> PRI extension or Arm SMMU stall model. What's more, the IOPF support
> in
> >>> the IOMMU driver is being implemented in SVA [3]. So do we consider to
> >>> add IOPF support for VFIO passthrough based on the IOPF part of SVA at
> >>> present?
> >>>
> >>> We have implemented a basic demo only for one stage of translation
> (GPA
> >>> -> HPA in virtualization, note that it can be configured at either stage),
> >>> and tested on Hisilicon Kunpeng920 board. The nested mode is more
> >> complicated
> >>> since VFIO only handles the second stage page faults (same as the non-
> >> nested
> >>> case), while the first stage page faults need to be further delivered to
> >>> the guest, which is being implemented in [4] on ARM. My thought on this
> >>> is to report the page faults to VFIO regardless of the occured stage (try
> >>> to carry the stage information), and handle respectively according to the
> >>> configured mode in VFIO. Or the IOMMU driver might evolve to support
> >> more...
> >>>
> >>> Might TODO:
> >>> - Optimize the faulting path, and measure the performance (it might still
> >>> be a big issue).
> >>> - Add support for PRI.
> >>> - Add a MMU notifier to avoid pinning.
> >>> - Add support for the nested mode.
> >>> ...
> >>>
> >>> Any comments and suggestions are very welcome. :-)
> >>
> >> I expect performance to be pretty bad here, the lookup involved per
> >> fault is excessive. There are cases where a user is not going to be
> >> willing to have a slow ramp up of performance for their devices as they
> >> fault in pages, so we might need to considering making this
> >> configurable through the vfio interface. Our page mapping also only
> >
> > There is another factor to be considered. The presence of IOMMU_
> > DEV_FEAT_IOPF just indicates the device capability of triggering I/O
> > page fault through the IOMMU, but not exactly means that the device
> > can tolerate I/O page fault for arbitrary DMA requests.
>
> Yes, so I add a iopf_enabled field in VFIO to indicate the whole path faulting
> capability and set it to true after registering a VFIO page fault handler.
>
> > In reality, many
> > devices allow I/O faulting only in selective contexts. However, there
> > is no standard way (e.g. PCISIG) for the device to report whether
> > arbitrary I/O fault is allowed. Then we may have to maintain device
> > specific knowledge in software, e.g. in an opt-in table to list devices
> > which allows arbitrary faults. For devices which only support selective
> > faulting, a mediator (either through vendor extensions on vfio-pci-core
> > or a mdev wrapper) might be necessary to help lock down non-faultable
> > mappings and then enable faulting on the rest mappings.
>
> For devices which only support selective faulting, they could tell it to the
> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
selectively page-pinning. The matter is that 'they' imply some device
specific logic to decide which pages must be pinned and such knowledge
is outside of VFIO.
From enabling p.o.v we could possibly do it in phased approach. First
handles devices which tolerate arbitrary DMA faults, and then extends
to devices with selective-faulting. The former is simpler, but with one
main open whether we want to maintain such device IDs in a static
table in VFIO or rely on some hints from other components (e.g. PF
driver in VF assignment case). Let's see how Alex thinks about it.
>
> >
> >> grows here, should mappings expire or do we need a least recently
> >> mapped tracker to avoid exceeding the user's locked memory limit? How
> >> does a user know what to set for a locked memory limit? The behavior
> >> here would lead to cases where an idle system might be ok, but as soon
> >> as load increases with more inflight DMA, we start seeing
> >> "unpredictable" I/O faults from the user perspective. Seems like there
> >> are lots of outstanding considerations and I'd also like to hear from
> >> the SVA folks about how this meshes with their work. Thanks,
> >>
> >
> > The main overlap between this feature and SVA is the IOPF reporting
> > framework, which currently still has gap to support both in nested
> > mode, as discussed here:
> >
> > https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/
> >
> > Once that gap is resolved in the future, the VFIO fault handler just
> > adopts different actions according to the fault-level: 1st level faults
> > are forwarded to userspace thru the vSVA path while 2nd-level faults
> > are fixed (or warned if not intended) by VFIO itself thru the IOMMU
> > mapping interface.
>
> I understand what you mean is:
> From the perspective of VFIO, first, we need to set FEAT_IOPF, and then
> regster its
> own handler with a flag to indicate FLAT or NESTED and which level is
> concerned,
> thus the VFIO handler can handle the page faults directly according to the
> carried
> level information.
>
> Is there any plan for evolving(implementing) the IOMMU driver to support
> this? Or
> could we help this? :-)
>
Yes, it's in plan but just not happened yet. We are still focusing on guest
SVA part thus only the 1st-level page fault (+Yi/Jacob). It's always welcomed
to collaborate/help if you have time. ????
Thanks
Kevin
Hi,
On Thu, Feb 04, 2021 at 06:52:10AM +0000, Tian, Kevin wrote:
> > >>> The static pinning and mapping problem in VFIO and possible solutions
> > >>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> > >>> page fault support for VFIO devices. Different from those relatively
> > >>> complicated software approaches such as presenting a vIOMMU that
> > >> provides
> > >>> the DMA buffer information (might include para-virtualized
> > optimizations),
I'm curious about the performance difference between this and the
map/unmap vIOMMU, as well as the coIOMMU. This is probably a lot faster
but those don't depend on IOPF which is a pretty rare feature at the
moment.
[...]
> > > In reality, many
> > > devices allow I/O faulting only in selective contexts. However, there
> > > is no standard way (e.g. PCISIG) for the device to report whether
> > > arbitrary I/O fault is allowed. Then we may have to maintain device
> > > specific knowledge in software, e.g. in an opt-in table to list devices
> > > which allows arbitrary faults. For devices which only support selective
> > > faulting, a mediator (either through vendor extensions on vfio-pci-core
> > > or a mdev wrapper) might be necessary to help lock down non-faultable
> > > mappings and then enable faulting on the rest mappings.
> >
> > For devices which only support selective faulting, they could tell it to the
> > IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
>
> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
> selectively page-pinning. The matter is that 'they' imply some device
> specific logic to decide which pages must be pinned and such knowledge
> is outside of VFIO.
>
> From enabling p.o.v we could possibly do it in phased approach. First
> handles devices which tolerate arbitrary DMA faults, and then extends
> to devices with selective-faulting. The former is simpler, but with one
> main open whether we want to maintain such device IDs in a static
> table in VFIO or rely on some hints from other components (e.g. PF
> driver in VF assignment case). Let's see how Alex thinks about it.
Do you think selective-faulting will be the norm, or only a problem for
initial IOPF implementations? To me it's the selective-faulting kind of
device that will be the odd one out, but that's pure speculation. Either
way maintaining a device list seems like a pain.
[...]
> Yes, it's in plan but just not happened yet. We are still focusing on guest
> SVA part thus only the 1st-level page fault (+Yi/Jacob). It's always welcomed
> to collaborate/help if you have time. ????
By the way the current fault report API is missing a way to invalidate
partial faults: when the IOMMU device's PRI queue overflows, it may
auto-respond to page request groups that were already partially reported
by the IOMMU driver. Upon detecting an overflow, the IOMMU driver needs to
tell all fault consumers to discard their partial groups.
iopf_queue_discard_partial() [1] does this for the internal IOPF handler
but we have nothing for the lower-level fault handler at the moment. And
it gets more complicated when injecting IOPFs to guests, we'd need a
mechanism to recall partial groups all the way through kernel->userspace
and userspace->guest.
Shenming suggests [2] to also use the IOPF handler for IOPFs managed by
device drivers. It's worth considering in my opinion because we could hold
partial groups within the kernel and only report full groups to device
drivers (and guests). In addition we'd consolidate tracking of IOPFs,
since they're done both by iommu_report_device_fault() and the IOPF
handler at the moment.
Note that I plan to upstream the IOPF patch [1] as is because it was
already in good shape for 5.12, and consolidating the fault handler will
require some thinking.
Thanks,
Jean
[1] https://lore.kernel.org/linux-iommu/[email protected]/
[2] https://lore.kernel.org/linux-iommu/[email protected]/
> From: Jean-Philippe Brucker <[email protected]>
> Sent: Friday, February 5, 2021 6:37 PM
>
> Hi,
>
> On Thu, Feb 04, 2021 at 06:52:10AM +0000, Tian, Kevin wrote:
> > > >>> The static pinning and mapping problem in VFIO and possible
> solutions
> > > >>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> > > >>> page fault support for VFIO devices. Different from those relatively
> > > >>> complicated software approaches such as presenting a vIOMMU that
> > > >> provides
> > > >>> the DMA buffer information (might include para-virtualized
> > > optimizations),
>
> I'm curious about the performance difference between this and the
> map/unmap vIOMMU, as well as the coIOMMU. This is probably a lot faster
> but those don't depend on IOPF which is a pretty rare feature at the
> moment.
>
> [...]
> > > > In reality, many
> > > > devices allow I/O faulting only in selective contexts. However, there
> > > > is no standard way (e.g. PCISIG) for the device to report whether
> > > > arbitrary I/O fault is allowed. Then we may have to maintain device
> > > > specific knowledge in software, e.g. in an opt-in table to list devices
> > > > which allows arbitrary faults. For devices which only support selective
> > > > faulting, a mediator (either through vendor extensions on vfio-pci-core
> > > > or a mdev wrapper) might be necessary to help lock down non-
> faultable
> > > > mappings and then enable faulting on the rest mappings.
> > >
> > > For devices which only support selective faulting, they could tell it to the
> > > IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
> >
> > Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
> > selectively page-pinning. The matter is that 'they' imply some device
> > specific logic to decide which pages must be pinned and such knowledge
> > is outside of VFIO.
> >
> > From enabling p.o.v we could possibly do it in phased approach. First
> > handles devices which tolerate arbitrary DMA faults, and then extends
> > to devices with selective-faulting. The former is simpler, but with one
> > main open whether we want to maintain such device IDs in a static
> > table in VFIO or rely on some hints from other components (e.g. PF
> > driver in VF assignment case). Let's see how Alex thinks about it.
>
> Do you think selective-faulting will be the norm, or only a problem for
> initial IOPF implementations? To me it's the selective-faulting kind of
> device that will be the odd one out, but that's pure speculation. Either
> way maintaining a device list seems like a pain.
I would think it's norm for quite some time (e.g. multiple years), as from
what I learned turning a complex accelerator to an implementation
tolerating arbitrary DMA fault is way complex (in every critical path) and
not cost effective (tracking in-fly requests). It might be OK for some
purposely-built devices in specific usage but for most it has to be an
evolving path toward the 100%-faultable goal...
>
> [...]
> > Yes, it's in plan but just not happened yet. We are still focusing on guest
> > SVA part thus only the 1st-level page fault (+Yi/Jacob). It's always
> welcomed
> > to collaborate/help if you have time. ????
>
> By the way the current fault report API is missing a way to invalidate
> partial faults: when the IOMMU device's PRI queue overflows, it may
> auto-respond to page request groups that were already partially reported
> by the IOMMU driver. Upon detecting an overflow, the IOMMU driver needs
> to
> tell all fault consumers to discard their partial groups.
> iopf_queue_discard_partial() [1] does this for the internal IOPF handler
> but we have nothing for the lower-level fault handler at the moment. And
> it gets more complicated when injecting IOPFs to guests, we'd need a
> mechanism to recall partial groups all the way through kernel->userspace
> and userspace->guest.
I didn't know how to recall partial groups through emulated vIOMMUs
(at least for virtual VT-d). Possibly it could be supported by virtio-iommu.
But in any case I consider it more like an optimization instead of a functional
requirement (and could be avoided in below Shenming's suggestion).
>
> Shenming suggests [2] to also use the IOPF handler for IOPFs managed by
> device drivers. It's worth considering in my opinion because we could hold
> partial groups within the kernel and only report full groups to device
> drivers (and guests). In addition we'd consolidate tracking of IOPFs,
> since they're done both by iommu_report_device_fault() and the IOPF
> handler at the moment.
I also think it's the right thing to do. In concept w/ or w/o DEV_FEAT_IOPF
just reflects how IOPFs are delivered to the system software. In the end
IOPFs are all about permission violations in the IOMMU page tables thus
we should try to reuse/consolidate the IOMMU fault reporting stack as
much as possible.
>
> Note that I plan to upstream the IOPF patch [1] as is because it was
> already in good shape for 5.12, and consolidating the fault handler will
> require some thinking.
This plan makes sense.
>
> Thanks,
> Jean
>
>
> [1] https://lore.kernel.org/linux-iommu/20210127154322.3959196-7-jean-
> [email protected]/
> [2] https://lore.kernel.org/linux-iommu/f79f06be-e46b-a65a-3951-
> [email protected]/
Thanks
Kevin
On 2021/2/7 16:20, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker <[email protected]>
>> Sent: Friday, February 5, 2021 6:37 PM
>>
>> Hi,
>>
>> On Thu, Feb 04, 2021 at 06:52:10AM +0000, Tian, Kevin wrote:
>>>>>>> The static pinning and mapping problem in VFIO and possible
>> solutions
>>>>>>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>>>>>>> page fault support for VFIO devices. Different from those relatively
>>>>>>> complicated software approaches such as presenting a vIOMMU that
>>>>>> provides
>>>>>>> the DMA buffer information (might include para-virtualized
>>>> optimizations),
>>
>> I'm curious about the performance difference between this and the
>> map/unmap vIOMMU, as well as the coIOMMU. This is probably a lot faster
>> but those don't depend on IOPF which is a pretty rare feature at the
>> moment.
Yeah, I will give the performance data later.
>>
>> [...]
>>>>> In reality, many
>>>>> devices allow I/O faulting only in selective contexts. However, there
>>>>> is no standard way (e.g. PCISIG) for the device to report whether
>>>>> arbitrary I/O fault is allowed. Then we may have to maintain device
>>>>> specific knowledge in software, e.g. in an opt-in table to list devices
>>>>> which allows arbitrary faults. For devices which only support selective
>>>>> faulting, a mediator (either through vendor extensions on vfio-pci-core
>>>>> or a mdev wrapper) might be necessary to help lock down non-
>> faultable
>>>>> mappings and then enable faulting on the rest mappings.
>>>>
>>>> For devices which only support selective faulting, they could tell it to the
>>>> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
>>>
>>> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
>>> selectively page-pinning. The matter is that 'they' imply some device
>>> specific logic to decide which pages must be pinned and such knowledge
>>> is outside of VFIO.
>>>
>>> From enabling p.o.v we could possibly do it in phased approach. First
>>> handles devices which tolerate arbitrary DMA faults, and then extends
>>> to devices with selective-faulting. The former is simpler, but with one
>>> main open whether we want to maintain such device IDs in a static
>>> table in VFIO or rely on some hints from other components (e.g. PF
>>> driver in VF assignment case). Let's see how Alex thinks about it.
>>
>> Do you think selective-faulting will be the norm, or only a problem for
>> initial IOPF implementations? To me it's the selective-faulting kind of
>> device that will be the odd one out, but that's pure speculation. Either
>> way maintaining a device list seems like a pain.
>
> I would think it's norm for quite some time (e.g. multiple years), as from
> what I learned turning a complex accelerator to an implementation
> tolerating arbitrary DMA fault is way complex (in every critical path) and
> not cost effective (tracking in-fly requests). It might be OK for some
> purposely-built devices in specific usage but for most it has to be an
> evolving path toward the 100%-faultable goal...
>
>>
>> [...]
>>> Yes, it's in plan but just not happened yet. We are still focusing on guest
>>> SVA part thus only the 1st-level page fault (+Yi/Jacob). It's always
>> welcomed
>>> to collaborate/help if you have time. ????
>>
>> By the way the current fault report API is missing a way to invalidate
>> partial faults: when the IOMMU device's PRI queue overflows, it may
>> auto-respond to page request groups that were already partially reported
>> by the IOMMU driver. Upon detecting an overflow, the IOMMU driver needs
>> to
>> tell all fault consumers to discard their partial groups.
>> iopf_queue_discard_partial() [1] does this for the internal IOPF handler
>> but we have nothing for the lower-level fault handler at the moment. And
>> it gets more complicated when injecting IOPFs to guests, we'd need a
>> mechanism to recall partial groups all the way through kernel->userspace
>> and userspace->guest.
>
> I didn't know how to recall partial groups through emulated vIOMMUs
> (at least for virtual VT-d). Possibly it could be supported by virtio-iommu.
> But in any case I consider it more like an optimization instead of a functional
> requirement (and could be avoided in below Shenming's suggestion).
>
>>
>> Shenming suggests [2] to also use the IOPF handler for IOPFs managed by
>> device drivers. It's worth considering in my opinion because we could hold
>> partial groups within the kernel and only report full groups to device
>> drivers (and guests). In addition we'd consolidate tracking of IOPFs,
>> since they're done both by iommu_report_device_fault() and the IOPF
>> handler at the moment.
>
> I also think it's the right thing to do. In concept w/ or w/o DEV_FEAT_IOPF
> just reflects how IOPFs are delivered to the system software. In the end
> IOPFs are all about permission violations in the IOMMU page tables thus
> we should try to reuse/consolidate the IOMMU fault reporting stack as
> much as possible.
>
>>
>> Note that I plan to upstream the IOPF patch [1] as is because it was
>> already in good shape for 5.12, and consolidating the fault handler will
>> require some thinking.
>
> This plan makes sense.
Yeah, maybe this problem could be left for the implementation of a device(VFIO)
specific fault handler... :-)
Thanks,
Shenming
>
>>
>> Thanks,
>> Jean
>>
>>
>> [1] https://lore.kernel.org/linux-iommu/20210127154322.3959196-7-jean-
>> [email protected]/
>> [2] https://lore.kernel.org/linux-iommu/f79f06be-e46b-a65a-3951-
>> [email protected]/
>
> Thanks
> Kevin
>
> From: Tian, Kevin <[email protected]>
> Sent: Thursday, February 4, 2021 2:52 PM
>
> > From: Shenming Lu <[email protected]>
> > Sent: Tuesday, February 2, 2021 2:42 PM
> >
> > On 2021/2/1 15:56, Tian, Kevin wrote:
> > >> From: Alex Williamson <[email protected]>
> > >> Sent: Saturday, January 30, 2021 6:58 AM
> > >>
> > >> On Mon, 25 Jan 2021 17:03:58 +0800
> > >> Shenming Lu <[email protected]> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> The static pinning and mapping problem in VFIO and possible
> solutions
> > >>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> > >>> page fault support for VFIO devices. Different from those relatively
> > >>> complicated software approaches such as presenting a vIOMMU that
> > >> provides
> > >>> the DMA buffer information (might include para-virtualized
> > optimizations),
> > >>> IOPF mainly depends on the hardware faulting capability, such as the
> > PCIe
> > >>> PRI extension or Arm SMMU stall model. What's more, the IOPF
> support
> > in
> > >>> the IOMMU driver is being implemented in SVA [3]. So do we
> consider to
> > >>> add IOPF support for VFIO passthrough based on the IOPF part of SVA
> at
> > >>> present?
> > >>>
> > >>> We have implemented a basic demo only for one stage of translation
> > (GPA
> > >>> -> HPA in virtualization, note that it can be configured at either stage),
> > >>> and tested on Hisilicon Kunpeng920 board. The nested mode is more
> > >> complicated
> > >>> since VFIO only handles the second stage page faults (same as the
> non-
> > >> nested
> > >>> case), while the first stage page faults need to be further delivered to
> > >>> the guest, which is being implemented in [4] on ARM. My thought on
> this
> > >>> is to report the page faults to VFIO regardless of the occured stage
> (try
> > >>> to carry the stage information), and handle respectively according to
> the
> > >>> configured mode in VFIO. Or the IOMMU driver might evolve to
> support
> > >> more...
> > >>>
> > >>> Might TODO:
> > >>> - Optimize the faulting path, and measure the performance (it might
> still
> > >>> be a big issue).
> > >>> - Add support for PRI.
> > >>> - Add a MMU notifier to avoid pinning.
> > >>> - Add support for the nested mode.
> > >>> ...
> > >>>
> > >>> Any comments and suggestions are very welcome. :-)
> > >>
> > >> I expect performance to be pretty bad here, the lookup involved per
> > >> fault is excessive. There are cases where a user is not going to be
> > >> willing to have a slow ramp up of performance for their devices as they
> > >> fault in pages, so we might need to considering making this
> > >> configurable through the vfio interface. Our page mapping also only
> > >
> > > There is another factor to be considered. The presence of IOMMU_
> > > DEV_FEAT_IOPF just indicates the device capability of triggering I/O
> > > page fault through the IOMMU, but not exactly means that the device
> > > can tolerate I/O page fault for arbitrary DMA requests.
> >
> > Yes, so I add a iopf_enabled field in VFIO to indicate the whole path
> faulting
> > capability and set it to true after registering a VFIO page fault handler.
> >
> > > In reality, many
> > > devices allow I/O faulting only in selective contexts. However, there
> > > is no standard way (e.g. PCISIG) for the device to report whether
> > > arbitrary I/O fault is allowed. Then we may have to maintain device
> > > specific knowledge in software, e.g. in an opt-in table to list devices
> > > which allows arbitrary faults. For devices which only support selective
> > > faulting, a mediator (either through vendor extensions on vfio-pci-core
> > > or a mdev wrapper) might be necessary to help lock down non-faultable
> > > mappings and then enable faulting on the rest mappings.
> >
> > For devices which only support selective faulting, they could tell it to the
> > IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
>
> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
> selectively page-pinning. The matter is that 'they' imply some device
> specific logic to decide which pages must be pinned and such knowledge
> is outside of VFIO.
>
> From enabling p.o.v we could possibly do it in phased approach. First
> handles devices which tolerate arbitrary DMA faults, and then extends
> to devices with selective-faulting. The former is simpler, but with one
> main open whether we want to maintain such device IDs in a static
> table in VFIO or rely on some hints from other components (e.g. PF
> driver in VF assignment case). Let's see how Alex thinks about it.
>
> >
> > >
> > >> grows here, should mappings expire or do we need a least recently
> > >> mapped tracker to avoid exceeding the user's locked memory limit?
> How
> > >> does a user know what to set for a locked memory limit? The behavior
> > >> here would lead to cases where an idle system might be ok, but as
> soon
> > >> as load increases with more inflight DMA, we start seeing
> > >> "unpredictable" I/O faults from the user perspective. Seems like there
> > >> are lots of outstanding considerations and I'd also like to hear from
> > >> the SVA folks about how this meshes with their work. Thanks,
> > >>
> > >
> > > The main overlap between this feature and SVA is the IOPF reporting
> > > framework, which currently still has gap to support both in nested
> > > mode, as discussed here:
> > >
> > > https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/
> > >
> > > Once that gap is resolved in the future, the VFIO fault handler just
> > > adopts different actions according to the fault-level: 1st level faults
> > > are forwarded to userspace thru the vSVA path while 2nd-level faults
> > > are fixed (or warned if not intended) by VFIO itself thru the IOMMU
> > > mapping interface.
> >
> > I understand what you mean is:
> > From the perspective of VFIO, first, we need to set FEAT_IOPF, and then
> > regster its
> > own handler with a flag to indicate FLAT or NESTED and which level is
> > concerned,
> > thus the VFIO handler can handle the page faults directly according to the
> > carried
> > level information.
> >
> > Is there any plan for evolving(implementing) the IOMMU driver to
> support
> > this? Or
> > could we help this? :-)
> >
>
> Yes, it's in plan but just not happened yet. We are still focusing on guest
> SVA part thus only the 1st-level page fault (+Yi/Jacob). It's always welcomed
> to collaborate/help if you have time. ??
yeah, I saw Eric's page fault support patch is listed as reference. BTW.
one thing needs to clarify, currently only one iommu fault handler supported
for a single device. So for the fault handler added in this series, it should
be consolidated with the one added in Eric's series.
Regards,
Yi Liu
> Thanks
> Kevin
On 2021/2/9 19:06, Liu, Yi L wrote:
>> From: Tian, Kevin <[email protected]>
>> Sent: Thursday, February 4, 2021 2:52 PM
>>
>>> From: Shenming Lu <[email protected]>
>>> Sent: Tuesday, February 2, 2021 2:42 PM
>>>
>>> On 2021/2/1 15:56, Tian, Kevin wrote:
>>>>> From: Alex Williamson <[email protected]>
>>>>> Sent: Saturday, January 30, 2021 6:58 AM
>>>>>
>>>>> On Mon, 25 Jan 2021 17:03:58 +0800
>>>>> Shenming Lu <[email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The static pinning and mapping problem in VFIO and possible
>> solutions
>>>>>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>>>>>> page fault support for VFIO devices. Different from those relatively
>>>>>> complicated software approaches such as presenting a vIOMMU that
>>>>> provides
>>>>>> the DMA buffer information (might include para-virtualized
>>> optimizations),
>>>>>> IOPF mainly depends on the hardware faulting capability, such as the
>>> PCIe
>>>>>> PRI extension or Arm SMMU stall model. What's more, the IOPF
>> support
>>> in
>>>>>> the IOMMU driver is being implemented in SVA [3]. So do we
>> consider to
>>>>>> add IOPF support for VFIO passthrough based on the IOPF part of SVA
>> at
>>>>>> present?
>>>>>>
>>>>>> We have implemented a basic demo only for one stage of translation
>>> (GPA
>>>>>> -> HPA in virtualization, note that it can be configured at either stage),
>>>>>> and tested on Hisilicon Kunpeng920 board. The nested mode is more
>>>>> complicated
>>>>>> since VFIO only handles the second stage page faults (same as the
>> non-
>>>>> nested
>>>>>> case), while the first stage page faults need to be further delivered to
>>>>>> the guest, which is being implemented in [4] on ARM. My thought on
>> this
>>>>>> is to report the page faults to VFIO regardless of the occured stage
>> (try
>>>>>> to carry the stage information), and handle respectively according to
>> the
>>>>>> configured mode in VFIO. Or the IOMMU driver might evolve to
>> support
>>>>> more...
>>>>>>
>>>>>> Might TODO:
>>>>>> - Optimize the faulting path, and measure the performance (it might
>> still
>>>>>> be a big issue).
>>>>>> - Add support for PRI.
>>>>>> - Add a MMU notifier to avoid pinning.
>>>>>> - Add support for the nested mode.
>>>>>> ...
>>>>>>
>>>>>> Any comments and suggestions are very welcome. :-)
>>>>>
>>>>> I expect performance to be pretty bad here, the lookup involved per
>>>>> fault is excessive. There are cases where a user is not going to be
>>>>> willing to have a slow ramp up of performance for their devices as they
>>>>> fault in pages, so we might need to considering making this
>>>>> configurable through the vfio interface. Our page mapping also only
>>>>
>>>> There is another factor to be considered. The presence of IOMMU_
>>>> DEV_FEAT_IOPF just indicates the device capability of triggering I/O
>>>> page fault through the IOMMU, but not exactly means that the device
>>>> can tolerate I/O page fault for arbitrary DMA requests.
>>>
>>> Yes, so I add a iopf_enabled field in VFIO to indicate the whole path
>> faulting
>>> capability and set it to true after registering a VFIO page fault handler.
>>>
>>>> In reality, many
>>>> devices allow I/O faulting only in selective contexts. However, there
>>>> is no standard way (e.g. PCISIG) for the device to report whether
>>>> arbitrary I/O fault is allowed. Then we may have to maintain device
>>>> specific knowledge in software, e.g. in an opt-in table to list devices
>>>> which allows arbitrary faults. For devices which only support selective
>>>> faulting, a mediator (either through vendor extensions on vfio-pci-core
>>>> or a mdev wrapper) might be necessary to help lock down non-faultable
>>>> mappings and then enable faulting on the rest mappings.
>>>
>>> For devices which only support selective faulting, they could tell it to the
>>> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
>>
>> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
>> selectively page-pinning. The matter is that 'they' imply some device
>> specific logic to decide which pages must be pinned and such knowledge
>> is outside of VFIO.
>>
>> From enabling p.o.v we could possibly do it in phased approach. First
>> handles devices which tolerate arbitrary DMA faults, and then extends
>> to devices with selective-faulting. The former is simpler, but with one
>> main open whether we want to maintain such device IDs in a static
>> table in VFIO or rely on some hints from other components (e.g. PF
>> driver in VF assignment case). Let's see how Alex thinks about it.
>>
>>>
>>>>
>>>>> grows here, should mappings expire or do we need a least recently
>>>>> mapped tracker to avoid exceeding the user's locked memory limit?
>> How
>>>>> does a user know what to set for a locked memory limit? The behavior
>>>>> here would lead to cases where an idle system might be ok, but as
>> soon
>>>>> as load increases with more inflight DMA, we start seeing
>>>>> "unpredictable" I/O faults from the user perspective. Seems like there
>>>>> are lots of outstanding considerations and I'd also like to hear from
>>>>> the SVA folks about how this meshes with their work. Thanks,
>>>>>
>>>>
>>>> The main overlap between this feature and SVA is the IOPF reporting
>>>> framework, which currently still has gap to support both in nested
>>>> mode, as discussed here:
>>>>
>>>> https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/
>>>>
>>>> Once that gap is resolved in the future, the VFIO fault handler just
>>>> adopts different actions according to the fault-level: 1st level faults
>>>> are forwarded to userspace thru the vSVA path while 2nd-level faults
>>>> are fixed (or warned if not intended) by VFIO itself thru the IOMMU
>>>> mapping interface.
>>>
>>> I understand what you mean is:
>>> From the perspective of VFIO, first, we need to set FEAT_IOPF, and then
>>> regster its
>>> own handler with a flag to indicate FLAT or NESTED and which level is
>>> concerned,
>>> thus the VFIO handler can handle the page faults directly according to the
>>> carried
>>> level information.
>>>
>>> Is there any plan for evolving(implementing) the IOMMU driver to
>> support
>>> this? Or
>>> could we help this? :-)
>>>
>>
>> Yes, it's in plan but just not happened yet. We are still focusing on guest
>> SVA part thus only the 1st-level page fault (+Yi/Jacob). It's always welcomed
>> to collaborate/help if you have time. ??
>
> yeah, I saw Eric's page fault support patch is listed as reference. BTW.
> one thing needs to clarify, currently only one iommu fault handler supported
> for a single device. So for the fault handler added in this series, it should
> be consolidated with the one added in Eric's series.
Yeah, they could be combined. And maybe we could register the handler in this series
to the IOMMU driver, and when it receives page faults, further calls the handler in Eric's
series (maybe implemented as a callback in vfio_device_ops) if occurs at 1st-level.
We have to communicate with Eric about this. :-)
Thanks,
Shenming
On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many
>>> devices allow I/O faulting only in selective contexts. However, there
>>> is no standard way (e.g. PCISIG) for the device to report whether
>>> arbitrary I/O fault is allowed. Then we may have to maintain device
>>> specific knowledge in software, e.g. in an opt-in table to list devices
>>> which allows arbitrary faults. For devices which only support selective
>>> faulting, a mediator (either through vendor extensions on vfio-pci-core
>>> or a mdev wrapper) might be necessary to help lock down non-faultable
>>> mappings and then enable faulting on the rest mappings.
>>
>> For devices which only support selective faulting, they could tell it to the
>> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
>
> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
> selectively page-pinning. The matter is that 'they' imply some device
> specific logic to decide which pages must be pinned and such knowledge
> is outside of VFIO.
>
> From enabling p.o.v we could possibly do it in phased approach. First
> handles devices which tolerate arbitrary DMA faults, and then extends
> to devices with selective-faulting. The former is simpler, but with one
> main open whether we want to maintain such device IDs in a static
> table in VFIO or rely on some hints from other components (e.g. PF
> driver in VF assignment case). Let's see how Alex thinks about it.
Hi Kevin,
You mentioned selective-faulting some time ago. I still have some doubt
about it:
There is already a vfio_pin_pages() which is used for limiting the IOMMU
group dirty scope to pinned pages, could it also be used for indicating
the faultable scope is limited to the pinned pages and the rest mappings
is non-faultable that should be pinned and mapped immediately? But it
seems to be a little weird and not exactly to what you meant... I will
be grateful if you can help to explain further. :-)
Thanks,
Shenming
> From: Shenming Lu <[email protected]>
> Sent: Thursday, March 18, 2021 3:53 PM
>
> On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many
> >>> devices allow I/O faulting only in selective contexts. However, there
> >>> is no standard way (e.g. PCISIG) for the device to report whether
> >>> arbitrary I/O fault is allowed. Then we may have to maintain device
> >>> specific knowledge in software, e.g. in an opt-in table to list devices
> >>> which allows arbitrary faults. For devices which only support selective
> >>> faulting, a mediator (either through vendor extensions on vfio-pci-core
> >>> or a mdev wrapper) might be necessary to help lock down non-faultable
> >>> mappings and then enable faulting on the rest mappings.
> >>
> >> For devices which only support selective faulting, they could tell it to the
> >> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
> >
> > Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
> > selectively page-pinning. The matter is that 'they' imply some device
> > specific logic to decide which pages must be pinned and such knowledge
> > is outside of VFIO.
> >
> > From enabling p.o.v we could possibly do it in phased approach. First
> > handles devices which tolerate arbitrary DMA faults, and then extends
> > to devices with selective-faulting. The former is simpler, but with one
> > main open whether we want to maintain such device IDs in a static
> > table in VFIO or rely on some hints from other components (e.g. PF
> > driver in VF assignment case). Let's see how Alex thinks about it.
>
> Hi Kevin,
>
> You mentioned selective-faulting some time ago. I still have some doubt
> about it:
> There is already a vfio_pin_pages() which is used for limiting the IOMMU
> group dirty scope to pinned pages, could it also be used for indicating
> the faultable scope is limited to the pinned pages and the rest mappings
> is non-faultable that should be pinned and mapped immediately? But it
> seems to be a little weird and not exactly to what you meant... I will
> be grateful if you can help to explain further. :-)
>
The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down
pages that are not faultable (based on its specific knowledge) and then
the rest memory becomes faultable.
Thanks
Kevin
On 2021/3/18 17:07, Tian, Kevin wrote:
>> From: Shenming Lu <[email protected]>
>> Sent: Thursday, March 18, 2021 3:53 PM
>>
>> On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many
>>>>> devices allow I/O faulting only in selective contexts. However, there
>>>>> is no standard way (e.g. PCISIG) for the device to report whether
>>>>> arbitrary I/O fault is allowed. Then we may have to maintain device
>>>>> specific knowledge in software, e.g. in an opt-in table to list devices
>>>>> which allows arbitrary faults. For devices which only support selective
>>>>> faulting, a mediator (either through vendor extensions on vfio-pci-core
>>>>> or a mdev wrapper) might be necessary to help lock down non-faultable
>>>>> mappings and then enable faulting on the rest mappings.
>>>>
>>>> For devices which only support selective faulting, they could tell it to the
>>>> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
>>>
>>> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
>>> selectively page-pinning. The matter is that 'they' imply some device
>>> specific logic to decide which pages must be pinned and such knowledge
>>> is outside of VFIO.
>>>
>>> From enabling p.o.v we could possibly do it in phased approach. First
>>> handles devices which tolerate arbitrary DMA faults, and then extends
>>> to devices with selective-faulting. The former is simpler, but with one
>>> main open whether we want to maintain such device IDs in a static
>>> table in VFIO or rely on some hints from other components (e.g. PF
>>> driver in VF assignment case). Let's see how Alex thinks about it.
>>
>> Hi Kevin,
>>
>> You mentioned selective-faulting some time ago. I still have some doubt
>> about it:
>> There is already a vfio_pin_pages() which is used for limiting the IOMMU
>> group dirty scope to pinned pages, could it also be used for indicating
>> the faultable scope is limited to the pinned pages and the rest mappings
>> is non-faultable that should be pinned and mapped immediately? But it
>> seems to be a little weird and not exactly to what you meant... I will
>> be grateful if you can help to explain further. :-)
>>
>
> The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down
> pages that are not faultable (based on its specific knowledge) and then
> the rest memory becomes faultable.
Ahh...
Thus, from the perspective of VFIO IOMMU, if IOPF enabled for such device,
only the page faults within the pinned range are valid in the registered
iommu fault handler...
I have another question here, for the IOMMU backed devices, they are already
all pinned and mapped when attaching, is there a need to call vfio_pin_pages()
to lock down pages for them? Did I miss something?...
Thanks,
Shenming
>
> Thanks
> Kevin
>
> From: Shenming Lu <[email protected]>
> Sent: Thursday, March 18, 2021 7:54 PM
>
> On 2021/3/18 17:07, Tian, Kevin wrote:
> >> From: Shenming Lu <[email protected]>
> >> Sent: Thursday, March 18, 2021 3:53 PM
> >>
> >> On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many
> >>>>> devices allow I/O faulting only in selective contexts. However, there
> >>>>> is no standard way (e.g. PCISIG) for the device to report whether
> >>>>> arbitrary I/O fault is allowed. Then we may have to maintain device
> >>>>> specific knowledge in software, e.g. in an opt-in table to list devices
> >>>>> which allows arbitrary faults. For devices which only support selective
> >>>>> faulting, a mediator (either through vendor extensions on vfio-pci-core
> >>>>> or a mdev wrapper) might be necessary to help lock down non-
> faultable
> >>>>> mappings and then enable faulting on the rest mappings.
> >>>>
> >>>> For devices which only support selective faulting, they could tell it to the
> >>>> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
> >>>
> >>> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
> >>> selectively page-pinning. The matter is that 'they' imply some device
> >>> specific logic to decide which pages must be pinned and such knowledge
> >>> is outside of VFIO.
> >>>
> >>> From enabling p.o.v we could possibly do it in phased approach. First
> >>> handles devices which tolerate arbitrary DMA faults, and then extends
> >>> to devices with selective-faulting. The former is simpler, but with one
> >>> main open whether we want to maintain such device IDs in a static
> >>> table in VFIO or rely on some hints from other components (e.g. PF
> >>> driver in VF assignment case). Let's see how Alex thinks about it.
> >>
> >> Hi Kevin,
> >>
> >> You mentioned selective-faulting some time ago. I still have some doubt
> >> about it:
> >> There is already a vfio_pin_pages() which is used for limiting the IOMMU
> >> group dirty scope to pinned pages, could it also be used for indicating
> >> the faultable scope is limited to the pinned pages and the rest mappings
> >> is non-faultable that should be pinned and mapped immediately? But it
> >> seems to be a little weird and not exactly to what you meant... I will
> >> be grateful if you can help to explain further. :-)
> >>
> >
> > The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down
> > pages that are not faultable (based on its specific knowledge) and then
> > the rest memory becomes faultable.
>
> Ahh...
> Thus, from the perspective of VFIO IOMMU, if IOPF enabled for such device,
> only the page faults within the pinned range are valid in the registered
> iommu fault handler...
> I have another question here, for the IOMMU backed devices, they are
> already
> all pinned and mapped when attaching, is there a need to call
> vfio_pin_pages()
> to lock down pages for them? Did I miss something?...
>
If a device is marked as supporting I/O page fault (fully or selectively),
there should be no pinning at attach or DMA_MAP time (suppose as
this series does). Then for devices with selective-faulting its vendor
driver will lock down the pages which are not faultable at run-time,
e.g. when intercepting guest registration of a ring buffer...
Thanks
Kevin
On 2021/3/18 20:32, Tian, Kevin wrote:
>> From: Shenming Lu <[email protected]>
>> Sent: Thursday, March 18, 2021 7:54 PM
>>
>> On 2021/3/18 17:07, Tian, Kevin wrote:
>>>> From: Shenming Lu <[email protected]>
>>>> Sent: Thursday, March 18, 2021 3:53 PM
>>>>
>>>> On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many
>>>>>>> devices allow I/O faulting only in selective contexts. However, there
>>>>>>> is no standard way (e.g. PCISIG) for the device to report whether
>>>>>>> arbitrary I/O fault is allowed. Then we may have to maintain device
>>>>>>> specific knowledge in software, e.g. in an opt-in table to list devices
>>>>>>> which allows arbitrary faults. For devices which only support selective
>>>>>>> faulting, a mediator (either through vendor extensions on vfio-pci-core
>>>>>>> or a mdev wrapper) might be necessary to help lock down non-
>> faultable
>>>>>>> mappings and then enable faulting on the rest mappings.
>>>>>>
>>>>>> For devices which only support selective faulting, they could tell it to the
>>>>>> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
>>>>>
>>>>> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
>>>>> selectively page-pinning. The matter is that 'they' imply some device
>>>>> specific logic to decide which pages must be pinned and such knowledge
>>>>> is outside of VFIO.
>>>>>
>>>>> From enabling p.o.v we could possibly do it in phased approach. First
>>>>> handles devices which tolerate arbitrary DMA faults, and then extends
>>>>> to devices with selective-faulting. The former is simpler, but with one
>>>>> main open whether we want to maintain such device IDs in a static
>>>>> table in VFIO or rely on some hints from other components (e.g. PF
>>>>> driver in VF assignment case). Let's see how Alex thinks about it.
>>>>
>>>> Hi Kevin,
>>>>
>>>> You mentioned selective-faulting some time ago. I still have some doubt
>>>> about it:
>>>> There is already a vfio_pin_pages() which is used for limiting the IOMMU
>>>> group dirty scope to pinned pages, could it also be used for indicating
>>>> the faultable scope is limited to the pinned pages and the rest mappings
>>>> is non-faultable that should be pinned and mapped immediately? But it
>>>> seems to be a little weird and not exactly to what you meant... I will
>>>> be grateful if you can help to explain further. :-)
>>>>
>>>
>>> The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down
>>> pages that are not faultable (based on its specific knowledge) and then
>>> the rest memory becomes faultable.
>>
>> Ahh...
>> Thus, from the perspective of VFIO IOMMU, if IOPF enabled for such device,
>> only the page faults within the pinned range are valid in the registered
>> iommu fault handler...
>> I have another question here, for the IOMMU backed devices, they are
>> already
>> all pinned and mapped when attaching, is there a need to call
>> vfio_pin_pages()
>> to lock down pages for them? Did I miss something?...
>>
>
> If a device is marked as supporting I/O page fault (fully or selectively),
> there should be no pinning at attach or DMA_MAP time (suppose as
> this series does). Then for devices with selective-faulting its vendor
> driver will lock down the pages which are not faultable at run-time,
> e.g. when intercepting guest registration of a ring buffer...
Get it. Thanks a lot for this! :-)
Shenming
>
> Thanks
> Kevin
>
On 3/18/21 7:53 PM, Shenming Lu wrote:
> On 2021/3/18 17:07, Tian, Kevin wrote:
>>> From: Shenming Lu <[email protected]>
>>> Sent: Thursday, March 18, 2021 3:53 PM
>>>
>>> On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many
>>>>>> devices allow I/O faulting only in selective contexts. However, there
>>>>>> is no standard way (e.g. PCISIG) for the device to report whether
>>>>>> arbitrary I/O fault is allowed. Then we may have to maintain device
>>>>>> specific knowledge in software, e.g. in an opt-in table to list devices
>>>>>> which allows arbitrary faults. For devices which only support selective
>>>>>> faulting, a mediator (either through vendor extensions on vfio-pci-core
>>>>>> or a mdev wrapper) might be necessary to help lock down non-faultable
>>>>>> mappings and then enable faulting on the rest mappings.
>>>>>
>>>>> For devices which only support selective faulting, they could tell it to the
>>>>> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
>>>>
>>>> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
>>>> selectively page-pinning. The matter is that 'they' imply some device
>>>> specific logic to decide which pages must be pinned and such knowledge
>>>> is outside of VFIO.
>>>>
>>>> From enabling p.o.v we could possibly do it in phased approach. First
>>>> handles devices which tolerate arbitrary DMA faults, and then extends
>>>> to devices with selective-faulting. The former is simpler, but with one
>>>> main open whether we want to maintain such device IDs in a static
>>>> table in VFIO or rely on some hints from other components (e.g. PF
>>>> driver in VF assignment case). Let's see how Alex thinks about it.
>>>
>>> Hi Kevin,
>>>
>>> You mentioned selective-faulting some time ago. I still have some doubt
>>> about it:
>>> There is already a vfio_pin_pages() which is used for limiting the IOMMU
>>> group dirty scope to pinned pages, could it also be used for indicating
>>> the faultable scope is limited to the pinned pages and the rest mappings
>>> is non-faultable that should be pinned and mapped immediately? But it
>>> seems to be a little weird and not exactly to what you meant... I will
>>> be grateful if you can help to explain further. :-)
>>>
>>
>> The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down
>> pages that are not faultable (based on its specific knowledge) and then
>> the rest memory becomes faultable.
>
> Ahh...
> Thus, from the perspective of VFIO IOMMU, if IOPF enabled for such device,
> only the page faults within the pinned range are valid in the registered
> iommu fault handler...
Isn't it opposite? The pinned pages will never generate any page faults.
I might miss some contexts here.
> I have another question here, for the IOMMU backed devices, they are already
> all pinned and mapped when attaching, is there a need to call vfio_pin_pages()
> to lock down pages for them? Did I miss something?...
Best regards,
baolu
Hi Baolu,
On 2021/3/19 8:33, Lu Baolu wrote:
> On 3/18/21 7:53 PM, Shenming Lu wrote:
>> On 2021/3/18 17:07, Tian, Kevin wrote:
>>>> From: Shenming Lu <[email protected]>
>>>> Sent: Thursday, March 18, 2021 3:53 PM
>>>>
>>>> On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many
>>>>>>> devices allow I/O faulting only in selective contexts. However, there
>>>>>>> is no standard way (e.g. PCISIG) for the device to report whether
>>>>>>> arbitrary I/O fault is allowed. Then we may have to maintain device
>>>>>>> specific knowledge in software, e.g. in an opt-in table to list devices
>>>>>>> which allows arbitrary faults. For devices which only support selective
>>>>>>> faulting, a mediator (either through vendor extensions on vfio-pci-core
>>>>>>> or a mdev wrapper) might be necessary to help lock down non-faultable
>>>>>>> mappings and then enable faulting on the rest mappings.
>>>>>>
>>>>>> For devices which only support selective faulting, they could tell it to the
>>>>>> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
>>>>>
>>>>> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
>>>>> selectively page-pinning. The matter is that 'they' imply some device
>>>>> specific logic to decide which pages must be pinned and such knowledge
>>>>> is outside of VFIO.
>>>>>
>>>>> From enabling p.o.v we could possibly do it in phased approach. First
>>>>> handles devices which tolerate arbitrary DMA faults, and then extends
>>>>> to devices with selective-faulting. The former is simpler, but with one
>>>>> main open whether we want to maintain such device IDs in a static
>>>>> table in VFIO or rely on some hints from other components (e.g. PF
>>>>> driver in VF assignment case). Let's see how Alex thinks about it.
>>>>
>>>> Hi Kevin,
>>>>
>>>> You mentioned selective-faulting some time ago. I still have some doubt
>>>> about it:
>>>> There is already a vfio_pin_pages() which is used for limiting the IOMMU
>>>> group dirty scope to pinned pages, could it also be used for indicating
>>>> the faultable scope is limited to the pinned pages and the rest mappings
>>>> is non-faultable that should be pinned and mapped immediately? But it
>>>> seems to be a little weird and not exactly to what you meant... I will
>>>> be grateful if you can help to explain further. :-)
>>>>
>>>
>>> The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down
>>> pages that are not faultable (based on its specific knowledge) and then
>>> the rest memory becomes faultable.
>>
>> Ahh...
>> Thus, from the perspective of VFIO IOMMU, if IOPF enabled for such device,
>> only the page faults within the pinned range are valid in the registered
>> iommu fault handler...
>
> Isn't it opposite? The pinned pages will never generate any page faults.
> I might miss some contexts here.
It seems that vfio_pin_pages() just pin some pages and record the pinned scope to pfn_list of vfio_dma.
No mapping is established, so we still has page faults.
IIUC, vfio_pin_pages() is used to
1. pin pages for non-iommu backed devices.
2. mark dirty scope for non-iommu backed devices and iommu backed devices.
Thanks,
Keqian
>
>> I have another question here, for the IOMMU backed devices, they are already
>> all pinned and mapped when attaching, is there a need to call vfio_pin_pages()
>> to lock down pages for them? Did I miss something?...
>
> Best regards,
> baolu
> .
>
On 3/19/21 9:30 AM, Keqian Zhu wrote:
> Hi Baolu,
>
> On 2021/3/19 8:33, Lu Baolu wrote:
>> On 3/18/21 7:53 PM, Shenming Lu wrote:
>>> On 2021/3/18 17:07, Tian, Kevin wrote:
>>>>> From: Shenming Lu<[email protected]>
>>>>> Sent: Thursday, March 18, 2021 3:53 PM
>>>>>
>>>>> On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many
>>>>>>>> devices allow I/O faulting only in selective contexts. However, there
>>>>>>>> is no standard way (e.g. PCISIG) for the device to report whether
>>>>>>>> arbitrary I/O fault is allowed. Then we may have to maintain device
>>>>>>>> specific knowledge in software, e.g. in an opt-in table to list devices
>>>>>>>> which allows arbitrary faults. For devices which only support selective
>>>>>>>> faulting, a mediator (either through vendor extensions on vfio-pci-core
>>>>>>>> or a mdev wrapper) might be necessary to help lock down non-faultable
>>>>>>>> mappings and then enable faulting on the rest mappings.
>>>>>>> For devices which only support selective faulting, they could tell it to the
>>>>>>> IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
>>>>>> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
>>>>>> selectively page-pinning. The matter is that 'they' imply some device
>>>>>> specific logic to decide which pages must be pinned and such knowledge
>>>>>> is outside of VFIO.
>>>>>>
>>>>>> From enabling p.o.v we could possibly do it in phased approach. First
>>>>>> handles devices which tolerate arbitrary DMA faults, and then extends
>>>>>> to devices with selective-faulting. The former is simpler, but with one
>>>>>> main open whether we want to maintain such device IDs in a static
>>>>>> table in VFIO or rely on some hints from other components (e.g. PF
>>>>>> driver in VF assignment case). Let's see how Alex thinks about it.
>>>>> Hi Kevin,
>>>>>
>>>>> You mentioned selective-faulting some time ago. I still have some doubt
>>>>> about it:
>>>>> There is already a vfio_pin_pages() which is used for limiting the IOMMU
>>>>> group dirty scope to pinned pages, could it also be used for indicating
>>>>> the faultable scope is limited to the pinned pages and the rest mappings
>>>>> is non-faultable that should be pinned and mapped immediately? But it
>>>>> seems to be a little weird and not exactly to what you meant... I will
>>>>> be grateful if you can help to explain further.:-)
>>>>>
>>>> The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down
>>>> pages that are not faultable (based on its specific knowledge) and then
>>>> the rest memory becomes faultable.
>>> Ahh...
>>> Thus, from the perspective of VFIO IOMMU, if IOPF enabled for such device,
>>> only the page faults within the pinned range are valid in the registered
>>> iommu fault handler...
>> Isn't it opposite? The pinned pages will never generate any page faults.
>> I might miss some contexts here.
> It seems that vfio_pin_pages() just pin some pages and record the pinned scope to pfn_list of vfio_dma.
> No mapping is established, so we still has page faults.
Make sense. Thanks a lot for the explanation.
>
> IIUC, vfio_pin_pages() is used to
> 1. pin pages for non-iommu backed devices.
> 2. mark dirty scope for non-iommu backed devices and iommu backed devices.
Best regards,
baolu