2022-02-26 01:46:54

by Joao Martins

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On 2/23/22 01:03, Jason Gunthorpe wrote:
> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>> in terms of direction...
>>>
>>> I think go ahead and build it on top of iommufd, start working out the
>>> API details, etc. I think once the direction is concluded the new APIs
>>> will go forward.
>>>
>> /me nods, will do. Looking at your repository it is looking good.
>
> I would like to come with some plan for dirty tracking on iommufd and
> combine that with a plan for dirty tracking inside the new migration
> drivers.
>
I had a few things going on my end over the past weeks, albeit it is
getting a bit better now and I will be coming back to this topic. I hope/want
to give you a more concrete update/feedback over the coming week or two wrt
to dirty-tracking+iommufd+amd.

So far, I am not particularly concerned that this will affect overall iommufd
design. The main thing is really lookups to get vendor iopte, upon on what might
be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling the tracking,
I'll be simplifying the interface in the other type1 series I had and making it
a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
enable/disable over a domain. Perhaps same trick could be expanded to other
features to have a bit more control on what userspace is allowed to use. I think
this just needs to set/clear a feature bit in the domain, for VFIO or userspace
to have full control during the different stages of migration of dirty tracking.
In all of the IOMMU implementations/manuals I read it means setting a protection
domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
(albeit past work had also it always-on).

Provided the iommufd does /separately/ more finer granularity on what we can
do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
at will as separate operations, before and after migration respectivally. That logic
would probably be better to be in separate iommufd ioctls(), as that it's going to be
expensive.


>> I, too, have been wondering what that is going to look like -- and how do we
>> convey the setup of dirty tracking versus the steering of it.
>
> What I suggested was to just split them.
>
> Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
> would be on a per-domain basis, not on the ioas.
>
> Some ioctl toward the vfio device will turn on the device's tracker.
>
In the activation/fetching-data of either trackers I see some things in common in
terms of UAPI with the difference that whether a device or a list of devices are passed on
as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
your suggestion)

Albeit perhaps the main difference is going to be that one needs to setup with
hardware interface with the device tracker and how we carry the regions of memory that
want to be tracked i.e. GPA/IOVA ranges that the device should track. The tracking-GPA
space is not linear GPA space sadly. But at the same point perhaps the internal VFIO API
between core-VFIO and vendor-VFIO is just reading the @dma ranges we mapped.

In IOMMU this is sort of cheap and 'stateless', but on the setup of the
device tracker might mean giving all the IOVA ranges to the PF (once?).
Perhaps leaving to the vendor driver to pick when to register the IOVA space to
be tracked, or perhaps when you turn on the device's tracker. But on all cases,
the driver needs some form of awareness of and convey that to the PF for
tracking purposes.

> Userspace has to iterate and query each enabled tracker. This is not
> so bad because the time to make the system call is going to be tiny
> compared to the time to marshal XXGB of dirty bits.
>
Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
following by a smaller cost copying back to userspace (which KVM does too, when it's using
a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
(gup as you mentioned earlier in the thread), or just copying based on the input bitmap
(from PF) number of leading zeroes within some threshold.

> This makes qemu more complicated because it has to decide what
> trackers to turn on, but that is also the point because we do want
> userspace to be able to decide.
>
If the interface wants extending to pass a device or an array of devices (if I understood
you correctly), it would free/simplify VFIO from having to concatenate potentially
different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
more to avoid performing too much copying.

> The other idea that has some possible interest is to allow the
> trackers to dump their dirty bits into the existing kvm tracker, then
> userspace just does a single kvm centric dirty pass.

That would probably limit certain more modern options of ring based dirty tracking,
as that kvm dirty bitmap is mutually exclusive with kvm dirty ring. Or at least,
would require KVM to always use a bitmap during migration/dirty-rate-estimation with
the presence of vfio/vdpa devices. It's a nice idea, though. It would require making
core-iommu aware of bitmap as external storage for tracking (that is not iommufd as
it's a module).

Although, perhaps IOMMU sharing pgtables with CPUs would probably better align with
reusing KVM existing dirty bitmap interface. But I haven't given much thought on that one.
I just remember reading something on the IOMMU manual about this (ISTR that iommu_v2 was
made for that purpose).


2022-02-26 02:06:45

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
> On 2/23/22 01:03, Jason Gunthorpe wrote:
> > On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
> >>>> If by conclusion you mean the whole thing to be merged, how can the work be
> >>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> >>>> in terms of direction...
> >>>
> >>> I think go ahead and build it on top of iommufd, start working out the
> >>> API details, etc. I think once the direction is concluded the new APIs
> >>> will go forward.
> >>>
> >> /me nods, will do. Looking at your repository it is looking good.
> >
> > I would like to come with some plan for dirty tracking on iommufd and
> > combine that with a plan for dirty tracking inside the new migration
> > drivers.
> >
> I had a few things going on my end over the past weeks, albeit it is
> getting a bit better now and I will be coming back to this topic. I hope/want
> to give you a more concrete update/feedback over the coming week or two wrt
> to dirty-tracking+iommufd+amd.
>
> So far, I am not particularly concerned that this will affect overall iommufd
> design. The main thing is really lookups to get vendor iopte, upon on what might
> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
> the tracking,

I'm not very keen on these multiplexer interfaces. I think you should
just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
'read_dirty_bits'

NULL op means not supported.

IMHO we don't need a kapi wrapper if only iommufd is going to call the
op.

> I'll be simplifying the interface in the other type1 series I had and making it
> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
> enable/disable over a domain. Perhaps same trick could be expanded to other
> features to have a bit more control on what userspace is allowed to use. I think
> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
> to have full control during the different stages of migration of dirty tracking.
> In all of the IOMMU implementations/manuals I read it means setting a protection
> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
> (albeit past work had also it always-on).
>
> Provided the iommufd does /separately/ more finer granularity on what we can
> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
> at will as separate operations, before and after migration respectivally. That logic
> would probably be better to be in separate iommufd ioctls(), as that it's going to be
> expensive.

This all sounds right to me

Questions I have:
- Do we need ranges for some reason? You mentioned ARM SMMU wants
ranges? how/what/why?

- What about the unmap and read dirty without races operation that
vfio has?

> >> I, too, have been wondering what that is going to look like -- and how do we
> >> convey the setup of dirty tracking versus the steering of it.
> >
> > What I suggested was to just split them.
> >
> > Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
> > would be on a per-domain basis, not on the ioas.
> >
> > Some ioctl toward the vfio device will turn on the device's tracker.
> >
> In the activation/fetching-data of either trackers I see some things in common in
> terms of UAPI with the difference that whether a device or a list of devices are passed on
> as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
> your suggestion)

I was thinking a VFIO_DEVICE ioctl located on the device FD
implemented in the end VFIO driver (like mlx5_vfio). No lists..

As you say the driver should just take in the request to set dirty
tracking and take core of it somehow. There is no value the core VFIO
code can add here.

> Albeit perhaps the main difference is going to be that one needs to
> setup with hardware interface with the device tracker and how we
> carry the regions of memory that want to be tracked i.e. GPA/IOVA
> ranges that the device should track. The tracking-GPA space is not
> linear GPA space sadly. But at the same point perhaps the internal
> VFIO API between core-VFIO and vendor-VFIO is just reading the @dma
> ranges we mapped.

Yes, this is a point that needs some answering. One option is to pass
in the tracking range list from userspace. Another is to query it in
the driver from the currently mapped areas in IOAS.

I know devices have limitations here in terms of how many/how big the
ranges can be, and devices probably can't track dynamic changes.

> In IOMMU this is sort of cheap and 'stateless', but on the setup of the
> device tracker might mean giving all the IOVA ranges to the PF (once?).
> Perhaps leaving to the vendor driver to pick when to register the IOVA space to
> be tracked, or perhaps when you turn on the device's tracker. But on all cases,
> the driver needs some form of awareness of and convey that to the PF for
> tracking purposes.

Yes, this is right

> Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
> following by a smaller cost copying back to userspace (which KVM does too, when it's using
> a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
> (gup as you mentioned earlier in the thread), or just copying based on the input bitmap
> (from PF) number of leading zeroes within some threshold.

What I would probably strive for is an API that deliberately OR's in
the dirty bits. So GUP and kmap a 4k page then call the driver to 'or
in your dirty data', then do the next page. etc. That is 134M of IOVA
per chunk which seems OK.

> > This makes qemu more complicated because it has to decide what
> > trackers to turn on, but that is also the point because we do want
> > userspace to be able to decide.
> >
> If the interface wants extending to pass a device or an array of devices (if I understood
> you correctly), it would free/simplify VFIO from having to concatenate potentially
> different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
> more to avoid performing too much copying.

Yes. Currently VFIO maintains its own bitmap so it also saves that
memory by keeping the dirty bits stored in the IOPTEs until read out.

> > The other idea that has some possible interest is to allow the
> > trackers to dump their dirty bits into the existing kvm tracker, then
> > userspace just does a single kvm centric dirty pass.
>
> That would probably limit certain more modern options of ring based dirty tracking,
> as that kvm dirty bitmap is mutually exclusive with kvm dirty ring. Or at least,
> would require KVM to always use a bitmap during migration/dirty-rate-estimation with
> the presence of vfio/vdpa devices. It's a nice idea, though. It would require making
> core-iommu aware of bitmap as external storage for tracking (that is not iommufd as
> it's a module).

Yes, I don't know enough about kvm to say if that is a great idea or
not. The fact the CPUs seem to be going to logging instead of bitmaps
suggests it isn't. I don't think DMA devices can work effectively with
logging..

Jason

2022-02-28 17:39:36

by Joao Martins

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On 2/25/22 20:44, Jason Gunthorpe wrote:
> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>>>> in terms of direction...
>>>>>
>>>>> I think go ahead and build it on top of iommufd, start working out the
>>>>> API details, etc. I think once the direction is concluded the new APIs
>>>>> will go forward.
>>>>>
>>>> /me nods, will do. Looking at your repository it is looking good.
>>>
>>> I would like to come with some plan for dirty tracking on iommufd and
>>> combine that with a plan for dirty tracking inside the new migration
>>> drivers.
>>>
>> I had a few things going on my end over the past weeks, albeit it is
>> getting a bit better now and I will be coming back to this topic. I hope/want
>> to give you a more concrete update/feedback over the coming week or two wrt
>> to dirty-tracking+iommufd+amd.
>>
>> So far, I am not particularly concerned that this will affect overall iommufd
>> design. The main thing is really lookups to get vendor iopte, upon on what might
>> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
>> the tracking,
>
> I'm not very keen on these multiplexer interfaces. I think you should
> just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
> 'read_dirty_bits'
>
> NULL op means not supported.
>
> IMHO we don't need a kapi wrapper if only iommufd is going to call the
> op.
>

OK, good point.

Even without a kapi wrapper I am still wondering whether the iommu op needs to
be something like a generic iommu feature toggling (e.g. .set_feature()), rather
than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about
the only feature we will change out-of-band in the protection-domain.
I guess I can stay with set_ad_tracking/set_dirty_tracking and if should
need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS).

Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync'
because the sync() doesn't only read, but also "writes" to root pagetable to update
the dirty bit (and then IOTLB flush). And that's about what VFIO current interface
does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear.

And TBH, the question on whether we need a clear op isn't immediately obvious: reading
the access/dirty bit is cheap for the IOMMU, the problem OTOH is the expensive
io page table walk thus expensive in sw. The clear-dirty part, though, is precise on what
it wants to clear (in principle cheaper on io-page-table walk as you just iterate over
sets of bits to clear) but then it incurs a DMA perf hit given that we need to flush the
IOTLBs. Given the IOTLB flush is batched (over a course of a dirty updates) perhaps this
isn't immediately clear that is a problem in terms of total overall ioctl cost. Hence my
thinking in merging both in one sync_dirty_bitmap() as opposed to more KVM-style of
get_dirty_bitmap() and clear_dirty_ditmap().

Hopefully a futuristic IOMMUs would just get a new DTE/CD/etc field for stashing a set of
PFNs (for a bitmap) that the IOMMU would use for setting the dirty bits there. That is,
rather than forcing sw to split/merge pagetables for better granularity and having to
flush IOTLBs for dirty to be written again :(

>> I'll be simplifying the interface in the other type1 series I had and making it
>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>> enable/disable over a domain. Perhaps same trick could be expanded to other
>> features to have a bit more control on what userspace is allowed to use. I think
>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>> to have full control during the different stages of migration of dirty tracking.
>> In all of the IOMMU implementations/manuals I read it means setting a protection
>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>> (albeit past work had also it always-on).
>>
>> Provided the iommufd does /separately/ more finer granularity on what we can
>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>> at will as separate operations, before and after migration respectivally. That logic
>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>> expensive.
>
> This all sounds right to me
>
> Questions I have:
> - Do we need ranges for some reason? You mentioned ARM SMMU wants
> ranges? how/what/why?
>
Ignore that. I got mislead by the implementation and when I read the SDM
I realized that the implementation was doing the same thing I was doing
i.e. enabling dirty-bit in the protection domain at start rather than
dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
bit in the context descriptor to enable dirty bits or the STE.S2HD in the
stream table entry for the stage2 equivalent. Nothing here is per-range
basis. And the ranges was only used by the implementation for the eager
splitting/merging of IO page table levels.

> - What about the unmap and read dirty without races operation that
> vfio has?
>
I am afraid that might need a new unmap iommu op that reads the dirty bit
after clearing the page table entry. It's marshalling the bits from
iopte into a bitmap as opposed to some logic added on top. As far as I
looked for AMD this isn't difficult to add, (same for Intel) it can
*I think* reuse all of the unmap code.

>>>> I, too, have been wondering what that is going to look like -- and how do we
>>>> convey the setup of dirty tracking versus the steering of it.
>>>
>>> What I suggested was to just split them.
>>>
>>> Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
>>> would be on a per-domain basis, not on the ioas.
>>>
>>> Some ioctl toward the vfio device will turn on the device's tracker.
>>>
>> In the activation/fetching-data of either trackers I see some things in common in
>> terms of UAPI with the difference that whether a device or a list of devices are passed on
>> as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
>> your suggestion)
>
> I was thinking a VFIO_DEVICE ioctl located on the device FD
> implemented in the end VFIO driver (like mlx5_vfio). No lists..
>
Interesting. I was assuming that we wanted to keep the same ioctl()
interface for dirty-tracking hence me mentioning the device/device-list on
top of the DIRTY ioctl. But well on a second thought it doesn't make sense
given that we query a container and we want to move away from vfio for iopgtbl
related-work and rather into iommufd direct access instead by the VMM. So
placing more dependency on that ioctl wouldn't align with that. I suppose, it
makes sense that this is on a per-vfio-device basis, hence device-vfio ioctl().

> As you say the driver should just take in the request to set dirty
> tracking and take core of it somehow. There is no value the core VFIO
> code can add here.
>
>> Albeit perhaps the main difference is going to be that one needs to
>> setup with hardware interface with the device tracker and how we
>> carry the regions of memory that want to be tracked i.e. GPA/IOVA
>> ranges that the device should track. The tracking-GPA space is not
>> linear GPA space sadly. But at the same point perhaps the internal
>> VFIO API between core-VFIO and vendor-VFIO is just reading the @dma
>> ranges we mapped.
>
> Yes, this is a point that needs some answering. One option is to pass
> in the tracking range list from userspace. Another is to query it in
> the driver from the currently mapped areas in IOAS.
>
I sort of like the second given that we de-duplicate info that is already
tracked by IOAS -- it would be mostly internal and then it would be a
matter of when does this device tracker is set up, and whether we should
differentiate tracker "start"/"stop" vs "setup"/"teardown".

> I know devices have limitations here in terms of how many/how big the
> ranges can be, and devices probably can't track dynamic changes.
>
/me nods

Should this be some sort of capability perhaps? Given that this may limit
how many concurrent VFs can be migrated and how much ranges it can store,
for example.

(interestingly and speaking of VF capabilities, it's yet another thing to
tackle in the migration stream between src and dst hosts)

>> In IOMMU this is sort of cheap and 'stateless', but on the setup of the
>> device tracker might mean giving all the IOVA ranges to the PF (once?).
>> Perhaps leaving to the vendor driver to pick when to register the IOVA space to
>> be tracked, or perhaps when you turn on the device's tracker. But on all cases,
>> the driver needs some form of awareness of and convey that to the PF for
>> tracking purposes.
>
> Yes, this is right
>
>> Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
>> following by a smaller cost copying back to userspace (which KVM does too, when it's using
>> a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
>> (gup as you mentioned earlier in the thread), or just copying based on the input bitmap
>> (from PF) number of leading zeroes within some threshold.
>
> What I would probably strive for is an API that deliberately OR's in
> the dirty bits. So GUP and kmap a 4k page then call the driver to 'or
> in your dirty data', then do the next page. etc. That is 134M of IOVA
> per chunk which seems OK.
>
Yeap, sort of what I was aiming for.

>>> This makes qemu more complicated because it has to decide what
>>> trackers to turn on, but that is also the point because we do want
>>> userspace to be able to decide.
>>>
>> If the interface wants extending to pass a device or an array of devices (if I understood
>> you correctly), it would free/simplify VFIO from having to concatenate potentially
>> different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
>> more to avoid performing too much copying.
>
> Yes. Currently VFIO maintains its own bitmap so it also saves that
> memory by keeping the dirty bits stored in the IOPTEs until read out.
>
/me nods with the iommu, yes.

2022-03-11 22:03:23

by Joao Martins

[permalink] [raw]
Subject: iommufd(+vfio-compat) dirty tracking (Was: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver)

On 2/28/22 13:01, Joao Martins wrote:
> On 2/25/22 20:44, Jason Gunthorpe wrote:
>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>> I'll be simplifying the interface in the other type1 series I had and making it
>>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>>> enable/disable over a domain. Perhaps same trick could be expanded to other
>>> features to have a bit more control on what userspace is allowed to use. I think
>>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>>> to have full control during the different stages of migration of dirty tracking.
>>> In all of the IOMMU implementations/manuals I read it means setting a protection
>>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>>> (albeit past work had also it always-on).
>>>
>>> Provided the iommufd does /separately/ more finer granularity on what we can
>>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>>> at will as separate operations, before and after migration respectivally. That logic
>>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>>> expensive.
>>
>> This all sounds right to me
>>
>> Questions I have:
>> - Do we need ranges for some reason? You mentioned ARM SMMU wants
>> ranges? how/what/why?
>>
> Ignore that. I got mislead by the implementation and when I read the SDM
> I realized that the implementation was doing the same thing I was doing
> i.e. enabling dirty-bit in the protection domain at start rather than
> dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
> bit in the context descriptor to enable dirty bits or the STE.S2HD in the
> stream table entry for the stage2 equivalent. Nothing here is per-range
> basis. And the ranges was only used by the implementation for the eager
> splitting/merging of IO page table levels.
>
>> - What about the unmap and read dirty without races operation that
>> vfio has?
>>
> I am afraid that might need a new unmap iommu op that reads the dirty bit
> after clearing the page table entry. It's marshalling the bits from
> iopte into a bitmap as opposed to some logic added on top. As far as I
> looked for AMD this isn't difficult to add, (same for Intel) it can
> *I think* reuse all of the unmap code.
>

OK, made some progress.

It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests,
revising locking, bugs, and more -- works with my emulated qemu patches which
is a good sign. But hopefully starts some sort of skeleton of what we were
talking about in the above thread.

The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with
the vfio-compat one as it was easier provided there's existing userspace to work
with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto
steering the dirty tracking, read-clear the dirty bits, unmap and get dirty. But
as we discussed earlier, the one gap of VFIO dirty API is that it lacks controls
for upgrading/downgrading area/IOPTE sizes which is where IOMMUFD would most
likely shine. When that latter part is done we can probably adopt an eager-split
approach inside vfio-compat.

Additionally I also sort of want to skeleton ARM and Intel to see how it looks.
Some of the commits made notes of some of research I made, so *I think* the APIs
introduced capture h/w semantics for all the three IOMMUs supporting dirty
tracking.

I am storing my dirty-tracking bits here:

https://github.com/jpemartins/linux iommufd

It's a version of it rebased on top of your iommufd branch.

2022-03-16 03:18:12

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: iommufd(+vfio-compat) dirty tracking (Was: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver)

On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
> On 2/28/22 13:01, Joao Martins wrote:
> > On 2/25/22 20:44, Jason Gunthorpe wrote:
> >> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
> >>> On 2/23/22 01:03, Jason Gunthorpe wrote:
> >>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
> >>> I'll be simplifying the interface in the other type1 series I had and making it
> >>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
> >>> enable/disable over a domain. Perhaps same trick could be expanded to other
> >>> features to have a bit more control on what userspace is allowed to use. I think
> >>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
> >>> to have full control during the different stages of migration of dirty tracking.
> >>> In all of the IOMMU implementations/manuals I read it means setting a protection
> >>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
> >>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
> >>> (albeit past work had also it always-on).
> >>>
> >>> Provided the iommufd does /separately/ more finer granularity on what we can
> >>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
> >>> at will as separate operations, before and after migration respectivally. That logic
> >>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
> >>> expensive.
> >>
> >> This all sounds right to me
> >>
> >> Questions I have:
> >> - Do we need ranges for some reason? You mentioned ARM SMMU wants
> >> ranges? how/what/why?
> >>
> > Ignore that. I got mislead by the implementation and when I read the SDM
> > I realized that the implementation was doing the same thing I was doing
> > i.e. enabling dirty-bit in the protection domain at start rather than
> > dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
> > bit in the context descriptor to enable dirty bits or the STE.S2HD in the
> > stream table entry for the stage2 equivalent. Nothing here is per-range
> > basis. And the ranges was only used by the implementation for the eager
> > splitting/merging of IO page table levels.
> >
> >> - What about the unmap and read dirty without races operation that
> >> vfio has?
> >>
> > I am afraid that might need a new unmap iommu op that reads the dirty bit
> > after clearing the page table entry. It's marshalling the bits from
> > iopte into a bitmap as opposed to some logic added on top. As far as I
> > looked for AMD this isn't difficult to add, (same for Intel) it can
> > *I think* reuse all of the unmap code.
> >
>
> OK, made some progress.

Nice!


> It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests,
> revising locking, bugs, and more -- works with my emulated qemu patches which
> is a good sign. But hopefully starts some sort of skeleton of what we were
> talking about in the above thread.

I'm a bit bogged with the coming merge window right now, but will have
more to say in a bit

> The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with
> the vfio-compat one as it was easier provided there's existing userspace to work
> with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto
> steering the dirty tracking, read-clear the dirty bits, unmap and
> get dirty.

I think this is fine to start experimenting with

> But as we discussed earlier, the one gap of VFIO dirty API is that
> it lacks controls for upgrading/downgrading area/IOPTE sizes which
> is where IOMMUFD would most likely shine. When that latter part is
> done we can probably adopt an eager-split approach inside
> vfio-compat.

I think the native API should be new ioctls that operate on the
hw_pagetable object to:
- enable/disable dirty tracking
- read&clear a bitmap from a range
- read&unmap a bitmap from a range
- Manipulate IOPTE sizes

As you say it should not be much distance from the VFIO compat stuff

Most probably I would say to leave dirty tracking out of the type1 api
and compat for it. Maybe we can make some limited cases work back
compat, like the whole ioas supports iommu dirty tracking or
something..

Need to understand if it is wortwhile - remember to use any of this
you need a qemu that is updated to the v2 migration interface,
so there is little practical value in back compat to old qemu if we
expect qemu will use the native interface anyhow.

> Additionally I also sort of want to skeleton ARM and Intel to see how it looks.
> Some of the commits made notes of some of research I made, so *I think* the APIs
> introduced capture h/w semantics for all the three IOMMUs supporting dirty
> tracking.

I think the primitives are pretty basic, I'd be surprised if there is
something different :)

Though things to be thinking about are how does this work with nested
translation and other advanced features.

Thanks,
Jason

2022-03-16 23:22:41

by Joao Martins

[permalink] [raw]
Subject: Re: iommufd(+vfio-compat) dirty tracking

On 3/15/22 19:29, Jason Gunthorpe wrote:
> On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
>> On 2/28/22 13:01, Joao Martins wrote:
>>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>> I'll be simplifying the interface in the other type1 series I had and making it
>>>>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>>>>> enable/disable over a domain. Perhaps same trick could be expanded to other
>>>>> features to have a bit more control on what userspace is allowed to use. I think
>>>>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>>>>> to have full control during the different stages of migration of dirty tracking.
>>>>> In all of the IOMMU implementations/manuals I read it means setting a protection
>>>>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>>>>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>>>>> (albeit past work had also it always-on).
>>>>>
>>>>> Provided the iommufd does /separately/ more finer granularity on what we can
>>>>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>>>>> at will as separate operations, before and after migration respectivally. That logic
>>>>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>>>>> expensive.
>>>>
>>>> This all sounds right to me
>>>>
>>>> Questions I have:
>>>> - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>>> ranges? how/what/why?
>>>>
>>> Ignore that. I got mislead by the implementation and when I read the SDM
>>> I realized that the implementation was doing the same thing I was doing
>>> i.e. enabling dirty-bit in the protection domain at start rather than
>>> dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
>>> bit in the context descriptor to enable dirty bits or the STE.S2HD in the
>>> stream table entry for the stage2 equivalent. Nothing here is per-range
>>> basis. And the ranges was only used by the implementation for the eager
>>> splitting/merging of IO page table levels.
>>>
>>>> - What about the unmap and read dirty without races operation that
>>>> vfio has?
>>>>
>>> I am afraid that might need a new unmap iommu op that reads the dirty bit
>>> after clearing the page table entry. It's marshalling the bits from
>>> iopte into a bitmap as opposed to some logic added on top. As far as I
>>> looked for AMD this isn't difficult to add, (same for Intel) it can
>>> *I think* reuse all of the unmap code.
>>>
>>
>> OK, made some progress.
>
> Nice!
>
>
>> It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests,
>> revising locking, bugs, and more -- works with my emulated qemu patches which
>> is a good sign. But hopefully starts some sort of skeleton of what we were
>> talking about in the above thread.
>
> I'm a bit bogged with the coming merge window right now, but will have
> more to say in a bit
>

Thanks! Take your time :)

>> The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with
>> the vfio-compat one as it was easier provided there's existing userspace to work
>> with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto
>> steering the dirty tracking, read-clear the dirty bits, unmap and
>> get dirty.
>
> I think this is fine to start experimenting with
>
>> But as we discussed earlier, the one gap of VFIO dirty API is that
>> it lacks controls for upgrading/downgrading area/IOPTE sizes which
>> is where IOMMUFD would most likely shine. When that latter part is
>> done we can probably adopt an eager-split approach inside
>> vfio-compat.
>
> I think the native API should be new ioctls that operate on the
> hw_pagetable object to:
> - enable/disable dirty tracking
> - read&clear a bitmap from a range
> - read&unmap a bitmap from a range
> - Manipulate IOPTE sizes
>
Yeap -- heh nice, it is roughly what I was into already. I will take the
opportunity of doing the zerocopy/gup of the bitmap when writing the
iommufd native ioctls. The vfio-compat is still copying.

other things on my mind for iommufd native interface are:

1) The iopte split *could* be while we read the
dirty bits of a pgsize != than the args pgsize. The
read_and_clear_dirty() is already expensive, but I wonder about the idea
to take advantage of the dirty-clearing needed IOTLB flush, to also also
update the IOPTEs on the next memory-request/translation-request. Albeit
iopte collpase still needs to be routed via a specific ioctl() to promote
to a higher page size (should the migration fail or something).

(I referred to this in the past as lazy-split contrast to explicit IOPTE
size manipulation iommufd ioctls i.e. {manual,eager}-split)

2) I was thinking io_pagetable new APIs instead of hw_pagetable, but if you're
sort of seeing this as a hw_pagetable object primitive, then I am
wondering why? Albeit at the end of the day we operate on the iopt inside the hw_pagetable.

Note: IIUC that hw_pagetable is supposed to model a iommu_domain
direct manipulation which dirty fits in, but wouldn't that be applicable
to _MAP and _UNMAP too maybe?

> As you say it should not be much distance from the VFIO compat stuff
>
> Most probably I would say to leave dirty tracking out of the type1 api
> and compat for it. Maybe we can make some limited cases work back
> compat, like the whole ioas supports iommu dirty tracking or
> something..
>
> Need to understand if it is wortwhile - remember to use any of this
> you need a qemu that is updated to the v2 migration interface,
> so there is little practical value in back compat to old qemu if we
> expect qemu will use the native interface anyhow.
>

Perhaps the value is really just *how much* applications need to get rewritten
to adjust to iommufd versus iteratively adding small (new) parts of it .
migration-v2 didn't turn into a complete rewrite of the vfio initial part iiuc.
Albeit I suspect keeping the compat was perhaps a struggle to keep, which we
might not what we desire if the semantics diverge too much.

e.g. If the semantics are about the same, for example, new VMMs could use the
iommufd new features using IOPTE size change ioctls() via the VFIO_IOAS (if available)
while leaving vfio-compat somewhat usable on the existent dirty ioctls. While slowly
moving to newer iommufd (and less vfio-iommu-type1) until the point you totally deprecate
it. Even without IOPTE size modification in vfio-compat you would still mark as dirty
a lot less than the current logic would (which really represents the whole bitmap).

I guess at the end of the day it is the maintenance cost -- Fortunately
APIs are looking similar and most of vfio-compat is dealing with userspace
args/validation.

Anyhow just some thoughts, open to anything really :)

>> Additionally I also sort of want to skeleton ARM and Intel to see how it looks.
>> Some of the commits made notes of some of research I made, so *I think* the APIs
>> introduced capture h/w semantics for all the three IOMMUs supporting dirty
>> tracking.
>
> I think the primitives are pretty basic, I'd be surprised if there is
> something different :)
>
Yeah, they are very basic.

My main unknown in the rework was how the hw protection domain context can deal with
dynamically switching dirty on and off. But I think I answered myself that restriction at
least from groking into specs and experimentation.

> Though things to be thinking about are how does this work with nested
> translation and other advanced features
/me nods

There's usually some flags/bits on the 'stage 2' domain, at least ARM (S2CD) and Intel
(second-level table). But for AMD, you usually control via the vIOMMU what features the hw
exposes[*], and I think there's an extra bit to set to disable vIOMMU dirty tracking from
VMM. But I think still the only thing needed is first-stage tracking, as that's what
matters for L0 live migration.

[*] AMD vIOMMU also offloads some of the command processing, not just the pagetables. Not
sure if ARM/Intel has the equivalent.

2022-03-17 00:41:54

by Joao Martins

[permalink] [raw]
Subject: Re: iommufd(+vfio-compat) dirty tracking

On 3/16/22 16:36, Joao Martins wrote:
> On 3/15/22 19:29, Jason Gunthorpe wrote:
>> On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
>>> On 2/28/22 13:01, Joao Martins wrote:
>>>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>> Questions I have:
>>>>> - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>>>> ranges? how/what/why?
>>>>>

An amend here.

Sigh, ARM turns out is slightly more unique compared to x86. As I am re-reviewing
the ARM side. Apparently you have two controls: one is a 'feature bit'
just like x86 and another is a modifier (arm-only).

The Context descriptor (CD) equivalent to AMD DTEs or Intel context descriptor
equivalent for second-level. That's the top-level enabler to actually a *second*
modifier bit per-PTE (or per-TTD for more accurate terminology) which is the so
called DBM (dirty-bit-modifier). The latter when set, changes the meaning of
read/write access-flags of the PTE AP[2].

If you have CD.HD enabled (aka HTTU is enabled) *and* PTE.DBM set, then a
transition in the SMMU from "writable Clean" to "written" means that the the
access bits go from "read-only" (AP[2] = 1) to "read/write" (AP[2] = 0)
if-and-only-if PTE.DBM = 1 (and does not generate a permission IO page fault
like it normally would be with DBM = 0). Same thing for stage-2, except that
the access-bits are reversed (S2AP[1] is set when "written" and it's cleared
when it's "writable" (when DBM is also set).

Now you could say that this allows you to control on a per-range basis.
Gah, no, more like a per-PTE basis is more accurate.

And in practice I suppose that means that dynamically switching on/off SMMU
dirty-tracking *dynamically* means not only setting CD.HD but also walking the
page tables, and atomically setting/clearing both the DBM and AP[2].

References:

DDI0487H, Table D5-30 Data access permissions
SMMU 3.2 spec, 3.13.3 Dirty flag hardware update

2022-03-19 05:51:42

by Joao Martins

[permalink] [raw]
Subject: Re: iommufd(+vfio-compat) dirty tracking

On 3/16/22 20:37, Joao Martins wrote:
> On 3/16/22 16:36, Joao Martins wrote:
>> On 3/15/22 19:29, Jason Gunthorpe wrote:
>>> On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
>>>> On 2/28/22 13:01, Joao Martins wrote:
>>>>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>>>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>>> Questions I have:
>>>>>> - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>>>>> ranges? how/what/why?
>>>>>>
>
> An amend here.
>
> Sigh, ARM turns out is slightly more unique compared to x86. As I am re-reviewing
> the ARM side. Apparently you have two controls: one is a 'feature bit'
> just like x86 and another is a modifier (arm-only).
>
> The Context descriptor (CD) equivalent to AMD DTEs or Intel context descriptor
> equivalent for second-level. That's the top-level enabler to actually a *second*
> modifier bit per-PTE (or per-TTD for more accurate terminology) which is the so
> called DBM (dirty-bit-modifier). The latter when set, changes the meaning of
> read/write access-flags of the PTE AP[2].
>
> If you have CD.HD enabled (aka HTTU is enabled) *and* PTE.DBM set, then a
> transition in the SMMU from "writable Clean" to "written" means that the the
> access bits go from "read-only" (AP[2] = 1) to "read/write" (AP[2] = 0)
> if-and-only-if PTE.DBM = 1 (and does not generate a permission IO page fault
> like it normally would be with DBM = 0). Same thing for stage-2, except that
> the access-bits are reversed (S2AP[1] is set when "written" and it's cleared
> when it's "writable" (when DBM is also set).
>
> Now you could say that this allows you to control on a per-range basis.
> Gah, no, more like a per-PTE basis is more accurate.
>
> And in practice I suppose that means that dynamically switching on/off SMMU
> dirty-tracking *dynamically* means not only setting CD.HD but also walking the
> page tables, and atomically setting/clearing both the DBM and AP[2].
>
> References:
>
> DDI0487H, Table D5-30 Data access permissions
> SMMU 3.2 spec, 3.13.3 Dirty flag hardware update

I updated my branch and added an SMMUv3 implementation of the whole thing (slightly based
on the past work) and adjusted the 'set tracking' structure to cover this slightly
different h/w construct above. At the high-level we have 'set_dirty_tracking_range' API,
which is internal in iommufd obviously. The UAPI won't change ofc.

It's only compile-tested sadly as I have no SMMUv3.2 hardware, and to have this SMMUv3
DBM/HTTU support there's some requirements on the processor that I am not sure they can be
fully emulated. The Intel iommu implementation follows same model as AMD and I will get to
that next with a compliant iommu emulation too.

But now, I will be focusing on hw_pagetable UAPI part. I understand your thinking of being
tied to the hw_pagetable obj as opposed to IOAS for the dirty tracking APIs.