Subject: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

Hi,

This series attempts to add vfio live migration support for
HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space
includes both the functional register space and migration
control register space. As discussed in RFCv1[0], this may create
security issues as these regions get shared between the Guest
driver and the migration driver. Based on the feedback, we tried
to address those concerns in this version.

This is now based on the new vfio-pci-core framework proposal[1].
Understand that the framework proposal is still under discussion,
but really appreciate any feedback on the approach taken here
to mitigate the security risks.

The major changes from v1 are,

 -Adds a new vendor-specific vfio_pci driver(hisi-acc-vfio-pci)
  for HiSilicon ACC VF devices based on the new vfio-pci-core
  framework proposal.

 -Since HiSilicon ACC VF device MMIO space contains both the
  functional register space and migration control register space,
  override the vfio_device_ops ioctl method to report only the
  functional space to VMs.

 -For a successful migration, we still need access to VF dev
  functional register space mainly to read the status registers.
  But accessing these while the Guest vCPUs are running may leave
a security hole. To avoid any potential security issues, we
map/unmap the MMIO regions on a need basis and is safe to do so.
  (Please see hisi_acc_vf_ioremap/unmap() fns in patch #4).
 
 -Dropped debugfs support for now.
 -Uses common QM functions for mailbox access(patch #3).

This is now sanity tested on HiSilicon platforms that support these
ACC devices.

From v1:
 -In order to ensure the compatibility of the devices before and
after the migration, the device compatibility information check
will be performed in the Pre-copy stage. If the check fails,
  an error will be returned and the source VM will exit the migration.
 -After the compatibility check is passed, it will enter the
  Stop-and-copy stage. At this time, all the live migration data
will be copied and saved to the VF device of the destination.
  Finally, the VF device of the destination will be started.

Thanks,
Shameer
[0] https://lore.kernel.org/lkml/[email protected]/
[1] https://lore.kernel.org/lkml/[email protected]/

Longfang Liu (2):
crypto: hisilicon/qm - Export mailbox functions for common use
hisi_acc_vfio_pci: Add support for vfio live migration

Shameer Kolothum (2):
hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
hisi_acc_vfio_pci: Override ioctl method to limit BAR2 region size

drivers/crypto/hisilicon/qm.c | 8 +-
drivers/crypto/hisilicon/qm.h | 4 +
drivers/vfio/pci/Kconfig | 13 +
drivers/vfio/pci/Makefile | 2 +
drivers/vfio/pci/hisi_acc_vfio_pci.c | 1155 ++++++++++++++++++++++++++
drivers/vfio/pci/hisi_acc_vfio_pci.h | 144 ++++
6 files changed, 1323 insertions(+), 3 deletions(-)
create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.c
create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.h

--
2.17.1


Subject: RE: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver



> -----Original Message-----
> From: Jason Gunthorpe [mailto:[email protected]]
> Sent: 02 February 2022 13:15
> To: Shameerali Kolothum Thodi <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; Linuxarm <[email protected]>; liulongfang
> <[email protected]>; Zengtao (B) <[email protected]>;
> yuzenghui <[email protected]>; Jonathan Cameron
> <[email protected]>; Wangzhou (B) <[email protected]>
> Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
>
> On Fri, Jul 02, 2021 at 10:58:45AM +0100, Shameer Kolothum wrote:
> > This series attempts to add vfio live migration support for
> > HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space
> > includes both the functional register space and migration
> > control register space. As discussed in RFCv1[0], this may create
> > security issues as these regions get shared between the Guest
> > driver and the migration driver. Based on the feedback, we tried
> > to address those concerns in this version.
> >
> > This is now based on the new vfio-pci-core framework proposal[1].
> > Understand that the framework proposal is still under discussion,
> > but really appreciate any feedback on the approach taken here
> > to mitigate the security risks.
>
> Hi, can you look at the v6 proposal for the mlx5 implementation of the
> migration API and see if it meets hisilicon acc's needs as well?
>
> https://lore.kernel.org/all/[email protected]/

Yes, I saw that one. Thanks for that and is now looking into it.

>
> There are few topics to consider:
> - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
> sense for this driver?

I think it will be STOP_COPY only for now. We might have PRECOPY feature once
we have the SMMUv3 HTTU support in future.

>
> I see pf_qm_state_pre_save() but didn't understand why it wanted to
> send the first 32 bytes in the PRECOPY mode? It is fine, but it
> will add some complexity to continue to do this.

That was mainly to do a quick verification between src and dst compatibility
before we start saving the state. I think probably we can delay that check
for later.

> - I think we discussed the P2P implementation and decided it would
> work for this device? Can you re-read and confirm?

In our case these devices are Integrated End Point devices and doesn't have
P2P DMA capability. Hence the FSM arcs will be limited to STOP_COPY feature
I guess. Also, since we cannot guarantee a NDMA state in STOP, my
assumption currently is the onus of making sure that no MMIO access happens
in STOP is on the user. Is that a valid assumption?

> - Are the arcs we defined going to work here as well? The current
> implementation in hisi_acc_vf_set_device_state() is very far away
> from what the v1 protocol is, so I'm having a hard time guessing,
> but..

Right. The FSM has changed a couple of times since we posted this.
I am going to rebase all that now.

> RESUMING -> STOP
> Probably vf_qm_state_resume()
>
> RUNNING -> STOP
> vf_qm_fun_restart() - that is oddly named..
>
> STOP -> RESUMING
> Seems to be a nop (likely a bug)
>
> STOP -> RUNNING
> Not implemented currenty? (also a bug)
>
> STOP -> STOP_COPY
> pf_qm_state_pre_save / vf_qm_state_save
>
> STOP_COPY -> STOP
> NOP

I will check and verify this.

> And the modification for the P2P/NO DMA is presumably just
> fun_restart too since stopping the device and stopping DMA are
> going to be the same thing here?

Yes, in our case stopping device and stopping DMA are effectively the
same thing.

>
> The mlx5 implementation linked above is a full example you can cut and
> paste from for how to implement the state function and the how to do
> the data transfer. The f_ops read/write implementation for acc looks
> trivial as it only streams the fixed size and pre-allocated 'struct
> acc_vf_data'
>
> It looks like it would be a short path to implement our v2 proposal
> and remove a lot of driver code, as we saw in mlx5.
>

Ok. These are the git repo I am using for the rework,
https://github.com/jgunthorpe/qemu/commits/vfio_migration_v2
https://github.com/jgunthorpe/linux/tree/vfio_migration_v2

Please let me know if the above are not up to date.

Also, just noted that my quick prototype is now failing
with below error,

" Error: VFIO device doesn't support migration"

Do we need to set the below before the feature query?
Or am I using a wrong Qemu/kernel repo?

--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -488,6 +488,7 @@ static int vfio_migration_query_flags(VFIODevice
*vbasedev, uint64_t *mig_flags)
struct vfio_device_feature_migration *mig = (void *)feature->data;

feature->argsz = sizeof(buf);
+ feature->flags = VFIO_DEVICE_FEATURE_MIGRATION | VFIO_DEVICE_FEATURE_GET;
if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature) != 0)
return -EOPNOTSUPP;

Thanks,
Shameer

2022-02-03 13:31:03

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On Wed, 2 Feb 2022 14:34:52 +0000
Shameerali Kolothum Thodi <[email protected]> wrote:

> > -----Original Message-----
> > From: Jason Gunthorpe [mailto:[email protected]]
>
> >
> > I see pf_qm_state_pre_save() but didn't understand why it wanted to
> > send the first 32 bytes in the PRECOPY mode? It is fine, but it
> > will add some complexity to continue to do this.
>
> That was mainly to do a quick verification between src and dst compatibility
> before we start saving the state. I think probably we can delay that check
> for later.

In the v1 migration scheme, this was considered good practice. It
shouldn't be limited to PRECOPY, as there's no requirement to use
PRECOPY, but the earlier in the migration process that we can trigger a
device or data stream compatibility fault, the better. TBH, even in
the case where a device doesn't support live dirty tracking for a
PRECOPY phase, using it for compatibility testing continues to seem
like good practice. Thanks,

Alex

2022-02-03 16:45:37

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
> On 2/2/22 17:03, Jason Gunthorpe wrote:

> > how to integrate that with the iommufd work, which I hope will allow
> > that series, and the other IOMMU drivers that can support this to be
> > merged..
>
> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
> there, but TBH I am not up to speed on iommu-fd yet so I missed something
> obvious for sure. When you say 'integrate that with the iommufd' can you
> expand on that?

The general idea is that iommufd is the place to put all the iommu
driver uAPI for consumption by userspace. The IOMMU feature of dirty
tracking would belong there.

So, some kind of API needs to be designed to meet the needs of the
IOMMU drivers.

> Did you meant to use interface in the link, or perhaps VFIO would use an iommufd
> /internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's
> not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the
> VMM to scan iommu pagetables itself and see what was marked dirty or
> not?

iommufd and VFIO container's don't co-exist, either iommufd is
providing the IOMMU interface, or the current type 1 code - not both
together.

iommfd current approach presents the same ABI as the type1 container
as compatability, and it is a possible direction to provide the
iommu_domain stored dirty bits through that compat API.

But, as you say, it looks unnatural and inefficient when the domain
itself is storing the dirty bits inside the IOPTE.

It need some study I haven't got into yet :)

> My over-simplistic/naive view was that the proposal in the link
> above sounded a lot simpler. While iommu-fd had more longevity for
> many other usecases outside dirty tracking, no?

I'd prefer we don't continue to hack on the type1 code if iommufd is
expected to take over in this role - especially for a multi-vendor
feature like dirty tracking.

It is actually a pretty complicated topic because migration capable
PCI devices are also include their own dirty tracking HW, all this
needs to be harmonized somehow. VFIO proposed to squash everything
into the container code, but I've been mulling about having iommufd
only do system iommu and push the PCI device internal tracking over to
VFIO.

> I have a PoC-ish using the interface in the link, with AMD IOMMU
> dirty bit supported (including Qemu emulated amd-iommu for folks
> lacking the hardware). Albeit the eager-spliting + collapsing of
> IOMMU hugepages is not yet done there, and I wanted to play around
> the emulated intel-iommu SLADS from specs looks quite similar. Happy
> to join existing effort anyways.

This sounds great, I would love to bring the AMD IOMMU along with a
dirty tracking implementation! Can you share some patches so we can
see what the HW implementation looks like?

Jason

2022-02-05 10:37:03

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On Fri, Feb 04, 2022 at 07:53:12PM +0000, Joao Martins wrote:
> On 2/3/22 15:18, Jason Gunthorpe wrote:
> > On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
> >> On 2/2/22 17:03, Jason Gunthorpe wrote:
> >>> how to integrate that with the iommufd work, which I hope will allow
> >>> that series, and the other IOMMU drivers that can support this to be
> >>> merged..
> >>
> >> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
> >> there, but TBH I am not up to speed on iommu-fd yet so I missed something
> >> obvious for sure. When you say 'integrate that with the iommufd' can you
> >> expand on that?
> >
> > The general idea is that iommufd is the place to put all the iommu
> > driver uAPI for consumption by userspace. The IOMMU feature of dirty
> > tracking would belong there.
> >
> > So, some kind of API needs to be designed to meet the needs of the
> > IOMMU drivers.
> >
> /me nods
>
> I am gonna assume below is the most up-to-date to iommufd (as you pointed
> out in another thread IIRC):
>
> https://github.com/jgunthorpe/linux iommufd
>
> Let me know if it's not :)

The iommufd part is pretty good, but there is hacky patch to hook it
into vfio that isn't there, if you want to actually try it.

> > But, as you say, it looks unnatural and inefficient when the domain
> > itself is storing the dirty bits inside the IOPTE.
>
> How much is this already represented as the io-pgtable in IOMMU internal kAPI
> (if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today
> used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :(

Which are you looking at? AFACIT there is no diry page support in
iommu_ops ?

> then potentially VMM/process can more efficiently scan the dirtied
> set? But if some layer needs to somehow mediate between the vendor
> IOPTE representation and an UAPI IOPTE representation, to be able to
> make that delegation to userspace ... then maybe both might be
> inefficient? I didn't see how iommu-fd would abstract the IOPTEs
> lookup as far as I glanced through the code, perhaps that's another
> ioctl().

It is based around the same model as VFIO container - map/unmap of
user address space into the IOPTEs and the user space doesn't see
anything resembling a 'pte' - at least for kernel owned IO page
tables.

User space page tables will not be abstracted and the userspace must
know the direct HW format of the IOMMU they is being used.

> But what strikes /specifically/ on the dirty bit feature is that it looks
> simpler with the current VFIO, the heavy lifting seems to be
> mostly on the IOMMU vendor. The proposed API above for VFIO looking at
> the container (small changes), and IOMMU vendor would do most of it:

It is basically the same, almost certainly the user API in iommufd
will be some 'get dirty bits' and 'unmap and give me the dirty bits'
just like vfio has.

The tricky details are around how do you manage this when the system
may have multiple things invovled capable, or not, of actualy doing
dirty tracking.

> At the same time, what particularly scares me perf-wise (for the
> device being migrated) ... is the fact that we need to dynamically
> split and collapse page tables to increase the granularity of which
> we track. In the above interface it splits/collapses when you turn
> on/off the dirty tracking (respectively). That's *probably* where we
> need more flexibility, not sure.

For sure that is a particularly big adventure in the iommu driver..

> Do you have thoughts on what such device-dirty interface could look like?
> (Perhaps too early to poke while the FSM/UAPI is being worked out)

I've been thinking the same general read-and-clear of a dirty
bitmap. It matches nicely the the KVM interface.

> I was wondering if container has a dirty scan/sync callback funnelled
> by a vendor IOMMU ops implemented (as Shameerali patches proposed),

Yes, this is almost certainly how the in-kernel parts will look

> and vfio vendor driver provides one per device.

But this is less clear..

> Or propagate the dirty tracking API to vendor vfio driver[*].
> [*] considering the device may choose where to place its tracking storage, and
> which scheme (bitmap, ring, etc) it might be.

This has been my thinking, yes

> The reporting of the dirtying, though, looks hazzy to achieve if you
> try to make it uniform even to userspace. Perhaps with iommu-fd
> you're thinking to mmap() the dirty region back to userspace, or an
> iommu-fd ioctl() updates the PTEs, while letting the kernel clear
> the dirty status via the mmap() object. And that would be the common
> API regardless of dirty-hw scheme. Anyway, just thinking out loud.

My general thinking has be that iommufd would control only the system
IOMMU hardware. The FD interface directly exposes the iommu_domain as
a manipulable object, so I'd imagine making userspace have a simple
1:1 connection to the iommu_ops of a single iommu_domain.

Doing this avoids all the weirdo questions about what do you do if
there is non-uniformity in the iommu_domain's.

Keeping with that theme the vfio_device would provide a similar
interface, on its own device FD.

I don't know if mmap should be involed here, the dirty bitmaps are not
so big, I suspect a simple get_user_pages_fast() would be entirely OK.

> > VFIO proposed to squash everything
> > into the container code, but I've been mulling about having iommufd
> > only do system iommu and push the PCI device internal tracking over to
> > VFIO.
> >
>
> Seems to me that the juicy part falls mostly in IOMMU vendor code, I am
> not sure yet how much one can we 'offload' to a generic layer, at least
> compared with this other proposal.

Yes, I expect there is very little generic code here if we go this
way. The generic layer is just marshalling the ioctl(s) to the iommu
drivers. Certainly not providing storage or anything/

> Give me some time (few days only, as I gotta sort some things) and I'll
> respond here as follow up with link to a branch with the WIP/PoC patches.

Great!

> 3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty
> bit is cheap, clearing them is 'expensive' because one needs to flush
> IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result
> of an address-translation/io-page-walk. Even though the IOMMU uses interlocked
> operations to actually update the Access/Dirty bit in concurrency with
> the CPU. The AMD manuals are a tad misleading as they talk about marking
> non-present, but that would be catastrophic for migration as it would
> mean a DMA target abort for the PCI device, unless I missed something obvious.
> In any case, this means that the dirty bit *clearing* needs to be
> batched as much as possible, to amortize the cost of flushing the IOTLB.
> This is the same for Intel *IIUC*.

You have to mark it as non-present to do the final read out if
something unmaps while the tracker is on - eg emulating a viommu or
something. Then you mark non-present, flush the iotlb and read back
the dirty bit.

Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
then read and clear them.

> 4) Adjust the granularity of pagetables in place:
> [This item wasn't done, but it is generic to any IOMMU because it
> is mostly the ability to split existing IO pages in place.]

This seems like it would be some interesting amount of driver work,
but yes it could be a generic new iommu_domina op.

> 4.b) Optionally starting dirtying earlier (at provisioning) and let
> userspace dynamically split pages. This is to hopefully minimize the
> IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly.
> So dirty tracking would be enabled at creation of the protection domain
> after the vfio container is set up, and we would use pages dirtied
> as a indication of what needs to be splited. Problem is for IO page
> sizes bigger than 1G, which might unnecessarily lead to marking too
> much as dirty early on; but at least it's better than transferring the
> whole set.

I'm not sure running with dirty tracking permanently on would be good
for guest performance either.

I'd suspect you'd be better to have a warm up period where you track
dirtys and split down pages.

It is interesting, this is a possible reason why device dirty tracking
might actually perfom better because it can operate at a different
granularity from the system iommu without disrupting the guest DMA
performance.

Jason

2022-02-11 21:53:53

by Joao Martins

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On 2/11/22 17:49, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 05:28:22PM +0000, Joao Martins wrote:
>>> It is basically the same, almost certainly the user API in iommufd
>>> will be some 'get dirty bits' and 'unmap and give me the dirty bits'
>>> just like vfio has.
>>>
>>
>> The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
>> migration flow.
>
> It is essential to implement any kind of viommu behavior where
> map/unmap is occuring while the dirty tracking is running. It should
> never make a difference except in some ugly edge cases where the dma
> and unmap are racing.
>
/me nods

>> supposed to be happening (excluding P2P)? So perhaps the unmap is
>> unneeded after quiescing the VF.
>
> Yes, you don't need to unmap for migration, a simple fully synchronous
> read and clear is sufficient. But that final read, while DMA is quite,
> must obtain the latest dirty bit data.
>

The final read doesn't need special logic in VFIO/IOMMU IUC
(maybe only in the VMM)

Anyways, the above paragraph matches my understanding.

>> We have a bitmap based interface in KVM, but there's also a recent ring
>> interface for dirty tracking, which has some probably more determinism than
>> a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
>> and breaking its entries on-demand IIRC, whereas Intel resembles something
>> closer to a 512 entries 'ring' with VMX PML, which tells what has been
>> dirtied.
>
> KVM has an advantage that it can throttle the rate of dirty generation
> so it can rely on logging. The IOMMU can't throttle DMA, so we are
> stuck with a bitmap
>
Yeap, sadly :(

>>> I don't know if mmap should be involed here, the dirty bitmaps are not
>>> so big, I suspect a simple get_user_pages_fast() would be entirely OK.
>>>
>> Considering that is 32MB of a bitmap per TB maybe it is cheap.
>
> Rigt. GUP fasting a couple huge pages is nothing compared to scanning
> 1TB of IO page table.
>
... With 4K PTEs which is even more ludricous expensive. Well yeah, a
lesser concern the mangling of the bitmap, when you put that way heh

>>> You have to mark it as non-present to do the final read out if
>>> something unmaps while the tracker is on - eg emulating a viommu or
>>> something. Then you mark non-present, flush the iotlb and read back
>>> the dirty bit.
>>>
>> You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
>> too :) without needing VMM intervention (that's also not supported
>> in Linux).
>
> I'm sure, but dirty tracking has to happen on the kernel owned page
> table, not the user owned one I think..
>
The plumbing for the hw-accelerated vIOMMU is a little different that
a regular vIOMMU, at least IIUC host does not take an active part in the
GVA -> GPA translation. Suravee's preso explains it nicely, if you don't
have time to fiddle with the SDM:

https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf


>>> Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
>>> then read and clear them.
>>>
>> It's the other way around AIUI. The dirty bits are sticky, so you flush
>> the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
>> again on the next memory transaction (or ATS translation).
>
> I guess it depends on how the HW works, if it writes the dirty bit
> back to ram instantly on the first dirty, or if it only writes the
> dirty bit when flushing the iotlb.
>
The manual says roughly that the update is visible to CPU as soon as the
it updates. Particularly from the IOMMU SDM:

"Note that the setting of accessed and dirty status bits in the page tables is visible to
both the CPU and the peripheral when sharing guest page tables. The IOMMU interlocked
operations to update A and D bits must be 64-bit operations and naturally aligned on a
64-bit boundary"

And ...

1. Decodes the read and write intent from the memory access.
2. If P=0 in the page descriptor, fail the access.
3. Compare the A & D bits in the descriptor with the read and write intent in the request.
4. If the A or D bits need to be updated in the descriptor:
* Start atomic operation.
* Read the descriptor as a 64-bit access.
* If the descriptor no longer appears to require an update, release the atomic lock with
no further action and continue to step 5.
* Calculate the new A & D bits.
* Write the descriptor as a 64-bit access.
* End atomic operation.
5. Continue to the next stage of translation or to the memory access.

> In any case you have to synchronize with the HW in some way to ensure
> that all dirty bits are 'pulled back' into system memory to resolve
> races (ie you need the iommu HW to release and the CPU to acquire on
> the dirty data) before trying to read them, at least for the final
> quieced system read.
>
/me nods

>>> This seems like it would be some interesting amount of driver work,
>>> but yes it could be a generic new iommu_domina op.
>>
>> I am slightly at odds that .split and .collapse at .switch() are enough.
>> But, with iommu if we are working on top of an IOMMU domain object and
>> .split and .collapse are iommu_ops perhaps that looks to be enough
>> flexibility to give userspace the ability to decide what it wants to
>> split, if it starts eargerly/warming-up tracking dirty pages.
>>
>> The split and collapsing is something I wanted to work on next, to get
>> to a stage closer to that of an RFC on the AMD side.
>
> split/collapse seems kind of orthogonal to me it doesn't really
> connect to dirty tracking other than being mostly useful during dirty
> tracking.
>
> And I wonder how hard split is when trying to atomically preserve any
> dirty bit..
>
Would would it be hard? The D bit is supposed to be replicated when you
split to smaller page size.

>> Hmmm, judging how the IOMMU works I am not sure this is particularly
>> affecting DMA performance (not sure yet about RDMA, it's something I
>> curious to see how it gets to perform with 4K IOPTEs, and with dirty
>> tracking always enabled). Considering how the bits are sticky, and
>> unless CPU clears it, it's short of a nop? Unless of course the checking
>> for A^D during an atomic memory transaction is expensive. Needs some
>> performance testing nonetheless.
>
> If you leave the bits all dirty then why do it? The point is to see
> the dirties, which means the iommu is generating a workload of dirty
> cachelines while operating it didn't do before.
>
My thinking was that if it's dirtied and in the IOTLB most likely the
descriptor in the IOTLB is cached. And if you need to do a IOMMU page walk
to resolve an IOVA, perhaps the check for the A & D bits needing
to be updated is probably the least problem in this path. Naturally, if
it's not split, you have a much higher chance (e.g. with 1GB entries) to stay
in the IOTLB and just compare two bits *prior* to consider starting
the atomic operation to update the descriptor.

>> I forgot to mention, but the early enablement of IOMMU dirty tracking
>> was also meant to fully know since guest creation what needs to be
>> sent to the destination. Otherwise, wouldn't we need to send the whole
>> pinned set to destination, if we only start tracking dirty pages during
>> migration?
>
> ? At the start of migration you have to send everything. Dirty
> tracking is intended to allow the VM to be stopped and then have only
> a small set of data to xfer.
>
Right, that's how it works today.

This is just preemptive longterm thinking about the overal problem space (probably
unnecessary noise at this stage). Particularly whenever I need to migrate 1 to 2TB VMs.
Particular that the stage *prior* to precopy takes way too long to transfer the whole
memory. So I was thinking say only transfer the pages that are populated[*] in the
second-stage page tables (for the CPU) coupled with IOMMU tracking from the beginning
(prior to vcpus even entering). That could probably decrease 1024 1GB Dirtied IOVA
entries, to maybe only dirty a smaller subset, saving a whole bootload of time.

[*] VMs without VFIO would be even easier as the first stage page tables are non-present.

>> Also, this is probably a differentiator for iommufd, if we were to provide
>> split and collapse semantics to IOMMU domain objects that userspace can use.
>> That would get more freedom, to switch dirty-tracking, and then do the warm
>> up thingie and piggy back on what it wants to split before migration.
>> perhaps the switch() should get some flag to pick where to split, I guess.
>
> Yes, right. Split/collapse should be completely seperate from dirty
> tracking.

Yeap.

I wonder if we could start progressing the dirty tracking as a first initial series and
then have the split + collapse handling as a second part? That would be quite
nice to get me going! :D

Joao

2022-02-12 06:30:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On Fri, Feb 11, 2022 at 09:43:56PM +0000, Joao Martins wrote:
> The plumbing for the hw-accelerated vIOMMU is a little different that
> a regular vIOMMU, at least IIUC host does not take an active part in the
> GVA -> GPA translation. Suravee's preso explains it nicely, if you don't
> have time to fiddle with the SDM:
>
> https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf

That looks like the same nesting everyone else is talking about.

I've been refering to this as 'user space page tables' to avoid a lot
of ambiguity becuase what it fundamentally does is allow userspace to
directly manage an IO page table - building the IO PTEs, issue
invalidations, etc.

To be secure the userspace page table is nested under a kernel owned
page table and all the IO PTE offsets/etc are understood to be within
the address space of the kernel owned page table.

When combined with KVM the userspace is just inside a VM.

> 1. Decodes the read and write intent from the memory access.
> 2. If P=0 in the page descriptor, fail the access.
> 3. Compare the A & D bits in the descriptor with the read and write intent in the request.
> 4. If the A or D bits need to be updated in the descriptor:

Ah, so the dirty update is actually atomic on the first write before
any DMA happens - and I suppose all of this happens when the entry is
first loaded into the IOTLB.

So the flush is to allow the IOTLB to see the cleared D bit..

> > split/collapse seems kind of orthogonal to me it doesn't really
> > connect to dirty tracking other than being mostly useful during dirty
> > tracking.
> >
> > And I wonder how hard split is when trying to atomically preserve any
> > dirty bit..
> >
> Would would it be hard? The D bit is supposed to be replicated when you
> split to smaller page size.

I guess it depends on how the 'acquire' is done, as the CPU has to
atomically replace a large entry with a pointer to a small entry,
flush the IOTLB then 'acquire' the dirty bit. If the dirty bit is set
in the old entry then it has to sprinkle it into the new entries with
atomics.

> This is just preemptive longterm thinking about the overal problem
> space (probably unnecessary noise at this stage). Particularly
> whenever I need to migrate 1 to 2TB VMs. Particular that the stage
> *prior* to precopy takes way too long to transfer the whole
> memory. So I was thinking say only transfer the pages that are
> populated[*] in the second-stage page tables (for the CPU) coupled
> with IOMMU tracking from the beginning (prior to vcpus even
> entering). That could probably decrease 1024 1GB Dirtied IOVA
> entries, to maybe only dirty a smaller subset, saving a whole
> bootload of time.

Oh, you want to effectively optimize zero page detection..

> I wonder if we could start progressing the dirty tracking as a first initial series and
> then have the split + collapse handling as a second part? That would be quite
> nice to get me going! :D

I think so, and I think we should. It is such a big problem space, it
needs to get broken up.

Jason

2022-02-13 14:40:13

by Joao Martins

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On 2/4/22 23:07, Jason Gunthorpe wrote:
> On Fri, Feb 04, 2022 at 07:53:12PM +0000, Joao Martins wrote:
>> On 2/3/22 15:18, Jason Gunthorpe wrote:
>>> On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
>>>> On 2/2/22 17:03, Jason Gunthorpe wrote:
>>>>> how to integrate that with the iommufd work, which I hope will allow
>>>>> that series, and the other IOMMU drivers that can support this to be
>>>>> merged..
>>>>
>>>> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
>>>> there, but TBH I am not up to speed on iommu-fd yet so I missed something
>>>> obvious for sure. When you say 'integrate that with the iommufd' can you
>>>> expand on that?
>>>
>>> The general idea is that iommufd is the place to put all the iommu
>>> driver uAPI for consumption by userspace. The IOMMU feature of dirty
>>> tracking would belong there.
>>>
>>> So, some kind of API needs to be designed to meet the needs of the
>>> IOMMU drivers.
>>>
>> /me nods
>>
>> I am gonna assume below is the most up-to-date to iommufd (as you pointed
>> out in another thread IIRC):
>>
>> https://github.com/jgunthorpe/linux iommufd
>>
>> Let me know if it's not :)
>
> The iommufd part is pretty good, but there is hacky patch to hook it
> into vfio that isn't there, if you want to actually try it.
>
OK -- Thanks for the heads up.

>>> But, as you say, it looks unnatural and inefficient when the domain
>>> itself is storing the dirty bits inside the IOPTE.
>>
>> How much is this already represented as the io-pgtable in IOMMU internal kAPI
>> (if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today
>> used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :(
>
> Which are you looking at? AFACIT there is no diry page support in
> iommu_ops ?
>
There is not, indeed. But how you manage IO page tables felt *somewhat
hinting* to io-pgtable -- but maybe it's a stretch. The dirty
page support builds on top of that, as ARM IOMMUs use iommu io-pgtable.

In the series in the link, it pretty much boils down to
essentially 3 iommu_ops introduced:

* support_dirty_log(domain)
- Checks whether a given IOMMU domain supports dirty tracking
* switch_dirty_log(domain, enable, iova, size, prot)
- Enables/Disables IOMMU dirty tracking on an IOVA range
(The range I suppose it's because ARM supports range tracking, contrary
to Intel and AMD which affect all the protection domain pg tables)
* sync_dirty_log(domain, iova, size, &bitmap, ...)

The IO pgtables, have an interesting set of APIs strictly for page table
manipulation, and they are sort of optional (so far ARM and AMD). Those
include, map, unmap, iova_to_pfn translation. So, the dirty log supports
for ARM builds on that and adds split, merge (for splitting and collapsing
IO pagtables), and finally sync_dirty_log() which is where we get the
bitmap of dirtied pages *updated*.

But well, at the end of the day for an IOMMU driver the domain ops are
the important stuff, maybe IO pgtable framework isn't as critical
(Intel for example, doesn't use that at all).

>> then potentially VMM/process can more efficiently scan the dirtied
>> set? But if some layer needs to somehow mediate between the vendor
>> IOPTE representation and an UAPI IOPTE representation, to be able to
>> make that delegation to userspace ... then maybe both might be
>> inefficient? I didn't see how iommu-fd would abstract the IOPTEs
>> lookup as far as I glanced through the code, perhaps that's another
>> ioctl().
>
> It is based around the same model as VFIO container - map/unmap of
> user address space into the IOPTEs and the user space doesn't see
> anything resembling a 'pte' - at least for kernel owned IO page
> tables.
>
Ah! I misinterpreted what you were saying.

> User space page tables will not be abstracted and the userspace must
> know the direct HW format of the IOMMU they is being used.
>
That's countering earlier sentence? Because hw format (for me at least)
means PTE and protection domain config format too. And if iommufd
abstracts the HW format, modelling after the IOMMU domain and its ops,
then it's abstracting userspace from those details e.g. it works over
IOVAs, but not over its vendor representation of how that IOVA is set up.

I am probably being dense.

>> But what strikes /specifically/ on the dirty bit feature is that it looks
>> simpler with the current VFIO, the heavy lifting seems to be
>> mostly on the IOMMU vendor. The proposed API above for VFIO looking at
>> the container (small changes), and IOMMU vendor would do most of it:
>
> It is basically the same, almost certainly the user API in iommufd
> will be some 'get dirty bits' and 'unmap and give me the dirty bits'
> just like vfio has.
>

The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
migration flow. Considering that we pause vCPUs, and we quiesce the
VFs, I guess we are mostly supposed to have a guarantee that no DMA is
supposed to be happening (excluding P2P)? So perhaps the unmap is
unneeded after quiescing the VF. Which probably is important
migration-wise considering the unmap is really what's expensive (e.g.
needs IOTLB flush) and that would add up between stop copy and the time it
would resume on the destination, if we need to wait to unmap the whole IOVA
from source.

> The tricky details are around how do you manage this when the system
> may have multiple things invovled capable, or not, of actualy doing
> dirty tracking.
>
Yeap.

>> At the same time, what particularly scares me perf-wise (for the
>> device being migrated) ... is the fact that we need to dynamically
>> split and collapse page tables to increase the granularity of which
>> we track. In the above interface it splits/collapses when you turn
>> on/off the dirty tracking (respectively). That's *probably* where we
>> need more flexibility, not sure.
>
> For sure that is a particularly big adventure in the iommu driver..
>
Yeah.

Me playing around was with VFIO IOMMU hugepages disabled, and this
was something I am planning on working when I get back to this.

>> Do you have thoughts on what such device-dirty interface could look like?
>> (Perhaps too early to poke while the FSM/UAPI is being worked out)
>
> I've been thinking the same general read-and-clear of a dirty
> bitmap. It matches nicely the the KVM interface.
>

A bitmap I think it is a good start.

We have a bitmap based interface in KVM, but there's also a recent ring
interface for dirty tracking, which has some probably more determinism than
a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
and breaking its entries on-demand IIRC, whereas Intel resembles something
closer to a 512 entries 'ring' with VMX PML, which tells what has been
dirtied.

>> I was wondering if container has a dirty scan/sync callback funnelled
>> by a vendor IOMMU ops implemented (as Shameerali patches proposed),
>
> Yes, this is almost certainly how the in-kernel parts will look
>
/me nods

>> and vfio vendor driver provides one per device.
>
> But this is less clear..
>
>> Or propagate the dirty tracking API to vendor vfio driver[*].
>> [*] considering the device may choose where to place its tracking storage, and
>> which scheme (bitmap, ring, etc) it might be.
>
> This has been my thinking, yes
>
/me nods

>> The reporting of the dirtying, though, looks hazzy to achieve if you
>> try to make it uniform even to userspace. Perhaps with iommu-fd
>> you're thinking to mmap() the dirty region back to userspace, or an
>> iommu-fd ioctl() updates the PTEs, while letting the kernel clear
>> the dirty status via the mmap() object. And that would be the common
>> API regardless of dirty-hw scheme. Anyway, just thinking out loud.
>
> My general thinking has be that iommufd would control only the system
> IOMMU hardware. The FD interface directly exposes the iommu_domain as
> a manipulable object, so I'd imagine making userspace have a simple
> 1:1 connection to the iommu_ops of a single iommu_domain.
>
> Doing this avoids all the weirdo questions about what do you do if
> there is non-uniformity in the iommu_domain's.
>
> Keeping with that theme the vfio_device would provide a similar
> interface, on its own device FD.
>
> I don't know if mmap should be involed here, the dirty bitmaps are not
> so big, I suspect a simple get_user_pages_fast() would be entirely OK.
>
Considering that is 32MB of a bitmap per TB maybe it is cheap.

>>> VFIO proposed to squash everything
>>> into the container code, but I've been mulling about having iommufd
>>> only do system iommu and push the PCI device internal tracking over to
>>> VFIO.
>>>
>>
>> Seems to me that the juicy part falls mostly in IOMMU vendor code, I am
>> not sure yet how much one can we 'offload' to a generic layer, at least
>> compared with this other proposal.
>
> Yes, I expect there is very little generic code here if we go this
> way. The generic layer is just marshalling the ioctl(s) to the iommu
> drivers. Certainly not providing storage or anything/
>
Gotcha. Perhaps I need to sort out how this would work with iommufd.

But I guess we can hash out how the iommu ops would look like?

We seem to be on the same page, at least that's the feeling I get.

>> Give me some time (few days only, as I gotta sort some things) and I'll
>> respond here as follow up with link to a branch with the WIP/PoC patches.
>
> Great!
>
Here it is. "A few days" turn into a week sorry :/

https://github.com/jpemartins/qemu amd-iommu-hdsup-wip
https://github.com/jpemartins/linux amd-vfio-iommu-hdsup-wip

Note, it is an early PoC. I still need to get the split/collapse thing going,
and fix the FIXMEs there, and have a second good look at the iommu page tables.

The Qemu link above has one patch to get device-dirty bitmaps via HMP to be able to
measure the rate by hw dirties and another patch for emulated amd iommu support too,
should you lack the hardware to play (I usually do on both, more for people to be
able to coverage test it). I sadly cannot do the end-to-end migration test.

>> 3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty
>> bit is cheap, clearing them is 'expensive' because one needs to flush
>> IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result
>> of an address-translation/io-page-walk. Even though the IOMMU uses interlocked
>> operations to actually update the Access/Dirty bit in concurrency with
>> the CPU. The AMD manuals are a tad misleading as they talk about marking
>> non-present, but that would be catastrophic for migration as it would
>> mean a DMA target abort for the PCI device, unless I missed something obvious.
>> In any case, this means that the dirty bit *clearing* needs to be
>> batched as much as possible, to amortize the cost of flushing the IOTLB.
>> This is the same for Intel *IIUC*.
>
> You have to mark it as non-present to do the final read out if
> something unmaps while the tracker is on - eg emulating a viommu or
> something. Then you mark non-present, flush the iotlb and read back
> the dirty bit.
>
You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
too :) without needing VMM intervention (that's also not supported
in Linux).

But the mark as NP for viommu is something I haven't investigated.

> Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
> then read and clear them.
>
It's the other way around AIUI. The dirty bits are sticky, so you flush
the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
again on the next memory transaction (or ATS translation).

I am not entirely sure we need to unmap + mark non-present for non-viommu
That would actually mean something is not properly quiscieing the VF DMA.
Maybe we should .. to gate whether if we should actually continue with LM
if something kept doing DMA when it shouldn't have.

Also, considering the dirty-bitmap check happens on very-very long IOVA extents --
some measurements over how long the scans need to be done. Mostly
validated how much a much NIC was dirtying with traditional networking tools.

>> 4) Adjust the granularity of pagetables in place:
>> [This item wasn't done, but it is generic to any IOMMU because it
>> is mostly the ability to split existing IO pages in place.]
>
> This seems like it would be some interesting amount of driver work,
> but yes it could be a generic new iommu_domina op.
>
I am slightly at odds that .split and .collapse at .switch() are enough.
But, with iommu if we are working on top of an IOMMU domain object and
.split and .collapse are iommu_ops perhaps that looks to be enough
flexibility to give userspace the ability to decide what it wants to
split, if it starts eargerly/warming-up tracking dirty pages.

The split and collapsing is something I wanted to work on next, to get
to a stage closer to that of an RFC on the AMD side.

>> 4.b) Optionally starting dirtying earlier (at provisioning) and let
>> userspace dynamically split pages. This is to hopefully minimize the
>> IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly.
>> So dirty tracking would be enabled at creation of the protection domain
>> after the vfio container is set up, and we would use pages dirtied
>> as a indication of what needs to be splited. Problem is for IO page
>> sizes bigger than 1G, which might unnecessarily lead to marking too
>> much as dirty early on; but at least it's better than transferring the
>> whole set.
>
> I'm not sure running with dirty tracking permanently on would be good
> for guest performance either.
>
Hmmm, judging how the IOMMU works I am not sure this is particularly
affecting DMA performance (not sure yet about RDMA, it's something I
curious to see how it gets to perform with 4K IOPTEs, and with dirty
tracking always enabled). Considering how the bits are sticky, and
unless CPU clears it, it's short of a nop? Unless of course the checking
for A^D during an atomic memory transaction is expensive. Needs some
performance testing nonetheless.

The IOTLB flushes pre/after clearing, tough, would visibly hurt DMA
performance :(

I forgot to mention, but the early enablement of IOMMU dirty tracking
was also meant to fully know since guest creation what needs to be
sent to the destination. Otherwise, wouldn't we need to send the whole
pinned set to destination, if we only start tracking dirty pages during
migration?

> I'd suspect you'd be better to have a warm up period where you track
> dirtys and split down pages.
>
Yeah, but right now the splitting is done when switching on and off
dirty tracking, for the entirety of page tables.

And well at least on AMD, the flag is attached to a protection domain.
For SMMUv3. it appears that it's per IOVA range if I read the original code
correctly (which is interesting contrast to AMD/Intel)

Also, this is probably a differentiator for iommufd, if we were to provide
split and collapse semantics to IOMMU domain objects that userspace can use.
That would get more freedom, to switch dirty-tracking, and then do the warm
up thingie and piggy back on what it wants to split before migration.
perhaps the switch() should get some flag to pick where to split, I guess.


> It is interesting, this is a possible reason why device dirty tracking
> might actually perfom better because it can operate at a different
> granularity from the system iommu without disrupting the guest DMA
> performance.

I agree -- and the other advantage is that you don't depend on platform
support for dirty tracking, which is largely nonexistent.

Albeit I personally like to have the IOMMU support because it also
frees endpoints to do rather expensive dirty tracking, which is a rather
even more exclusive feature to have. It's trading off max performance for
LM commodity over, where DMA performance can be less (momentarily).

Joao

2022-02-14 16:11:34

by Joao Martins

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On 2/12/22 00:01, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 09:43:56PM +0000, Joao Martins wrote:
>> 1. Decodes the read and write intent from the memory access.
>> 2. If P=0 in the page descriptor, fail the access.
>> 3. Compare the A & D bits in the descriptor with the read and write intent in the request.
>> 4. If the A or D bits need to be updated in the descriptor:
>
> Ah, so the dirty update is actually atomic on the first write before
> any DMA happens - and I suppose all of this happens when the entry is
> first loaded into the IOTLB.
>
Intel's equivalent feature suggests me that this works the same way there.

ARM update of ioptes looks to be slightly different[*] but this FEAT_BBM
in SMMUv3.2 makes it work in similar way to Intel/AMD. But I could be
misunderstanding something there.

[*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
and then write the new valid entry. Not sure I understood correctly that this
is the 'break-before-make' thingie.

> So the flush is to allow the IOTLB to see the cleared D bit..
>
Right.

>>> split/collapse seems kind of orthogonal to me it doesn't really
>>> connect to dirty tracking other than being mostly useful during dirty
>>> tracking.
>>>
>>> And I wonder how hard split is when trying to atomically preserve any
>>> dirty bit..
>>>
>> Would would it be hard? The D bit is supposed to be replicated when you
>> split to smaller page size.
>
> I guess it depends on how the 'acquire' is done, as the CPU has to
> atomically replace a large entry with a pointer to a small entry,
> flush the IOTLB then 'acquire' the dirty bit. If the dirty bit is set
> in the old entry then it has to sprinkle it into the new entries with
> atomics.
>
ISTR some mention to what we are chatting here in the IOMMU SDM:

When a non-default page size is used , software must OR the Dirty bits in
all of the replicated host PTEs used to map the page. The IOMMU does not
guarantee the Dirty bits are set in all of the replicated PTEs. Any portion
of the page may have been written even if the Dirty bit is set in only one
of the replicated PTEs.

>> I wonder if we could start progressing the dirty tracking as a first initial series and
>> then have the split + collapse handling as a second part? That would be quite
>> nice to get me going! :D
>
> I think so, and I think we should. It is such a big problem space, it
> needs to get broken up.

OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the
past except with the x86 support only[*]. And we poke holes there I guess.

[*] I might include Intel too, albeit emulated only.

2022-02-15 16:41:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On Tue, Feb 15, 2022 at 04:00:35PM +0000, Joao Martins wrote:
> On 2/14/22 14:06, Jason Gunthorpe wrote:
> > On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:
> >
> >> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
> >> and then write the new valid entry. Not sure I understood correctly that this
> >> is the 'break-before-make' thingie.
> >
> > Doesn't that explode if the invalid entry is DMA'd to?
> >
> Yes, IIUC. Also, the manual has this note:

Heh, sounds like "this doesn't work" to me :)

> > Like I said, I'd prefer we not build more on the VFIO type 1 code
> > until we have a conclusion for iommufd..
> >
>
> I didn't quite understand what you mean by conclusion.

If people are dead-set against doing iommufd, then lets abandon the
idea and go back to hacking up vfio.

> If by conclusion you mean the whole thing to be merged, how can the work be
> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> in terms of direction...

I think go ahead and build it on top of iommufd, start working out the
API details, etc. I think once the direction is concluded the new APIs
will go forward.

> > While returning the dirty data looks straight forward, it is hard to
> > see an obvious path to enabling and controlling the system iommu the
> > way vfio is now.
>
> It seems strange to have a whole UAPI for userspace [*] meant to
> return dirty data to userspace, when dirty right now means the whole
> pinned page set and so copying the whole guest ...

Yes, the whole thing is only partially implemented, and doesn't have
any in-kernel user. It is another place holder for an implementation
to come someday.

> Hence my thinking was that the patches /if small/ would let us see how dirty
> tracking might work for iommu kAPI (and iommufd) too.

It could be tried, but I think if you go into there you will find it
quickly turns quite complicated to address all the edge cases. Eg what
do you do if you have a mdev present after you turn on system
tracking? What if the mdev is using a PASID? What about hotplug of new
VFIO devices?

Remember, dirty tracking for vfio is totally useless without also
having vfio device migration. Do you already have a migration capable
device to use with this?

> Would it be better to do more iterative steps (when possible) as opposed to
> scratch and rebuild VFIO type1 IOMMU handling?

Possibly, but every thing that gets added has to be carried over to
the new code too, and energy has to be expended trying to figure out
how the half implemented stuff should work while finishing it.

At the very least we must decide what to do with device-provided dirty
tracking before the VFIO type1 stuff can be altered to use the system
IOMMU.

This is very much like the migration FSM, the only appeal is the
existing qemu implementation of the protocol.

Jason

2022-02-22 12:43:01

by Joao Martins

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On 2/15/22 16:21, Jason Gunthorpe wrote:
> On Tue, Feb 15, 2022 at 04:00:35PM +0000, Joao Martins wrote:
>> On 2/14/22 14:06, Jason Gunthorpe wrote:
>>> On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:
>>>
>>>> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
>>>> and then write the new valid entry. Not sure I understood correctly that this
>>>> is the 'break-before-make' thingie.
>>>
>>> Doesn't that explode if the invalid entry is DMA'd to?
>>>
>> Yes, IIUC. Also, the manual has this note:
>
> Heh, sounds like "this doesn't work" to me :)
>
Yeah, but I remember reading in manual that HTTUD (what ARM calls it for dirty
tracking, albeit DBM is another term for the same thing) requires FEAT_BBM
which avoids us to play the above games. So, supposedly, we can "just"
use atomics with IOPTE changes and IOTLB flush. Not if we need the latter
flush before or after on smmuv3.

>>> Like I said, I'd prefer we not build more on the VFIO type 1 code
>>> until we have a conclusion for iommufd..
>>>
>>
>> I didn't quite understand what you mean by conclusion.
>
> If people are dead-set against doing iommufd, then lets abandon the
> idea and go back to hacking up vfio.
>
Heh, I was under the impression everybody was investing so much *because*
that direction was set onto iommufd direction.

>> If by conclusion you mean the whole thing to be merged, how can the work be
>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>> in terms of direction...
>
> I think go ahead and build it on top of iommufd, start working out the
> API details, etc. I think once the direction is concluded the new APIs
> will go forward.
>
/me nods, will do. Looking at your repository it is looking good.

>>> While returning the dirty data looks straight forward, it is hard to
>>> see an obvious path to enabling and controlling the system iommu the
>>> way vfio is now.
>>
>> It seems strange to have a whole UAPI for userspace [*] meant to
>> return dirty data to userspace, when dirty right now means the whole
>> pinned page set and so copying the whole guest ...
>
> Yes, the whole thing is only partially implemented, and doesn't have
> any in-kernel user. It is another place holder for an implementation
> to come someday.
>
Yeap, seems like.

>> Hence my thinking was that the patches /if small/ would let us see how dirty
>> tracking might work for iommu kAPI (and iommufd) too.
>
> It could be tried, but I think if you go into there you will find it
> quickly turns quite complicated to address all the edge cases. Eg what
> do you do if you have a mdev present after you turn on system
> tracking? What if the mdev is using a PASID?
> What about hotplug of new
> VFIO devices?
>
> Remember, dirty tracking for vfio is totally useless without also
> having vfio device migration.

Oh yes -- I am definitely aware. IOMMU/Device Dirty tracking is useless
if we can't do the device part first. But if quiescing DMA and saving
state are two hard requirements that are mandatory for a live migrateable
VF, having dirty tracking in the devices I suspect might be more rare.
So perhaps people will look at IOMMUs as a commodity-workaround to avoid
a whole bunch of hardware logic for dirty tracking, even bearing what it
entails for DMA performance (hisilicon might be an example).

> Do you already have a migration capable
> device to use with this?
>
Not yet, but soon I hope.

>> Would it be better to do more iterative steps (when possible) as opposed to
>> scratch and rebuild VFIO type1 IOMMU handling?
>
> Possibly, but every thing that gets added has to be carried over to
> the new code too, and energy has to be expended trying to figure out
> how the half implemented stuff should work while finishing it.
>
/me nods I understand

> At the very least we must decide what to do with device-provided dirty
> tracking before the VFIO type1 stuff can be altered to use the system
> IOMMU.
>
I, too, have been wondering what that is going to look like -- and how do we
convey the setup of dirty tracking versus the steering of it.

> This is very much like the migration FSM, the only appeal is the
> existing qemu implementation of the protocol.

Yeah.

2022-02-23 02:34:25

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:

> > If people are dead-set against doing iommufd, then lets abandon the
> > idea and go back to hacking up vfio.
> >
> Heh, I was under the impression everybody was investing so much *because*
> that direction was set onto iommufd direction.

Such is the hope :)

> >> If by conclusion you mean the whole thing to be merged, how can the work be
> >> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> >> in terms of direction...
> >
> > I think go ahead and build it on top of iommufd, start working out the
> > API details, etc. I think once the direction is concluded the new APIs
> > will go forward.
> >
> /me nods, will do. Looking at your repository it is looking good.

I would like to come with some plan for dirty tracking on iommufd and
combine that with a plan for dirty tracking inside the new migration
drivers.

> Oh yes -- I am definitely aware. IOMMU/Device Dirty tracking is useless
> if we can't do the device part first. But if quiescing DMA and saving
> state are two hard requirements that are mandatory for a live migrateable
> VF, having dirty tracking in the devices I suspect might be more
> rare.

So far all but one of the live migration devices I know about can do
dirty tracking internally..

> So perhaps people will look at IOMMUs as a commodity-workaround to avoid
> a whole bunch of hardware logic for dirty tracking, even bearing what it
> entails for DMA performance (hisilicon might be an example).

I do expect this will be true.

> > At the very least we must decide what to do with device-provided dirty
> > tracking before the VFIO type1 stuff can be altered to use the system
> > IOMMU.
>
> I, too, have been wondering what that is going to look like -- and how do we
> convey the setup of dirty tracking versus the steering of it.

What I suggested was to just split them.

Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
would be on a per-domain basis, not on the ioas.

Some ioctl toward the vfio device will turn on the device's tracker.

Userspace has to iterate and query each enabled tracker. This is not
so bad because the time to make the system call is going to be tiny
compared to the time to marshal XXGB of dirty bits.

This makes qemu more complicated because it has to decide what
trackers to turn on, but that is also the point because we do want
userspace to be able to decide.

The other idea that has some possible interest is to allow the
trackers to dump their dirty bits into the existing kvm tracker, then
userspace just does a single kvm centric dirty pass.

Jason

2022-02-28 21:02:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On Mon, Feb 28, 2022 at 01:01:39PM +0000, Joao Martins wrote:
> On 2/25/22 20:44, Jason Gunthorpe wrote:
> > On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
> >> On 2/23/22 01:03, Jason Gunthorpe wrote:
> >>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
> >>>>>> If by conclusion you mean the whole thing to be merged, how can the work be
> >>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> >>>>>> in terms of direction...
> >>>>>
> >>>>> I think go ahead and build it on top of iommufd, start working out the
> >>>>> API details, etc. I think once the direction is concluded the new APIs
> >>>>> will go forward.
> >>>>>
> >>>> /me nods, will do. Looking at your repository it is looking good.
> >>>
> >>> I would like to come with some plan for dirty tracking on iommufd and
> >>> combine that with a plan for dirty tracking inside the new migration
> >>> drivers.
> >>>
> >> I had a few things going on my end over the past weeks, albeit it is
> >> getting a bit better now and I will be coming back to this topic. I hope/want
> >> to give you a more concrete update/feedback over the coming week or two wrt
> >> to dirty-tracking+iommufd+amd.
> >>
> >> So far, I am not particularly concerned that this will affect overall iommufd
> >> design. The main thing is really lookups to get vendor iopte, upon on what might
> >> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
> >> the tracking,
> >
> > I'm not very keen on these multiplexer interfaces. I think you should
> > just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
> > 'read_dirty_bits'
> >
> > NULL op means not supported.
> >
> > IMHO we don't need a kapi wrapper if only iommufd is going to call the
> > op.
> >
>
> OK, good point.
>
> Even without a kapi wrapper I am still wondering whether the iommu op needs to
> be something like a generic iommu feature toggling (e.g. .set_feature()), rather
> than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about
> the only feature we will change out-of-band in the protection-domain.
> I guess I can stay with set_ad_tracking/set_dirty_tracking and if should
> need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS).

I just generally dislike multiplexers like this. We are already
calling through a function pointer struct, why should the driver
implement another switch/case just to find out which function pointer
the caller really ment to use? It doesn't make things faster, it
doesn't make things smaller, it doesn't use less LOC. Why do it?

> Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync'
> because the sync() doesn't only read, but also "writes" to root pagetable to update
> the dirty bit (and then IOTLB flush). And that's about what VFIO current interface
> does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear.

'read and clear' is what I'd suggest

> And TBH, the question on whether we need a clear op isn't immediately obvious: reading
> the access/dirty bit is cheap for the IOMMU, the problem OTOH is the expensive
> io page table walk thus expensive in sw. The clear-dirty part, though, is precise on what
> it wants to clear (in principle cheaper on io-page-table walk as you just iterate over
> sets of bits to clear) but then it incurs a DMA perf hit given that we need to flush the
> IOTLBs. Given the IOTLB flush is batched (over a course of a dirty updates) perhaps this
> isn't immediately clear that is a problem in terms of total overall ioctl cost. Hence my
> thinking in merging both in one sync_dirty_bitmap() as opposed to more KVM-style of
> get_dirty_bitmap() and clear_dirty_ditmap().

Yes, I wouldn't split them.

> > Questions I have:
> > - Do we need ranges for some reason? You mentioned ARM SMMU wants
> > ranges? how/what/why?
> >
> Ignore that. I got mislead by the implementation and when I read the SDM
> I realized that the implementation was doing the same thing I was
> doing

Ok

> > - What about the unmap and read dirty without races operation that
> > vfio has?
> >
> I am afraid that might need a new unmap iommu op that reads the dirty bit
> after clearing the page table entry. It's marshalling the bits from
> iopte into a bitmap as opposed to some logic added on top. As far as I
> looked for AMD this isn't difficult to add, (same for Intel) it can
> *I think* reuse all of the unmap code.

Ok. It feels necessary to be complete

> > Yes, this is a point that needs some answering. One option is to pass
> > in the tracking range list from userspace. Another is to query it in
> > the driver from the currently mapped areas in IOAS.
> >
> I sort of like the second given that we de-duplicate info that is already
> tracked by IOAS -- it would be mostly internal and then it would be a
> matter of when does this device tracker is set up, and whether we should
> differentiate tracker "start"/"stop" vs "setup"/"teardown".

One problem with this is that devices that don't support dynamic
tracking are stuck in vIOMMU cases where the IOAS will have some tiny
set of all memory mapped.

So I'd probably say qemu should forward the entire guest CPU memory
space, as well as any future memory hotplug area, as ranges and not
rely on the IOAS already mapping anything.

> > I know devices have limitations here in terms of how many/how big the
> > ranges can be, and devices probably can't track dynamic changes.
> >
> /me nods
>
> Should this be some sort of capability perhaps? Given that this may limit
> how many concurrent VFs can be migrated and how much ranges it can store,
> for example.
>
> (interestingly and speaking of VF capabilities, it's yet another thing to
> tackle in the migration stream between src and dst hosts)

I don't know what to do with these limitations right now..

Jason

2022-03-01 16:58:14

by Joao Martins

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On 3/1/22 13:54, Jason Gunthorpe wrote:
> On Tue, Mar 01, 2022 at 01:06:30PM +0000, Joao Martins wrote:
>
>> I concur with you in the above but I don't mean it like a multiplexer.
>> Rather, mimicking the general nature of feature bits in the protection domain
>> (or the hw pagetable abstraction). So hypothetically for every bit ... if you
>> wanted to create yet another op that just flips a bit of the
>> DTEs/PASID-table/CD it would be an excessive use of callbacks to get
>> to in the iommu_domain_ops if they all set do the same thing.
>> Right now it's only Dirty (Access bit I don't see immediate use for it
>> right now) bits, but could there be more? Perhaps this is something to
>> think about.
>
> Not sure it matters :)
>
Hehe, most likely it doesn't :)

>>> One problem with this is that devices that don't support dynamic
>>> tracking are stuck in vIOMMU cases where the IOAS will have some tiny
>>> set of all memory mapped.
>>>
>> Sorry to be pedantic, when you say 'dynamic tracking' for you it just means
>> that you have no limitation of ranges and fw/hw can cope with being fed of
>> 'new-ranges' to enable dirty-tracking.
>
> Yes, the ranges can change once the tracking starts, like the normal
> IOMMU can do
>
> We are looking at devices where the HW can track a range at the start
> of tracking and that range cannot change over time.
>
> So, we need to track the whole guest memory and some extra for memory
> hot plug, not just the currently viommu mapped things.

I'm aware of the guest memmap accounting foo, so I was already factoring
that in the equation. I was just mainly grokking at hardware limitations
you are coming from to be captured in these future VFIO extensions.

Tracking the max possible GPA (as you hinted earlier) could solve the two natures
of write-tracking, also given that the IOMMU ultimately protects what's really
allowed to be DMA-written to. Unless such tracking-bitmap is placed on device
memory, where space probably needs to be more carefully managed.

Anyways, thanks for the insights!

2022-03-01 18:29:32

by Joao Martins

[permalink] [raw]
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On 2/28/22 21:01, Jason Gunthorpe wrote:
> On Mon, Feb 28, 2022 at 01:01:39PM +0000, Joao Martins wrote:
>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>>>>>> in terms of direction...
>>>>>>>
>>>>>>> I think go ahead and build it on top of iommufd, start working out the
>>>>>>> API details, etc. I think once the direction is concluded the new APIs
>>>>>>> will go forward.
>>>>>>>
>>>>>> /me nods, will do. Looking at your repository it is looking good.
>>>>>
>>>>> I would like to come with some plan for dirty tracking on iommufd and
>>>>> combine that with a plan for dirty tracking inside the new migration
>>>>> drivers.
>>>>>
>>>> I had a few things going on my end over the past weeks, albeit it is
>>>> getting a bit better now and I will be coming back to this topic. I hope/want
>>>> to give you a more concrete update/feedback over the coming week or two wrt
>>>> to dirty-tracking+iommufd+amd.
>>>>
>>>> So far, I am not particularly concerned that this will affect overall iommufd
>>>> design. The main thing is really lookups to get vendor iopte, upon on what might
>>>> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
>>>> the tracking,
>>>
>>> I'm not very keen on these multiplexer interfaces. I think you should
>>> just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
>>> 'read_dirty_bits'
>>>
>>> NULL op means not supported.
>>>
>>> IMHO we don't need a kapi wrapper if only iommufd is going to call the
>>> op.
>>>
>>
>> OK, good point.
>>
>> Even without a kapi wrapper I am still wondering whether the iommu op needs to
>> be something like a generic iommu feature toggling (e.g. .set_feature()), rather
>> than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about
>> the only feature we will change out-of-band in the protection-domain.
>> I guess I can stay with set_ad_tracking/set_dirty_tracking and if should
>> need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS).
>
> I just generally dislike multiplexers like this. We are already
> calling through a function pointer struct, why should the driver
> implement another switch/case just to find out which function pointer
> the caller really ment to use? It doesn't make things faster, it
> doesn't make things smaller, it doesn't use less LOC. Why do it?
>
I concur with you in the above but I don't mean it like a multiplexer.
Rather, mimicking the general nature of feature bits in the protection domain
(or the hw pagetable abstraction). So hypothetically for every bit ... if you
wanted to create yet another op that just flips a bit of the
DTEs/PASID-table/CD it would be an excessive use of callbacks to get
to in the iommu_domain_ops if they all set do the same thing.
Right now it's only Dirty (Access bit I don't see immediate use for it
right now) bits, but could there be more? Perhaps this is something to
think about.

Anyways, I am anyways sticking with plain ops function.

>> Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync'
>> because the sync() doesn't only read, but also "writes" to root pagetable to update
>> the dirty bit (and then IOTLB flush). And that's about what VFIO current interface
>> does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear.
>
> 'read and clear' is what I'd suggest
>
/me nods


>>> - What about the unmap and read dirty without races operation that
>>> vfio has?
>>>
>> I am afraid that might need a new unmap iommu op that reads the dirty bit
>> after clearing the page table entry. It's marshalling the bits from
>> iopte into a bitmap as opposed to some logic added on top. As far as I
>> looked for AMD this isn't difficult to add, (same for Intel) it can
>> *I think* reuse all of the unmap code.
>
> Ok. It feels necessary to be complete
>
Yes, I guess it's mandatory if we want to fully implement vfio-compat dirty
tracking IOCTLs, too.

>>> Yes, this is a point that needs some answering. One option is to pass
>>> in the tracking range list from userspace. Another is to query it in
>>> the driver from the currently mapped areas in IOAS.
>>>
>> I sort of like the second given that we de-duplicate info that is already
>> tracked by IOAS -- it would be mostly internal and then it would be a
>> matter of when does this device tracker is set up, and whether we should
>> differentiate tracker "start"/"stop" vs "setup"/"teardown".
>
> One problem with this is that devices that don't support dynamic
> tracking are stuck in vIOMMU cases where the IOAS will have some tiny
> set of all memory mapped.
>
Sorry to be pedantic, when you say 'dynamic tracking' for you it just means
that you have no limitation of ranges and fw/hw can cope with being fed of
'new-ranges' to enable dirty-tracking. Or does it mean something else at
hw level? Like dirty bitmap being pulled out of device memory, and hw keeps
its own state of dirtying and sw 'just' needs to sync the bitmap every scan.