VFIO "driver" development has moved to a publicly accessible respository
on github:
git://github.com/pugs/vfio-linux-2.6.git
This is a clone of the Linux-2.6 tree with all VFIO changes on the vfio
branch (which is the default). There is a tag 'vfio-v6' marking the latest
"release" of VFIO.
In addition, I am open-sourcing my user level code which uses VFIO.
It is a simple UDP/IP/Ethernet stack supporting 3 different VFIO based
hardware drivers. This code is available at:
git://github.com/pugs/vfio-user-level-drivers.git
On Mon, 2010-11-22 at 15:21 -0800, Tom Lyon wrote:
> VFIO "driver" development has moved to a publicly accessible respository
> on github:
>
> git://github.com/pugs/vfio-linux-2.6.git
>
> This is a clone of the Linux-2.6 tree with all VFIO changes on the vfio
> branch (which is the default). There is a tag 'vfio-v6' marking the latest
> "release" of VFIO.
>
> In addition, I am open-sourcing my user level code which uses VFIO.
> It is a simple UDP/IP/Ethernet stack supporting 3 different VFIO based
> hardware drivers. This code is available at:
>
> git://github.com/pugs/vfio-user-level-drivers.git
So I do have some concerns about this...
So first, before I go into the meat of my issues, let's just drop a
quick one about the interface: why netlink ? I find it horrible
myself... Just confuses everything and adds overhead. ioctl's would have
been a better choice imho.
Now, my actual issues, which in fact extend to the whole "generic" iommu
APIs that have been added to drivers/pci for "domains", and that in
turns "stains" VFIO in ways that I'm not sure I can use on POWER...
I would appreciate your input on how you think is the best way for me to
solve some of these "mismatches" between our HW and this design.
Basically, the whole iommu domain stuff has been entirely designed
around the idea that you can create those "domains" which are each an
entire address space, and put devices in there.
This is sadly not how the IBM iommus work on POWER today...
I have currently one "shared" DMA address space (per host bridge), but I
can assign regions of it to different devices (and I have limited
filtering capabilities so basically, a bus per region, a device per
region or a function per region).
That means essentially that I cannot just create a mapping for the DMA
addresses I want, but instead, need to have some kind of "allocator" for
DMA translations (which we have in the kernel, ie, dma_map/unmap use a
bitmap allocator).
I generally have 2 regions per device, one in 32-bit space of quite
limited size (some times as small as 128M window) and one in 64-bit
space that I can make quite large if I need to, enough to map all of
memory if that's really desired, using large pages or something like
that).
Now that has various consequences vs. the interfaces betweem iommu
domains and qemu, and VFIO:
- I don't quite see how I can translate the concept of domains and
attaching devices to such domains. The basic idea won't work. The
domains in my case are essentially pre-existing, not created on-the-fly,
and may contain multiple devices tho I suppose I can assume for now that
we only support KVM pass-through with 1 device == 1 domain.
I don't know how to sort that one out if the userspace or kvm code
assumes it can put multiple devices in one domain and they start to
magically share the translations...
Not sure what the right approach here is. I could make the "Linux"
domain some artifical SW construct that contains a list of the real
iommu's it's "bound" to and establish translations in all of them... but
that isn't very efficient. If the guest kernel explicitely use some
iommu PV ops targeting a device, I need to only setup translations for
-that- device, not everything in the "domain".
- The code in virt/kvm/iommu.c that assumes it can map the entire guest
memory 1:1 in the IOMMU is just not usable for us that way. We -might-
be able to do that for 64-bit capable devices as we can create quite
large regions in the 64-bit space, but at the very least we need some
kind of offset, and the guest must know about it...
- Similar deal with all the code that currently assume it can pick up a
"virtual" address and create a mapping from that. Either we provide an
allocator, or if we want to keep the flexibility of userspace/kvm
choosing the virtual addresses (preferable), we need to convey some
"ranges" information down to the user.
- Finally, my guest are always paravirt. There's well defined Hcalls
for inserting/removing DMA translations and we're implementing these
since existing kernels already know how to use them. That means that
overall, I might simply not need to use any of the above.
IE. I could have my own infrastructure for iommu, my H-calls populating
the target iommu directly from the kernel (kvm) or qemu (via ioctls in
the non-kvm case). Might be the best option ... but that would mean
somewhat disentangling VFIO from uiommu...
Any suggestions ? Great ideas ?
Cheers,
Ben.
On Monday, December 20, 2010 09:37:33 pm Benjamin Herrenschmidt wrote:
> Hi Tom, just wrote that to linux-pci in reply to your VFIO annouce,
> but your email bounced. Alex gave me your ieee one instead, I'm sending
> this copy to you, please feel free to reply on the list !
>
> Cheers,
> Ben.
>
> On Tue, 2010-12-21 at 16:29 +1100, Benjamin Herrenschmidt wrote:
> > On Mon, 2010-11-22 at 15:21 -0800, Tom Lyon wrote:
> > > VFIO "driver" development has moved to a publicly accessible
> > > respository
> > >
> > > on github:
> > > git://github.com/pugs/vfio-linux-2.6.git
> > >
> > > This is a clone of the Linux-2.6 tree with all VFIO changes on the vfio
> > > branch (which is the default). There is a tag 'vfio-v6' marking the
> > > latest "release" of VFIO.
> > >
> > > In addition, I am open-sourcing my user level code which uses VFIO.
> > > It is a simple UDP/IP/Ethernet stack supporting 3 different VFIO based
> > >
> > > hardware drivers. This code is available at:
> > > git://github.com/pugs/vfio-user-level-drivers.git
> >
> > So I do have some concerns about this...
> >
> > So first, before I go into the meat of my issues, let's just drop a
> > quick one about the interface: why netlink ? I find it horrible
> > myself... Just confuses everything and adds overhead. ioctl's would have
> > been a better choice imho.
> >
> > Now, my actual issues, which in fact extend to the whole "generic" iommu
> > APIs that have been added to drivers/pci for "domains", and that in
> > turns "stains" VFIO in ways that I'm not sure I can use on POWER...
> >
> > I would appreciate your input on how you think is the best way for me to
> > solve some of these "mismatches" between our HW and this design.
> >
> > Basically, the whole iommu domain stuff has been entirely designed
> > around the idea that you can create those "domains" which are each an
> > entire address space, and put devices in there.
> >
> > This is sadly not how the IBM iommus work on POWER today...
> >
> > I have currently one "shared" DMA address space (per host bridge), but I
> > can assign regions of it to different devices (and I have limited
> > filtering capabilities so basically, a bus per region, a device per
> > region or a function per region).
> >
> > That means essentially that I cannot just create a mapping for the DMA
> > addresses I want, but instead, need to have some kind of "allocator" for
> > DMA translations (which we have in the kernel, ie, dma_map/unmap use a
> > bitmap allocator).
> >
> > I generally have 2 regions per device, one in 32-bit space of quite
> > limited size (some times as small as 128M window) and one in 64-bit
> > space that I can make quite large if I need to, enough to map all of
> > memory if that's really desired, using large pages or something like
> > that).
> >
> > Now that has various consequences vs. the interfaces betweem iommu
> >
> > domains and qemu, and VFIO:
> > - I don't quite see how I can translate the concept of domains and
> >
> > attaching devices to such domains. The basic idea won't work. The
> > domains in my case are essentially pre-existing, not created on-the-fly,
> > and may contain multiple devices tho I suppose I can assume for now that
> > we only support KVM pass-through with 1 device == 1 domain.
> >
> > I don't know how to sort that one out if the userspace or kvm code
> > assumes it can put multiple devices in one domain and they start to
> > magically share the translations...
> >
> > Not sure what the right approach here is. I could make the "Linux"
> > domain some artifical SW construct that contains a list of the real
> > iommu's it's "bound" to and establish translations in all of them... but
> > that isn't very efficient. If the guest kernel explicitely use some
> > iommu PV ops targeting a device, I need to only setup translations for
> > -that- device, not everything in the "domain".
> >
> > - The code in virt/kvm/iommu.c that assumes it can map the entire guest
> >
> > memory 1:1 in the IOMMU is just not usable for us that way. We -might-
> > be able to do that for 64-bit capable devices as we can create quite
> > large regions in the 64-bit space, but at the very least we need some
> > kind of offset, and the guest must know about it...
> >
> > - Similar deal with all the code that currently assume it can pick up a
> >
> > "virtual" address and create a mapping from that. Either we provide an
> > allocator, or if we want to keep the flexibility of userspace/kvm
> > choosing the virtual addresses (preferable), we need to convey some
> > "ranges" information down to the user.
> >
> > - Finally, my guest are always paravirt. There's well defined Hcalls
> >
> > for inserting/removing DMA translations and we're implementing these
> > since existing kernels already know how to use them. That means that
> > overall, I might simply not need to use any of the above.
> >
> > IE. I could have my own infrastructure for iommu, my H-calls populating
> > the target iommu directly from the kernel (kvm) or qemu (via ioctls in
> > the non-kvm case). Might be the best option ... but that would mean
> > somewhat disentangling VFIO from uiommu...
> >
> > Any suggestions ? Great ideas ?
Ben - I don't have any good news for you.
DMA remappers like on Power and Sparc have been around forever, the new thing
about Intel/AMD iommus is the per-device address spaces and the protection
inherent in having separate mappings for each device. If one is to trust a
user level app or virtual machine to program DMA registers directly, then you
really need per device translation.
That said, early versions of VFIO had a mapping mode that used the normal DMA
API instead of the iommu/uiommu api and assumed that the user was trusted, but
that wasn't interesting for the long term.
So if you want safe device assigment you're going to need hardware help.
> >
> > Cheers,
> > Ben.
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2010-12-21 at 16:29 +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2010-11-22 at 15:21 -0800, Tom Lyon wrote:
> > VFIO "driver" development has moved to a publicly accessible respository
> > on github:
> >
> > git://github.com/pugs/vfio-linux-2.6.git
> >
> > This is a clone of the Linux-2.6 tree with all VFIO changes on the vfio
> > branch (which is the default). There is a tag 'vfio-v6' marking the latest
> > "release" of VFIO.
> >
> > In addition, I am open-sourcing my user level code which uses VFIO.
> > It is a simple UDP/IP/Ethernet stack supporting 3 different VFIO based
> > hardware drivers. This code is available at:
> >
> > git://github.com/pugs/vfio-user-level-drivers.git
>
> So I do have some concerns about this...
>
> So first, before I go into the meat of my issues, let's just drop a
> quick one about the interface: why netlink ? I find it horrible
> myself... Just confuses everything and adds overhead. ioctl's would have
> been a better choice imho.
I mentioned on irc, but I'll repeat here, ioctls are used for all the
basic interactions. IIRC, netlink is only used for error handling,
which I haven't tried to link into qemu yet.
> Now, my actual issues, which in fact extend to the whole "generic" iommu
> APIs that have been added to drivers/pci for "domains", and that in
> turns "stains" VFIO in ways that I'm not sure I can use on POWER...
>
> I would appreciate your input on how you think is the best way for me to
> solve some of these "mismatches" between our HW and this design.
>
> Basically, the whole iommu domain stuff has been entirely designed
> around the idea that you can create those "domains" which are each an
> entire address space, and put devices in there.
>
> This is sadly not how the IBM iommus work on POWER today...
>
> I have currently one "shared" DMA address space (per host bridge), but I
> can assign regions of it to different devices (and I have limited
> filtering capabilities so basically, a bus per region, a device per
> region or a function per region).
>
> That means essentially that I cannot just create a mapping for the DMA
> addresses I want, but instead, need to have some kind of "allocator" for
> DMA translations (which we have in the kernel, ie, dma_map/unmap use a
> bitmap allocator).
>
> I generally have 2 regions per device, one in 32-bit space of quite
> limited size (some times as small as 128M window) and one in 64-bit
> space that I can make quite large if I need to, enough to map all of
> memory if that's really desired, using large pages or something like
> that).
>
> Now that has various consequences vs. the interfaces betweem iommu
> domains and qemu, and VFIO:
>
> - I don't quite see how I can translate the concept of domains and
> attaching devices to such domains. The basic idea won't work. The
> domains in my case are essentially pre-existing, not created on-the-fly,
> and may contain multiple devices tho I suppose I can assume for now that
> we only support KVM pass-through with 1 device == 1 domain.
Yep, your hardware may impose some usage restrictions and the iommu
domain may just be an index to the pre-existing context.
> I don't know how to sort that one out if the userspace or kvm code
> assumes it can put multiple devices in one domain and they start to
> magically share the translations...
I misremembered how I have this wired in the vfio qemu side when we
spoke on irc. Userspace gets to make the decision which device is
associated to which iommu, and the command line options (when invoked
through libvirt) allows any combination of putting all the devices into
separate or shared iommu domains. I expect that on x86 libvirt might
want to use a single domain for all passthrough devices for a given
guest (to reduce iommu tlb pressure), but it would also be entirely
valid to put each in it's own, which is the behavior you'd want. From
the qemu command line (not using vfiofd & uiommufd), each device gets
it's own domain.
> Not sure what the right approach here is. I could make the "Linux"
> domain some artifical SW construct that contains a list of the real
> iommu's it's "bound" to and establish translations in all of them... but
> that isn't very efficient. If the guest kernel explicitely use some
> iommu PV ops targeting a device, I need to only setup translations for
> -that- device, not everything in the "domain".
>
> - The code in virt/kvm/iommu.c that assumes it can map the entire guest
> memory 1:1 in the IOMMU is just not usable for us that way. We -might-
> be able to do that for 64-bit capable devices as we can create quite
> large regions in the 64-bit space, but at the very least we need some
> kind of offset, and the guest must know about it...
Note that this is used for current qemu-kvm device assignment, not vfio
assignment. For x86 qemu vfio we register each virtual to guest
physical mapping using VFIO_DMA_MAP_IOVA. Since you only have a portion
of the address space available and plan to trap existing guest
hypercalls to create mappings, you'll want to set these up and tear them
down dynamically. x86 is expecting non-iommu aware guests, so we need
to handle the entire guest address space, which maps nicely with how the
current generation of x86 iommus work.
> - Similar deal with all the code that currently assume it can pick up a
> "virtual" address and create a mapping from that. Either we provide an
> allocator, or if we want to keep the flexibility of userspace/kvm
> choosing the virtual addresses (preferable), we need to convey some
> "ranges" information down to the user.
>
> - Finally, my guest are always paravirt. There's well defined Hcalls
> for inserting/removing DMA translations and we're implementing these
> since existing kernels already know how to use them. That means that
> overall, I might simply not need to use any of the above.
>
> IE. I could have my own infrastructure for iommu, my H-calls populating
> the target iommu directly from the kernel (kvm) or qemu (via ioctls in
> the non-kvm case). Might be the best option ... but that would mean
> somewhat disentangling VFIO from uiommu...
>
> Any suggestions ? Great ideas ?
It seems like vfio could still work for you. You have a restricted
iommu address space, but you can also expect your guests to make use of
hcalls to setup mappings, which gives you a way to use your resources
more sparingly. So I could imagine a model where your hcalls end up
calling into qemu vfio userspace, which does the guest physical to host
virtual translation (or some kind of allocator function). That calls
the vfio VFIO_DMA_MAP_IOVA ioctl to map or unmap the region. You then
need to implement an iommu interface in the host that performs the hand
waving of inserting that mapping into the translation for the device.
You probably still want something like the uiommu interface and
VFIO_DOMAIN_SET call to create a context for a device for the security
restrictions that Tom mentions, even if the mapping back to hardware
page tables is less direct than it is on x86.
Alex
On Tue, 2010-12-21 at 11:48 -0800, Tom Lyon wrote:
>
> Ben - I don't have any good news for you.
>
> DMA remappers like on Power and Sparc have been around forever, the new thing
> about Intel/AMD iommus is the per-device address spaces and the protection
> inherent in having separate mappings for each device. If one is to trust a
> user level app or virtual machine to program DMA registers directly, then you
> really need per device translation.
Right, and we had that for a while too on our PCIe variants :-)
IE. We have a single address space, -but- that address space is divided
into windows that have an individual filter on the transaction requester
IDs (which I can configure to filter a full bus, a full device, or
pretty much per function). I have a pile of such windows (depending on
the exact chipset, up to 256 today).
So essentially, each device -does- have separate mappings, tho those are
limited to a "window" of the address space which is typically going to
be around 256M (or smaller) in 32-bit space (but can be much larger in
64-bit space depending on how much physically contiguous space we can
spare for the translation table itself).
Now, it doesn't do multi-level translations. So KVM guests (or userspace
applications) will not directly modify the translation table. That does
mean map/unmap "ioctls" for userspace. In the KVM case, hypercalls.
This is not a huge deal for us right now as our operating environment is
already paravirtualized (for running under pHyp aka PowerVM aka IBM
proprietary hypervisor). So we just implement the same hypercalls in KVM
and existing kernels will "just work". Not as efficient as direct access
into a multi level page table but still better than nothing :-)
> That said, early versions of VFIO had a mapping mode that used the normal DMA
> API instead of the iommu/uiommu api and assumed that the user was trusted, but
> that wasn't interesting for the long term.
>
> So if you want safe device assigment you're going to need hardware help.
Well, there are going to be some amount of changes in future HW but
that's not something we can count on today and we have to support
existing machines. That said, as I wrote above, I -do- have per-device
assignment, however, I don't get to give an entire 64-bit address space
to each of them, only a "window" in a single address space, so I need
somewhat to convey those boundaries to userspace.
There's also a mismatch with the concept of creating an iommu domain,
and then attaching devices to it (which kvm intends to exploit, Alex was
explaining that his plan is to put all devices in a partition inside the
same domain). In our case, the domains are pretty-much pre-existing and
tied to each device. But this is more an API mismatch specific to
uiommu.
Cheers,
Ben.
>
> > >
> > > Cheers,
> It seems like vfio could still work for you. You have a restricted
> iommu address space, but you can also expect your guests to make use of
> hcalls to setup mappings, which gives you a way to use your resources
> more sparingly. So I could imagine a model where your hcalls end up
> calling into qemu vfio userspace, which does the guest physical to host
> virtual translation (or some kind of allocator function). That calls
> the vfio VFIO_DMA_MAP_IOVA ioctl to map or unmap the region.
That would be way too much overhead ... ie we could do it today as a
proof of concept, but ultimately, we want the H-calls to be done in real
mode and directly populate the TCE table (iommu translation table).
IE. We have a guest on one side doing H_PUT_TCE giving us a value and a
table index, so all we really need is "validate" that index, translate
the GPA to a HPA, and write to the real TCE table. This is a very hot
code path, especially in networking, so I'd like as much as possible to
do it in the kernel in real mode.
We essentially have to do a partition switch when exiting from the guest
into the host linux, so that's costly. Anything we can handle in "real
mode" (ie, MMU off, right when taking the "interrupt" from the guest as
the result of the hcall for example) will be a win.
When we eventually implement non-pinned user memory, things will be a
bit more nasty I suppose. We can try to "pin" the pages at H_PUT_TCE
time, but that means either doing an exit to linux, or trying to walk
the sparse memmap and do a speculative page reference all in real
mode... not impossible but nasty (since we can't use the vmemmap region
without the MMU on).
But for now, our user memory is all pinned huge pages, so we can have a
nice fast path there.
> You then
> need to implement an iommu interface in the host that performs the hand
> waving of inserting that mapping into the translation for the device.
> You probably still want something like the uiommu interface and
> VFIO_DOMAIN_SET call to create a context for a device for the security
> restrictions that Tom mentions, even if the mapping back to hardware
> page tables is less direct than it is on x86.
Well, yes and no...
The HW has additional fancy isolation features, for example, MMIOs are
also split into domains associated with the MMU window etc... this is so
that the HW can immediately isolate a device on error, making it less
likely for corrupted data to propagate in the system and allowing for
generally more reliably error recovery mechanisms.
That means that in the end, I have a certain amount of "domains"
grouping those MMIO, DMA regions, etc... generally containing one device
each, but the way I see things, all those domains are pre-existing. They
are setup in the host, with or without KVM, when PCIe is enumerated (or
on hotplug).
IE. The host linux without KVM benefits from that isolation as well in
term of added reliability and recovery services (as it does today under
pHyp).
KVM guests then are purely a matter of making such pre-existing domains
accessible to a guest.
I don't think KVM (or VFIO for that matter) should be involved in the
creation and configuration of those domains, it's a tricky exercise
already due to the MMIO domain thing coupled with funny HW limitations,
and I'd rather keep that totally orthogonal from the act of mapping
those into KVM guests.
Cheers,
Ben.
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html