LinuxLists.cc - Re: [PATCH] iommu: making IOMMU sysfs nodes API public

2013-03-18 03:52:42

Subject: Re: [PATCH] iommu: making IOMMU sysfs nodes API public

On 20/02/13 15:33, Alex Williamson wrote:
> On Wed, 2013-02-20 at 15:20 +1100, Alexey Kardashevskiy wrote:
>> On 20/02/13 14:47, Alex Williamson wrote:
>>> On Wed, 2013-02-20 at 13:31 +1100, Alexey Kardashevskiy wrote:
>>>> On 20/02/13 07:11, Alex Williamson wrote:
>>>>> On Tue, 2013-02-19 at 18:38 +1100, David Gibson wrote:
>>>>>> On Mon, Feb 18, 2013 at 10:24:00PM -0700, Alex Williamson wrote:
>>>>>>> On Mon, 2013-02-18 at 17:15 +1100, Alexey Kardashevskiy wrote:
>>>>>>>> On 13/02/13 04:15, Alex Williamson wrote:
>>>>>>>>> On Wed, 2013-02-13 at 01:42 +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>> On 12/02/13 16:07, Alex Williamson wrote:
>>>>>>>>>>> On Tue, 2013-02-12 at 15:06 +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>>>> Having this patch in a tree, adding new nodes in sysfs
>>>>>>>>>>>> for IOMMU groups is going to be easier.
>>>>>>>>>>>>
>>>>>>>>>>>> The first candidate for this change is a "dma-window-size"
>>>>>>>>>>>> property which tells a size of a DMA window of the specific
>>>>>>>>>>>> IOMMU group which can be used later for locked pages accounting.
>>>>>>>>>>>
>>>>>>>>>>> I'm still churning on this one; I'm nervous this would basically creat
>>>>>>>>>>> a /proc free-for-all under /sys/kernel/iommu_group/$GROUP/ where any
>>>>>>>>>>> iommu driver can add random attributes. That can get ugly for
>>>>>>>>>>> userspace.
>>>>>>>>>>
>>>>>>>>>> Is not it exactly what sysfs is for (unlike /proc)? :)
>>>>>>>>>
>>>>>>>>> Um, I hope it's a little more thought out than /proc.
>>>>>>>>>
>>>>>>>>>>> On the other hand, for the application of userspace knowing how much
>>>>>>>>>>> memory to lock for vfio use of a group, it's an appealing location to
>>>>>>>>>>> get that information. Something like libvirt would already be poking
>>>>>>>>>>> around here to figure out which devices to bind. Page limits need to be
>>>>>>>>>>> setup prior to use through vfio, so sysfs is more convenient than
>>>>>>>>>>> through vfio ioctls.
>>>>>>>>>>
>>>>>>>>>> True. DMA window properties do not change since boot so sysfs is the right
>>>>>>>>>> place to expose them.
>>>>>>>>>>
>>>>>>>>>>> But then is dma-window-size just a vfio requirement leaking over into
>>>>>>>>>>> iommu groups? Can we allow iommu driver based attributes without giving
>>>>>>>>>>> up control of the namespace? Thanks,
>>>>>>>>>>
>>>>>>>>>> Who are you asking these questions? :)
>>>>>>>>>
>>>>>>>>> Anyone, including you. Rather than dropping misc files in sysfs to
>>>>>>>>> describe things about the group, I think the better solution in your
>>>>>>>>> case might be a link from the group to an existing sysfs directory
>>>>>>>>> describing the PE. I believe your PE is rooted in a PCI bridge, so that
>>>>>>>>> presumably already has a representation in sysfs. Can the aperture size
>>>>>>>>> be determined from something in sysfs for that bridge already? I'm just
>>>>>>>>> not ready to create a grab bag of sysfs entries for a group yet.
>>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>> At the moment there is no information neither in sysfs nor
>>>>>>>> /proc/device-tree about the dma-window. And adding a sysfs entry per PE
>>>>>>>> (powerpc partitionable end-point which is often a PHB but not always) just
>>>>>>>> for VFIO is quite heavy.
>>>>>>>
>>>>>>> How do you learn the window size and PE extents in the host kernel?
>>>>>>>
>>>>>>>> We could add a ppc64 subfolder under /sys/kernel/iommu/xxx/ and put the
>>>>>>>> "dma-window" property there. And replace it with a symlink when and if we
>>>>>>>> add something for PE later. Would work?
>>>>>>
>>>>>> Fwiw, I'd suggest a subfolder named for the type of IOMMU, rather than
>>>>>> "ppc64".
>>>>>>
>>>>>>> To be clear, you're suggesting /sys/kernel/iommu_groups/$GROUP/xxx/,
>>>>>>> right? A subfolder really only limits the scope of the mess, so it's
>>>>>>> not much improvement. What does the interface look like to make those
>>>>>>> subfolders?
>>>>>>>
>>>>>>> The problem we're trying to solve is this call flow:
>>>>>>>
>>>>>>> containerfd = open("/dev/vfio/vfio");
>>>>>>> ioctl(containerfd, VFIO_GET_API_VERSION);
>>>>>>> ioctl(containerfd, VFIO_CHECK_EXTENSION, ...);
>>>>>>> groupfd = open("/dev/vfio/$GROUP");
>>>>>>> ioctl(groupfd, VFIO_GROUP_GET_STATUS);
>>>>>>> ioctl(groupfd, VFIO_GROUP_SET_CONTAINER, &containerfd);
>>>>>>>
>>>>>>> You wanted to lock all the memory for the DMA window here, before we can
>>>>>>> call VFIO_IOMMU_GET_INFO, but does it need to happen there? We still
>>>>>>> have a MAP_DMA hook. We could do it all on the first mapping.
>>>>>>
>>>>>> MAP_DMA isn't quite enough, since the guest can also directly cause
>>>>>> mappings using hypercalls directly implemented in KVM. I think it
>>>>>> would be feasible to lock on the first mapping (either via MAP_DMA, or
>>>>>> H_PUT_TCE) though it would be a bit ugly and require that the first
>>>>>> H_PUT_TCE always bounce out to virtual mode (Alexey, correct me if I'm
>>>>>> wrong here). IIRC there is also a call to bind the vfio container to
>>>>>> a (qemu assigned) LIOBN, before the guest can use H_PUT_TCE directly,
>>>>>> so that might be another place we could do the lock.
>>>>>
>>>>> Somehow hypercall mappings have to be gated by the userspace setup,
>>>>> right?
>>>>
>>>>
>>>> There is a KVM ioctl (and a KVM capability) which hooks LIOBN (PCI bus ID)
>>>> with IOMMU ID. It basically creates an entry in the list of all LIOBNs and
>>>> when TCE call occurs, the host finds correct IOMMU group to pass this call to.
>>>>
>>>> It happens from spapr_register_vfio_container() in QEMU, i.e. after getting
>>>> DMA window properties but only if the host supports real mode TCE handling.
>>>>
>>>>
>>>>>>> It also
>>>>>>> has a flags field that could augment the behavior to trigger page
>>>>>>> locking.
>>>>>>
>>>>>> I don't see how the flags help us - we can't have userspace choose to
>>>>>> skip the locked memory accounting. Or are you suggesting a flag to
>>>>>> open the container in some sort of dummy mode where only GET_INFO is
>>>>>> possible, then re-open with the full locking?
>>>>>
>>>>> Sort of, I don't think it needs to be re-opened, but we had previously
>>>>> talked about some kind of enable and disable ioctl. "enable" would be
>>>>> the logical place to lock pages, but then we probably got stuck in
>>>>> questions around what it means to enable an iommu generically.
>>>>
>>>> The other question is if a container is ready to work if I add just one
>>>> group? What happens when I add another one (not supported on ppc64 but
>>>> still)?
>>>
>>> This is also the problem with exposing a dma window under the group in
>>> sysfs. Do I require the ability to lock the sum of the window, the
>>> largest window, what? If we rely on the ioctls, userspace can figure
>>> out that they can't be combined and know it's the sum. I'm not sure
>>> what your plans are around hotplug of a PHB though.
>>>
>>>> Having "enable" method and disabling new attachments when it is
>>>> "enabled" would keep my brain calm :)
>>>
>>> Now I'm not sure whether you're for or against it ;)
>>
>>
>> I am for introducing enable() ioctls :) Or even "lock" ioctl.
>>
>>
>>>>> So what
>>>>> if instead of a separate enable ioctl we had a flag on DMA_MAP that was
>>>>> defined as SET_WINDOW where iova and size are passed and specify the
>>>>> portion of the DMA window that userspace intends to use and which is
>>>>> therefore locked. If you don't support subwindows, fine, just fail it.
>>>>> You could have a matching PUT_WINDOW on DMA_UNMAP if you wanted.
>>>>
>>>> DMA_MAP which does not do "map" but does "lock" or "set window"?
>>>> enable()/disable() look better.
>>>
>>> Sure, this is why we have a modular iommu interface, spapr can create an
>>> enable ioctl if necessary. I think there are ways to use the
>>> DMA_MAP/UNMAP ioctl in ways that aren't a complete kludge though.
>>>
>>>>>>> Adding the window size to sysfs seems more readily convenient,
>>>>>>> but is it so hard for userspace to open the files and call a couple
>>>>>>> ioctls to get far enough to call IOMMU_GET_INFO? I'm unconvinced the
>>>>>>> clutter in sysfs more than just a quick fix. Thanks,
>>>>>>
>>>>>> And finally, as Alexey points out, isn't the point here so we know how
>>>>>> much rlimit to give qemu? Using ioctls we'd need a special tool just
>>>>>> to check the dma window sizes, which seems a bit hideous.
>>>>>
>>>>> Is it more hideous that using iommu groups to report a vfio imposed
>>>>> restriction? Are a couple open files and a handful of ioctls worse than
>>>>> code to parse directory entries and the future maintenance of an
>>>>> unrestricted grab bag of sysfs entries?
>>>>
>>>> At the moment DMA32 window properties are static. So I can easily get rid
>>>> of VFIO_IOMMU_SPAPR_TCE_GET_INFO and be happy.
>>>
>>> Like, for instance, every PE always gets 512MB DMA window, fixed base
>>> address, not configurable, end of story?
>>
>>
>> Almost :) 1GB, starting at 0 (sometime at 2GB). Multiple PCI domains are
>> supported on ppc64 so it does not make a problem as bus address spaces are
>> separated. But yes, not flexible at all.
>
> Statements like "at the moment...", "[but] sometimes at..." make me
> think it's best to keep the GET_INFO call.
>
>>>> Ah, anyway, how do you see these ioctls to work on a user machine?
>>>> A separate tool which takes an iommu id, returns DMA window size and
>>>> adjusts rlimit?
>>>
>>> Sure, we need something that provides the function of libvirt and
>>> unbinds devices from host drivers, re-binds them to vfio-pci. That tool
>>> needs to have permissions to manipulate groups, so we're just talking
>>> about whether it's stepping over the line for it to open the group and a
>>> container, associate them, and probe the iommu info vs reading a sysfs
>>> file. Thanks,
>>
>> So the Tool is going to be a part of libvirt but not kernel or qemu, right?
>> Then implementing "LOCK" (and call it after GET_INFO in QEMU and not call
>> it from the Tool) should work fine.
>
> Right, a probe tool would check the value, close the files and set the
> locked page limit for qemu, which would take the next step to trigger
> the in-kernel accounting. Thanks,

Continuing the discussion :)
In meanwhile I added/tested VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE like
that (will repost the patch later, may be this week, only few changes there):

+ case VFIO_IOMMU_ENABLE: {
+ mutex_lock(&container->lock);
+ ret = tce_iommu_enable(container);
+ mutex_unlock(&container->lock);
+
+ return ret;
+ }
+ case VFIO_IOMMU_DISABLE: {
+ mutex_lock(&container->lock);
+ tce_iommu_disable(container);
+ mutex_unlock(&container->lock);
+
+ return 0;
+ }

and defined them as (not arch specific):

+/* IOCTLs to enable/disable IOMMU container usage */
+#define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
+#define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)

should be ok, right?

There is another question. It is possible to compile
vfio_iommu_spapr_tce as a module. How/when is it supposed to be loaded?
A user may not want to do "modprobe vfio_iommu_spapr_tce" manually.

--
Alexey

2013-03-19 02:40:58

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH] iommu: making IOMMU sysfs nodes API public

On Mon, 2013-03-18 at 14:53 +1100, Alexey Kardashevskiy wrote:
> On 20/02/13 15:33, Alex Williamson wrote:
> > On Wed, 2013-02-20 at 15:20 +1100, Alexey Kardashevskiy wrote:
> >> On 20/02/13 14:47, Alex Williamson wrote:
> >>> On Wed, 2013-02-20 at 13:31 +1100, Alexey Kardashevskiy wrote:
> >>>> On 20/02/13 07:11, Alex Williamson wrote:
> >>>>> On Tue, 2013-02-19 at 18:38 +1100, David Gibson wrote:
> >>>>>> On Mon, Feb 18, 2013 at 10:24:00PM -0700, Alex Williamson wrote:
> >>>>>>> On Mon, 2013-02-18 at 17:15 +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> On 13/02/13 04:15, Alex Williamson wrote:
> >>>>>>>>> On Wed, 2013-02-13 at 01:42 +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>> On 12/02/13 16:07, Alex Williamson wrote:
> >>>>>>>>>>> On Tue, 2013-02-12 at 15:06 +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>>>> Having this patch in a tree, adding new nodes in sysfs
> >>>>>>>>>>>> for IOMMU groups is going to be easier.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The first candidate for this change is a "dma-window-size"
> >>>>>>>>>>>> property which tells a size of a DMA window of the specific
> >>>>>>>>>>>> IOMMU group which can be used later for locked pages accounting.
> >>>>>>>>>>>
> >>>>>>>>>>> I'm still churning on this one; I'm nervous this would basically creat
> >>>>>>>>>>> a /proc free-for-all under /sys/kernel/iommu_group/$GROUP/ where any
> >>>>>>>>>>> iommu driver can add random attributes. That can get ugly for
> >>>>>>>>>>> userspace.
> >>>>>>>>>>
> >>>>>>>>>> Is not it exactly what sysfs is for (unlike /proc)? :)
> >>>>>>>>>
> >>>>>>>>> Um, I hope it's a little more thought out than /proc.
> >>>>>>>>>
> >>>>>>>>>>> On the other hand, for the application of userspace knowing how much
> >>>>>>>>>>> memory to lock for vfio use of a group, it's an appealing location to
> >>>>>>>>>>> get that information. Something like libvirt would already be poking
> >>>>>>>>>>> around here to figure out which devices to bind. Page limits need to be
> >>>>>>>>>>> setup prior to use through vfio, so sysfs is more convenient than
> >>>>>>>>>>> through vfio ioctls.
> >>>>>>>>>>
> >>>>>>>>>> True. DMA window properties do not change since boot so sysfs is the right
> >>>>>>>>>> place to expose them.
> >>>>>>>>>>
> >>>>>>>>>>> But then is dma-window-size just a vfio requirement leaking over into
> >>>>>>>>>>> iommu groups? Can we allow iommu driver based attributes without giving
> >>>>>>>>>>> up control of the namespace? Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Who are you asking these questions? :)
> >>>>>>>>>
> >>>>>>>>> Anyone, including you. Rather than dropping misc files in sysfs to
> >>>>>>>>> describe things about the group, I think the better solution in your
> >>>>>>>>> case might be a link from the group to an existing sysfs directory
> >>>>>>>>> describing the PE. I believe your PE is rooted in a PCI bridge, so that
> >>>>>>>>> presumably already has a representation in sysfs. Can the aperture size
> >>>>>>>>> be determined from something in sysfs for that bridge already? I'm just
> >>>>>>>>> not ready to create a grab bag of sysfs entries for a group yet.
> >>>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> At the moment there is no information neither in sysfs nor
> >>>>>>>> /proc/device-tree about the dma-window. And adding a sysfs entry per PE
> >>>>>>>> (powerpc partitionable end-point which is often a PHB but not always) just
> >>>>>>>> for VFIO is quite heavy.
> >>>>>>>
> >>>>>>> How do you learn the window size and PE extents in the host kernel?
> >>>>>>>
> >>>>>>>> We could add a ppc64 subfolder under /sys/kernel/iommu/xxx/ and put the
> >>>>>>>> "dma-window" property there. And replace it with a symlink when and if we
> >>>>>>>> add something for PE later. Would work?
> >>>>>>
> >>>>>> Fwiw, I'd suggest a subfolder named for the type of IOMMU, rather than
> >>>>>> "ppc64".
> >>>>>>
> >>>>>>> To be clear, you're suggesting /sys/kernel/iommu_groups/$GROUP/xxx/,
> >>>>>>> right? A subfolder really only limits the scope of the mess, so it's
> >>>>>>> not much improvement. What does the interface look like to make those
> >>>>>>> subfolders?
> >>>>>>>
> >>>>>>> The problem we're trying to solve is this call flow:
> >>>>>>>
> >>>>>>> containerfd = open("/dev/vfio/vfio");
> >>>>>>> ioctl(containerfd, VFIO_GET_API_VERSION);
> >>>>>>> ioctl(containerfd, VFIO_CHECK_EXTENSION, ...);
> >>>>>>> groupfd = open("/dev/vfio/$GROUP");
> >>>>>>> ioctl(groupfd, VFIO_GROUP_GET_STATUS);
> >>>>>>> ioctl(groupfd, VFIO_GROUP_SET_CONTAINER, &containerfd);
> >>>>>>>
> >>>>>>> You wanted to lock all the memory for the DMA window here, before we can
> >>>>>>> call VFIO_IOMMU_GET_INFO, but does it need to happen there? We still
> >>>>>>> have a MAP_DMA hook. We could do it all on the first mapping.
> >>>>>>
> >>>>>> MAP_DMA isn't quite enough, since the guest can also directly cause
> >>>>>> mappings using hypercalls directly implemented in KVM. I think it
> >>>>>> would be feasible to lock on the first mapping (either via MAP_DMA, or
> >>>>>> H_PUT_TCE) though it would be a bit ugly and require that the first
> >>>>>> H_PUT_TCE always bounce out to virtual mode (Alexey, correct me if I'm
> >>>>>> wrong here). IIRC there is also a call to bind the vfio container to
> >>>>>> a (qemu assigned) LIOBN, before the guest can use H_PUT_TCE directly,
> >>>>>> so that might be another place we could do the lock.
> >>>>>
> >>>>> Somehow hypercall mappings have to be gated by the userspace setup,
> >>>>> right?
> >>>>
> >>>>
> >>>> There is a KVM ioctl (and a KVM capability) which hooks LIOBN (PCI bus ID)
> >>>> with IOMMU ID. It basically creates an entry in the list of all LIOBNs and
> >>>> when TCE call occurs, the host finds correct IOMMU group to pass this call to.
> >>>>
> >>>> It happens from spapr_register_vfio_container() in QEMU, i.e. after getting
> >>>> DMA window properties but only if the host supports real mode TCE handling.
> >>>>
> >>>>
> >>>>>>> It also
> >>>>>>> has a flags field that could augment the behavior to trigger page
> >>>>>>> locking.
> >>>>>>
> >>>>>> I don't see how the flags help us - we can't have userspace choose to
> >>>>>> skip the locked memory accounting. Or are you suggesting a flag to
> >>>>>> open the container in some sort of dummy mode where only GET_INFO is
> >>>>>> possible, then re-open with the full locking?
> >>>>>
> >>>>> Sort of, I don't think it needs to be re-opened, but we had previously
> >>>>> talked about some kind of enable and disable ioctl. "enable" would be
> >>>>> the logical place to lock pages, but then we probably got stuck in
> >>>>> questions around what it means to enable an iommu generically.
> >>>>
> >>>> The other question is if a container is ready to work if I add just one
> >>>> group? What happens when I add another one (not supported on ppc64 but
> >>>> still)?
> >>>
> >>> This is also the problem with exposing a dma window under the group in
> >>> sysfs. Do I require the ability to lock the sum of the window, the
> >>> largest window, what? If we rely on the ioctls, userspace can figure
> >>> out that they can't be combined and know it's the sum. I'm not sure
> >>> what your plans are around hotplug of a PHB though.
> >>>
> >>>> Having "enable" method and disabling new attachments when it is
> >>>> "enabled" would keep my brain calm :)
> >>>
> >>> Now I'm not sure whether you're for or against it ;)
> >>
> >>
> >> I am for introducing enable() ioctls :) Or even "lock" ioctl.
> >>
> >>
> >>>>> So what
> >>>>> if instead of a separate enable ioctl we had a flag on DMA_MAP that was
> >>>>> defined as SET_WINDOW where iova and size are passed and specify the
> >>>>> portion of the DMA window that userspace intends to use and which is
> >>>>> therefore locked. If you don't support subwindows, fine, just fail it.
> >>>>> You could have a matching PUT_WINDOW on DMA_UNMAP if you wanted.
> >>>>
> >>>> DMA_MAP which does not do "map" but does "lock" or "set window"?
> >>>> enable()/disable() look better.
> >>>
> >>> Sure, this is why we have a modular iommu interface, spapr can create an
> >>> enable ioctl if necessary. I think there are ways to use the
> >>> DMA_MAP/UNMAP ioctl in ways that aren't a complete kludge though.
> >>>
> >>>>>>> Adding the window size to sysfs seems more readily convenient,
> >>>>>>> but is it so hard for userspace to open the files and call a couple
> >>>>>>> ioctls to get far enough to call IOMMU_GET_INFO? I'm unconvinced the
> >>>>>>> clutter in sysfs more than just a quick fix. Thanks,
> >>>>>>
> >>>>>> And finally, as Alexey points out, isn't the point here so we know how
> >>>>>> much rlimit to give qemu? Using ioctls we'd need a special tool just
> >>>>>> to check the dma window sizes, which seems a bit hideous.
> >>>>>
> >>>>> Is it more hideous that using iommu groups to report a vfio imposed
> >>>>> restriction? Are a couple open files and a handful of ioctls worse than
> >>>>> code to parse directory entries and the future maintenance of an
> >>>>> unrestricted grab bag of sysfs entries?
> >>>>
> >>>> At the moment DMA32 window properties are static. So I can easily get rid
> >>>> of VFIO_IOMMU_SPAPR_TCE_GET_INFO and be happy.
> >>>
> >>> Like, for instance, every PE always gets 512MB DMA window, fixed base
> >>> address, not configurable, end of story?
> >>
> >>
> >> Almost :) 1GB, starting at 0 (sometime at 2GB). Multiple PCI domains are
> >> supported on ppc64 so it does not make a problem as bus address spaces are
> >> separated. But yes, not flexible at all.
> >
> > Statements like "at the moment...", "[but] sometimes at..." make me
> > think it's best to keep the GET_INFO call.
> >
> >>>> Ah, anyway, how do you see these ioctls to work on a user machine?
> >>>> A separate tool which takes an iommu id, returns DMA window size and
> >>>> adjusts rlimit?
> >>>
> >>> Sure, we need something that provides the function of libvirt and
> >>> unbinds devices from host drivers, re-binds them to vfio-pci. That tool
> >>> needs to have permissions to manipulate groups, so we're just talking
> >>> about whether it's stepping over the line for it to open the group and a
> >>> container, associate them, and probe the iommu info vs reading a sysfs
> >>> file. Thanks,
> >>
> >> So the Tool is going to be a part of libvirt but not kernel or qemu, right?
> >> Then implementing "LOCK" (and call it after GET_INFO in QEMU and not call
> >> it from the Tool) should work fine.
> >
> > Right, a probe tool would check the value, close the files and set the
> > locked page limit for qemu, which would take the next step to trigger
> > the in-kernel accounting. Thanks,
>
> Continuing the discussion :)
> In meanwhile I added/tested VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE like
> that (will repost the patch later, may be this week, only few changes there):
>
>
> + case VFIO_IOMMU_ENABLE: {
> + mutex_lock(&container->lock);
> + ret = tce_iommu_enable(container);
> + mutex_unlock(&container->lock);
> +
> + return ret;
> + }
> + case VFIO_IOMMU_DISABLE: {
> + mutex_lock(&container->lock);
> + tce_iommu_disable(container);
> + mutex_unlock(&container->lock);
> +
> + return 0;
> + }
>
> and defined them as (not arch specific):
>
> +/* IOCTLs to enable/disable IOMMU container usage */
> +#define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
> +#define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
>
>
> should be ok, right?

Seems fine. Can you envision any parameters to either of these in the
future? Be sure to document in the header and maybe add a SPAPR section
to Documentation/vfio.txt making note of the call flow for this IOMMU
backend.

> There is another question. It is possible to compile
> vfio_iommu_spapr_tce as a module. How/when is it supposed to be loaded?
> A user may not want to do "modprobe vfio_iommu_spapr_tce" manually.

See the request_module() call in drivers/vfio/vfio.c. I'd think you'd
just add another for vfio_iommu_spapr_tce. Thanks,

Alex

2013-03-19 07:14:36

by Alexey Kardashevskiy

[permalink] [raw]

Subject: [PATCH] vfio powerpc: implement IOMMU driver for VFIO

VFIO implements platform independent stuff such as
a PCI driver, BAR access (via read/write on a file descriptor
or direct mapping when possible) and IRQ signaling.

The platform dependent part includes IOMMU initialization
and handling. This patch implements an IOMMU driver for VFIO
which does mapping/unmapping pages for the guest IO and
provides information about DMA window (required by a POWERPC
guest).

The counterpart in QEMU is required to support this functionality.

Changelog:
* documentation updated
* containter enable/disable ioctls added
* request_module(spapr_iommu) added
* various locks fixed

Cc: David Gibson <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Documentation/vfio.txt | 63 ++++++
drivers/vfio/Kconfig | 6 +
drivers/vfio/Makefile | 1 +
drivers/vfio/vfio.c | 1 +
drivers/vfio/vfio_iommu_spapr_tce.c | 365 +++++++++++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 38 ++++
6 files changed, 474 insertions(+)
create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 8eda363..351f7c9 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -283,6 +283,69 @@ a direct pass through for VFIO_DEVICE_* ioctls. The read/write/mmap
interfaces implement the device region access defined by the device's
own VFIO_DEVICE_GET_REGION_INFO ioctl.

+
+PPC64 sPAPR implementation note
+-------------------------------------------------------------------------------
+
+This implementation has some specifics:
+
+1) Only one IOMMU group per container is supported as an IOMMU group
+represents the minimal entity which isolation can be guaranteed for and
+groups are allocated statically, one per a Partitionable Endpoint (PE)
+(PE is often a PCI domain but not always).
+
+2) The hardware supports so called DMA windows - the PCI address range
+within which DMA transfer is allowed, any attempt to access address space
+out of the window leads to the whole PE isolation.
+
+3) PPC64 guests are paravirtualized but not fully emulated. There is an API
+to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
+currently there is no way to reduce the number of calls. In order to make things
+faster, the map/unmap handling has been implented in real mode which provides
+an excellent performance which has limitations such as inability to do
+locked pages accounting in real time.
+
+So 3 additional ioctls have been added:
+
+ VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
+ of the DMA window on the PCI bus.
+
+ VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting
+ is done at this point. This lets user first to know what
+ the DMA window is and adjust rlimit before doing any real job.
+
+ VFIO_IOMMU_DISABLE - disables the container.
+
+
+The code flow from the example above should be slightly changed:
+
+ .....
+ /* Add the group to the container */
+ ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
+
+ /* Enable the IOMMU model we want */
+ ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU)
+
+ /* Get addition sPAPR IOMMU info */
+ vfio_iommu_spapr_tce_info spapr_iommu_info;
+ ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info);
+
+ if (ioctl(container, VFIO_IOMMU_ENABLE))
+ /* Cannot enable container, may be low rlimit */
+
+ /* Allocate some space and setup a DMA mapping */
+ dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+
+ dma_map.size = 1024 * 1024;
+ dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
+ dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+ /* Check here is .iova/.size are within DMA window from spapr_iommu_info */
+
+ ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
+ .....
+
-------------------------------------------------------------------------------

[1] VFIO was originally an acronym for "Virtual Function I/O" in its
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 7cd5dec..b464687 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
depends on VFIO
default n

+config VFIO_IOMMU_SPAPR_TCE
+ tristate
+ depends on VFIO && SPAPR_TCE_IOMMU
+ default n
+
menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if X86
+ select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
help
VFIO provides a framework for secure userspace device drivers.
See Documentation/vfio.txt for more details.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 2398d4a..72bfabc 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,3 +1,4 @@
obj-$(CONFIG_VFIO) += vfio.o
obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
+obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 12c264d..004033d 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1376,6 +1376,7 @@ static int __init vfio_init(void)
* drivers.
*/
request_module_nowait("vfio_iommu_type1");
+ request_module_nowait("vfio_iommu_spapr");

return 0;

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
new file mode 100644
index 0000000..22ba0b5
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -0,0 +1,365 @@
+/*
+ * VFIO: IOMMU DMA mapping support for TCE on POWER
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Author: Alexey Kardashevskiy <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio_iommu_type1.c:
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ */
+
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/err.h>
+#include <linux/vfio.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
+
+#define DRIVER_VERSION "0.1"
+#define DRIVER_AUTHOR "[email protected]"
+#define DRIVER_DESC "VFIO IOMMU SPAPR TCE"
+
+static void tce_iommu_detach_group(void *iommu_data,
+ struct iommu_group *iommu_group);
+
+/*
+ * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
+ *
+ * This code handles mapping and unmapping of user data buffers
+ * into DMA'ble space using the IOMMU
+ */
+
+/*
+ * The container descriptor supports only a single group per container.
+ * Required by the API as the container is not supplied with the IOMMU group
+ * at the moment of initialization.
+ */
+struct tce_container {
+ struct mutex lock;
+ struct iommu_table *tbl;
+ bool enabled;
+};
+
+static int tce_iommu_enable(struct tce_container *container)
+{
+ int ret = 0;
+ unsigned long locked, lock_limit, npages;
+ struct iommu_table *tbl = container->tbl;
+
+ if (!container->tbl)
+ return -ENXIO;
+
+ if (!current->mm)
+ return -ESRCH; /* process exited */
+
+ mutex_lock(&container->lock);
+ if (container->enabled) {
+ mutex_unlock(&container->lock);
+ return -EBUSY;
+ }
+
+ /*
+ * Accounting for locked pages.
+ *
+ * On sPAPR platform, IOMMU translation table contains
+ * an entry per 4K page. Every map/unmap request is sent
+ * by the guest via hypercall and supposed to be handled
+ * quickly, ecpesially in real mode (if such option is
+ * supported and enabled).
+ * As handling the accounting in real time is rather
+ * impossible, it may require switching to virtual mode
+ * which is quite expensive operation.
+ * As the whole window can be actuall mapped on high
+ * performance guests and we do not want such guests
+ * to fail, we do the accounting as a part of IOMMU
+ * enablement process.
+ */
+ down_write(&current->mm->mmap_sem);
+ npages = (tbl->it_size << IOMMU_PAGE_SHIFT) >> PAGE_SHIFT;
+ locked = current->mm->locked_vm + npages;
+ lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+ pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
+ rlimit(RLIMIT_MEMLOCK));
+ ret = -ENOMEM;
+ } else {
+
+ current->mm->locked_vm += npages;
+ container->enabled = true;
+ }
+ up_write(&current->mm->mmap_sem);
+
+ mutex_unlock(&container->lock);
+
+ return ret;
+}
+
+static void tce_iommu_disable(struct tce_container *container)
+{
+ mutex_lock(&container->lock);
+
+ if (container->enabled && container->tbl) {
+ if (current->mm) {
+ down_write(&current->mm->mmap_sem);
+ current->mm->locked_vm -= (container->tbl->it_size <<
+ IOMMU_PAGE_SHIFT) >> PAGE_SHIFT;
+ up_write(&current->mm->mmap_sem);
+ }
+ container->enabled = false;
+ }
+
+ mutex_unlock(&container->lock);
+}
+
+static void *tce_iommu_open(unsigned long arg)
+{
+ struct tce_container *container;
+
+ if (arg != VFIO_SPAPR_TCE_IOMMU) {
+ pr_err("tce_vfio: Wrong IOMMU type\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ container = kzalloc(sizeof(*container), GFP_KERNEL);
+ if (!container)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_init(&container->lock);
+
+ return container;
+}
+
+static void tce_iommu_release(void *iommu_data)
+{
+ struct tce_container *container = iommu_data;
+
+ WARN_ON(container->tbl && !container->tbl->it_group);
+ tce_iommu_disable(container);
+
+ if (container->tbl && container->tbl->it_group)
+ tce_iommu_detach_group(iommu_data, container->tbl->it_group);
+
+ mutex_destroy(&container->lock);
+
+ kfree(container);
+}
+
+static long tce_iommu_ioctl(void *iommu_data,
+ unsigned int cmd, unsigned long arg)
+{
+ struct tce_container *container = iommu_data;
+ unsigned long minsz;
+ long ret;
+
+ switch (cmd) {
+ case VFIO_CHECK_EXTENSION:
+ return (arg == VFIO_SPAPR_TCE_IOMMU) ? 1 : 0;
+
+ case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
+ struct vfio_iommu_spapr_tce_info info;
+ struct iommu_table *tbl = container->tbl;
+
+ if (WARN_ON(!tbl))
+ return -ENXIO;
+
+ minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
+ dma32_window_size);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz)
+ return -EINVAL;
+
+ info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT;
+ info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT;
+ info.flags = 0;
+
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ return -EFAULT;
+
+ return 0;
+ }
+ case VFIO_IOMMU_MAP_DMA: {
+ vfio_iommu_spapr_tce_dma_map param;
+ struct iommu_table *tbl = container->tbl;
+ unsigned long tce;
+
+ if (WARN_ON(!tbl))
+ return -ENXIO;
+
+ minsz = offsetofend(vfio_iommu_spapr_tce_dma_map, size);
+
+ if (copy_from_user(&param, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (param.argsz < minsz)
+ return -EINVAL;
+
+ if (param.flags & ~(VFIO_DMA_MAP_FLAG_READ |
+ VFIO_DMA_MAP_FLAG_WRITE))
+ return -EINVAL;
+
+ if ((param.size & ~IOMMU_PAGE_MASK) ||
+ (param.vaddr & ~IOMMU_PAGE_MASK))
+ return -EINVAL;
+
+ /* TODO: support multiple TCEs */
+ if (param.size != IOMMU_PAGE_SIZE) {
+ pr_err("VFIO map on POWER supports only %lu page size\n",
+ IOMMU_PAGE_SIZE);
+ return -EINVAL;
+ }
+
+ /* iova is checked by the IOMMU API */
+ tce = param.vaddr;
+ if (param.flags & VFIO_DMA_MAP_FLAG_READ)
+ tce |= TCE_PCI_READ;
+ if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
+ tce |= TCE_PCI_WRITE;
+
+ ret = iommu_tce_put_param_check(tbl, param.iova, tce);
+ if (ret)
+ return ret;
+
+ ret = iommu_put_tce_user_mode(tbl,
+ param.iova >> IOMMU_PAGE_SHIFT, tce);
+ iommu_flush_tce(tbl);
+
+ return ret;
+ }
+ case VFIO_IOMMU_UNMAP_DMA: {
+ vfio_iommu_spapr_tce_dma_unmap param;
+ struct iommu_table *tbl = container->tbl;
+
+ if (WARN_ON(!tbl))
+ return -ENXIO;
+
+ minsz = offsetofend(vfio_iommu_spapr_tce_dma_unmap, size);
+
+ if (copy_from_user(&param, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (param.argsz < minsz)
+ return -EINVAL;
+
+ /* No flag is supported now */
+ if (param.flags)
+ return -EINVAL;
+
+ if (param.size & ~IOMMU_PAGE_MASK)
+ return -EINVAL;
+
+ ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
+ param.size >> IOMMU_PAGE_SHIFT);
+ if (ret)
+ return ret;
+
+ ret = iommu_clear_tces_and_put_pages(tbl,
+ param.iova >> IOMMU_PAGE_SHIFT,
+ param.size >> IOMMU_PAGE_SHIFT);
+ iommu_flush_tce(tbl);
+
+ return ret;
+ }
+ case VFIO_IOMMU_ENABLE: {
+ return tce_iommu_enable(container);
+ }
+ case VFIO_IOMMU_DISABLE: {
+ tce_iommu_disable(container);
+ return 0;
+ }
+ }
+
+ return -ENOTTY;
+}
+
+static int tce_iommu_attach_group(void *iommu_data,
+ struct iommu_group *iommu_group)
+{
+ int ret;
+ struct tce_container *container = iommu_data;
+ struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+
+ BUG_ON(!tbl);
+ mutex_lock(&container->lock);
+ pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
+ iommu_group_id(iommu_group), iommu_group);
+ if (container->tbl) {
+ pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
+ iommu_group_id(container->tbl->it_group),
+ iommu_group_id(iommu_group));
+ mutex_unlock(&container->lock);
+ return -EBUSY;
+ }
+
+ ret = iommu_take_ownership(tbl);
+ if (!ret)
+ container->tbl = tbl;
+
+ mutex_unlock(&container->lock);
+
+ return ret;
+}
+
+static void tce_iommu_detach_group(void *iommu_data,
+ struct iommu_group *iommu_group)
+{
+ struct tce_container *container = iommu_data;
+ struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+
+ BUG_ON(!tbl);
+ mutex_lock(&container->lock);
+ if (tbl != container->tbl) {
+ pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
+ iommu_group_id(iommu_group),
+ iommu_group_id(tbl->it_group));
+ } else if (container->enabled) {
+ pr_err("tce_vfio: detaching group #%u from enabled container\n",
+ iommu_group_id(tbl->it_group));
+ } else {
+
+ pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
+ iommu_group_id(iommu_group), iommu_group);
+
+ container->tbl = NULL;
+ iommu_release_ownership(tbl);
+ }
+ mutex_unlock(&container->lock);
+}
+
+const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
+ .name = "iommu-vfio-powerpc",
+ .owner = THIS_MODULE,
+ .open = tce_iommu_open,
+ .release = tce_iommu_release,
+ .ioctl = tce_iommu_ioctl,
+ .attach_group = tce_iommu_attach_group,
+ .detach_group = tce_iommu_detach_group,
+};
+
+static int __init tce_iommu_init(void)
+{
+ return vfio_register_iommu_driver(&tce_iommu_driver_ops);
+}
+
+static void __exit tce_iommu_cleanup(void)
+{
+ vfio_unregister_iommu_driver(&tce_iommu_driver_ops);
+}
+
+module_init(tce_iommu_init);
+module_exit(tce_iommu_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 4758d1b..38ecccf 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -22,6 +22,7 @@
/* Extensions */

#define VFIO_TYPE1_IOMMU 1
+#define VFIO_SPAPR_TCE_IOMMU 2

/*
* The IOCTL interface is designed for extensibility by embedding the
@@ -365,4 +366,41 @@ struct vfio_iommu_type1_dma_unmap {

#define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)

+/*
+ * IOCTLs to enable/disable IOMMU container usage.
+ * No parameters are supported.
+ */
+#define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
+#define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
+
+/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
+
+/*
+ * The SPAPR TCE info struct provides the information about the PCI bus
+ * address ranges available for DMA, these values are programmed into
+ * the hardware so the guest has to know that information.
+ *
+ * The DMA 32 bit window start is an absolute PCI bus address.
+ * The IOVA address passed via map/unmap ioctls are absolute PCI bus
+ * addresses too so the window works as a filter rather than an offset
+ * for IOVA addresses.
+ *
+ * A flag will need to be added if other page sizes are supported,
+ * so as defined here, it is always 4k.
+ */
+struct vfio_iommu_spapr_tce_info {
+ __u32 argsz;
+ __u32 flags; /* reserved for future use */
+ __u32 dma32_window_start; /* 32 bit window start (bytes) */
+ __u32 dma32_window_size; /* 32 bit window size (bytes) */
+};
+
+#define VFIO_IOMMU_SPAPR_TCE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/* Reuse type1 map/unmap structs as they are the same at the moment */
+typedef struct vfio_iommu_type1_dma_map vfio_iommu_spapr_tce_dma_map;
+typedef struct vfio_iommu_type1_dma_unmap vfio_iommu_spapr_tce_dma_unmap;
+
+/* ***************************************************************** */
+
#endif /* _UAPIVFIO_H */
--
1.7.10.4

2013-03-20 20:45:38

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH] vfio powerpc: implement IOMMU driver for VFIO

On Tue, 2013-03-19 at 18:08 +1100, Alexey Kardashevskiy wrote:
> VFIO implements platform independent stuff such as
> a PCI driver, BAR access (via read/write on a file descriptor
> or direct mapping when possible) and IRQ signaling.
>
> The platform dependent part includes IOMMU initialization
> and handling. This patch implements an IOMMU driver for VFIO
> which does mapping/unmapping pages for the guest IO and
> provides information about DMA window (required by a POWERPC
> guest).
>
> The counterpart in QEMU is required to support this functionality.
>
> Changelog:
> * documentation updated
> * containter enable/disable ioctls added
> * request_module(spapr_iommu) added
> * various locks fixed
>
> Cc: David Gibson <[email protected]>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---

Looking pretty good. There's one problem with the detach_group,
otherwise just some trivial comments below. What's the status of the
tce code that this depends on? Thanks,

Alex

> Documentation/vfio.txt | 63 ++++++
> drivers/vfio/Kconfig | 6 +
> drivers/vfio/Makefile | 1 +
> drivers/vfio/vfio.c | 1 +
> drivers/vfio/vfio_iommu_spapr_tce.c | 365 +++++++++++++++++++++++++++++++++++
> include/uapi/linux/vfio.h | 38 ++++
> 6 files changed, 474 insertions(+)
> create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
>
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index 8eda363..351f7c9 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -283,6 +283,69 @@ a direct pass through for VFIO_DEVICE_* ioctls. The read/write/mmap
> interfaces implement the device region access defined by the device's
> own VFIO_DEVICE_GET_REGION_INFO ioctl.
>
> +
> +PPC64 sPAPR implementation note
> +-------------------------------------------------------------------------------
> +
> +This implementation has some specifics:
> +
> +1) Only one IOMMU group per container is supported as an IOMMU group
> +represents the minimal entity which isolation can be guaranteed for and
> +groups are allocated statically, one per a Partitionable Endpoint (PE)
> +(PE is often a PCI domain but not always).
> +
> +2) The hardware supports so called DMA windows - the PCI address range
> +within which DMA transfer is allowed, any attempt to access address space
> +out of the window leads to the whole PE isolation.
> +
> +3) PPC64 guests are paravirtualized but not fully emulated. There is an API
> +to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
> +currently there is no way to reduce the number of calls. In order to make things
> +faster, the map/unmap handling has been implented in real mode which provides

s/implented/implemented/

> +an excellent performance which has limitations such as inability to do
> +locked pages accounting in real time.
> +
> +So 3 additional ioctls have been added:
> +
> + VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
> + of the DMA window on the PCI bus.
> +
> + VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting
> + is done at this point. This lets user first to know what
> + the DMA window is and adjust rlimit before doing any real job.
> +
> + VFIO_IOMMU_DISABLE - disables the container.
> +
> +
> +The code flow from the example above should be slightly changed:
> +
> + .....
> + /* Add the group to the container */
> + ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
> +
> + /* Enable the IOMMU model we want */
> + ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU)
> +
> + /* Get addition sPAPR IOMMU info */
> + vfio_iommu_spapr_tce_info spapr_iommu_info;
> + ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info);
> +
> + if (ioctl(container, VFIO_IOMMU_ENABLE))
> + /* Cannot enable container, may be low rlimit */
> +
> + /* Allocate some space and setup a DMA mapping */
> + dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
> + MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
> +
> + dma_map.size = 1024 * 1024;
> + dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
> + dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> +
> + /* Check here is .iova/.size are within DMA window from spapr_iommu_info */
> +
> + ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
> + .....
> +

Awesome, thanks for the docs update

> -------------------------------------------------------------------------------
>
> [1] VFIO was originally an acronym for "Virtual Function I/O" in its
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 7cd5dec..b464687 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
> depends on VFIO
> default n
>
> +config VFIO_IOMMU_SPAPR_TCE
> + tristate
> + depends on VFIO && SPAPR_TCE_IOMMU
> + default n
> +
> menuconfig VFIO
> tristate "VFIO Non-Privileged userspace driver framework"
> depends on IOMMU_API
> select VFIO_IOMMU_TYPE1 if X86
> + select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
> help
> VFIO provides a framework for secure userspace device drivers.
> See Documentation/vfio.txt for more details.
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 2398d4a..72bfabc 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1,3 +1,4 @@
> obj-$(CONFIG_VFIO) += vfio.o
> obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> obj-$(CONFIG_VFIO_PCI) += pci/
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 12c264d..004033d 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1376,6 +1376,7 @@ static int __init vfio_init(void)
> * drivers.
> */
> request_module_nowait("vfio_iommu_type1");
> + request_module_nowait("vfio_iommu_spapr");
>
> return 0;
>
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> new file mode 100644
> index 0000000..22ba0b5
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -0,0 +1,365 @@
> +/*
> + * VFIO: IOMMU DMA mapping support for TCE on POWER
> + *
> + * Copyright (C) 2013 IBM Corp. All rights reserved.
> + * Author: Alexey Kardashevskiy <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio_iommu_type1.c:
> + * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
> + * Author: Alex Williamson <[email protected]>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/err.h>
> +#include <linux/vfio.h>
> +#include <asm/iommu.h>
> +#include <asm/tce.h>
> +
> +#define DRIVER_VERSION "0.1"
> +#define DRIVER_AUTHOR "[email protected]"
> +#define DRIVER_DESC "VFIO IOMMU SPAPR TCE"
> +
> +static void tce_iommu_detach_group(void *iommu_data,
> + struct iommu_group *iommu_group);
> +
> +/*
> + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
> + *
> + * This code handles mapping and unmapping of user data buffers
> + * into DMA'ble space using the IOMMU
> + */
> +
> +/*
> + * The container descriptor supports only a single group per container.
> + * Required by the API as the container is not supplied with the IOMMU group
> + * at the moment of initialization.
> + */
> +struct tce_container {
> + struct mutex lock;
> + struct iommu_table *tbl;
> + bool enabled;
> +};
> +
> +static int tce_iommu_enable(struct tce_container *container)
> +{
> + int ret = 0;
> + unsigned long locked, lock_limit, npages;
> + struct iommu_table *tbl = container->tbl;
> +
> + if (!container->tbl)
> + return -ENXIO;
> +
> + if (!current->mm)
> + return -ESRCH; /* process exited */
> +
> + mutex_lock(&container->lock);
> + if (container->enabled) {
> + mutex_unlock(&container->lock);
> + return -EBUSY;
> + }
> +
> + /*
> + * Accounting for locked pages.
> + *
> + * On sPAPR platform, IOMMU translation table contains
> + * an entry per 4K page. Every map/unmap request is sent
> + * by the guest via hypercall and supposed to be handled
> + * quickly, ecpesially in real mode (if such option is

s/ecpesially/especially/

> + * supported and enabled).
> + * As handling the accounting in real time is rather
> + * impossible, it may require switching to virtual mode
> + * which is quite expensive operation.
> + * As the whole window can be actuall mapped on high

s/actuall/actually/

> + * performance guests and we do not want such guests
> + * to fail, we do the accounting as a part of IOMMU
> + * enablement process.
> + */
> + down_write(&current->mm->mmap_sem);
> + npages = (tbl->it_size << IOMMU_PAGE_SHIFT) >> PAGE_SHIFT;
> + locked = current->mm->locked_vm + npages;
> + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> + pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> + rlimit(RLIMIT_MEMLOCK));
> + ret = -ENOMEM;
> + } else {
> +
> + current->mm->locked_vm += npages;
> + container->enabled = true;
> + }
> + up_write(&current->mm->mmap_sem);
> +
> + mutex_unlock(&container->lock);
> +
> + return ret;
> +}
> +
> +static void tce_iommu_disable(struct tce_container *container)
> +{
> + mutex_lock(&container->lock);
> +
> + if (container->enabled && container->tbl) {
> + if (current->mm) {
> + down_write(&current->mm->mmap_sem);
> + current->mm->locked_vm -= (container->tbl->it_size <<
> + IOMMU_PAGE_SHIFT) >> PAGE_SHIFT;
> + up_write(&current->mm->mmap_sem);
> + }
> + container->enabled = false;
> + }
> +
> + mutex_unlock(&container->lock);
> +}
> +
> +static void *tce_iommu_open(unsigned long arg)
> +{
> + struct tce_container *container;
> +
> + if (arg != VFIO_SPAPR_TCE_IOMMU) {
> + pr_err("tce_vfio: Wrong IOMMU type\n");
> + return ERR_PTR(-EINVAL);
> + }
> +
> + container = kzalloc(sizeof(*container), GFP_KERNEL);
> + if (!container)
> + return ERR_PTR(-ENOMEM);
> +
> + mutex_init(&container->lock);
> +
> + return container;
> +}
> +
> +static void tce_iommu_release(void *iommu_data)
> +{
> + struct tce_container *container = iommu_data;
> +
> + WARN_ON(container->tbl && !container->tbl->it_group);
> + tce_iommu_disable(container);
> +
> + if (container->tbl && container->tbl->it_group)
> + tce_iommu_detach_group(iommu_data, container->tbl->it_group);
> +
> + mutex_destroy(&container->lock);
> +
> + kfree(container);
> +}
> +
> +static long tce_iommu_ioctl(void *iommu_data,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct tce_container *container = iommu_data;
> + unsigned long minsz;
> + long ret;
> +
> + switch (cmd) {
> + case VFIO_CHECK_EXTENSION:
> + return (arg == VFIO_SPAPR_TCE_IOMMU) ? 1 : 0;
> +
> + case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
> + struct vfio_iommu_spapr_tce_info info;
> + struct iommu_table *tbl = container->tbl;
> +
> + if (WARN_ON(!tbl))
> + return -ENXIO;
> +
> + minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
> + dma32_window_size);
> +
> + if (copy_from_user(&info, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;
> +
> + info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT;
> + info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT;
> + info.flags = 0;
> +
> + if (copy_to_user((void __user *)arg, &info, minsz))
> + return -EFAULT;
> +
> + return 0;
> + }
> + case VFIO_IOMMU_MAP_DMA: {
> + vfio_iommu_spapr_tce_dma_map param;
> + struct iommu_table *tbl = container->tbl;
> + unsigned long tce;
> +
> + if (WARN_ON(!tbl))
> + return -ENXIO;
> +
> + minsz = offsetofend(vfio_iommu_spapr_tce_dma_map, size);
> +
> + if (copy_from_user(&param, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (param.argsz < minsz)
> + return -EINVAL;
> +
> + if (param.flags & ~(VFIO_DMA_MAP_FLAG_READ |
> + VFIO_DMA_MAP_FLAG_WRITE))
> + return -EINVAL;
> +
> + if ((param.size & ~IOMMU_PAGE_MASK) ||
> + (param.vaddr & ~IOMMU_PAGE_MASK))
> + return -EINVAL;
> +
> + /* TODO: support multiple TCEs */
> + if (param.size != IOMMU_PAGE_SIZE) {
> + pr_err("VFIO map on POWER supports only %lu page size\n",
> + IOMMU_PAGE_SIZE);
> + return -EINVAL;
> + }
> +
> + /* iova is checked by the IOMMU API */
> + tce = param.vaddr;
> + if (param.flags & VFIO_DMA_MAP_FLAG_READ)
> + tce |= TCE_PCI_READ;
> + if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
> + tce |= TCE_PCI_WRITE;
> +
> + ret = iommu_tce_put_param_check(tbl, param.iova, tce);
> + if (ret)
> + return ret;
> +
> + ret = iommu_put_tce_user_mode(tbl,
> + param.iova >> IOMMU_PAGE_SHIFT, tce);
> + iommu_flush_tce(tbl);
> +
> + return ret;
> + }
> + case VFIO_IOMMU_UNMAP_DMA: {
> + vfio_iommu_spapr_tce_dma_unmap param;
> + struct iommu_table *tbl = container->tbl;
> +
> + if (WARN_ON(!tbl))
> + return -ENXIO;
> +
> + minsz = offsetofend(vfio_iommu_spapr_tce_dma_unmap, size);
> +
> + if (copy_from_user(&param, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (param.argsz < minsz)
> + return -EINVAL;
> +
> + /* No flag is supported now */
> + if (param.flags)
> + return -EINVAL;
> +
> + if (param.size & ~IOMMU_PAGE_MASK)
> + return -EINVAL;
> +
> + ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
> + param.size >> IOMMU_PAGE_SHIFT);
> + if (ret)
> + return ret;
> +
> + ret = iommu_clear_tces_and_put_pages(tbl,
> + param.iova >> IOMMU_PAGE_SHIFT,
> + param.size >> IOMMU_PAGE_SHIFT);
> + iommu_flush_tce(tbl);
> +
> + return ret;
> + }
> + case VFIO_IOMMU_ENABLE: {
> + return tce_iommu_enable(container);
> + }
> + case VFIO_IOMMU_DISABLE: {
> + tce_iommu_disable(container);
> + return 0;
> + }

Unnecessary brackets

> + }
> +
> + return -ENOTTY;
> +}
> +
> +static int tce_iommu_attach_group(void *iommu_data,
> + struct iommu_group *iommu_group)
> +{
> + int ret;
> + struct tce_container *container = iommu_data;
> + struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> +
> + BUG_ON(!tbl);
> + mutex_lock(&container->lock);
> + pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
> + iommu_group_id(iommu_group), iommu_group);
> + if (container->tbl) {
> + pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
> + iommu_group_id(container->tbl->it_group),
> + iommu_group_id(iommu_group));
> + mutex_unlock(&container->lock);
> + return -EBUSY;
> + }
> +
> + ret = iommu_take_ownership(tbl);
> + if (!ret)
> + container->tbl = tbl;
> +
> + mutex_unlock(&container->lock);
> +
> + return ret;
> +}
> +
> +static void tce_iommu_detach_group(void *iommu_data,
> + struct iommu_group *iommu_group)
> +{
> + struct tce_container *container = iommu_data;
> + struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> +
> + BUG_ON(!tbl);
> + mutex_lock(&container->lock);
> + if (tbl != container->tbl) {
> + pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
> + iommu_group_id(iommu_group),
> + iommu_group_id(tbl->it_group));
> + } else if (container->enabled) {
> + pr_err("tce_vfio: detaching group #%u from enabled container\n",
> + iommu_group_id(tbl->it_group));

Hmm, something more than a pr_err needs to happen here. Wouldn't this
imply a disable and going back to an unprivileged container?

> + } else {
> +
> + pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
> + iommu_group_id(iommu_group), iommu_group);
> +
> + container->tbl = NULL;
> + iommu_release_ownership(tbl);
> + }
> + mutex_unlock(&container->lock);
> +}
> +
> +const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> + .name = "iommu-vfio-powerpc",
> + .owner = THIS_MODULE,
> + .open = tce_iommu_open,
> + .release = tce_iommu_release,
> + .ioctl = tce_iommu_ioctl,
> + .attach_group = tce_iommu_attach_group,
> + .detach_group = tce_iommu_detach_group,
> +};
> +
> +static int __init tce_iommu_init(void)
> +{
> + return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> +}
> +
> +static void __exit tce_iommu_cleanup(void)
> +{
> + vfio_unregister_iommu_driver(&tce_iommu_driver_ops);
> +}
> +
> +module_init(tce_iommu_init);
> +module_exit(tce_iommu_cleanup);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> +
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 4758d1b..38ecccf 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -22,6 +22,7 @@
> /* Extensions */
>
> #define VFIO_TYPE1_IOMMU 1
> +#define VFIO_SPAPR_TCE_IOMMU 2
>
> /*
> * The IOCTL interface is designed for extensibility by embedding the
> @@ -365,4 +366,41 @@ struct vfio_iommu_type1_dma_unmap {
>
> #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
>
> +/*
> + * IOCTLs to enable/disable IOMMU container usage.
> + * No parameters are supported.
> + */
> +#define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
> +#define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
> +
> +/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> +
> +/*
> + * The SPAPR TCE info struct provides the information about the PCI bus
> + * address ranges available for DMA, these values are programmed into
> + * the hardware so the guest has to know that information.
> + *
> + * The DMA 32 bit window start is an absolute PCI bus address.
> + * The IOVA address passed via map/unmap ioctls are absolute PCI bus
> + * addresses too so the window works as a filter rather than an offset
> + * for IOVA addresses.
> + *
> + * A flag will need to be added if other page sizes are supported,
> + * so as defined here, it is always 4k.
> + */
> +struct vfio_iommu_spapr_tce_info {
> + __u32 argsz;
> + __u32 flags; /* reserved for future use */
> + __u32 dma32_window_start; /* 32 bit window start (bytes) */
> + __u32 dma32_window_size; /* 32 bit window size (bytes) */
> +};
> +
> +#define VFIO_IOMMU_SPAPR_TCE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> +
> +/* Reuse type1 map/unmap structs as they are the same at the moment */
> +typedef struct vfio_iommu_type1_dma_map vfio_iommu_spapr_tce_dma_map;
> +typedef struct vfio_iommu_type1_dma_unmap vfio_iommu_spapr_tce_dma_unmap;
> +
> +/* ***************************************************************** */
> +
> #endif /* _UAPIVFIO_H */

2013-03-21 00:56:13

by Alexey Kardashevskiy

[permalink] [raw]

Subject: Re: [PATCH] vfio powerpc: implement IOMMU driver for VFIO

On 21/03/13 07:45, Alex Williamson wrote:
> On Tue, 2013-03-19 at 18:08 +1100, Alexey Kardashevskiy wrote:
>> VFIO implements platform independent stuff such as
>> a PCI driver, BAR access (via read/write on a file descriptor
>> or direct mapping when possible) and IRQ signaling.
>>
>> The platform dependent part includes IOMMU initialization
>> and handling. This patch implements an IOMMU driver for VFIO
>> which does mapping/unmapping pages for the guest IO and
>> provides information about DMA window (required by a POWERPC
>> guest).
>>
>> The counterpart in QEMU is required to support this functionality.
>>
>> Changelog:
>> * documentation updated
>> * containter enable/disable ioctls added
>> * request_module(spapr_iommu) added
>> * various locks fixed
>>
>> Cc: David Gibson <[email protected]>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>
>
> Looking pretty good. There's one problem with the detach_group,
> otherwise just some trivial comments below. What's the status of the
> tce code that this depends on? Thanks,

It is done, I am just waiting till other patches of the series will be
reviewed (by guys) and fixed (by me) and then I'll post everything again.

[skipped]

>> +static int tce_iommu_enable(struct tce_container *container)
>> +{
>> + int ret = 0;
>> + unsigned long locked, lock_limit, npages;
>> + struct iommu_table *tbl = container->tbl;
>> +
>> + if (!container->tbl)
>> + return -ENXIO;
>> +
>> + if (!current->mm)
>> + return -ESRCH; /* process exited */
>> +
>> + mutex_lock(&container->lock);
>> + if (container->enabled) {
>> + mutex_unlock(&container->lock);
>> + return -EBUSY;
>> + }
>> +
>> + /*
>> + * Accounting for locked pages.
>> + *
>> + * On sPAPR platform, IOMMU translation table contains
>> + * an entry per 4K page. Every map/unmap request is sent
>> + * by the guest via hypercall and supposed to be handled
>> + * quickly, ecpesially in real mode (if such option is
>
> s/ecpesially/especially/

I replaced the whole text by the one written by Ben and David.

[skipped]

>> + }
>> +
>> + return -ENOTTY;
>> +}
>> +
>> +static int tce_iommu_attach_group(void *iommu_data,
>> + struct iommu_group *iommu_group)
>> +{
>> + int ret;
>> + struct tce_container *container = iommu_data;
>> + struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
>> +
>> + BUG_ON(!tbl);
>> + mutex_lock(&container->lock);
>> + pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
>> + iommu_group_id(iommu_group), iommu_group);
>> + if (container->tbl) {
>> + pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
>> + iommu_group_id(container->tbl->it_group),
>> + iommu_group_id(iommu_group));
>> + mutex_unlock(&container->lock);
>> + return -EBUSY;
>> + }
>> +
>> + ret = iommu_take_ownership(tbl);
>> + if (!ret)
>> + container->tbl = tbl;
>> +
>> + mutex_unlock(&container->lock);
>> +
>> + return ret;
>> +}
>> +
>> +static void tce_iommu_detach_group(void *iommu_data,
>> + struct iommu_group *iommu_group)
>> +{
>> + struct tce_container *container = iommu_data;
>> + struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
>> +
>> + BUG_ON(!tbl);
>> + mutex_lock(&container->lock);
>> + if (tbl != container->tbl) {
>> + pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
>> + iommu_group_id(iommu_group),
>> + iommu_group_id(tbl->it_group));
>> + } else if (container->enabled) {
>> + pr_err("tce_vfio: detaching group #%u from enabled container\n",
>> + iommu_group_id(tbl->it_group));
>
> Hmm, something more than a pr_err needs to happen here. Wouldn't this
> imply a disable and going back to an unprivileged container?

Ah. It does not return error code... Then something like that?

...
} else if (container->enabled) {
pr_warn("tce_vfio: detaching group #%u from enabled
container, forcing disable\n",
iommu_group_id(tbl->it_group));
tce_iommu_disable(container);
...

--
Alexey

2013-03-21 01:29:57

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH] vfio powerpc: implement IOMMU driver for VFIO

On Thu, 2013-03-21 at 11:57 +1100, Alexey Kardashevskiy wrote:
> On 21/03/13 07:45, Alex Williamson wrote:
> > On Tue, 2013-03-19 at 18:08 +1100, Alexey Kardashevskiy wrote:
> >> VFIO implements platform independent stuff such as
> >> a PCI driver, BAR access (via read/write on a file descriptor
> >> or direct mapping when possible) and IRQ signaling.
> >>
> >> The platform dependent part includes IOMMU initialization
> >> and handling. This patch implements an IOMMU driver for VFIO
> >> which does mapping/unmapping pages for the guest IO and
> >> provides information about DMA window (required by a POWERPC
> >> guest).
> >>
> >> The counterpart in QEMU is required to support this functionality.
> >>
> >> Changelog:
> >> * documentation updated
> >> * containter enable/disable ioctls added
> >> * request_module(spapr_iommu) added
> >> * various locks fixed
> >>
> >> Cc: David Gibson <[email protected]>
> >> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >> ---
> >
> >
> > Looking pretty good. There's one problem with the detach_group,
> > otherwise just some trivial comments below. What's the status of the
> > tce code that this depends on? Thanks,
>
>
> It is done, I am just waiting till other patches of the series will be
> reviewed (by guys) and fixed (by me) and then I'll post everything again.
>
> [skipped]
>
> >> +static int tce_iommu_enable(struct tce_container *container)
> >> +{
> >> + int ret = 0;
> >> + unsigned long locked, lock_limit, npages;
> >> + struct iommu_table *tbl = container->tbl;
> >> +
> >> + if (!container->tbl)
> >> + return -ENXIO;
> >> +
> >> + if (!current->mm)
> >> + return -ESRCH; /* process exited */
> >> +
> >> + mutex_lock(&container->lock);
> >> + if (container->enabled) {
> >> + mutex_unlock(&container->lock);
> >> + return -EBUSY;
> >> + }
> >> +
> >> + /*
> >> + * Accounting for locked pages.
> >> + *
> >> + * On sPAPR platform, IOMMU translation table contains
> >> + * an entry per 4K page. Every map/unmap request is sent
> >> + * by the guest via hypercall and supposed to be handled
> >> + * quickly, ecpesially in real mode (if such option is
> >
> > s/ecpesially/especially/
>
>
> I replaced the whole text by the one written by Ben and David.
>
> [skipped]
>
> >> + }
> >> +
> >> + return -ENOTTY;
> >> +}
> >> +
> >> +static int tce_iommu_attach_group(void *iommu_data,
> >> + struct iommu_group *iommu_group)
> >> +{
> >> + int ret;
> >> + struct tce_container *container = iommu_data;
> >> + struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> >> +
> >> + BUG_ON(!tbl);
> >> + mutex_lock(&container->lock);
> >> + pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
> >> + iommu_group_id(iommu_group), iommu_group);
> >> + if (container->tbl) {
> >> + pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
> >> + iommu_group_id(container->tbl->it_group),
> >> + iommu_group_id(iommu_group));
> >> + mutex_unlock(&container->lock);
> >> + return -EBUSY;
> >> + }
> >> +
> >> + ret = iommu_take_ownership(tbl);
> >> + if (!ret)
> >> + container->tbl = tbl;
> >> +
> >> + mutex_unlock(&container->lock);
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static void tce_iommu_detach_group(void *iommu_data,
> >> + struct iommu_group *iommu_group)
> >> +{
> >> + struct tce_container *container = iommu_data;
> >> + struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> >> +
> >> + BUG_ON(!tbl);
> >> + mutex_lock(&container->lock);
> >> + if (tbl != container->tbl) {
> >> + pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
> >> + iommu_group_id(iommu_group),
> >> + iommu_group_id(tbl->it_group));
> >> + } else if (container->enabled) {
> >> + pr_err("tce_vfio: detaching group #%u from enabled container\n",
> >> + iommu_group_id(tbl->it_group));
> >
> > Hmm, something more than a pr_err needs to happen here. Wouldn't this
> > imply a disable and going back to an unprivileged container?
>
> Ah. It does not return error code... Then something like that?
>
> ...
> } else if (container->enabled) {
> pr_warn("tce_vfio: detaching group #%u from enabled
> container, forcing disable\n",
> iommu_group_id(tbl->it_group));
> tce_iommu_disable(container);
> ...

Yep, but of course it also needs to fall through and set tbl = NULL and
release ownership as well. BTW, don't you think the pr_debug in that
path is a little excessive? Likewise the one in the attach path.
Thanks,

Alex

2013-03-21 03:02:50

by David Gibson

[permalink] [raw]

Subject: Re: [PATCH] vfio powerpc: implement IOMMU driver for VFIO

On Wed, Mar 20, 2013 at 02:45:24PM -0600, Alex Williamson wrote:
> On Tue, 2013-03-19 at 18:08 +1100, Alexey Kardashevskiy wrote:
> > VFIO implements platform independent stuff such as
> > a PCI driver, BAR access (via read/write on a file descriptor
> > or direct mapping when possible) and IRQ signaling.
> >
> > The platform dependent part includes IOMMU initialization
> > and handling. This patch implements an IOMMU driver for VFIO
> > which does mapping/unmapping pages for the guest IO and
> > provides information about DMA window (required by a POWERPC
> > guest).
> >
> > The counterpart in QEMU is required to support this functionality.
> >
> > Changelog:
> > * documentation updated
> > * containter enable/disable ioctls added
> > * request_module(spapr_iommu) added
> > * various locks fixed
> >
> > Cc: David Gibson <[email protected]>
> > Signed-off-by: Alexey Kardashevskiy <[email protected]>
> > ---
>
>
> Looking pretty good. There's one problem with the detach_group,
> otherwise just some trivial comments below. What's the status of the
> tce code that this depends on? Thanks,

[snip]
> > +static void tce_iommu_detach_group(void *iommu_data,
> > + struct iommu_group *iommu_group)
> > +{
> > + struct tce_container *container = iommu_data;
> > + struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> > +
> > + BUG_ON(!tbl);
> > + mutex_lock(&container->lock);
> > + if (tbl != container->tbl) {
> > + pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
> > + iommu_group_id(iommu_group),
> > + iommu_group_id(tbl->it_group));
> > + } else if (container->enabled) {
> > + pr_err("tce_vfio: detaching group #%u from enabled container\n",
> > + iommu_group_id(tbl->it_group));
>
> Hmm, something more than a pr_err needs to happen here. Wouldn't this
> imply a disable and going back to an unprivileged container?

Uh, no. I think the idea here is that we use the enable/disable
semantic to address some other potential problems. Specifically,
sidestepping the problems of what happens if you change the
container's capabilities by adding/removing groups while in the middle
of using it. So the point is that the detach fails when the group is
enabled, rather than implicitly doing anything.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachments:

(No filename) (2.41 kB)
signature.asc (198.00 B)
Digital signature Download all attachments

2013-03-21 03:16:34

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH] vfio powerpc: implement IOMMU driver for VFIO

On Thu, 2013-03-21 at 12:55 +1100, David Gibson wrote:
> On Wed, Mar 20, 2013 at 02:45:24PM -0600, Alex Williamson wrote:
> > On Tue, 2013-03-19 at 18:08 +1100, Alexey Kardashevskiy wrote:
> > > VFIO implements platform independent stuff such as
> > > a PCI driver, BAR access (via read/write on a file descriptor
> > > or direct mapping when possible) and IRQ signaling.
> > >
> > > The platform dependent part includes IOMMU initialization
> > > and handling. This patch implements an IOMMU driver for VFIO
> > > which does mapping/unmapping pages for the guest IO and
> > > provides information about DMA window (required by a POWERPC
> > > guest).
> > >
> > > The counterpart in QEMU is required to support this functionality.
> > >
> > > Changelog:
> > > * documentation updated
> > > * containter enable/disable ioctls added
> > > * request_module(spapr_iommu) added
> > > * various locks fixed
> > >
> > > Cc: David Gibson <[email protected]>
> > > Signed-off-by: Alexey Kardashevskiy <[email protected]>
> > > ---
> >
> >
> > Looking pretty good. There's one problem with the detach_group,
> > otherwise just some trivial comments below. What's the status of the
> > tce code that this depends on? Thanks,
>
> [snip]
> > > +static void tce_iommu_detach_group(void *iommu_data,
> > > + struct iommu_group *iommu_group)
> > > +{
> > > + struct tce_container *container = iommu_data;
> > > + struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> > > +
> > > + BUG_ON(!tbl);
> > > + mutex_lock(&container->lock);
> > > + if (tbl != container->tbl) {
> > > + pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
> > > + iommu_group_id(iommu_group),
> > > + iommu_group_id(tbl->it_group));
> > > + } else if (container->enabled) {
> > > + pr_err("tce_vfio: detaching group #%u from enabled container\n",
> > > + iommu_group_id(tbl->it_group));
> >
> > Hmm, something more than a pr_err needs to happen here. Wouldn't this
> > imply a disable and going back to an unprivileged container?
>
> Uh, no. I think the idea here is that we use the enable/disable
> semantic to address some other potential problems. Specifically,
> sidestepping the problems of what happens if you change the
> container's capabilities by adding/removing groups while in the middle
> of using it. So the point is that the detach fails when the group is
> enabled, rather than implicitly doing anything.

The function returns void. We're not failing the detach, just getting
into a broken state. This is only called to unwind attaching groups
when the iommu is set or if the user explicitly calls
GROUP_UNSET_CONTAINER. The former won't have had a chance to call
enable but the latter would need to be fixed. Thanks,

Alex

2013-03-25 02:20:20

by David Gibson

[permalink] [raw]

Subject: Re: [PATCH] vfio powerpc: implement IOMMU driver for VFIO

On Wed, Mar 20, 2013 at 09:16:24PM -0600, Alex Williamson wrote:
> On Thu, 2013-03-21 at 12:55 +1100, David Gibson wrote:
> > On Wed, Mar 20, 2013 at 02:45:24PM -0600, Alex Williamson wrote:
> > > On Tue, 2013-03-19 at 18:08 +1100, Alexey Kardashevskiy wrote:
> > > > VFIO implements platform independent stuff such as
> > > > a PCI driver, BAR access (via read/write on a file descriptor
> > > > or direct mapping when possible) and IRQ signaling.
> > > >
> > > > The platform dependent part includes IOMMU initialization
> > > > and handling. This patch implements an IOMMU driver for VFIO
> > > > which does mapping/unmapping pages for the guest IO and
> > > > provides information about DMA window (required by a POWERPC
> > > > guest).
> > > >
> > > > The counterpart in QEMU is required to support this functionality.
> > > >
> > > > Changelog:
> > > > * documentation updated
> > > > * containter enable/disable ioctls added
> > > > * request_module(spapr_iommu) added
> > > > * various locks fixed
> > > >
> > > > Cc: David Gibson <[email protected]>
> > > > Signed-off-by: Alexey Kardashevskiy <[email protected]>
> > > > ---
> > >
> > >
> > > Looking pretty good. There's one problem with the detach_group,
> > > otherwise just some trivial comments below. What's the status of the
> > > tce code that this depends on? Thanks,
> >
> > [snip]
> > > > +static void tce_iommu_detach_group(void *iommu_data,
> > > > + struct iommu_group *iommu_group)
> > > > +{
> > > > + struct tce_container *container = iommu_data;
> > > > + struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> > > > +
> > > > + BUG_ON(!tbl);
> > > > + mutex_lock(&container->lock);
> > > > + if (tbl != container->tbl) {
> > > > + pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
> > > > + iommu_group_id(iommu_group),
> > > > + iommu_group_id(tbl->it_group));
> > > > + } else if (container->enabled) {
> > > > + pr_err("tce_vfio: detaching group #%u from enabled container\n",
> > > > + iommu_group_id(tbl->it_group));
> > >
> > > Hmm, something more than a pr_err needs to happen here. Wouldn't this
> > > imply a disable and going back to an unprivileged container?
> >
> > Uh, no. I think the idea here is that we use the enable/disable
> > semantic to address some other potential problems. Specifically,
> > sidestepping the problems of what happens if you change the
> > container's capabilities by adding/removing groups while in the middle
> > of using it. So the point is that the detach fails when the group is
> > enabled, rather than implicitly doing anything.
>
> The function returns void.

Uh, yeah, we should probably fix that.

> We're not failing the detach, just getting
> into a broken state. This is only called to unwind attaching groups
> when the iommu is set or if the user explicitly calls
> GROUP_UNSET_CONTAINER. The former won't have had a chance to call
> enable but the latter would need to be fixed.

Well, in the latter case it's essentialy a silent failure, which isn't
great.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachments:

(No filename) (3.17 kB)
signature.asc (198.00 B)
Digital signature Download all attachments