LinuxLists.cc - Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Tue, Jun 01, 2010 at 11:10:45AM +0300, Avi Kivity wrote:
> On 05/31/2010 08:10 PM, Michael S. Tsirkin wrote:
>> On Mon, May 31, 2010 at 02:50:29PM +0300, Avi Kivity wrote:
>>
>>> On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote:
>>>
>>>> So what I suggested is failing any kind of access until iommu
>>>> is assigned.
>>>>
>>>>
>>> So, the kernel driver must be aware of the iommu. In which case it may
>>> as well program it.
>>>
>> It's a kernel driver anyway. Point is that
>> the *device* driver is better off not programming iommu,
>> this way we do not need to reprogram it for each device.
>>
>
> The device driver is in userspace.

I mean the kernel driver that grants userspace the access.

> It can't program the iommu.
> What
> the patch proposes is that userspace tells vfio about the needed
> mappings, and vfio programs the iommu.

There seems to be some misunderstanding. The userspace interface
proposed forces a separate domain per device and forces userspace to
repeat iommu programming for each device. We are better off sharing a
domain between devices and programming the iommu once.

The natural way to do this is to have an iommu driver for programming
iommu.

This likely means we will have to pass the domain to 'vfio' or uio or
whatever the driver that gives userspace the access to device is called,
but this is only for security, there's no need to support programming
iommu there.

And using this design means the uio framework changes
required would be minor, so we won't have to duplicate code.

> --
> error compiling committee.c: too many arguments to function

2010-06-01 10:29:04

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
>
>> It can't program the iommu.
>> What
>> the patch proposes is that userspace tells vfio about the needed
>> mappings, and vfio programs the iommu.
>>
> There seems to be some misunderstanding. The userspace interface
> proposed forces a separate domain per device and forces userspace to
> repeat iommu programming for each device. We are better off sharing a
> domain between devices and programming the iommu once.
>

iommufd = open(/dev/iommu);
ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)

?

If so, I agree.

> The natural way to do this is to have an iommu driver for programming
> iommu.
>
> This likely means we will have to pass the domain to 'vfio' or uio or
> whatever the driver that gives userspace the access to device is called,
> but this is only for security, there's no need to support programming
> iommu there.
>
> And using this design means the uio framework changes
> required would be minor, so we won't have to duplicate code.
>

Since vfio would be the only driver, there would be no duplication. But
a separate object for the iommu mapping is a good thing. Perhaps we can
even share it with vhost (without actually using the mmu, since vhost is
software only).

--
error compiling committee.c: too many arguments to function

2010-06-01 10:51:23

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Tue, Jun 01, 2010 at 01:28:48PM +0300, Avi Kivity wrote:
> On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
>>
>>> It can't program the iommu.
>>> What
>>> the patch proposes is that userspace tells vfio about the needed
>>> mappings, and vfio programs the iommu.
>>>
>> There seems to be some misunderstanding. The userspace interface
>> proposed forces a separate domain per device and forces userspace to
>> repeat iommu programming for each device. We are better off sharing a
>> domain between devices and programming the iommu once.
>>
>
> iommufd = open(/dev/iommu);
> ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
> ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
>
> ?

Yes.

> If so, I agree.

Good.

>> The natural way to do this is to have an iommu driver for programming
>> iommu.
>>
>> This likely means we will have to pass the domain to 'vfio' or uio or
>> whatever the driver that gives userspace the access to device is called,
>> but this is only for security, there's no need to support programming
>> iommu there.
>>
>> And using this design means the uio framework changes
>> required would be minor, so we won't have to duplicate code.
>>
>
> Since vfio would be the only driver, there would be no duplication. But
> a separate object for the iommu mapping is a good thing. Perhaps we can
> even share it with vhost (without actually using the mmu, since vhost is
> software only).

Main difference is that vhost works fine with unlocked
memory, paging it in on demand. iommu needs to unmap
memory when it is swapped out or relocated.

> --
> error compiling committee.c: too many arguments to function

2010-06-01 12:44:28

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:
>
>> Since vfio would be the only driver, there would be no duplication. But
>> a separate object for the iommu mapping is a good thing. Perhaps we can
>> even share it with vhost (without actually using the mmu, since vhost is
>> software only).
>>
> Main difference is that vhost works fine with unlocked
> memory, paging it in on demand. iommu needs to unmap
> memory when it is swapped out or relocated.
>
>

So you'd just take the memory map and not pin anything. This way you
can reuse the memory map.

But no, it doesn't handle the dirty bitmap, so no go.

--
error compiling committee.c: too many arguments to function

2010-06-01 21:30:14

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Tuesday 01 June 2010 03:46:51 am Michael S. Tsirkin wrote:
> On Tue, Jun 01, 2010 at 01:28:48PM +0300, Avi Kivity wrote:
> > On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
> >>
> >>> It can't program the iommu.
> >>> What
> >>> the patch proposes is that userspace tells vfio about the needed
> >>> mappings, and vfio programs the iommu.
> >>>
> >> There seems to be some misunderstanding. The userspace interface
> >> proposed forces a separate domain per device and forces userspace to
> >> repeat iommu programming for each device. We are better off sharing a
> >> domain between devices and programming the iommu once.
> >>
> >
> > iommufd = open(/dev/iommu);
> > ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
> > ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
> >
> > ?
>
> Yes.
>
> > If so, I agree.
>
> Good.

I'm not really opposed to multiple devices per domain, but let me point out how I
ended up here. First, the driver has two ways of mapping pages, one based on the
iommu api and one based on the dma_map_sg api. With the latter, the system
already allocates a domain per device and there's no way to control it. This was
presumably done to help isolation between drivers. If there are multiple drivers
in the user level, do we not want the same isoation to apply to them?
Also, domains are not a very scarce resource - my little core i5 has 256,
and the intel architecture goes to 64K.
And then there's the fact that it is possible to have multiple disjoint iommus on a system,
so it may not even be possible to bring 2 devices under one domain.

Given all that, I am inclined to leave it alone until someone has a real problem.
Note that not sharing iommu domains doesn't mean you can't share device memory,
just that you have to do multiple mappings

2010-06-02 03:00:57

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On 06/02/2010 12:26 AM, Tom Lyon wrote:
>
> I'm not really opposed to multiple devices per domain, but let me point out how I
> ended up here. First, the driver has two ways of mapping pages, one based on the
> iommu api and one based on the dma_map_sg api. With the latter, the system
> already allocates a domain per device and there's no way to control it. This was
> presumably done to help isolation between drivers. If there are multiple drivers
> in the user level, do we not want the same isoation to apply to them?
>

In the case of kvm, we don't want isolation between devices, because
that doesn't happen on real hardware. So if the guest programs devices
to dma to each other, we want that to succeed.

> Also, domains are not a very scarce resource - my little core i5 has 256,
> and the intel architecture goes to 64K.
>

But there is a 0.2% of mapped memory per domain cost for the page
tables. For the kvm use case, that could be significant since a guest
may have large amounts of memory and large numbers of assigned devices.

> And then there's the fact that it is possible to have multiple disjoint iommus on a system,
> so it may not even be possible to bring 2 devices under one domain.
>

That's indeed a deficiency.

> Given all that, I am inclined to leave it alone until someone has a real problem.
> Note that not sharing iommu domains doesn't mean you can't share device memory,
> just that you have to do multiple mappings
>

I think we do have a real problem (though a mild one).

The only issue I see with deferring the solution is that the API becomes
gnarly; both the kernel and userspace will have to support both APIs
forever. Perhaps we can implement the new API but defer the actual
sharing until later, don't know how much work this saves. Or Alex/Chris
can pitch in and help.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-06-02 04:30:26

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Tue, 2010-06-01 at 13:28 +0300, Avi Kivity wrote:
> On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
> >
> >> It can't program the iommu.
> >> What
> >> the patch proposes is that userspace tells vfio about the needed
> >> mappings, and vfio programs the iommu.
> >>
> > There seems to be some misunderstanding. The userspace interface
> > proposed forces a separate domain per device and forces userspace to
> > repeat iommu programming for each device. We are better off sharing a
> > domain between devices and programming the iommu once.
> >
>
> iommufd = open(/dev/iommu);
> ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
> ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)

It seems part of the annoyance of the current KVM device assignment is
that we have multiple files open, we mmap here, read there, write over
there, maybe, if it's not emulated. I quite like Tom's approach that we
have one stop shopping with /dev/vfio<n>, including config space
emulation so each driver doesn't have to try to write their own. So
continuing with that, shouldn't we be able to add a GET_IOMMU/SET_IOMMU
ioctl to vfio so that after we setup one device we can bind the next to
the same domain?

Alex

2010-06-02 05:03:04

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Tuesday 01 June 2010 09:29:47 pm Alex Williamson wrote:
> On Tue, 2010-06-01 at 13:28 +0300, Avi Kivity wrote:
> > On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
> > >
> > >> It can't program the iommu.
> > >> What
> > >> the patch proposes is that userspace tells vfio about the needed
> > >> mappings, and vfio programs the iommu.
> > >>
> > > There seems to be some misunderstanding. The userspace interface
> > > proposed forces a separate domain per device and forces userspace to
> > > repeat iommu programming for each device. We are better off sharing a
> > > domain between devices and programming the iommu once.
> > >
> >
> > iommufd = open(/dev/iommu);
> > ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
> > ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
>
> It seems part of the annoyance of the current KVM device assignment is
> that we have multiple files open, we mmap here, read there, write over
> there, maybe, if it's not emulated. I quite like Tom's approach that we
> have one stop shopping with /dev/vfio<n>, including config space
> emulation so each driver doesn't have to try to write their own. So
> continuing with that, shouldn't we be able to add a GET_IOMMU/SET_IOMMU
> ioctl to vfio so that after we setup one device we can bind the next to
> the same domain?

This is just what I was thinking. But rather than a get/set, just use two fds.

ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);

This may fail if there are really 2 different IOMMUs, so user code must be
prepared for failure, In addition, this is strictlyupwards compatible with
what is there now, so maybe we can add it later.

2010-06-02 05:08:57

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On 06/02/2010 07:59 AM, Tom Lyon wrote:
>
> This is just what I was thinking. But rather than a get/set, just use two fds.
>
> ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);
>
> This may fail if there are really 2 different IOMMUs, so user code must be
> prepared for failure, In addition, this is strictlyupwards compatible with
> what is there now, so maybe we can add it later.
>
>

What happens if one of the fds is later closed?

I don't like this conceptually. There is a 1:n relationship between the
memory map and the devices. Ignoring it will cause the API to have
warts. It's more straightforward to have an object to represent the
memory mapping (and talk to the iommus), and have devices bind to this
object.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-06-02 05:30:10

by Chris Wright

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

* Avi Kivity ([email protected]) wrote:
> On 06/02/2010 12:26 AM, Tom Lyon wrote:
> >
> >I'm not really opposed to multiple devices per domain, but let me point out how I
> >ended up here. First, the driver has two ways of mapping pages, one based on the
> >iommu api and one based on the dma_map_sg api. With the latter, the system
> >already allocates a domain per device and there's no way to control it. This was
> >presumably done to help isolation between drivers. If there are multiple drivers
> >in the user level, do we not want the same isoation to apply to them?
>
> In the case of kvm, we don't want isolation between devices, because
> that doesn't happen on real hardware.

Sure it does. That's exactly what happens when there's an iommu
involved with bare metal.

> So if the guest programs
> devices to dma to each other, we want that to succeed.

And it will as long as ATS is enabled (this is a basic requirement
for PCIe peer-to-peer traffic to succeed with an iommu involved on
bare metal).

That's how things currently are, i.e. we put all devices belonging to a
single guest in the same domain. However, it can be useful to put each
device belonging to a guest in a unique domain. Especially as qemu
grows support for iommu emulation, and guest OSes begin to understand
how to use a hw iommu.

> >Also, domains are not a very scarce resource - my little core i5 has 256,
> >and the intel architecture goes to 64K.
>
> But there is a 0.2% of mapped memory per domain cost for the page
> tables. For the kvm use case, that could be significant since a
> guest may have large amounts of memory and large numbers of assigned
> devices.
>
> >And then there's the fact that it is possible to have multiple disjoint iommus on a system,
> >so it may not even be possible to bring 2 devices under one domain.
>
> That's indeed a deficiency.

Not sure it's a deficiency. Typically to share page table mappings
across multiple iommu's you just have to do update/invalidate to each
hw iommu that is sharing the mapping. Alternatively, you can use more
memory and build/maintain identical mappings (as Tom alludes to below).

> >Given all that, I am inclined to leave it alone until someone has a real problem.
> >Note that not sharing iommu domains doesn't mean you can't share device memory,
> >just that you have to do multiple mappings
>
> I think we do have a real problem (though a mild one).
>
> The only issue I see with deferring the solution is that the API
> becomes gnarly; both the kernel and userspace will have to support
> both APIs forever. Perhaps we can implement the new API but defer
> the actual sharing until later, don't know how much work this saves.
> Or Alex/Chris can pitch in and help.

It really shouldn't be that complicated to create the API to allow for
flexible device <-> domain mappings, so I agree, makes sense to do it
right up front.

thanks,
-chris

2010-06-02 05:40:29

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On 06/02/2010 08:29 AM, Chris Wright wrote:
> * Avi Kivity ([email protected]) wrote:
>
>> On 06/02/2010 12:26 AM, Tom Lyon wrote:
>>
>>> I'm not really opposed to multiple devices per domain, but let me point out how I
>>> ended up here. First, the driver has two ways of mapping pages, one based on the
>>> iommu api and one based on the dma_map_sg api. With the latter, the system
>>> already allocates a domain per device and there's no way to control it. This was
>>> presumably done to help isolation between drivers. If there are multiple drivers
>>> in the user level, do we not want the same isoation to apply to them?
>>>
>> In the case of kvm, we don't want isolation between devices, because
>> that doesn't happen on real hardware.
>>
> Sure it does. That's exactly what happens when there's an iommu
> involved with bare metal.
>

But we are emulating a machine without an iommu.

When we emulate a machine with an iommu, then yes, we'll want to use as
many domains as the guest does.

>> So if the guest programs
>> devices to dma to each other, we want that to succeed.
>>
> And it will as long as ATS is enabled (this is a basic requirement
> for PCIe peer-to-peer traffic to succeed with an iommu involved on
> bare metal).
>
> That's how things currently are, i.e. we put all devices belonging to a
> single guest in the same domain. However, it can be useful to put each
> device belonging to a guest in a unique domain. Especially as qemu
> grows support for iommu emulation, and guest OSes begin to understand
> how to use a hw iommu.
>

Right, we need to keep flexibility.

>>> And then there's the fact that it is possible to have multiple disjoint iommus on a system,
>>> so it may not even be possible to bring 2 devices under one domain.
>>>
>> That's indeed a deficiency.
>>
> Not sure it's a deficiency. Typically to share page table mappings
> across multiple iommu's you just have to do update/invalidate to each
> hw iommu that is sharing the mapping. Alternatively, you can use more
> memory and build/maintain identical mappings (as Tom alludes to below).
>

Sharing the page tables is just an optimization, I was worried about
devices in separate domains not talking to each other. if ATS fixes
that, great.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-06-02 09:42:05

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote:

> There seems to be some misunderstanding. The userspace interface
> proposed forces a separate domain per device and forces userspace to
> repeat iommu programming for each device. We are better off sharing a
> domain between devices and programming the iommu once.
>
> The natural way to do this is to have an iommu driver for programming
> iommu.

IMO a seperate iommu-userspace driver is a nightmare for a userspace
interface. It is just too complicated to use. We can solve the problem
of multiple devices-per-domain with an ioctl which allows binding one
uio-device to the address-space on another. Thats much simpler.

Joerg

2010-06-02 09:45:29

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote:
> On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:

>> Main difference is that vhost works fine with unlocked
>> memory, paging it in on demand. iommu needs to unmap
>> memory when it is swapped out or relocated.
>>
> So you'd just take the memory map and not pin anything. This way you
> can reuse the memory map.
>
> But no, it doesn't handle the dirty bitmap, so no go.

IOMMU mapped memory can not be swapped out because we can't do demand
paging on io-page-faults with current devices. We have to pin _all_
userspace memory that is mapped into an IOMMU domain.

Joerg

2010-06-02 09:50:07

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On 06/02/2010 12:45 PM, Joerg Roedel wrote:
> On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote:
>
>> On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:
>>
>
>>> Main difference is that vhost works fine with unlocked
>>> memory, paging it in on demand. iommu needs to unmap
>>> memory when it is swapped out or relocated.
>>>
>>>
>> So you'd just take the memory map and not pin anything. This way you
>> can reuse the memory map.
>>
>> But no, it doesn't handle the dirty bitmap, so no go.
>>
> IOMMU mapped memory can not be swapped out because we can't do demand
> paging on io-page-faults with current devices. We have to pin _all_
> userspace memory that is mapped into an IOMMU domain.
>

vhost doesn't pin memory.

What I proposed is to describe the memory map using an object (fd), and
pass it around to clients that use it: kvm, vhost, vfio. That way you
maintain the memory map in a central location and broadcast changes to
clients. Only a vfio client would result in memory being pinned.

It can still work, but the interface needs to be extended to include
dirty bitmap logging.

--
error compiling committee.c: too many arguments to function

2010-06-02 09:50:51

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On 06/02/2010 12:42 PM, Joerg Roedel wrote:
> On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote:
>
>
>> There seems to be some misunderstanding. The userspace interface
>> proposed forces a separate domain per device and forces userspace to
>> repeat iommu programming for each device. We are better off sharing a
>> domain between devices and programming the iommu once.
>>
>> The natural way to do this is to have an iommu driver for programming
>> iommu.
>>
> IMO a seperate iommu-userspace driver is a nightmare for a userspace
> interface. It is just too complicated to use. We can solve the problem
> of multiple devices-per-domain with an ioctl which allows binding one
> uio-device to the address-space on another. Thats much simpler.
>

This is non trivial with hotplug.

--
error compiling committee.c: too many arguments to function

2010-06-02 09:53:56

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Tue, Jun 01, 2010 at 09:59:40PM -0700, Tom Lyon wrote:
> This is just what I was thinking. But rather than a get/set, just use two fds.
>
> ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);
>
> This may fail if there are really 2 different IOMMUs, so user code must be
> prepared for failure, In addition, this is strictlyupwards compatible with
> what is there now, so maybe we can add it later.

How can this fail with multiple IOMMUs? This should be handled
transparently by the IOMMU driver.

Joerg

2010-06-02 09:57:39

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:
> On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote:
>
> > There seems to be some misunderstanding. The userspace interface
> > proposed forces a separate domain per device and forces userspace to
> > repeat iommu programming for each device. We are better off sharing a
> > domain between devices and programming the iommu once.
> >
> > The natural way to do this is to have an iommu driver for programming
> > iommu.
>
> IMO a seperate iommu-userspace driver is a nightmare for a userspace
> interface. It is just too complicated to use.

One advantage would be that we can reuse the uio framework
for the devices themselves. So an existing app can just program
an iommu for DMA and keep using uio for interrupts and access.

> We can solve the problem
> of multiple devices-per-domain with an ioctl which allows binding one
> uio-device to the address-space on another.

This would imply switching an iommu domain for a device while
it could potentially be doing DMA. No idea whether this can be done
in a safe manner.
Forcing iommu assignment to be done as a first step seems much saner.

> Thats much simpler.
>
> Joerg

So instead of
dev = open();
ioctl(dev, ASSIGN, iommu)
mmap

and if we for ioctl mmap will fail
we have

dev = open();
if (ndevices > 0)
ioctl(devices[0], ASSIGN, dev)
mmap

And if we forget ioctl we get errors from device.
Seems more complicated to me.

There will also always exist the confusion: address space for
which device are we modifying? With a separate driver for iommu,
we can safely check that binding is done correctly.

--
MST

2010-06-02 10:04:08

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote:
> On 06/02/2010 12:45 PM, Joerg Roedel wrote:
>> IOMMU mapped memory can not be swapped out because we can't do demand
>> paging on io-page-faults with current devices. We have to pin _all_
>> userspace memory that is mapped into an IOMMU domain.
>
> vhost doesn't pin memory.
>
> What I proposed is to describe the memory map using an object (fd), and
> pass it around to clients that use it: kvm, vhost, vfio. That way you
> maintain the memory map in a central location and broadcast changes to
> clients. Only a vfio client would result in memory being pinned.

Ah ok, so its only about the database which keeps the mapping
information.

> It can still work, but the interface needs to be extended to include
> dirty bitmap logging.

Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
dirty-bits in the page-table. And without demand-paging we can't really
tell what pages a device has written to. The only choice is to mark all
IOMMU-mapped pages dirty as long as they are mapped.

Joerg

2010-06-02 10:14:29

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 12:04:04PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote:
> > On 06/02/2010 12:45 PM, Joerg Roedel wrote:
> >> IOMMU mapped memory can not be swapped out because we can't do demand
> >> paging on io-page-faults with current devices. We have to pin _all_
> >> userspace memory that is mapped into an IOMMU domain.
> >
> > vhost doesn't pin memory.
> >
> > What I proposed is to describe the memory map using an object (fd), and
> > pass it around to clients that use it: kvm, vhost, vfio. That way you
> > maintain the memory map in a central location and broadcast changes to
> > clients. Only a vfio client would result in memory being pinned.
>
> Ah ok, so its only about the database which keeps the mapping
> information.
>
> > It can still work, but the interface needs to be extended to include
> > dirty bitmap logging.
>
> Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
> dirty-bits in the page-table. And without demand-paging we can't really
> tell what pages a device has written to. The only choice is to mark all
> IOMMU-mapped pages dirty as long as they are mapped.
>
> Joerg

Or mark them dirty when they are unmapped.

--
MST

2010-06-02 10:19:43

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:

> > IMO a seperate iommu-userspace driver is a nightmare for a userspace
> > interface. It is just too complicated to use.
>
> One advantage would be that we can reuse the uio framework
> for the devices themselves. So an existing app can just program
> an iommu for DMA and keep using uio for interrupts and access.

The driver is called UIO and not U-INTR-MMIO ;-) So I think handling
IOMMU mappings belongs there.

> > We can solve the problem
> > of multiple devices-per-domain with an ioctl which allows binding one
> > uio-device to the address-space on another.
>
> This would imply switching an iommu domain for a device while
> it could potentially be doing DMA. No idea whether this can be done
> in a safe manner.

It can. The worst thing that can happen is an io-page-fault.

> Forcing iommu assignment to be done as a first step seems much saner.

If we force it, there is no reason why not doing it implicitly.

We can do something like this then:

dev1 = open();
ioctl(dev1, IOMMU_MAP, ...); /* creates IOMMU domain and assigns dev1 to
it*/

dev2 = open();
ioctl(dev2, IOMMU_MAP, ...);

/* Now dev1 and dev2 are in seperate domains */

ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
assigns it to the same domain as
dev1. Domain has a refcount of two
now */

close(dev1); /* domain refcount goes down to one */
close(dev2); /* domain refcount is zero and domain gets destroyed */

Joerg

2010-06-02 10:20:04

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 11:45:27AM +0200, Joerg Roedel wrote:
> On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote:
> > On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:
>
> >> Main difference is that vhost works fine with unlocked
> >> memory, paging it in on demand. iommu needs to unmap
> >> memory when it is swapped out or relocated.
> >>
> > So you'd just take the memory map and not pin anything. This way you
> > can reuse the memory map.
> >
> > But no, it doesn't handle the dirty bitmap, so no go.
>
> IOMMU mapped memory can not be swapped out because we can't do demand
> paging on io-page-faults with current devices. We have to pin _all_
> userspace memory that is mapped into an IOMMU domain.
>
> Joerg

One of the issues I see with the current patch is that
it uses the mlock rlimit to do this pinning. So this wastes the rlimit
for an app that did mlockall already, and also consumes
this resource transparently, so an app might call mlock
on a small buffer and be surprised that it fails.

Using mmu notifiers might help?

--
MST

2010-06-02 10:26:16

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:
>
> > > IMO a seperate iommu-userspace driver is a nightmare for a userspace
> > > interface. It is just too complicated to use.
> >
> > One advantage would be that we can reuse the uio framework
> > for the devices themselves. So an existing app can just program
> > an iommu for DMA and keep using uio for interrupts and access.
>
> The driver is called UIO and not U-INTR-MMIO ;-) So I think handling
> IOMMU mappings belongs there.
>
> > > We can solve the problem
> > > of multiple devices-per-domain with an ioctl which allows binding one
> > > uio-device to the address-space on another.
> >
> > This would imply switching an iommu domain for a device while
> > it could potentially be doing DMA. No idea whether this can be done
> > in a safe manner.
>
> It can. The worst thing that can happen is an io-page-fault.

devices might not be able to recover from this.

> > Forcing iommu assignment to be done as a first step seems much saner.
>
> If we force it, there is no reason why not doing it implicitly.

What you describe below does 3 ioctls for what can be done with 1.

> We can do something like this then:
>
> dev1 = open();
> ioctl(dev1, IOMMU_MAP, ...); /* creates IOMMU domain and assigns dev1 to
> it*/
>
> dev2 = open();
> ioctl(dev2, IOMMU_MAP, ...);
>
> /* Now dev1 and dev2 are in seperate domains */
>
> ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
> assigns it to the same domain as
> dev1. Domain has a refcount of two
> now */

Or maybe it destroys mapping for dev1?
How do you remember?

> close(dev1); /* domain refcount goes down to one */
> close(dev2); /* domain refcount is zero and domain gets destroyed */
>
>
> Joerg

Also, no way to unshare? That seems limiting.

--
MST

2010-06-02 10:26:34

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 01:15:34PM +0300, Michael S. Tsirkin wrote:
> One of the issues I see with the current patch is that
> it uses the mlock rlimit to do this pinning. So this wastes the rlimit
> for an app that did mlockall already, and also consumes
> this resource transparently, so an app might call mlock
> on a small buffer and be surprised that it fails.
>
> Using mmu notifiers might help?

MMU notifiers are problematic because they are designed for situations
where we can do demand paging. The invalidate_range_start and
invalidate_range_end functions are not only called on munmap, they also
run when mprotect is called (in which case we don't want to tear down
iommu mappings). So what may happen with mmu notifiers is that we
accidentially tear down iommu mappings. With demand-paging this is no
problem because the io-ptes could be re-faulted. But that does not work
here.

Joerg

2010-06-02 10:35:19

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 01:21:44PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:

> > It can. The worst thing that can happen is an io-page-fault.
>
> devices might not be able to recover from this.

With the userspace interface a process can create io-page-faults
anyway if it wants. We can't protect us from this. And the process is
also responsible to not tear down iommu-mappings that are currently in
use.

> What you describe below does 3 ioctls for what can be done with 1.

The second IOMMU_MAP ioctl is just to show that existing mappings would
be destroyed if the device is assigned to another address space. Not
strictly necessary. So we have two ioctls but save one call to create
the iommu-domain.

> > ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
> > assigns it to the same domain as
> > dev1. Domain has a refcount of two
> > now */
>
> Or maybe it destroys mapping for dev1?
> How do you remember?

Because we express here that "dev2 shares the iommu mappings of dev1".
Thats easy to remember.

> Also, no way to unshare? That seems limiting.

Just left out for simplicity reasons. An IOMMU_UNBIND (no IOMMU_UNSHARE
because that would require a second parameter) ioctl is certainly also
required.

Joerg

2010-06-02 10:43:17

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 01:21:44PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:
>
> > > It can. The worst thing that can happen is an io-page-fault.
> >
> > devices might not be able to recover from this.
>
> With the userspace interface a process can create io-page-faults
> anyway if it wants. We can't protect us from this.

We could fail all operations until an iommu is bound.
This will help catch bugs with access before setup. We can not do this
if a domain is bound by default.

> And the process is
> also responsible to not tear down iommu-mappings that are currently in
> use.
>
> > What you describe below does 3 ioctls for what can be done with 1.
>
> The second IOMMU_MAP ioctl is just to show that existing mappings would
> be destroyed if the device is assigned to another address space. Not
> strictly necessary. So we have two ioctls but save one call to create
> the iommu-domain.

With 10 devices you have 10 extra ioctls.

> > > ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
> > > assigns it to the same domain as
> > > dev1. Domain has a refcount of two
> > > now */
> >
> > Or maybe it destroys mapping for dev1?
> > How do you remember?
>
> Because we express here that "dev2 shares the iommu mappings of dev1".
> Thats easy to remember.

they both share the mappings. which one gets the iommu
destroyed (breaking the device if it is now doing DMA)?

> > Also, no way to unshare? That seems limiting.
>
> Just left out for simplicity reasons. An IOMMU_UNBIND (no IOMMU_UNSHARE
> because that would require a second parameter) ioctl is certainly also
> required.
>
> Joerg

--
MST

2010-06-02 10:49:16

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 11:12:27

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 01:38:28PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote:

> > With the userspace interface a process can create io-page-faults
> > anyway if it wants. We can't protect us from this.
>
> We could fail all operations until an iommu is bound. This will help
> catch bugs with access before setup. We can not do this if a domain is
> bound by default.

Even if it is bound to a domain the userspace driver could program the
device to do dma to unmapped regions causing io-page-faults. The kernel
can't do anything about it.

> > The second IOMMU_MAP ioctl is just to show that existing mappings would
> > be destroyed if the device is assigned to another address space. Not
> > strictly necessary. So we have two ioctls but save one call to create
> > the iommu-domain.
>
> With 10 devices you have 10 extra ioctls.

And this works implicitly with your proposal? Remember that we still
need to be able to provide seperate mappings for each device to support
IOMMU emulation for the guest. I think my proposal does not have any
extra costs.

> > Because we express here that "dev2 shares the iommu mappings of dev1".
> > Thats easy to remember.
>
> they both share the mappings. which one gets the iommu
> destroyed (breaking the device if it is now doing DMA)?

As I wrote the domain has a reference count and is destroyed only when
it goes down to zero. This does not happen as long as a device is bound
to it.

Joerg

2010-06-02 11:22:03

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On 06/02/2010 01:04 PM, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote:
>
>> On 06/02/2010 12:45 PM, Joerg Roedel wrote:
>>
>>> IOMMU mapped memory can not be swapped out because we can't do demand
>>> paging on io-page-faults with current devices. We have to pin _all_
>>> userspace memory that is mapped into an IOMMU domain.
>>>
>> vhost doesn't pin memory.
>>
>> What I proposed is to describe the memory map using an object (fd), and
>> pass it around to clients that use it: kvm, vhost, vfio. That way you
>> maintain the memory map in a central location and broadcast changes to
>> clients. Only a vfio client would result in memory being pinned.
>>
> Ah ok, so its only about the database which keeps the mapping
> information.
>

Yes.

>
>> It can still work, but the interface needs to be extended to include
>> dirty bitmap logging.
>>
> Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
> dirty-bits in the page-table. And without demand-paging we can't really
> tell what pages a device has written to. The only choice is to mark all
> IOMMU-mapped pages dirty as long as they are mapped.
>
>

The interface would only work for clients which support it: kvm, vhost,
and iommu/devices with restartable dma.

Note dirty logging is not very interesting for vfio anyway, since you
can't live migrate with assigned devices.

--
error compiling committee.c: too many arguments to function

2010-06-02 11:25:52

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 01:38:28PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote:
>
> > > With the userspace interface a process can create io-page-faults
> > > anyway if it wants. We can't protect us from this.
> >
> > We could fail all operations until an iommu is bound. This will help
> > catch bugs with access before setup. We can not do this if a domain is
> > bound by default.
>
> Even if it is bound to a domain the userspace driver could program the
> device to do dma to unmapped regions causing io-page-faults. The kernel
> can't do anything about it.

It can always corrupt its own memory directly as well :)
But that is not a reason not to detect errors if we can,
and not to make APIs hard to misuse.

> > > The second IOMMU_MAP ioctl is just to show that existing mappings would
> > > be destroyed if the device is assigned to another address space. Not
> > > strictly necessary. So we have two ioctls but save one call to create
> > > the iommu-domain.
> >
> > With 10 devices you have 10 extra ioctls.
>
> And this works implicitly with your proposal?

Yes. so you do:
iommu = open
ioctl(dev1, BIND, iommu)
ioctl(dev2, BIND, iommu)
ioctl(dev3, BIND, iommu)
ioctl(dev4, BIND, iommu)

No need to add a SHARE ioctl.

> Remember that we still
> need to be able to provide seperate mappings for each device to support
> IOMMU emulation for the guest.

Generally not true. E.g. guest can enable iommu passthrough
or have domain per a group of devices.

> I think my proposal does not have any
> extra costs.

with my proposal we have 1 ioctl per device + 1 per domain.
with yours we have 2 ioctls per device is iommu is shared
and 1 if it is not shared.

as current apps share iommu it seems to make sense
to optimize for that.

> > > Because we express here that "dev2 shares the iommu mappings of dev1".
> > > Thats easy to remember.
> >
> > they both share the mappings. which one gets the iommu
> > destroyed (breaking the device if it is now doing DMA)?
>
> As I wrote the domain has a reference count and is destroyed only when
> it goes down to zero. This does not happen as long as a device is bound
> to it.
>
> Joerg

We were talking about UNSHARE ioctl:
ioctl(dev1, UNSHARE, dev2)
Does it change the domain for dev1 or dev2?
If you make a mistake you get a hard to debug bug.

--
MST

2010-06-02 12:19:30

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:

> > Even if it is bound to a domain the userspace driver could program the
> > device to do dma to unmapped regions causing io-page-faults. The kernel
> > can't do anything about it.
>
> It can always corrupt its own memory directly as well :)
> But that is not a reason not to detect errors if we can,
> and not to make APIs hard to misuse.

Changing the domain of a device while dma can happen is the same type of
bug as unmapping potential dma target addresses. We can't catch this
kind of misuse.

> > > With 10 devices you have 10 extra ioctls.
> >
> > And this works implicitly with your proposal?
>
> Yes. so you do:
> iommu = open
> ioctl(dev1, BIND, iommu)
> ioctl(dev2, BIND, iommu)
> ioctl(dev3, BIND, iommu)
> ioctl(dev4, BIND, iommu)
>
> No need to add a SHARE ioctl.

In my proposal this looks like:

dev1 = open();
ioctl(dev2, SHARE, dev1);
ioctl(dev3, SHARE, dev1);
ioctl(dev4, SHARE, dev1);

So we actually save an ioctl.

> > Remember that we still need to be able to provide seperate mappings
> > for each device to support IOMMU emulation for the guest.
>
> Generally not true. E.g. guest can enable iommu passthrough
> or have domain per a group of devices.

What I meant was that there may me multiple io-addresses spaces
necessary for one process. I didn't want to say that every device
_needs_ to have its own address space.

> > As I wrote the domain has a reference count and is destroyed only when
> > it goes down to zero. This does not happen as long as a device is bound
> > to it.
> >
> > Joerg
>
> We were talking about UNSHARE ioctl:
> ioctl(dev1, UNSHARE, dev2)
> Does it change the domain for dev1 or dev2?
> If you make a mistake you get a hard to debug bug.

As I already wrote we would have an UNBIND ioctl which just removes a
device from its current domain. UNBIND is better than UNSHARE for
exactly the reason you pointed out above. I thought I stated that
already.

Joerg

2010-06-02 12:25:28

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On 06/02/2010 03:19 PM, Joerg Roedel wrote:
>
>> Yes. so you do:
>> iommu = open
>> ioctl(dev1, BIND, iommu)
>> ioctl(dev2, BIND, iommu)
>> ioctl(dev3, BIND, iommu)
>> ioctl(dev4, BIND, iommu)
>>
>> No need to add a SHARE ioctl.
>>
> In my proposal this looks like:
>
>
> dev1 = open();
> ioctl(dev2, SHARE, dev1);
> ioctl(dev3, SHARE, dev1);
> ioctl(dev4, SHARE, dev1);
>
> So we actually save an ioctl.
>

The problem with this is that it is assymetric, dev1 is treated
differently from dev[234]. It's an unintuitive API.

--
error compiling committee.c: too many arguments to function

2010-06-02 12:38:45

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 02:19:28PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
>
> > > Even if it is bound to a domain the userspace driver could program the
> > > device to do dma to unmapped regions causing io-page-faults. The kernel
> > > can't do anything about it.
> >
> > It can always corrupt its own memory directly as well :)
> > But that is not a reason not to detect errors if we can,
> > and not to make APIs hard to misuse.
>
> Changing the domain of a device while dma can happen is the same type of
> bug as unmapping potential dma target addresses. We can't catch this
> kind of misuse.

you normally need device mapped to start DMA.
SHARE makes this bug more likely as you allow
switching domains: mmap could be done before switching.

> > > > With 10 devices you have 10 extra ioctls.
> > >
> > > And this works implicitly with your proposal?
> >
> > Yes. so you do:
> > iommu = open
> > ioctl(dev1, BIND, iommu)
> > ioctl(dev2, BIND, iommu)
> > ioctl(dev3, BIND, iommu)
> > ioctl(dev4, BIND, iommu)
> >
> > No need to add a SHARE ioctl.
>
> In my proposal this looks like:
>
>
> dev1 = open();
> ioctl(dev2, SHARE, dev1);
> ioctl(dev3, SHARE, dev1);
> ioctl(dev4, SHARE, dev1);
>
> So we actually save an ioctl.

I thought we had a BIND ioctl?

> > > Remember that we still need to be able to provide seperate mappings
> > > for each device to support IOMMU emulation for the guest.
> >
> > Generally not true. E.g. guest can enable iommu passthrough
> > or have domain per a group of devices.
>
> What I meant was that there may me multiple io-addresses spaces
> necessary for one process. I didn't want to say that every device
> _needs_ to have its own address space.
>
> > > As I wrote the domain has a reference count and is destroyed only when
> > > it goes down to zero. This does not happen as long as a device is bound
> > > to it.
> > >
> > > Joerg
> >
> > We were talking about UNSHARE ioctl:
> > ioctl(dev1, UNSHARE, dev2)
> > Does it change the domain for dev1 or dev2?
> > If you make a mistake you get a hard to debug bug.
>
> As I already wrote we would have an UNBIND ioctl which just removes a
> device from its current domain. UNBIND is better than UNSHARE for
> exactly the reason you pointed out above. I thought I stated that
> already.
>
> Joerg

You undo SHARE with UNBIND?

2010-06-02 12:50:53

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 03:25:11PM +0300, Avi Kivity wrote:
> On 06/02/2010 03:19 PM, Joerg Roedel wrote:
>>
>>> Yes. so you do:
>>> iommu = open
>>> ioctl(dev1, BIND, iommu)
>>> ioctl(dev2, BIND, iommu)
>>> ioctl(dev3, BIND, iommu)
>>> ioctl(dev4, BIND, iommu)
>>>
>>> No need to add a SHARE ioctl.
>>>
>> In my proposal this looks like:
>>
>>
>> dev1 = open();
>> ioctl(dev2, SHARE, dev1);
>> ioctl(dev3, SHARE, dev1);
>> ioctl(dev4, SHARE, dev1);
>>
>> So we actually save an ioctl.
>>
>
> The problem with this is that it is assymetric, dev1 is treated
> differently from dev[234]. It's an unintuitive API.

Its by far more unintuitive that a process needs to explicitly bind a
device to an iommu domain before it can do anything with it. If its
required anyway the binding can happen implicitly. We could allow to do
a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.

Note that this way of handling userspace iommu mappings is also a lot
simpler for most use-cases outside of KVM. If a developer wants to write
a userspace driver all it needs to do is:

dev = open();
ioctl(dev, MAP, ...);
/* use device with mappings */
close(dev);

Which is much easier than the need to create a domain explicitly.

Joerg

2010-06-02 13:02:40

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 03:34:17PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 02, 2010 at 02:19:28PM +0200, Joerg Roedel wrote:

> you normally need device mapped to start DMA.
> SHARE makes this bug more likely as you allow
> switching domains: mmap could be done before switching.

We need to support domain switching anyway for iommu emulation in a
guest. So if you consider this to be a problem (I don't) it will not go
away with your proposal.

> > dev1 = open();
> > ioctl(dev2, SHARE, dev1);
> > ioctl(dev3, SHARE, dev1);
> > ioctl(dev4, SHARE, dev1);
> >
> > So we actually save an ioctl.
>
> I thought we had a BIND ioctl?

I can't remember a BIND ioctl in my proposal. I remember an UNBIND, but
thats bad naming as you pointed out below. See my statement on this
below too.

> You undo SHARE with UNBIND?

Thats bad naming, agreed. Lets keep UNSHARE. Point is, we only need one
parameter to do this which removes any ambiguity:

ioctl(dev1, UNSHARE);

Joerg

2010-06-02 13:06:39

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On 06/02/2010 03:50 PM, Joerg Roedel wrote:
>
>> The problem with this is that it is assymetric, dev1 is treated
>> differently from dev[234]. It's an unintuitive API.
>>
> Its by far more unintuitive that a process needs to explicitly bind a
> device to an iommu domain before it can do anything with it.

I don't really care about the iommu domain. It's a side effect. The
kernel takes care of it. I'm only worried about the API.

We have a memory map that is (often) the same for a set of devices. If
you were coding a non-kernel interface, how would you code it?

struct memory_map;
void memory_map_init(struct memory_map *mm, ...);
struct device;
void device_set_memory_map(struct device *device, struct memory_map *mm);

or

struct device;
void device_init_memory_map(struct device *dev, ...);
void device_clone_memory_map(struct device *dev, struct device *other);

I wouldn't even think of the second one personally.

> If its
> required anyway the binding can happen implicitly. We could allow to do
> a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
>

It's still special. You define the memory map only for the first
device. You have to make sure dev1 doesn't go away while sharing it.

> Note that this way of handling userspace iommu mappings is also a lot
> simpler for most use-cases outside of KVM. If a developer wants to write
> a userspace driver all it needs to do is:
>
> dev = open();
> ioctl(dev, MAP, ...);
> /* use device with mappings */
> close(dev);
>
> Which is much easier than the need to create a domain explicitly.
>

mm = open()
ioctl(mm, MAP, ...)
dev = open();
ioctl(dev, BIND, mm);
...
close(mm);
close(dev);

so yes, more work, but once you have multiple devices which come and go
dynamically things become simpler. The map object has global lifetime
(you can even construct it if you don't assign any devices), the devices
attach to it, memory hotplug updates the memory map but doesn't touch
devices.

--
error compiling committee.c: too many arguments to function

2010-06-02 13:21:56

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 02:50:50PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 03:25:11PM +0300, Avi Kivity wrote:
> > On 06/02/2010 03:19 PM, Joerg Roedel wrote:
> >>
> >>> Yes. so you do:
> >>> iommu = open
> >>> ioctl(dev1, BIND, iommu)
> >>> ioctl(dev2, BIND, iommu)
> >>> ioctl(dev3, BIND, iommu)
> >>> ioctl(dev4, BIND, iommu)
> >>>
> >>> No need to add a SHARE ioctl.
> >>>
> >> In my proposal this looks like:
> >>
> >>
> >> dev1 = open();
> >> ioctl(dev2, SHARE, dev1);
> >> ioctl(dev3, SHARE, dev1);
> >> ioctl(dev4, SHARE, dev1);
> >>
> >> So we actually save an ioctl.
> >>
> >
> > The problem with this is that it is assymetric, dev1 is treated
> > differently from dev[234]. It's an unintuitive API.
>
> Its by far more unintuitive that a process needs to explicitly bind a
> device to an iommu domain before it can do anything with it.

The reason it is more intuitive is because it is harder to get it
wrong. If you swap iommu and device in the call, you get BADF
so you know you made a mistake. We can even make it work
both ways if we wanted to. With ioctl(dev1, BIND, dev2)
it breaks silently.

> If its
> required anyway the binding can happen implicitly. We could allow to do
> a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.

And then when we assign meaning to it we find that half the apps
are broken because they did not call this ioctl.

> Note that this way of handling userspace iommu mappings is also a lot
> simpler for most use-cases outside of KVM. If a developer wants to write
> a userspace driver all it needs to do is:
>
> dev = open();
> ioctl(dev, MAP, ...);
> /* use device with mappings */
> close(dev);
>
> Which is much easier than the need to create a domain explicitly.
>
> Joerg

This simple scenario ignores all the real-life corner cases.
For example, with an explicit iommu open and bind application
can naturally detect that:
- we have run out of iommu domains
- iommu is unsupported
- iommu is in use by another, incompatible device
- device is in bad state
because each is a separate operation, so it is easy to produce meaningful
errors.

Another interesting thing that a separate iommu device supports is when
application A controls the iommu and application B
controls the device. This might be good to e.g. improve security
(B is run by root, A is unpriveledged and passes commands to/from B
over a pipe).

This is not possible when same fd is used for iommu and device.

--
MST

2010-06-02 13:53:19

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 04:06:21PM +0300, Avi Kivity wrote:
> On 06/02/2010 03:50 PM, Joerg Roedel wrote:

>> Its by far more unintuitive that a process needs to explicitly bind a
>> device to an iommu domain before it can do anything with it.
>
> I don't really care about the iommu domain. It's a side effect. The
> kernel takes care of it. I'm only worried about the API.

The proposed memory-map object is nothing else than a userspace
abstraction of an iommu-domain.

> We have a memory map that is (often) the same for a set of devices. If
> you were coding a non-kernel interface, how would you code it?
>
> struct memory_map;
> void memory_map_init(struct memory_map *mm, ...);
> struct device;
> void device_set_memory_map(struct device *device, struct memory_map *mm);
>
> or
>
> struct device;
> void device_init_memory_map(struct device *dev, ...);
> void device_clone_memory_map(struct device *dev, struct device *other);
>
> I wouldn't even think of the second one personally.

Right, a kernel-interface would be designed the first way. The IOMMU-API
is actually designed in this manner. But I still think we should keep it
simpler for userspace.

>> If its required anyway the binding can happen implicitly. We could
>> allow to do a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
>
> It's still special. You define the memory map only for the first
> device. You have to make sure dev1 doesn't go away while sharing it.

Must be a misunderstanding. In my proposal the domain is not owned by
one device. It is owned by all devices that share it and will only
vanish if all devices that use it are unbound (which happens when the file
descriptor is closed, for example).

> so yes, more work, but once you have multiple devices which come and go
> dynamically things become simpler. The map object has global lifetime
> (you can even construct it if you don't assign any devices), the devices
> attach to it, memory hotplug updates the memory map but doesn't touch
> devices.

I still think a userspace interface should be as simple as possible. But
since both ways will work I am not really opposed to Michael's proposal.
I just think its overkill for the common non-kvm usecase (a userspace
device driver).

Joerg

2010-06-02 14:01:57

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 04:17:19PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 02, 2010 at 02:50:50PM +0200, Joerg Roedel wrote:
> > On Wed, Jun 02, 2010 at 03:25:11PM +0300, Avi Kivity wrote:
> > > On 06/02/2010 03:19 PM, Joerg Roedel wrote:

>
> > If its
> > required anyway the binding can happen implicitly. We could allow to do
> > a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
>
> And then when we assign meaning to it we find that half the apps
> are broken because they did not call this ioctl.

The meaning is already assigned and chaning it means changing the
userspace-abi which is a no-go.

> This simple scenario ignores all the real-life corner cases.
> For example, with an explicit iommu open and bind application
> can naturally detect that:
> - we have run out of iommu domains

ioctl(dev, MAP, ...) will fail in this case.

> - iommu is unsupported

Is best checked by open() anyway because userspace can't do anything
with the device before it is bound to a domain.

> - iommu is in use by another, incompatible device

How should this happen?

> - device is in bad state

How is this checked with your proposal and why can this not be detected
with my one?

> because each is a separate operation, so it is easy to produce meaningful
> errors.

Ok, this is true.

> Another interesting thing that a separate iommu device supports is when
> application A controls the iommu and application B
> controls the device.

Until Linux becomes a micro-kernel the IOMMU itself will _never_ be
controlled by an application.

> This might be good to e.g. improve security (B is run by root, A is
> unpriveledged and passes commands to/from B over a pipe).

Micro-kernel arguments. I hope a userspace controlled IOMMU in Linux
will never happen ;-)

Joerg

2010-06-02 16:54:01

by Chris Wright

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

* Avi Kivity ([email protected]) wrote:
> The interface would only work for clients which support it: kvm,
> vhost, and iommu/devices with restartable dma.

BTW, there is no such thing as restartable dma. There is a provision in
new specs (read: no real hardware) that allows a device to request pages
before using them. So it's akin to demand paging, but the demand is an
explicit request rather than a page fault.

thanks,
-chris

2010-06-02 17:46:51

by Chris Wright

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

* Joerg Roedel ([email protected]) wrote:
> On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
>
> > > Even if it is bound to a domain the userspace driver could program the
> > > device to do dma to unmapped regions causing io-page-faults. The kernel
> > > can't do anything about it.
> >
> > It can always corrupt its own memory directly as well :)
> > But that is not a reason not to detect errors if we can,
> > and not to make APIs hard to misuse.
>
> Changing the domain of a device while dma can happen is the same type of
> bug as unmapping potential dma target addresses. We can't catch this
> kind of misuse.
>
> > > > With 10 devices you have 10 extra ioctls.
> > >
> > > And this works implicitly with your proposal?
> >
> > Yes. so you do:
> > iommu = open
> > ioctl(dev1, BIND, iommu)
> > ioctl(dev2, BIND, iommu)
> > ioctl(dev3, BIND, iommu)
> > ioctl(dev4, BIND, iommu)
> >
> > No need to add a SHARE ioctl.
>
> In my proposal this looks like:
>
>
> dev1 = open();
> ioctl(dev2, SHARE, dev1);
> ioctl(dev3, SHARE, dev1);
> ioctl(dev4, SHARE, dev1);
>
> So we actually save an ioctl.

This is not any hot path, so saving an ioctl shouldn't be a consideration.
Only important consideration is a good API. I may have lost context here,
but the SHARE API is limited to the vfio fd. The BIND API expects a new
iommu object. Are there other uses for this object? Tom's current vfio
driver exposes a dma mapping interface, would the iommu object expose
one as well? Current interface is device specific DMA interface for
host device drivers typically mapping in-flight dma buffers, and IOMMU
specific interface for assigned devices typically mapping entire virtual
address space.

thanks,
-chris

2010-06-02 18:12:38

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wednesday 02 June 2010 10:46:15 am Chris Wright wrote:
> * Joerg Roedel ([email protected]) wrote:
> > On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
> >
> > > > Even if it is bound to a domain the userspace driver could program the
> > > > device to do dma to unmapped regions causing io-page-faults. The kernel
> > > > can't do anything about it.
> > >
> > > It can always corrupt its own memory directly as well :)
> > > But that is not a reason not to detect errors if we can,
> > > and not to make APIs hard to misuse.
> >
> > Changing the domain of a device while dma can happen is the same type of
> > bug as unmapping potential dma target addresses. We can't catch this
> > kind of misuse.
> >
> > > > > With 10 devices you have 10 extra ioctls.
> > > >
> > > > And this works implicitly with your proposal?
> > >
> > > Yes. so you do:
> > > iommu = open
> > > ioctl(dev1, BIND, iommu)
> > > ioctl(dev2, BIND, iommu)
> > > ioctl(dev3, BIND, iommu)
> > > ioctl(dev4, BIND, iommu)
> > >
> > > No need to add a SHARE ioctl.
> >
> > In my proposal this looks like:
> >
> >
> > dev1 = open();
> > ioctl(dev2, SHARE, dev1);
> > ioctl(dev3, SHARE, dev1);
> > ioctl(dev4, SHARE, dev1);
> >
> > So we actually save an ioctl.
>
> This is not any hot path, so saving an ioctl shouldn't be a consideration.
> Only important consideration is a good API. I may have lost context here,
> but the SHARE API is limited to the vfio fd. The BIND API expects a new
> iommu object. Are there other uses for this object? Tom's current vfio
> driver exposes a dma mapping interface, would the iommu object expose
> one as well? Current interface is device specific DMA interface for
> host device drivers typically mapping in-flight dma buffers, and IOMMU
> specific interface for assigned devices typically mapping entire virtual
> address space.

Actually, it a domain object - which may be usable among iommus (Joerg?).
However, you can't really do the dma mapping with just the domain because
every device supports a different size address space as a master, i.e.,
the dma_mask.

And I don't know how kvm would deal with devices with varying dma mask support,
or why they'd be in the same domain.

2010-06-02 19:46:33

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 02, 2010 at 11:09:17AM -0700, Tom Lyon wrote:
> On Wednesday 02 June 2010 10:46:15 am Chris Wright wrote:

> > This is not any hot path, so saving an ioctl shouldn't be a consideration.
> > Only important consideration is a good API. I may have lost context here,
> > but the SHARE API is limited to the vfio fd. The BIND API expects a new
> > iommu object. Are there other uses for this object? Tom's current vfio
> > driver exposes a dma mapping interface, would the iommu object expose
> > one as well? Current interface is device specific DMA interface for
> > host device drivers typically mapping in-flight dma buffers, and IOMMU
> > specific interface for assigned devices typically mapping entire virtual
> > address space.
>
> Actually, it a domain object - which may be usable among iommus (Joerg?).

Yes, this 'iommu' thing is would be a domain object. But sharing among
iommus is not necessary because the fact that there are multiple iommus
in the system is hidden by the iommu drivers. This fact is not even
exposed by the iommu-api. This makes protection domains system global.

> However, you can't really do the dma mapping with just the domain because
> every device supports a different size address space as a master, i.e.,
> the dma_mask.

The dma_mask has to be handled by the device driver. With the
iommu-mapping interface the driver can specify the target io-address and
has to consider the dma_mask for that too.

> And I don't know how kvm would deal with devices with varying dma mask support,
> or why they'd be in the same domain.

KVM does not care about these masks. This is the business of the guest
device drivers.

Joerg

2010-06-03 06:23:53

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On 06/02/2010 08:46 PM, Chris Wright wrote:
> The BIND API expects a new
> iommu object. Are there other uses for this object?

Both kvm and vhost use similar memory maps, so they could use the new
object (without invoking the iommu unless they want dma).

> Tom's current vfio
> driver exposes a dma mapping interface, would the iommu object expose
> one as well? Current interface is device specific DMA interface for
> host device drivers typically mapping in-flight dma buffers, and IOMMU
> specific interface for assigned devices typically mapping entire virtual
> address space.
>

A per-request mapping sounds like a device API since it would only
affect that device (whereas the address space API affects multiple devices).

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-06-03 21:45:00

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

OK, in the interest of making progress, I am about to embark on the following:

1. Create a user-iommu-domain driver - opening it will give a new empty domain.
Ultimately this can also populate sysfs with the state of its world, which would
also be a good addition to the base iommu stuff.
If someone closes the fd while in use, the domain stays valid anyway until users
drop off.

2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver. Require that
a domain be set before using the VFIO_DMA_MAP_IOVA ioctl (this is the one
that KVM wants). However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
which uses the dma_sg interface which has no expicit control of domains. I
intend to keep it the way it is, but expect only non-hypervisor programs would
want to use it.

3. Clean up the docs and other nits that folks have found.

Comments?

2010-06-06 09:59:26

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote:
> OK, in the interest of making progress, I am about to embark on the following:
>
> 1. Create a user-iommu-domain driver - opening it will give a new empty domain.
> Ultimately this can also populate sysfs with the state of its world, which would
> also be a good addition to the base iommu stuff.
> If someone closes the fd while in use, the domain stays valid anyway until users
> drop off.
>
> 2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver. Require that
> a domain be set before using the VFIO_DMA_MAP_IOVA ioctl

Require domain to be set before you allow any access to the device:
mmap, write, read. IMO this is the only safe way to make sure userspace
does not corrupt memory, and this removes the need to special-case
MSI memory, play with bus master enable and hope it can be cleared without
reset, etc.

> (this is the one
> that KVM wants).

Not sure I understand. I think that MAP should be done on the domain,
not the device, this handles pinning pages correctly and
this way you don't need any special checks.

> However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
> which uses the dma_sg interface which has no expicit control of domains. I
> intend to keep it the way it is, but expect only non-hypervisor programs would
> want to use it.

If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't
non-hypervisors just pick an address?

> 3. Clean up the docs and other nits that folks have found.
>
> Comments?

2010-06-06 13:44:52

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On 06/02/2010 07:53 PM, Chris Wright wrote:
> * Avi Kivity ([email protected]) wrote:
>
>> The interface would only work for clients which support it: kvm,
>> vhost, and iommu/devices with restartable dma.
>>
> BTW, there is no such thing as restartable dma. There is a provision in
> new specs (read: no real hardware) that allows a device to request pages
> before using them. So it's akin to demand paging, but the demand is an
> explicit request rather than a page fault.
>

Thanks for the correction.

--
error compiling committee.c: too many arguments to function

2010-06-07 19:04:29

[permalink] [raw]

Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

On Sunday 06 June 2010 02:54:51 am Michael S. Tsirkin wrote:
> On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote:
> > OK, in the interest of making progress, I am about to embark on the following:
> >
> > 1. Create a user-iommu-domain driver - opening it will give a new empty domain.
> > Ultimately this can also populate sysfs with the state of its world, which would
> > also be a good addition to the base iommu stuff.
> > If someone closes the fd while in use, the domain stays valid anyway until users
> > drop off.
> >
> > 2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver. Require that
> > a domain be set before using the VFIO_DMA_MAP_IOVA ioctl
>
> Require domain to be set before you allow any access to the device:
> mmap, write, read. IMO this is the only safe way to make sure userspace
> does not corrupt memory, and this removes the need to special-case
> MSI memory, play with bus master enable and hope it can be cleared without
> reset, etc.

Michael - the light bulb finally lit for me and I now understand what you've been
saying the past few weeks. Of course you're right - we need iommu set before any
register access. I had thought that was done by default but now I see that the
dma_map_sg routine only attaches to the iommu on demand.

So I will torpedo the MAP_ANYWHERE stuff. I'd like to keep the MAP_IOVA ioctl
with the vfio fd so that the user can still do everything with one fd. I'm thinking the
fd opens and iommu bindings could be done in a program before spinning out the
program with the user driver.
>
> > (this is the one
> > that KVM wants).
>
> Not sure I understand. I think that MAP should be done on the domain,
> not the device, this handles pinning pages correctly and
> this way you don't need any special checks.
>
> > However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
> > which uses the dma_sg interface which has no expicit control of domains. I
> > intend to keep it the way it is, but expect only non-hypervisor programs would
> > want to use it.
>
> If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't
> non-hypervisors just pick an address?
>
> > 3. Clean up the docs and other nits that folks have found.
> >
> > Comments?
>

2010-06-08 21:27:39