LinuxLists.cc - Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-08 10:37:58

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On Thu, Oct 29, 2015 at 05:18:56PM +0100, David Woodhouse wrote:
> On Thu, 2015-10-29 at 11:01 +0200, Michael S. Tsirkin wrote:
> >
> > Example: you have a mix of assigned devices and virtio devices. You
> > don't trust your assigned device vendor not to corrupt your memory so
> > you want to limit the damage your assigned device can do to your
> > guest,
> > so you use an IOMMU for that. Thus existing iommu=pt within guest is
> > out.
> >
> > But you trust your hypervisor (you have no choice anyway),
> > and you don't want the overhead of tweaking IOMMU
> > on data path for virtio. Thus iommu=on is out too.
>
> That's not at all special for virtio or guest VMs. Even with real
> hardware, we might want performance from *some* devices, and security
> from others. See the DMA_ATTR_IOMMU_BYPASS which is currently being
> discussed.

Right. So let's wait for that discussion to play out?

> But of course the easy answer in *your* case it just to ask the
> hypervisor not to put the virtio devices behind an IOMMU at all. Which
> we were planning to remain the default behaviour.

One can't do this for x86 ATM, can one?

> In all cases, the DMA API shall do the right thing.

I have no problem with that. For example, can we teach
the DMA API on intel x86 to use PT for virtio by default?
That would allow merging Andy's patches with
full compatibility with old guests and hosts.

> --
> dwmw2
>
>

--
MST

2015-11-08 11:49:54

by Jörg Rödel

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote:
> I have no problem with that. For example, can we teach
> the DMA API on intel x86 to use PT for virtio by default?
> That would allow merging Andy's patches with
> full compatibility with old guests and hosts.

Well, the only incompatibility comes from an experimental qemu feature,
more explicitly from a bug in that features implementation. So why
should we work around that in the kernel? I think it is not too hard to
fix qemu to generate a correct DMAR table which excludes the virtio
devices from iommu translation.

Joerg

2015-11-08 12:00:47

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On Sun, 2015-11-08 at 12:37 +0200, Michael S. Tsirkin wrote:
> On Thu, Oct 29, 2015 at 05:18:56PM +0100, David Woodhouse wrote:
> > On Thu, 2015-10-29 at 11:01 +0200, Michael S. Tsirkin wrote:
> > >
> > > But you trust your hypervisor (you have no choice anyway),
> > > and you don't want the overhead of tweaking IOMMU
> > > on data path for virtio. Thus iommu=on is out too.
> >
> > That's not at all special for virtio or guest VMs. Even with real
> > hardware, we might want performance from *some* devices, and security
> > from others. See the DMA_ATTR_IOMMU_BYPASS which is currently being
> > discussed.
>
> Right. So let's wait for that discussion to play out?

That discussion is purely about a requested optimisation. This one is
about correctness.

> > But of course the easy answer in *your* case it just to ask the
> > hypervisor not to put the virtio devices behind an IOMMU at all. Which
> > we were planning to remain the default behaviour.
>
> One can't do this for x86 ATM, can one?

The converse is true, in fact — currently, there's no way to tell
qemu-system-x86 that you *do* want it to put the virtio devices behind
the emulated IOMMU, as it has no support for that.

Which is a bit sad really, since the DMAR table that qemu advertises to
the guest does *tell* the guest that the virtio devices are behind the
emulated IOMMU.

In the short term, we'll be fixing the DMAR table, and still not
actually making it possible to put the virtio devices behind the
emulated IOMMU.

In the fullness of time, however, we *will* be fixing the qemu IOMMU
code so that it can translate for virtio devices — and for assigned
physical devices, which I believe are also broken at the moment when
qemu emulates an IOMMU.

> > In all cases, the DMA API shall do the right thing.
>
> I have no problem with that. For example, can we teach
> the DMA API on intel x86 to use PT for virtio by default?
> That would allow merging Andy's patches with
> full compatibility with old guests and hosts.

A quirk so that we *notice* the bug in the existing qemu DMAR table,
and disbelieve what it says about the virtio devices?

Alternatively, we could just recognise that the emulated IOMMU support
in qemu is an experimental feature and doesn't work right, yet. Are
people really using it in anger?

If we do want to do a quirk, then we should make it get it right for
assigned devices too.

To start with, do you want to try to express the criteria for "the DMAR
table lies and <this> device is actually untranslated" in a form of
prose which could reasonably be translated into code?

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

Attachments:

smime.p7s (5.56 kB)

2015-11-10 15:02:28

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On Sun, Nov 08, 2015 at 12:49:46PM +0100, Joerg Roedel wrote:
> On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote:
> > I have no problem with that. For example, can we teach
> > the DMA API on intel x86 to use PT for virtio by default?
> > That would allow merging Andy's patches with
> > full compatibility with old guests and hosts.
>
> Well, the only incompatibility comes from an experimental qemu feature,
> more explicitly from a bug in that features implementation. So why
> should we work around that in the kernel? I think it is not too hard to
> fix qemu to generate a correct DMAR table which excludes the virtio
> devices from iommu translation.
>
>
> Joerg

It's not that easy - you'd have to dedicate some buses
for iommu bypass, and teach management tools to only put
virtio there - but it's possible.

This will absolutely address guests that don't need to set up IOMMU for
virtio devices, and virtio that bypasses the IOMMU.

But the problem is that we do want to *allow* guests
to set up IOMMU for virtio devices.
In that case, these are two other usecases:

A- monolitic virtio within QEMU:
iommu only needed for VFIO ->
guest should always use iommu=pt
iommu=on works but is just useless overhead.

B- modular out of process virtio outside QEMU:
iommu needed for VFIO or kernel driver ->
guest should use iommu=pt or iommu=on
depending on security/performance requirements

Note that there could easily be a mix of these in the same system.

So for these cases we do need QEMU to specify to guest that IOMMU covers
the virtio devices. Also, once one does this, the default on linux is
iommu=on and not pt, which works but ATM is very slow.

This poses three problems:

1. How do we address the different needs of A and B?
One way would be for virtio to pass the information to guest
using some virtio specific way, and have drivers
specify what kind of DMA access they want.

2. (Kind of a subset of 1) once we do allow IOMMU, how do we make sure most guests
use the more sensible iommu=pt.

3. Once we do allow IOMMU, how can we keep existing guests work in this configuration?
Creating different hypervisor configurations depending on guest is very nasty.
Again, one way would be some virtio specific interface.

I'd rather we figured the answers to this before merging Andy's patches
because I'm concerned that instead of 1 broken configuration
(virtio always bypasses IOMMU) we'll get two bad configurations
(in the second one, virtio uses the slow default with no
gain in security).

Suggestions wellcome.

--
MST

2015-11-10 18:54:55

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On Nov 10, 2015 7:02 AM, "Michael S. Tsirkin" <[email protected]> wrote:
>
> On Sun, Nov 08, 2015 at 12:49:46PM +0100, Joerg Roedel wrote:
> > On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote:
> > > I have no problem with that. For example, can we teach
> > > the DMA API on intel x86 to use PT for virtio by default?
> > > That would allow merging Andy's patches with
> > > full compatibility with old guests and hosts.
> >
> > Well, the only incompatibility comes from an experimental qemu feature,
> > more explicitly from a bug in that features implementation. So why
> > should we work around that in the kernel? I think it is not too hard to
> > fix qemu to generate a correct DMAR table which excludes the virtio
> > devices from iommu translation.
> >
> >
> > Joerg
>
> It's not that easy - you'd have to dedicate some buses
> for iommu bypass, and teach management tools to only put
> virtio there - but it's possible.
>
> This will absolutely address guests that don't need to set up IOMMU for
> virtio devices, and virtio that bypasses the IOMMU.
>
> But the problem is that we do want to *allow* guests
> to set up IOMMU for virtio devices.
> In that case, these are two other usecases:
>
> A- monolitic virtio within QEMU:
> iommu only needed for VFIO ->
> guest should always use iommu=pt
> iommu=on works but is just useless overhead.
>
> B- modular out of process virtio outside QEMU:
> iommu needed for VFIO or kernel driver ->
> guest should use iommu=pt or iommu=on
> depending on security/performance requirements
>
> Note that there could easily be a mix of these in the same system.
>
> So for these cases we do need QEMU to specify to guest that IOMMU covers
> the virtio devices. Also, once one does this, the default on linux is
> iommu=on and not pt, which works but ATM is very slow.
>
> This poses three problems:
>
> 1. How do we address the different needs of A and B?
> One way would be for virtio to pass the information to guest
> using some virtio specific way, and have drivers
> specify what kind of DMA access they want.
>
> 2. (Kind of a subset of 1) once we do allow IOMMU, how do we make sure most guests
> use the more sensible iommu=pt.
>
> 3. Once we do allow IOMMU, how can we keep existing guests work in this configuration?
> Creating different hypervisor configurations depending on guest is very nasty.
> Again, one way would be some virtio specific interface.
>
> I'd rather we figured the answers to this before merging Andy's patches
> because I'm concerned that instead of 1 broken configuration
> (virtio always bypasses IOMMU) we'll get two bad configurations
> (in the second one, virtio uses the slow default with no
> gain in security).
>
> Suggestions wellcome.

I think there's still no downside of using my patches, even on x86.

Old kernels on new QEMU work unless IOMMU is enabled on the host. I
think that's the best we can possibly do.

New kernels work at full speed on old QEMU.

New kernels with new QEMU and iommu enabled work slower. Even newer
kernels with default passthrough work at full speed, and there's no
obvious downside to the existence of kernels with just my patches.

--Andy

>
> --
> MST

2015-11-11 10:05:19

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On Tue, Nov 10, 2015 at 10:54:21AM -0800, Andy Lutomirski wrote:
> On Nov 10, 2015 7:02 AM, "Michael S. Tsirkin" <[email protected]> wrote:
> >
> > On Sun, Nov 08, 2015 at 12:49:46PM +0100, Joerg Roedel wrote:
> > > On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote:
> > > > I have no problem with that. For example, can we teach
> > > > the DMA API on intel x86 to use PT for virtio by default?
> > > > That would allow merging Andy's patches with
> > > > full compatibility with old guests and hosts.
> > >
> > > Well, the only incompatibility comes from an experimental qemu feature,
> > > more explicitly from a bug in that features implementation. So why
> > > should we work around that in the kernel? I think it is not too hard to
> > > fix qemu to generate a correct DMAR table which excludes the virtio
> > > devices from iommu translation.
> > >
> > >
> > > Joerg
> >
> > It's not that easy - you'd have to dedicate some buses
> > for iommu bypass, and teach management tools to only put
> > virtio there - but it's possible.
> >
> > This will absolutely address guests that don't need to set up IOMMU for
> > virtio devices, and virtio that bypasses the IOMMU.
> >
> > But the problem is that we do want to *allow* guests
> > to set up IOMMU for virtio devices.
> > In that case, these are two other usecases:
> >
> > A- monolitic virtio within QEMU:
> > iommu only needed for VFIO ->
> > guest should always use iommu=pt
> > iommu=on works but is just useless overhead.
> >
> > B- modular out of process virtio outside QEMU:
> > iommu needed for VFIO or kernel driver ->
> > guest should use iommu=pt or iommu=on
> > depending on security/performance requirements
> >
> > Note that there could easily be a mix of these in the same system.
> >
> > So for these cases we do need QEMU to specify to guest that IOMMU covers
> > the virtio devices. Also, once one does this, the default on linux is
> > iommu=on and not pt, which works but ATM is very slow.
> >
> > This poses three problems:
> >
> > 1. How do we address the different needs of A and B?
> > One way would be for virtio to pass the information to guest
> > using some virtio specific way, and have drivers
> > specify what kind of DMA access they want.
> >
> > 2. (Kind of a subset of 1) once we do allow IOMMU, how do we make sure most guests
> > use the more sensible iommu=pt.
> >
> > 3. Once we do allow IOMMU, how can we keep existing guests work in this configuration?
> > Creating different hypervisor configurations depending on guest is very nasty.
> > Again, one way would be some virtio specific interface.
> >
> > I'd rather we figured the answers to this before merging Andy's patches
> > because I'm concerned that instead of 1 broken configuration
> > (virtio always bypasses IOMMU) we'll get two bad configurations
> > (in the second one, virtio uses the slow default with no
> > gain in security).
> >
> > Suggestions wellcome.
>
> I think there's still no downside of using my patches, even on x86.
>
> Old kernels on new QEMU work unless IOMMU is enabled on the host. I
> think that's the best we can possibly do.
> New kernels work at full speed on old QEMU.

Only if IOMMU is disabled, right?

> New kernels with new QEMU and iommu enabled work slower. Even newer
> kernels with default passthrough work at full speed, and there's no
> obvious downside to the existence of kernels with just my patches.
>
> --Andy
>

I tried to explain the possible downside. Let me try again. Imagine
that guest kernel notifies hypervisor that it wants IOMMU to actually
work. This will make old kernel on new QEMU work even with IOMMU
enabled on host - better than "the best we can do" that you described
above. Specifically, QEMU will assume that if it didn't get
notification, it's an old kernel so it should ignore the IOMMU.

But if we apply your patches this trick won't work.

Without implementing it all, I think the easiest incremental step would
be to teach linux to make passthrough the default when running as a
guest on top of QEMU, put your patches on top. If someone specifies
non passthrough on command line it'll still be broken,
but not too bad.

> >
> > --
> > MST

2015-11-11 15:56:25

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On Wed, Nov 11, 2015 at 2:05 AM, Michael S. Tsirkin <[email protected]> wrote:
> On Tue, Nov 10, 2015 at 10:54:21AM -0800, Andy Lutomirski wrote:
>> On Nov 10, 2015 7:02 AM, "Michael S. Tsirkin" <[email protected]> wrote:
>> >
>> > On Sun, Nov 08, 2015 at 12:49:46PM +0100, Joerg Roedel wrote:
>> > > On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote:
>> > > > I have no problem with that. For example, can we teach
>> > > > the DMA API on intel x86 to use PT for virtio by default?
>> > > > That would allow merging Andy's patches with
>> > > > full compatibility with old guests and hosts.
>> > >
>> > > Well, the only incompatibility comes from an experimental qemu feature,
>> > > more explicitly from a bug in that features implementation. So why
>> > > should we work around that in the kernel? I think it is not too hard to
>> > > fix qemu to generate a correct DMAR table which excludes the virtio
>> > > devices from iommu translation.
>> > >
>> > >
>> > > Joerg
>> >
>> > It's not that easy - you'd have to dedicate some buses
>> > for iommu bypass, and teach management tools to only put
>> > virtio there - but it's possible.
>> >
>> > This will absolutely address guests that don't need to set up IOMMU for
>> > virtio devices, and virtio that bypasses the IOMMU.
>> >
>> > But the problem is that we do want to *allow* guests
>> > to set up IOMMU for virtio devices.
>> > In that case, these are two other usecases:
>> >
>> > A- monolitic virtio within QEMU:
>> > iommu only needed for VFIO ->
>> > guest should always use iommu=pt
>> > iommu=on works but is just useless overhead.
>> >
>> > B- modular out of process virtio outside QEMU:
>> > iommu needed for VFIO or kernel driver ->
>> > guest should use iommu=pt or iommu=on
>> > depending on security/performance requirements
>> >
>> > Note that there could easily be a mix of these in the same system.
>> >
>> > So for these cases we do need QEMU to specify to guest that IOMMU covers
>> > the virtio devices. Also, once one does this, the default on linux is
>> > iommu=on and not pt, which works but ATM is very slow.
>> >
>> > This poses three problems:
>> >
>> > 1. How do we address the different needs of A and B?
>> > One way would be for virtio to pass the information to guest
>> > using some virtio specific way, and have drivers
>> > specify what kind of DMA access they want.
>> >
>> > 2. (Kind of a subset of 1) once we do allow IOMMU, how do we make sure most guests
>> > use the more sensible iommu=pt.
>> >
>> > 3. Once we do allow IOMMU, how can we keep existing guests work in this configuration?
>> > Creating different hypervisor configurations depending on guest is very nasty.
>> > Again, one way would be some virtio specific interface.
>> >
>> > I'd rather we figured the answers to this before merging Andy's patches
>> > because I'm concerned that instead of 1 broken configuration
>> > (virtio always bypasses IOMMU) we'll get two bad configurations
>> > (in the second one, virtio uses the slow default with no
>> > gain in security).
>> >
>> > Suggestions wellcome.
>>
>> I think there's still no downside of using my patches, even on x86.
>>
>> Old kernels on new QEMU work unless IOMMU is enabled on the host. I
>> think that's the best we can possibly do.
>> New kernels work at full speed on old QEMU.
>
> Only if IOMMU is disabled, right?
>
>> New kernels with new QEMU and iommu enabled work slower. Even newer
>> kernels with default passthrough work at full speed, and there's no
>> obvious downside to the existence of kernels with just my patches.
>>
>> --Andy
>>
>
> I tried to explain the possible downside. Let me try again. Imagine
> that guest kernel notifies hypervisor that it wants IOMMU to actually
> work. This will make old kernel on new QEMU work even with IOMMU
> enabled on host - better than "the best we can do" that you described
> above. Specifically, QEMU will assume that if it didn't get
> notification, it's an old kernel so it should ignore the IOMMU.

Can you flesh out this trick?

On x86 IIUC the IOMMU more-or-less defaults to passthrough. If the
kernel wants, it can switch it to a non-passthrough mode. My patches
cause the virtio driver to do exactly this, except that the host
implementation doesn't actually exist yet, so the patches will instead
have no particular effect.

On powerpc and sparc, we *already* screwed up. The host already tells
the guest that there's an IOMMU and that it's *enabled* because those
platforms don't have selective IOMMU coverage the way that x86 does.
So we need to work around it.

I think that, if we want fancy virt-friendly IOMMU stuff like you're
talking about, then the right thing to do is to create a virtio bus
instead of pretending to be PCI. That bus could have a virtio IOMMU
and its own cross-platform enumeration mechanism for devices on the
bus, and everything would be peachy.

In the mean time, there are existing mechanisms by which every PCI
driver is supposed to notify the host/platform of how it intends to
map DMA memory, and virtio gets it wrong.

>
> But if we apply your patches this trick won't work.
>

I still don't understand what trick. If we want virtio devices to be
assignable, then they should be translated through the IOMMU, and the
DMA API is the right interface for that.

> Without implementing it all, I think the easiest incremental step would
> be to teach linux to make passthrough the default when running as a
> guest on top of QEMU, put your patches on top. If someone specifies
> non passthrough on command line it'll still be broken,
> but not too bad.

Can powerpc and sparc do exact 1:1 passthrough for a given device? If
so, that might be a reasonable way forward. After all, if a new
powerpc kernel asks for exact passthrough (dma addr = phys addr with
no offset at all), then old QEMU will just ignore it and therefore
accidentally get it right. Ben?

--Andy

2015-11-11 22:30:37

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On Wed, 2015-11-11 at 07:56 -0800, Andy Lutomirski wrote:
>
> Can you flesh out this trick?
>
> On x86 IIUC the IOMMU more-or-less defaults to passthrough. If the
> kernel wants, it can switch it to a non-passthrough mode. My patches
> cause the virtio driver to do exactly this, except that the host
> implementation doesn't actually exist yet, so the patches will instead
> have no particular effect.

At some level, yes — we're compatible with a 1982 IBM PC and thus the
IOMMU is entirely disabled at boot until the kernel turns it on —
except in TXT mode where we abandon that compatibility.

But no, the virtio driver has *nothing* to do with switching the device
out of passthrough mode. It is either in passthrough mode, or it isn't.

If the VMM *doesn't* expose an IOMMU to the guest, obviously the
devices are in passthrough mode. If the guest kernel doesn't have IOMMU
support enabled, then obviously the devices are in passthrough mode.
And if the ACPI tables exposed to the guest kernel *tell* it that the
virtio devices are not actually behind the IOMMU (which qemu gets
wrong), then it'll be in passthrough mode.

If the IOMMU is exposed, and enabled, and telling the guest kernel that
it *does* cover the virtio devices, then those virtio devices will
*not* be in passthrough mode.

You choosing to use the DMA API in the virtio device drivers instead of
being buggy, has nothing to do with whether it's actually in
passthrough mode or not. Whether it's in passthrough mode or not, using
the DMA API is technically the right thing to do — because it should
either *do* the translation, or return a 1:1 mapped IOVA, as
appropriate.

> On powerpc and sparc, we *already* screwed up. The host already tells
> the guest that there's an IOMMU and that it's *enabled* because those
> platforms don't have selective IOMMU coverage the way that x86 does.
> So we need to work around it.

No, we need it on x86 too because once we fix the virtio device driver
bug and make it start using the DMA API, then we start to trip up on
the qemu bug where it lies about which devices are covered by the
IOMMU.

Of course, we still have that same qemu bug w.r.t. assigned devices,
which it *also* claims are behind its IOMMU when they're not...

> I think that, if we want fancy virt-friendly IOMMU stuff like you're
> talking about, then the right thing to do is to create a virtio bus
> instead of pretending to be PCI. That bus could have a virtio IOMMU
> and its own cross-platform enumeration mechanism for devices on the
> bus, and everything would be peachy.

That doesn't really help very much for the x86 case where the problem
is compatibility with *existing* (arguably broken) qemu
implementations.

Having said that, if this were real hardware I'd just be blacklisting
it and saying "Another BIOS with broken DMAR tables --> IOMMU
completely disabled". So perhaps we should just do that.

> I still don't understand what trick. If we want virtio devices to be
> assignable, then they should be translated through the IOMMU, and the
> DMA API is the right interface for that.

The DMA API is the right interface *regardless* of whether there's
actual translation to be done. The device driver itself should not be
involved in any way with that decision.

When you want to access MMIO, you use ioremap() and writel() instead of
doing random crap for yourself. When you want DMA, you use the DMA API
to get a bus address for your device *even* if you expect there to be
no IOMMU and you expect it to precisely match the physical address. No
excuses.

--
dwmw2

Attachments:

smime.p7s (5.56 kB)

2015-11-12 11:10:04

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On Wed, Nov 11, 2015 at 11:30:27PM +0100, David Woodhouse wrote:
> On Wed, 2015-11-11 at 07:56 -0800, Andy Lutomirski wrote:
> >
> > Can you flesh out this trick?
> >
> > On x86 IIUC the IOMMU more-or-less defaults to passthrough. If the
> > kernel wants, it can switch it to a non-passthrough mode. My patches
> > cause the virtio driver to do exactly this, except that the host
> > implementation doesn't actually exist yet, so the patches will instead
> > have no particular effect.
>
> At some level, yes — we're compatible with a 1982 IBM PC and thus the
> IOMMU is entirely disabled at boot until the kernel turns it on —
> except in TXT mode where we abandon that compatibility.
>
> But no, the virtio driver has *nothing* to do with switching the device
> out of passthrough mode. It is either in passthrough mode, or it isn't.
>
> If the VMM *doesn't* expose an IOMMU to the guest, obviously the
> devices are in passthrough mode. If the guest kernel doesn't have IOMMU
> support enabled, then obviously the devices are in passthrough mode.
> And if the ACPI tables exposed to the guest kernel *tell* it that the
> virtio devices are not actually behind the IOMMU (which qemu gets
> wrong), then it'll be in passthrough mode.
>
> If the IOMMU is exposed, and enabled, and telling the guest kernel that
> it *does* cover the virtio devices, then those virtio devices will
> *not* be in passthrough mode.

This we need to fix. Because in most configurations if you are
using kernel drivers, then you don't want IOMMU with virtio,
but if you are using VFIO then you do.

Intel's iommu can be programmed to still
do a kind of passthrough (1:1) mapping, it's
just a matter of doing this for virtio devices
when not using VFIO.

> You choosing to use the DMA API in the virtio device drivers instead of
> being buggy, has nothing to do with whether it's actually in
> passthrough mode or not. Whether it's in passthrough mode or not, using
> the DMA API is technically the right thing to do — because it should
> either *do* the translation, or return a 1:1 mapped IOVA, as
> appropriate.

Right but first we need to actually make DMA API do the right thing
at least on x86,ppc and arm.

> > On powerpc and sparc, we *already* screwed up. The host already tells
> > the guest that there's an IOMMU and that it's *enabled* because those
> > platforms don't have selective IOMMU coverage the way that x86 does.
> > So we need to work around it.
>
> No, we need it on x86 too because once we fix the virtio device driver
> bug and make it start using the DMA API, then we start to trip up on
> the qemu bug where it lies about which devices are covered by the
> IOMMU.
>
> Of course, we still have that same qemu bug w.r.t. assigned devices,
> which it *also* claims are behind its IOMMU when they're not...

I'm not worried about qemu bugs that much. I am interested in being
able to use both VFIO and kernel drivers with virtio devices with good
performance and without tweaking kernel parameters.

> > I think that, if we want fancy virt-friendly IOMMU stuff like you're
> > talking about, then the right thing to do is to create a virtio bus
> > instead of pretending to be PCI. That bus could have a virtio IOMMU
> > and its own cross-platform enumeration mechanism for devices on the
> > bus, and everything would be peachy.
>
> That doesn't really help very much for the x86 case where the problem
> is compatibility with *existing* (arguably broken) qemu
> implementations.
>
> Having said that, if this were real hardware I'd just be blacklisting
> it and saying "Another BIOS with broken DMAR tables --> IOMMU
> completely disabled". So perhaps we should just do that.
>

Yes, once there is new QEMU where virtio is covered by the IOMMU,
that would be one way to address existing QEMU bugs.

> > I still don't understand what trick. If we want virtio devices to be
> > assignable, then they should be translated through the IOMMU, and the
> > DMA API is the right interface for that.
>
> The DMA API is the right interface *regardless* of whether there's
> actual translation to be done. The device driver itself should not be
> involved in any way with that decision.

With virt, each device can have different priveledges:
some are part of hypervisor so with a kernel driver
trying to get protection from them using an IOMMU which is also
part of hypervisor makes no sense - but when using a
userspace driver then getting protection from the userspace
driver does make sense. Others are real devices so
getting protection from them makes some sense.

Which is which? It's easiest for the device driver itself to
gain that knowledge. Please note this is *not* the same
question as whether a specific device is covered by an IOMMU.

> When you want to access MMIO, you use ioremap() and writel() instead of
> doing random crap for yourself. When you want DMA, you use the DMA API
> to get a bus address for your device *even* if you expect there to be
> no IOMMU and you expect it to precisely match the physical address. No
> excuses.

No problem, but the fact remains that virtio does need
per-device control over whether it's passthrough or not.

Forget the bugs, that's not the issue - the issue is
that it's sometimes part of hypervisor and
sometimes isn't.

We just can't say it's always not a part of hypervisor so you always
want maximum protection - that drops performance by to the floor.

Linux doesn't seem to support that usecase at the moment, if this is a
generic problem then we need to teach Linux to solve it, but if virtio
is unique in this requirement, then we should just keep doing virtio
specific things to solve it.

> --
> dwmw2
>
>

2015-11-12 12:18:23

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On Thu, 2015-11-12 at 13:09 +0200, Michael S. Tsirkin wrote:
> On Wed, Nov 11, 2015 at 11:30:27PM +0100, David Woodhouse wrote:
> >
> > If the IOMMU is exposed, and enabled, and telling the guest kernel that
> > it *does* cover the virtio devices, then those virtio devices will
> > *not* be in passthrough mode.
>
> This we need to fix. Because in most configurations if you are
> using kernel drivers, then you don't want IOMMU with virtio,
> but if you are using VFIO then you do.

This is *absolutely* not specific to virtio. There are *plenty* of
other users (especially networking) where we only really care about the
existence of the IOMMU for VFIO purposes and assigning devices to
guests, and we are willing to dispense with the protection that it
offers for native in-kernel drivers. For that, boot with iommu=pt.

There is no way, currently, to enable the passthrough mode on a per-
device basis. Although it has been discussed right here, very recently.

Let's not conflate those issues.

> > You choosing to use the DMA API in the virtio device drivers instead of
> > being buggy, has nothing to do with whether it's actually in
> > passthrough mode or not. Whether it's in passthrough mode or not, using
> > the DMA API is technically the right thing to do — because it should
> > either *do* the translation, or return a 1:1 mapped IOVA, as
> > appropriate.
>
> Right but first we need to actually make DMA API do the right thing
> at least on x86,ppc and arm.

It already does the right thing on x86, modulo BIOS bugs (including the
qemu ACPI table but that you said you're not too worried about).

> I'm not worried about qemu bugs that much. I am interested in being
> able to use both VFIO and kernel drivers with virtio devices with good
> performance and without tweaking kernel parameters.

OK, then you are interested in the semi-orthogonal discussion about
DMA_ATTR_IOMMU_BYPASS. Either way, device drivers SHALL use the DMA
API.

> > Having said that, if this were real hardware I'd just be blacklisting
> > it and saying "Another BIOS with broken DMAR tables --> IOMMU
> > completely disabled". So perhaps we should just do that.
> >
> Yes, once there is new QEMU where virtio is covered by the IOMMU,
> that would be one way to address existing QEMU bugs.

No, that's not required. All that's required is to fix the currently-
broken ACPI table so that it *admits* that the virtio devices aren't
covered by the IOMMU. And I've never waited for a fix to be available
before, before blacklisting *other* broken firmwares...

The only reason I'm holding off for now is because ARM and PPC also
need a quirk for their platform code to realise that certain devices
actually *aren't* covered by the IOMMU, and I might be able to just use
the same thing and still enable the IOMMU in the offending qemu
versions.

Although as noted, it would need to cover assigned devices as well as
virtio — qemu currently lies to us and tells us that the emulated IOMMU
in the guest does cover *those* too.

> With virt, each device can have different priveledges:
> some are part of hypervisor so with a kernel driver
> trying to get protection from them using an IOMMU which is also
> part of hypervisor makes no sense
> - but when using a
> userspace driver then getting protection from the userspace
> driver does make sense. Others are real devices so
> getting protection from them makes some sense.
>
> Which is which? It's easiest for the device driver itself to
> gain that knowledge. Please note this is *not* the same
> question as whether a specific device is covered by an IOMMU.

OK. How does your device driver know whether the virtio PCI device it's
talking to is actually implemented by the hypervisor, or whether it's
one of the real PCI implementations that apparently exist?

> Linux doesn't seem to support that usecase at the moment, if this is a
> generic problem then we need to teach Linux to solve it, but if virtio
> is unique in this requirement, then we should just keep doing virtio
> specific things to solve it.

It is a generic problem. There is a discussion elsewhere about how (or
indeed whether) to solve it. It absolutely isn't virtio-specific, and
we absolutely shouldn't be doing virtio-specific things to solve it.

Nothing excuses just eschewing the correct DMA API. That's just broken,
and only ever worked in conjunction with *other* bugs elsewhere in the
platform.

--
dwmw2

Attachments:

smime.p7s (5.56 kB)

2015-11-22 13:06:15

by Marcel Apfelbaum

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On 11/08/2015 01:49 PM, Joerg Roedel wrote:
> On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote:
>> I have no problem with that. For example, can we teach
>> the DMA API on intel x86 to use PT for virtio by default?
>> That would allow merging Andy's patches with
>> full compatibility with old guests and hosts.
>
> Well, the only incompatibility comes from an experimental qemu feature,
> more explicitly from a bug in that features implementation. So why
> should we work around that in the kernel? I think it is not too hard to
> fix qemu to generate a correct DMAR table which excludes the virtio
> devices from iommu translation.

Hi,

I tried to generate a DMAR table that excludes some devices from
IOMMU translation, however it does not help.

The reason is, as far as I understand, that Linux kernel does
not allow any device being outside an IOMMU scope if the
iommu kernel option is activated.

Does anybody know if it is "by design" or is simply an uncommon configuration?
(some devices in an IOMMU scope, while others outside *any* IOMMU scope)

Thanks,
Marcel

>
>
> Joerg
>
> _______________________________________________
> Virtualization mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
>

2015-11-22 15:54:32

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On Sun, 2015-11-22 at 15:06 +0200, Marcel Apfelbaum wrote:
>
>
> I tried to generate a DMAR table that excludes some devices from
> IOMMU translation, however it does not help.
>
> The reason is, as far as I understand, that Linux kernel does
> not allow any device being outside an IOMMU scope if the
> iommu kernel option is activated.
>
> Does anybody know if it is "by design" or is simply an uncommon
> configuration?
> (some devices in an IOMMU scope, while others outside *any* IOMMU
> scope)

That's a kernel bug in the way it handles per-device DMA operations. Or
more to the point, in the way it doesn't — the non-translated devices
end up being pointed to the intel_dma_ops despite the fact they
shouldn't be. I'm working on that...

--
dwmw2

Attachments:

smime.p7s (5.56 kB)

2015-11-22 17:04:46

by Marcel Apfelbaum

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On 11/22/2015 05:54 PM, David Woodhouse wrote:
> On Sun, 2015-11-22 at 15:06 +0200, Marcel Apfelbaum wrote:
>>
>>
>> I tried to generate a DMAR table that excludes some devices from
>> IOMMU translation, however it does not help.
>>
>> The reason is, as far as I understand, that Linux kernel does
>> not allow any device being outside an IOMMU scope if the
>> iommu kernel option is activated.
>>
>> Does anybody know if it is "by design" or is simply an uncommon
>> configuration?
>> (some devices in an IOMMU scope, while others outside *any* IOMMU
>> scope)
>
> That's a kernel bug in the way it handles per-device DMA operations. Or
> more to the point, in the way it doesn't — the non-translated devices
> end up being pointed to the intel_dma_ops despite the fact they
> shouldn't be. I'm working on that...
>

Hi David,
Thank you for the fast response.

Sadly I am not familiar with the DMA/IOMMU code to contribute
with a sane idea, but I'll gladly test it.
If you lack the time and have an idea to share, I can give it a try though.

Thanks,
Marcel

2015-11-22 22:11:17

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] virtio DMA API core stuff

On Sun, Nov 22, 2015 at 03:54:21PM +0000, David Woodhouse wrote:
> On Sun, 2015-11-22 at 15:06 +0200, Marcel Apfelbaum wrote:
> >
> >
> > I tried to generate a DMAR table that excludes some devices from
> > IOMMU translation, however it does not help.
> >
> > The reason is, as far as I understand, that Linux kernel does
> > not allow any device being outside an IOMMU scope if the
> > iommu kernel option is activated.
> >
> > Does anybody know if it is "by design" or is simply an uncommon
> > configuration?
> > (some devices in an IOMMU scope, while others outside *any* IOMMU
> > scope)
>
> That's a kernel bug in the way it handles per-device DMA operations. Or
> more to the point, in the way it doesn't — the non-translated devices
> end up being pointed to the intel_dma_ops despite the fact they
> shouldn't be. I'm working on that...
>
> --
> dwmw2
>

Interesting. This seems to imply such configurations aren't
common, so I wonder whether other guest OS-es treat them
correctly.

If many of them are, we probably shouldn't use this in QEMU:
we care about guests actually working :)

--
MST