This adds a hook which a platform can define in order to allow it to
force the use of the DMA API for all virtio devices even if they don't
have the VIRTIO_F_IOMMU_PLATFORM flag set. We want to use this to do
bounce-buffering of data on the new secure pSeries platform, currently
under development, where a KVM host cannot access all of the memory
space of a secure KVM guest. The host can only access the pages which
the guest has explicitly requested to be shared with the host, thus
the virtio implementation in the guest has to copy data to and from
shared pages.
With this hook, the platform code in the secure guest can force the
use of swiotlb for virtio buffers, with a back-end for swiotlb which
will use a pool of pre-allocated shared pages. Thus all data being
sent or received by virtio devices will be copied through pages which
the host has access to.
Signed-off-by: Anshuman Khandual <[email protected]>
---
Changes in V2:
The arch callback has been enabled through an weak symbol defintion
so that it is enabled only for those architectures subscribing to
this new framework. Clarified the patch description. The primary
objective for this RFC has been to get an in principle agreement
on this approach.
Original V1:
Original RFC and discussions https://patchwork.kernel.org/patch/10324405/
arch/powerpc/include/asm/dma-mapping.h | 6 ++++++
arch/powerpc/platforms/pseries/iommu.c | 11 +++++++++++
drivers/virtio/virtio_ring.c | 10 ++++++++++
3 files changed, 27 insertions(+)
diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
index 8fa3945..056e578 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -115,4 +115,10 @@ extern u64 __dma_get_required_mask(struct device *dev);
#define ARCH_HAS_DMA_MMAP_COHERENT
#endif /* __KERNEL__ */
+
+#define platform_forces_virtio_dma platform_forces_virtio_dma
+
+struct virtio_device;
+
+extern bool platform_forces_virtio_dma(struct virtio_device *vdev);
#endif /* _ASM_DMA_MAPPING_H */
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 06f0296..a2ec15a 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -38,6 +38,7 @@
#include <linux/of.h>
#include <linux/iommu.h>
#include <linux/rculist.h>
+#include <linux/virtio.h>
#include <asm/io.h>
#include <asm/prom.h>
#include <asm/rtas.h>
@@ -1396,3 +1397,13 @@ static int __init disable_multitce(char *str)
__setup("multitce=", disable_multitce);
machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
+
+bool platform_forces_virtio_dma(struct virtio_device *vdev)
+{
+ /*
+ * On protected guest platforms, force virtio core to use DMA
+ * MAP API for all virtio devices. But there can also be some
+ * exceptions for individual devices like virtio balloon.
+ */
+ return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
+}
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 21d464a..47ea6c3 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -141,8 +141,18 @@ struct vring_virtqueue {
* unconditionally on data path.
*/
+#ifndef platform_forces_virtio_dma
+static inline bool platform_forces_virtio_dma(struct virtio_device *vdev)
+{
+ return false;
+}
+#endif
+
static bool vring_use_dma_api(struct virtio_device *vdev)
{
+ if (platform_forces_virtio_dma(vdev))
+ return true;
+
if (!virtio_has_iommu_quirk(vdev))
return true;
--
2.9.3
subj: s/virito/virtio/
On Tue, May 22, 2018 at 12:03:17PM +0530, Anshuman Khandual wrote:
> This adds a hook which a platform can define in order to allow it to
> force the use of the DMA API for all virtio devices even if they don't
> have the VIRTIO_F_IOMMU_PLATFORM flag set. We want to use this to do
> bounce-buffering of data on the new secure pSeries platform, currently
> under development, where a KVM host cannot access all of the memory
> space of a secure KVM guest. The host can only access the pages which
> the guest has explicitly requested to be shared with the host, thus
> the virtio implementation in the guest has to copy data to and from
> shared pages.
>
> With this hook, the platform code in the secure guest can force the
> use of swiotlb for virtio buffers, with a back-end for swiotlb which
> will use a pool of pre-allocated shared pages. Thus all data being
> sent or received by virtio devices will be copied through pages which
> the host has access to.
>
> Signed-off-by: Anshuman Khandual <[email protected]>
> ---
> Changes in V2:
>
> The arch callback has been enabled through an weak symbol defintion
> so that it is enabled only for those architectures subscribing to
> this new framework. Clarified the patch description. The primary
> objective for this RFC has been to get an in principle agreement
> on this approach.
>
> Original V1:
>
> Original RFC and discussions https://patchwork.kernel.org/patch/10324405/
I re-read that discussion and I'm still unclear on the
original question, since I got several apparently
conflicting answers.
I asked:
Why isn't setting VIRTIO_F_IOMMU_PLATFORM on the
hypervisor side sufficient?
> arch/powerpc/include/asm/dma-mapping.h | 6 ++++++
> arch/powerpc/platforms/pseries/iommu.c | 11 +++++++++++
> drivers/virtio/virtio_ring.c | 10 ++++++++++
> 3 files changed, 27 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> index 8fa3945..056e578 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -115,4 +115,10 @@ extern u64 __dma_get_required_mask(struct device *dev);
> #define ARCH_HAS_DMA_MMAP_COHERENT
>
> #endif /* __KERNEL__ */
> +
> +#define platform_forces_virtio_dma platform_forces_virtio_dma
> +
> +struct virtio_device;
> +
> +extern bool platform_forces_virtio_dma(struct virtio_device *vdev);
> #endif /* _ASM_DMA_MAPPING_H */
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 06f0296..a2ec15a 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -38,6 +38,7 @@
> #include <linux/of.h>
> #include <linux/iommu.h>
> #include <linux/rculist.h>
> +#include <linux/virtio.h>
> #include <asm/io.h>
> #include <asm/prom.h>
> #include <asm/rtas.h>
> @@ -1396,3 +1397,13 @@ static int __init disable_multitce(char *str)
> __setup("multitce=", disable_multitce);
>
> machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> +
> +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> +{
> + /*
> + * On protected guest platforms, force virtio core to use DMA
> + * MAP API for all virtio devices. But there can also be some
> + * exceptions for individual devices like virtio balloon.
> + */
> + return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> +}
Isn't this kind of slow? vring_use_dma_api is on
data path and supposed to be very fast.
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 21d464a..47ea6c3 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -141,8 +141,18 @@ struct vring_virtqueue {
> * unconditionally on data path.
> */
>
> +#ifndef platform_forces_virtio_dma
> +static inline bool platform_forces_virtio_dma(struct virtio_device *vdev)
> +{
> + return false;
> +}
> +#endif
> +
> static bool vring_use_dma_api(struct virtio_device *vdev)
> {
> + if (platform_forces_virtio_dma(vdev))
> + return true;
> +
> if (!virtio_has_iommu_quirk(vdev))
> return true;
>
> --
> 2.9.3
On Wed, 2018-05-23 at 21:50 +0300, Michael S. Tsirkin wrote:
> I re-read that discussion and I'm still unclear on the
> original question, since I got several apparently
> conflicting answers.
>
> I asked:
>
> Why isn't setting VIRTIO_F_IOMMU_PLATFORM on the
> hypervisor side sufficient?
I thought I had replied to this...
There are a couple of reasons:
- First qemu doesn't know that the guest will switch to "secure mode"
in advance. There is no difference between a normal and a secure
partition until the partition does the magic UV call to "enter secure
mode" and qemu doesn't see any of it. So who can set the flag here ?
- Second, when using VIRTIO_F_IOMMU_PLATFORM, we also make qemu (or
vhost) go through the emulated MMIO for every access to the guest,
which adds additional overhead.
Cheers,
Ben.
>
>
> > arch/powerpc/include/asm/dma-mapping.h | 6 ++++++
> > arch/powerpc/platforms/pseries/iommu.c | 11 +++++++++++
> > drivers/virtio/virtio_ring.c | 10 ++++++++++
> > 3 files changed, 27 insertions(+)
> >
> > diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> > index 8fa3945..056e578 100644
> > --- a/arch/powerpc/include/asm/dma-mapping.h
> > +++ b/arch/powerpc/include/asm/dma-mapping.h
> > @@ -115,4 +115,10 @@ extern u64 __dma_get_required_mask(struct device *dev);
> > #define ARCH_HAS_DMA_MMAP_COHERENT
> >
> > #endif /* __KERNEL__ */
> > +
> > +#define platform_forces_virtio_dma platform_forces_virtio_dma
> > +
> > +struct virtio_device;
> > +
> > +extern bool platform_forces_virtio_dma(struct virtio_device *vdev);
> > #endif /* _ASM_DMA_MAPPING_H */
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > index 06f0296..a2ec15a 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -38,6 +38,7 @@
> > #include <linux/of.h>
> > #include <linux/iommu.h>
> > #include <linux/rculist.h>
> > +#include <linux/virtio.h>
> > #include <asm/io.h>
> > #include <asm/prom.h>
> > #include <asm/rtas.h>
> > @@ -1396,3 +1397,13 @@ static int __init disable_multitce(char *str)
> > __setup("multitce=", disable_multitce);
> >
> > machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> > +
> > +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > +{
> > + /*
> > + * On protected guest platforms, force virtio core to use DMA
> > + * MAP API for all virtio devices. But there can also be some
> > + * exceptions for individual devices like virtio balloon.
> > + */
> > + return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> > +}
>
> Isn't this kind of slow? vring_use_dma_api is on
> data path and supposed to be very fast.
>
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 21d464a..47ea6c3 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -141,8 +141,18 @@ struct vring_virtqueue {
> > * unconditionally on data path.
> > */
> >
> > +#ifndef platform_forces_virtio_dma
> > +static inline bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > +{
> > + return false;
> > +}
> > +#endif
> > +
> > static bool vring_use_dma_api(struct virtio_device *vdev)
> > {
> > + if (platform_forces_virtio_dma(vdev))
> > + return true;
> > +
> > if (!virtio_has_iommu_quirk(vdev))
> > return true;
> >
> > --
> > 2.9.3
On Thu, May 24, 2018 at 08:27:04AM +1000, Benjamin Herrenschmidt wrote:
> - First qemu doesn't know that the guest will switch to "secure mode"
> in advance. There is no difference between a normal and a secure
> partition until the partition does the magic UV call to "enter secure
> mode" and qemu doesn't see any of it. So who can set the flag here ?
>
> - Second, when using VIRTIO_F_IOMMU_PLATFORM, we also make qemu (or
> vhost) go through the emulated MMIO for every access to the guest,
> which adds additional overhead.
Also this whole scheme is simply the wrong way around. No driver should
opt out of the DMA API in general. For legacy reasons we might have to
opt out of the dma API for some virtio cases due to qemu bugs, but
this should never have been the default, but a quirk for the affected
versions.
We need to fix this now and make the dma ops bypass the quirk instead of
the default, which will also solve the power issue.
On Wed, May 23, 2018 at 09:50:02PM +0300, Michael S. Tsirkin wrote:
> subj: s/virito/virtio/
>
..snip..
> > machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> > +
> > +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > +{
> > + /*
> > + * On protected guest platforms, force virtio core to use DMA
> > + * MAP API for all virtio devices. But there can also be some
> > + * exceptions for individual devices like virtio balloon.
> > + */
> > + return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> > +}
>
> Isn't this kind of slow? vring_use_dma_api is on
> data path and supposed to be very fast.
Yes it is slow and not ideal. This won't be the final code. The final
code will cache the information in some global variable and used
in this function.
However this code was added to the RFC to illustrate the idea that dma
operation are needed only in a secure/protected environment.
RP
On Thu, May 24, 2018 at 08:27:04AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-05-23 at 21:50 +0300, Michael S. Tsirkin wrote:
>
> > I re-read that discussion and I'm still unclear on the
> > original question, since I got several apparently
> > conflicting answers.
> >
> > I asked:
> >
> > Why isn't setting VIRTIO_F_IOMMU_PLATFORM on the
> > hypervisor side sufficient?
>
> I thought I had replied to this...
>
> There are a couple of reasons:
>
> - First qemu doesn't know that the guest will switch to "secure mode"
> in advance. There is no difference between a normal and a secure
> partition until the partition does the magic UV call to "enter secure
> mode" and qemu doesn't see any of it. So who can set the flag here ?
Not sure I understand. Just set the flag e.g. on qemu command line.
I might be wrong, but these secure mode things usually
a. require hypervisor side tricks anyway
> - Second, when using VIRTIO_F_IOMMU_PLATFORM, we also make qemu (or
> vhost) go through the emulated MMIO for every access to the guest,
> which adds additional overhead.
>
> Cheers,
> Ben.
Well it's not supposed to be much slower for the static case.
vhost has a cache so should be fine.
A while ago Paolo implemented a translation cache which should be
perfect for this case - most of the code got merged but
never enabled because of stability issues.
If all else fails, we could teach QEMU to handle the no-iommu case
as if VIRTIO_F_IOMMU_PLATFORM was off.
> >
> >
> > > arch/powerpc/include/asm/dma-mapping.h | 6 ++++++
> > > arch/powerpc/platforms/pseries/iommu.c | 11 +++++++++++
> > > drivers/virtio/virtio_ring.c | 10 ++++++++++
> > > 3 files changed, 27 insertions(+)
> > >
> > > diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> > > index 8fa3945..056e578 100644
> > > --- a/arch/powerpc/include/asm/dma-mapping.h
> > > +++ b/arch/powerpc/include/asm/dma-mapping.h
> > > @@ -115,4 +115,10 @@ extern u64 __dma_get_required_mask(struct device *dev);
> > > #define ARCH_HAS_DMA_MMAP_COHERENT
> > >
> > > #endif /* __KERNEL__ */
> > > +
> > > +#define platform_forces_virtio_dma platform_forces_virtio_dma
> > > +
> > > +struct virtio_device;
> > > +
> > > +extern bool platform_forces_virtio_dma(struct virtio_device *vdev);
> > > #endif /* _ASM_DMA_MAPPING_H */
> > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > index 06f0296..a2ec15a 100644
> > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > @@ -38,6 +38,7 @@
> > > #include <linux/of.h>
> > > #include <linux/iommu.h>
> > > #include <linux/rculist.h>
> > > +#include <linux/virtio.h>
> > > #include <asm/io.h>
> > > #include <asm/prom.h>
> > > #include <asm/rtas.h>
> > > @@ -1396,3 +1397,13 @@ static int __init disable_multitce(char *str)
> > > __setup("multitce=", disable_multitce);
> > >
> > > machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> > > +
> > > +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > +{
> > > + /*
> > > + * On protected guest platforms, force virtio core to use DMA
> > > + * MAP API for all virtio devices. But there can also be some
> > > + * exceptions for individual devices like virtio balloon.
> > > + */
> > > + return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> > > +}
> >
> > Isn't this kind of slow? vring_use_dma_api is on
> > data path and supposed to be very fast.
> >
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 21d464a..47ea6c3 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -141,8 +141,18 @@ struct vring_virtqueue {
> > > * unconditionally on data path.
> > > */
> > >
> > > +#ifndef platform_forces_virtio_dma
> > > +static inline bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > +{
> > > + return false;
> > > +}
> > > +#endif
> > > +
> > > static bool vring_use_dma_api(struct virtio_device *vdev)
> > > {
> > > + if (platform_forces_virtio_dma(vdev))
> > > + return true;
> > > +
> > > if (!virtio_has_iommu_quirk(vdev))
> > > return true;
> > >
> > > --
> > > 2.9.3
On Tue, 2018-05-29 at 09:48 +1000, Benjamin Herrenschmidt wrote:
> > Well it's not supposed to be much slower for the static case.
> >
> > vhost has a cache so should be fine.
> >
> > A while ago Paolo implemented a translation cache which should be
> > perfect for this case - most of the code got merged but
> > never enabled because of stability issues.
> >
> > If all else fails, we could teach QEMU to handle the no-iommu case
> > as if VIRTIO_F_IOMMU_PLATFORM was off.
>
> Any serious reason why not just getting that 2 line patch allowing our
> arch code to force virtio to use the DMA API ?
>
> It's not particularly invasive and solves our problem rather nicely
> without adding overhead or additional knowledge to qemu/libvirt/mgmnt
> tools etc... that it doesn't need etc....
>
> The guest knows it's going secure so the guest arch code can do the
> right thing rather trivially.
>
> Long term we should probably make virtio always use the DMA API anyway,
> and interpose "1:1" dma_ops for the traditional virtio case, that would
> reduce code clutter significantly. In that case, it would become just a
> matter of having a platform hook to override the dma_ops used.
To elaborate a bit ....
What we are trying to solve here is entirely a guest problem, I don't
think involving qemu in the solution is the right thing to do.
The guest can only allow external parties (qemu, potentially PCI
devices, etc...) access to some restricted portions of memory (insecure
memory). Thus the guest need to do some bounce buffering/swiotlb type
tricks.
This is completely orthogonal to whether there is an actual iommu
between the guest and the device (or emulated device/virtio).
This is why I think the solution should reside in the guest kernel, by
proper manipulation (by the arch code) of the dma ops.
I don't think forcing the addition of an emulated iommu in the middle
just to work around the fact that virtio "cheats" and doesn't use the
dma API unless there is one, is the right "fix".
The right long term fix is to always use the DMA API, reducing code
path etc... and just have a single point where virtio can "chose"
alternate DMA ops (via an arch hook to deal with our case).
In the meantime, having the hook we propose gets us going, but if you
agree with the approach, we should also work on the long term approach.
Cheers,
Ben.
On Fri, 2018-05-25 at 20:45 +0300, Michael S. Tsirkin wrote:
> On Thu, May 24, 2018 at 08:27:04AM +1000, Benjamin Herrenschmidt wrote:
> > On Wed, 2018-05-23 at 21:50 +0300, Michael S. Tsirkin wrote:
> >
> > > I re-read that discussion and I'm still unclear on the
> > > original question, since I got several apparently
> > > conflicting answers.
> > >
> > > I asked:
> > >
> > > Why isn't setting VIRTIO_F_IOMMU_PLATFORM on the
> > > hypervisor side sufficient?
> >
> > I thought I had replied to this...
> >
> > There are a couple of reasons:
> >
> > - First qemu doesn't know that the guest will switch to "secure mode"
> > in advance. There is no difference between a normal and a secure
> > partition until the partition does the magic UV call to "enter secure
> > mode" and qemu doesn't see any of it. So who can set the flag here ?
>
> Not sure I understand. Just set the flag e.g. on qemu command line.
> I might be wrong, but these secure mode things usually
> a. require hypervisor side tricks anyway
The way our secure mode architecture is designed, there doesn't need at
this point to be any knowledge at qemu level whatsoever. Well at least
until we do migration but that's a different kettle of fish. In any
case, the guest starts normally (which means as a non-secure guest, and
thus expects normal virtio, our FW today doesn't handle
VIRTIO_F_IOMMU_PLATFORM, though granted, we can fix this), and later
that guest issues some special Ultravisor call that turns it into a
secure guest.
There is some involvement of the hypervisor, but not qemu at this
stage. We would very much like to avoid that, as it would be a hassle
for users to have to use different libvirt options etc... bcs the guest
might turn itself into a secure VM.
> > - Second, when using VIRTIO_F_IOMMU_PLATFORM, we also make qemu (or
> > vhost) go through the emulated MMIO for every access to the guest,
> > which adds additional overhead.
> >
> > Cheers,
> > Ben.
>
> Well it's not supposed to be much slower for the static case.
>
> vhost has a cache so should be fine.
>
> A while ago Paolo implemented a translation cache which should be
> perfect for this case - most of the code got merged but
> never enabled because of stability issues.
>
> If all else fails, we could teach QEMU to handle the no-iommu case
> as if VIRTIO_F_IOMMU_PLATFORM was off.
Any serious reason why not just getting that 2 line patch allowing our
arch code to force virtio to use the DMA API ?
It's not particularly invasive and solves our problem rather nicely
without adding overhead or additional knowledge to qemu/libvirt/mgmnt
tools etc... that it doesn't need etc....
The guest knows it's going secure so the guest arch code can do the
right thing rather trivially.
Long term we should probably make virtio always use the DMA API anyway,
and interpose "1:1" dma_ops for the traditional virtio case, that would
reduce code clutter significantly. In that case, it would become just a
matter of having a platform hook to override the dma_ops used.
Cheers,
Ben.
>
>
> > >
> > >
> > > > arch/powerpc/include/asm/dma-mapping.h | 6 ++++++
> > > > arch/powerpc/platforms/pseries/iommu.c | 11 +++++++++++
> > > > drivers/virtio/virtio_ring.c | 10 ++++++++++
> > > > 3 files changed, 27 insertions(+)
> > > >
> > > > diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> > > > index 8fa3945..056e578 100644
> > > > --- a/arch/powerpc/include/asm/dma-mapping.h
> > > > +++ b/arch/powerpc/include/asm/dma-mapping.h
> > > > @@ -115,4 +115,10 @@ extern u64 __dma_get_required_mask(struct device *dev);
> > > > #define ARCH_HAS_DMA_MMAP_COHERENT
> > > >
> > > > #endif /* __KERNEL__ */
> > > > +
> > > > +#define platform_forces_virtio_dma platform_forces_virtio_dma
> > > > +
> > > > +struct virtio_device;
> > > > +
> > > > +extern bool platform_forces_virtio_dma(struct virtio_device *vdev);
> > > > #endif /* _ASM_DMA_MAPPING_H */
> > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > > index 06f0296..a2ec15a 100644
> > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > @@ -38,6 +38,7 @@
> > > > #include <linux/of.h>
> > > > #include <linux/iommu.h>
> > > > #include <linux/rculist.h>
> > > > +#include <linux/virtio.h>
> > > > #include <asm/io.h>
> > > > #include <asm/prom.h>
> > > > #include <asm/rtas.h>
> > > > @@ -1396,3 +1397,13 @@ static int __init disable_multitce(char *str)
> > > > __setup("multitce=", disable_multitce);
> > > >
> > > > machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> > > > +
> > > > +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > > +{
> > > > + /*
> > > > + * On protected guest platforms, force virtio core to use DMA
> > > > + * MAP API for all virtio devices. But there can also be some
> > > > + * exceptions for individual devices like virtio balloon.
> > > > + */
> > > > + return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> > > > +}
> > >
> > > Isn't this kind of slow? vring_use_dma_api is on
> > > data path and supposed to be very fast.
> > >
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 21d464a..47ea6c3 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -141,8 +141,18 @@ struct vring_virtqueue {
> > > > * unconditionally on data path.
> > > > */
> > > >
> > > > +#ifndef platform_forces_virtio_dma
> > > > +static inline bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > > +{
> > > > + return false;
> > > > +}
> > > > +#endif
> > > > +
> > > > static bool vring_use_dma_api(struct virtio_device *vdev)
> > > > {
> > > > + if (platform_forces_virtio_dma(vdev))
> > > > + return true;
> > > > +
> > > > if (!virtio_has_iommu_quirk(vdev))
> > > > return true;
> > > >
> > > > --
> > > > 2.9.3
On Tue, May 29, 2018 at 09:56:24AM +1000, Benjamin Herrenschmidt wrote:
> I don't think forcing the addition of an emulated iommu in the middle
> just to work around the fact that virtio "cheats" and doesn't use the
> dma API unless there is one, is the right "fix".
Agreed.
> The right long term fix is to always use the DMA API, reducing code
> path etc... and just have a single point where virtio can "chose"
> alternate DMA ops (via an arch hook to deal with our case).
Also agreed.
When Andi added vring_use_dma_api it was marked as temporary.
So I'd much rather move to blacklisting platforms that needs this
hack now than adding another exception.
And then once we have the blacklist move it to a quirk in the arch
code that just forces dma_direct_ops as the per-device dma ops.
I don't really think this is crazy long term, but something we could
do relatively quickly. Interestingly enough the original commit
mentions PPC64 as a case where this quirk is needed.
On Tue, 2018-05-29 at 07:03 -0700, Christoph Hellwig wrote:
> On Tue, May 29, 2018 at 09:56:24AM +1000, Benjamin Herrenschmidt wrote:
> > I don't think forcing the addition of an emulated iommu in the middle
> > just to work around the fact that virtio "cheats" and doesn't use the
> > dma API unless there is one, is the right "fix".
>
> Agreed.
>
> > The right long term fix is to always use the DMA API, reducing code
> > path etc... and just have a single point where virtio can "chose"
> > alternate DMA ops (via an arch hook to deal with our case).
>
> Also agreed.
>
> When Andi added vring_use_dma_api it was marked as temporary.
>
> So I'd much rather move to blacklisting platforms that needs this
> hack now than adding another exception.
>
> And then once we have the blacklist move it to a quirk in the arch
> code that just forces dma_direct_ops as the per-device dma ops.
>
> I don't really think this is crazy long term, but something we could
> do relatively quickly. Interestingly enough the original commit
> mentions PPC64 as a case where this quirk is needed.
Not sure why, it's not so much a platform issue today. It's qemu itself
who by defaults bypasses any iommu. I suppose ppc64 stood out because
unlike x86 we always have an iommu by default.
Anyway, Anshuman, I think that's the right approach, first make virtio
always use the DMA API with a quirk early to override the ops.
Christoph: the overriding of the ops isn't a platform thing. It's a
qemu thing, ie, from a Linux perspective, it's a feature of the
"device". So it should be done in virtio itself, not the platform code.
However, we do want the ability in platform code to force the bounce
buffering to solve our secure VM issue.
Cheers,
Ben.
On 05/24/2018 12:51 PM, Ram Pai wrote:
> On Wed, May 23, 2018 at 09:50:02PM +0300, Michael S. Tsirkin wrote:
>> subj: s/virito/virtio/
>>
> ..snip..
>>> machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
>>> +
>>> +bool platform_forces_virtio_dma(struct virtio_device *vdev)
>>> +{
>>> + /*
>>> + * On protected guest platforms, force virtio core to use DMA
>>> + * MAP API for all virtio devices. But there can also be some
>>> + * exceptions for individual devices like virtio balloon.
>>> + */
>>> + return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
>>> +}
>>
>> Isn't this kind of slow? vring_use_dma_api is on
>> data path and supposed to be very fast.
>
> Yes it is slow and not ideal. This won't be the final code. The final
> code will cache the information in some global variable and used
> in this function.
Right this will be a simple check based on a global variable.
On Thu, May 31, 2018 at 09:09:24AM +0530, Anshuman Khandual wrote:
> On 05/24/2018 12:51 PM, Ram Pai wrote:
> > On Wed, May 23, 2018 at 09:50:02PM +0300, Michael S. Tsirkin wrote:
> >> subj: s/virito/virtio/
> >>
> > ..snip..
> >>> machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> >>> +
> >>> +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> >>> +{
> >>> + /*
> >>> + * On protected guest platforms, force virtio core to use DMA
> >>> + * MAP API for all virtio devices. But there can also be some
> >>> + * exceptions for individual devices like virtio balloon.
> >>> + */
> >>> + return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> >>> +}
> >>
> >> Isn't this kind of slow? vring_use_dma_api is on
> >> data path and supposed to be very fast.
> >
> > Yes it is slow and not ideal. This won't be the final code. The final
> > code will cache the information in some global variable and used
> > in this function.
>
> Right this will be a simple check based on a global variable.
>
Pls work on a long term solution. Short term needs can be served by
enabling the iommu platform in qemu.
--
MST
On Thu, May 24, 2018 at 08:27:04AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-05-23 at 21:50 +0300, Michael S. Tsirkin wrote:
>
> > I re-read that discussion and I'm still unclear on the
> > original question, since I got several apparently
> > conflicting answers.
> >
> > I asked:
> >
> > Why isn't setting VIRTIO_F_IOMMU_PLATFORM on the
> > hypervisor side sufficient?
>
> I thought I had replied to this...
>
> There are a couple of reasons:
>
> - First qemu doesn't know that the guest will switch to "secure mode"
> in advance. There is no difference between a normal and a secure
> partition until the partition does the magic UV call to "enter secure
> mode" and qemu doesn't see any of it. So who can set the flag here ?
This seems weird to me. As a rule HV calls should go through qemu -
or be allowed to go directly to KVM *by* qemu. We generally reserve
the latter for hot path things. Since this isn't a hot path, having
the call handled directly by the kernel seems wrong.
Unless a "UV call" is something different I don't know about.
> - Second, when using VIRTIO_F_IOMMU_PLATFORM, we also make qemu (or
> vhost) go through the emulated MMIO for every access to the guest,
> which adds additional overhead.
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Mon, 2018-06-04 at 18:57 +1000, David Gibson wrote:
>
> > - First qemu doesn't know that the guest will switch to "secure mode"
> > in advance. There is no difference between a normal and a secure
> > partition until the partition does the magic UV call to "enter secure
> > mode" and qemu doesn't see any of it. So who can set the flag here ?
>
> This seems weird to me. As a rule HV calls should go through qemu -
> or be allowed to go directly to KVM *by* qemu.
It's not an HV call, it's a UV call, qemu won't see it, qemu isn't
trusted. Now the UV *will* reflect that to the HV via some synthetized
HV calls, and we *could* have those do a pass by qemu, however, so far,
our entire design doesn't rely on *any* qemu knowledge whatsoever and
it would be sad to add it just for that purpose.
Additionally, this is rather orthogonal, see my other email, the
problem we are trying to solve is *not* a qemu problem and it doesn't
make sense to leak that into qemu.
> We generally reserve
> the latter for hot path things. Since this isn't a hot path, having
> the call handled directly by the kernel seems wrong.
>
> Unless a "UV call" is something different I don't know about.
Yes, a UV call goes to the Ultravisor, not the Hypervisor. The
Hypervisor isn't trusted.
> > - Second, when using VIRTIO_F_IOMMU_PLATFORM, we also make qemu (or
> > vhost) go through the emulated MMIO for every access to the guest,
> > which adds additional overhead.
>
Ben.
On Thu, May 24, 2018 at 08:27:04AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-05-23 at 21:50 +0300, Michael S. Tsirkin wrote:
>
> > I re-read that discussion and I'm still unclear on the
> > original question, since I got several apparently
> > conflicting answers.
> >
> > I asked:
> >
> > Why isn't setting VIRTIO_F_IOMMU_PLATFORM on the
> > hypervisor side sufficient?
>
> I thought I had replied to this...
>
> There are a couple of reasons:
>
> - First qemu doesn't know that the guest will switch to "secure mode"
> in advance. There is no difference between a normal and a secure
> partition until the partition does the magic UV call to "enter secure
> mode" and qemu doesn't see any of it. So who can set the flag here ?
The user should set it. You just tell user "to be able to use with
feature X, enable IOMMU".
> - Second, when using VIRTIO_F_IOMMU_PLATFORM, we also make qemu (or
> vhost) go through the emulated MMIO for every access to the guest,
> which adds additional overhead.
>
> Cheers,
> Ben.
There are several answers to this. One is that we are working hard to
make overhead small when the mappings are static (which they would be if
there's no actual IOMMU). So maybe especially given you are using
a bounce buffer on top it's not so bad - did you try to
benchmark?
Another is that given the basic functionality is in there, optimizations
can possibly wait until per-device quirks in DMA API are supported.
> >
> >
> > > arch/powerpc/include/asm/dma-mapping.h | 6 ++++++
> > > arch/powerpc/platforms/pseries/iommu.c | 11 +++++++++++
> > > drivers/virtio/virtio_ring.c | 10 ++++++++++
> > > 3 files changed, 27 insertions(+)
> > >
> > > diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> > > index 8fa3945..056e578 100644
> > > --- a/arch/powerpc/include/asm/dma-mapping.h
> > > +++ b/arch/powerpc/include/asm/dma-mapping.h
> > > @@ -115,4 +115,10 @@ extern u64 __dma_get_required_mask(struct device *dev);
> > > #define ARCH_HAS_DMA_MMAP_COHERENT
> > >
> > > #endif /* __KERNEL__ */
> > > +
> > > +#define platform_forces_virtio_dma platform_forces_virtio_dma
> > > +
> > > +struct virtio_device;
> > > +
> > > +extern bool platform_forces_virtio_dma(struct virtio_device *vdev);
> > > #endif /* _ASM_DMA_MAPPING_H */
> > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > index 06f0296..a2ec15a 100644
> > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > @@ -38,6 +38,7 @@
> > > #include <linux/of.h>
> > > #include <linux/iommu.h>
> > > #include <linux/rculist.h>
> > > +#include <linux/virtio.h>
> > > #include <asm/io.h>
> > > #include <asm/prom.h>
> > > #include <asm/rtas.h>
> > > @@ -1396,3 +1397,13 @@ static int __init disable_multitce(char *str)
> > > __setup("multitce=", disable_multitce);
> > >
> > > machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> > > +
> > > +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > +{
> > > + /*
> > > + * On protected guest platforms, force virtio core to use DMA
> > > + * MAP API for all virtio devices. But there can also be some
> > > + * exceptions for individual devices like virtio balloon.
> > > + */
> > > + return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> > > +}
> >
> > Isn't this kind of slow? vring_use_dma_api is on
> > data path and supposed to be very fast.
> >
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 21d464a..47ea6c3 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -141,8 +141,18 @@ struct vring_virtqueue {
> > > * unconditionally on data path.
> > > */
> > >
> > > +#ifndef platform_forces_virtio_dma
> > > +static inline bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > +{
> > > + return false;
> > > +}
> > > +#endif
> > > +
> > > static bool vring_use_dma_api(struct virtio_device *vdev)
> > > {
> > > + if (platform_forces_virtio_dma(vdev))
> > > + return true;
> > > +
> > > if (!virtio_has_iommu_quirk(vdev))
> > > return true;
> > >
> > > --
> > > 2.9.3
On Mon, Jun 04, 2018 at 07:48:54PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-06-04 at 18:57 +1000, David Gibson wrote:
> >
> > > - First qemu doesn't know that the guest will switch to "secure mode"
> > > in advance. There is no difference between a normal and a secure
> > > partition until the partition does the magic UV call to "enter secure
> > > mode" and qemu doesn't see any of it. So who can set the flag here ?
> >
> > This seems weird to me. As a rule HV calls should go through qemu -
> > or be allowed to go directly to KVM *by* qemu.
>
> It's not an HV call, it's a UV call, qemu won't see it, qemu isn't
> trusted. Now the UV *will* reflect that to the HV via some synthetized
> HV calls, and we *could* have those do a pass by qemu, however, so far,
> our entire design doesn't rely on *any* qemu knowledge whatsoever and
> it would be sad to add it just for that purpose.
It's a temporary work-around. I think that the long-term fix is to
support per-device quirks and have the DMA API DTRT for virtio.
> Additionally, this is rather orthogonal, see my other email, the
> problem we are trying to solve is *not* a qemu problem and it doesn't
> make sense to leak that into qemu.
>
> > We generally reserve
> > the latter for hot path things. Since this isn't a hot path, having
> > the call handled directly by the kernel seems wrong.
> >
> > Unless a "UV call" is something different I don't know about.
>
> Yes, a UV call goes to the Ultravisor, not the Hypervisor. The
> Hypervisor isn't trusted.
>
> > > - Second, when using VIRTIO_F_IOMMU_PLATFORM, we also make qemu (or
> > > vhost) go through the emulated MMIO for every access to the guest,
> > > which adds additional overhead.
> >
> Ben.
On Mon, Jun 04, 2018 at 03:43:09PM +0300, Michael S. Tsirkin wrote:
> Another is that given the basic functionality is in there, optimizations
> can possibly wait until per-device quirks in DMA API are supported.
We have had per-device dma_ops for quite a while.
On Mon, 2018-06-04 at 15:43 +0300, Michael S. Tsirkin wrote:
> On Thu, May 24, 2018 at 08:27:04AM +1000, Benjamin Herrenschmidt wrote:
> > On Wed, 2018-05-23 at 21:50 +0300, Michael S. Tsirkin wrote:
> >
> > > I re-read that discussion and I'm still unclear on the
> > > original question, since I got several apparently
> > > conflicting answers.
> > >
> > > I asked:
> > >
> > > Why isn't setting VIRTIO_F_IOMMU_PLATFORM on the
> > > hypervisor side sufficient?
> >
> > I thought I had replied to this...
> >
> > There are a couple of reasons:
> >
> > - First qemu doesn't know that the guest will switch to "secure mode"
> > in advance. There is no difference between a normal and a secure
> > partition until the partition does the magic UV call to "enter secure
> > mode" and qemu doesn't see any of it. So who can set the flag here ?
>
> The user should set it. You just tell user "to be able to use with
> feature X, enable IOMMU".
That's completely backwards. The user has no idea what that stuff is.
And it would have to percolate all the way up the management stack,
libvirt, kimchi, whatever else ... that's just nonsense.
Especially since, as I explained in my other email, this is *not* a
qemu problem and thus the solution shouldn't be messing around with
qemu.
>
> > - Second, when using VIRTIO_F_IOMMU_PLATFORM, we also make qemu (or
> > vhost) go through the emulated MMIO for every access to the guest,
> > which adds additional overhead.
> >
> > Cheers,
> > Ben.
>
> There are several answers to this. One is that we are working hard to
> make overhead small when the mappings are static (which they would be if
> there's no actual IOMMU). So maybe especially given you are using
> a bounce buffer on top it's not so bad - did you try to
> benchmark?
>
> Another is that given the basic functionality is in there, optimizations
> can possibly wait until per-device quirks in DMA API are supported.
The point is that requiring specific qemu command line arguments isn't
going to fly. We have additional problems due to the fact that our
firmware (SLOF) inside qemu doesn't currently deal with iommu's etc...
though those can be fixed.
Overall, however, this seems to be the most convoluted way of achieving
things, require user interventions where none should be needed etc...
Again, what's wrong with a 2 lines hook instead that solves it all and
completely avoids involving qemu ?
Ben.
>
> > >
> > >
> > > > arch/powerpc/include/asm/dma-mapping.h | 6 ++++++
> > > > arch/powerpc/platforms/pseries/iommu.c | 11 +++++++++++
> > > > drivers/virtio/virtio_ring.c | 10 ++++++++++
> > > > 3 files changed, 27 insertions(+)
> > > >
> > > > diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> > > > index 8fa3945..056e578 100644
> > > > --- a/arch/powerpc/include/asm/dma-mapping.h
> > > > +++ b/arch/powerpc/include/asm/dma-mapping.h
> > > > @@ -115,4 +115,10 @@ extern u64 __dma_get_required_mask(struct device *dev);
> > > > #define ARCH_HAS_DMA_MMAP_COHERENT
> > > >
> > > > #endif /* __KERNEL__ */
> > > > +
> > > > +#define platform_forces_virtio_dma platform_forces_virtio_dma
> > > > +
> > > > +struct virtio_device;
> > > > +
> > > > +extern bool platform_forces_virtio_dma(struct virtio_device *vdev);
> > > > #endif /* _ASM_DMA_MAPPING_H */
> > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > > index 06f0296..a2ec15a 100644
> > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > @@ -38,6 +38,7 @@
> > > > #include <linux/of.h>
> > > > #include <linux/iommu.h>
> > > > #include <linux/rculist.h>
> > > > +#include <linux/virtio.h>
> > > > #include <asm/io.h>
> > > > #include <asm/prom.h>
> > > > #include <asm/rtas.h>
> > > > @@ -1396,3 +1397,13 @@ static int __init disable_multitce(char *str)
> > > > __setup("multitce=", disable_multitce);
> > > >
> > > > machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> > > > +
> > > > +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > > +{
> > > > + /*
> > > > + * On protected guest platforms, force virtio core to use DMA
> > > > + * MAP API for all virtio devices. But there can also be some
> > > > + * exceptions for individual devices like virtio balloon.
> > > > + */
> > > > + return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> > > > +}
> > >
> > > Isn't this kind of slow? vring_use_dma_api is on
> > > data path and supposed to be very fast.
> > >
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 21d464a..47ea6c3 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -141,8 +141,18 @@ struct vring_virtqueue {
> > > > * unconditionally on data path.
> > > > */
> > > >
> > > > +#ifndef platform_forces_virtio_dma
> > > > +static inline bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > > +{
> > > > + return false;
> > > > +}
> > > > +#endif
> > > > +
> > > > static bool vring_use_dma_api(struct virtio_device *vdev)
> > > > {
> > > > + if (platform_forces_virtio_dma(vdev))
> > > > + return true;
> > > > +
> > > > if (!virtio_has_iommu_quirk(vdev))
> > > > return true;
> > > >
> > > > --
> > > > 2.9.3
On Mon, 2018-06-04 at 05:55 -0700, Christoph Hellwig wrote:
> On Mon, Jun 04, 2018 at 03:43:09PM +0300, Michael S. Tsirkin wrote:
> > Another is that given the basic functionality is in there, optimizations
> > can possibly wait until per-device quirks in DMA API are supported.
>
> We have had per-device dma_ops for quite a while.
I've asked Ansuman to start with a patch that converts virtio to use
DMA ops always, along with an init quirk to hookup "direct" ops when
the IOMMU flag isn't set.
This will at least remove that horrid duplication of code path we have
in there.
Then we can just involve the arch in that init quirk so we can chose an
alternate set of ops when running a secure VM.
This is completely orthogonal to whether an iommu exist qemu side or
not, and should be entirely solved on the Linux side.
Cheers,
Ben.
On Mon, Jun 04, 2018 at 11:11:52PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-06-04 at 15:43 +0300, Michael S. Tsirkin wrote:
> > On Thu, May 24, 2018 at 08:27:04AM +1000, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-05-23 at 21:50 +0300, Michael S. Tsirkin wrote:
> > >
> > > > I re-read that discussion and I'm still unclear on the
> > > > original question, since I got several apparently
> > > > conflicting answers.
> > > >
> > > > I asked:
> > > >
> > > > Why isn't setting VIRTIO_F_IOMMU_PLATFORM on the
> > > > hypervisor side sufficient?
> > >
> > > I thought I had replied to this...
> > >
> > > There are a couple of reasons:
> > >
> > > - First qemu doesn't know that the guest will switch to "secure mode"
> > > in advance. There is no difference between a normal and a secure
> > > partition until the partition does the magic UV call to "enter secure
> > > mode" and qemu doesn't see any of it. So who can set the flag here ?
> >
> > The user should set it. You just tell user "to be able to use with
> > feature X, enable IOMMU".
>
> That's completely backwards. The user has no idea what that stuff is.
> And it would have to percolate all the way up the management stack,
> libvirt, kimchi, whatever else ... that's just nonsense.
>
> Especially since, as I explained in my other email, this is *not* a
> qemu problem and thus the solution shouldn't be messing around with
> qemu.
virtio is implemented in qemu though. If you prefer to stick
all your code in either guest or the UV that's your decision
but it looks like qemu could be helpful here.
For example what if you have a guest that passes physical addresses
to qemu bypassing swiotlb? Don't you want to detect
that and fail gracefully rather than crash the guest?
That's what VIRTIO_F_IOMMU_PLATFORM will do for you.
Still that's hypervisor's decision. What isn't up to the hypervisor is
the way we structure code. We made an early decision to merge a hack
with xen, among discussion about how with time DMA API will learn to
support per-device quirks and we'll be able to switch to that.
So let's do that now?
> >
> > > - Second, when using VIRTIO_F_IOMMU_PLATFORM, we also make qemu (or
> > > vhost) go through the emulated MMIO for every access to the guest,
> > > which adds additional overhead.
> > >
> > > Cheers,
> > > Ben.
> >
> > There are several answers to this. One is that we are working hard to
> > make overhead small when the mappings are static (which they would be if
> > there's no actual IOMMU). So maybe especially given you are using
> > a bounce buffer on top it's not so bad - did you try to
> > benchmark?
> >
> > Another is that given the basic functionality is in there, optimizations
> > can possibly wait until per-device quirks in DMA API are supported.
>
> The point is that requiring specific qemu command line arguments isn't
> going to fly. We have additional problems due to the fact that our
> firmware (SLOF) inside qemu doesn't currently deal with iommu's etc...
> though those can be fixed.
>
> Overall, however, this seems to be the most convoluted way of achieving
> things, require user interventions where none should be needed etc...
>
> Again, what's wrong with a 2 lines hook instead that solves it all and
> completely avoids involving qemu ?
>
> Ben.
That each platform wants to add hacks in this data path function.
> >
> > > >
> > > >
> > > > > arch/powerpc/include/asm/dma-mapping.h | 6 ++++++
> > > > > arch/powerpc/platforms/pseries/iommu.c | 11 +++++++++++
> > > > > drivers/virtio/virtio_ring.c | 10 ++++++++++
> > > > > 3 files changed, 27 insertions(+)
> > > > >
> > > > > diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> > > > > index 8fa3945..056e578 100644
> > > > > --- a/arch/powerpc/include/asm/dma-mapping.h
> > > > > +++ b/arch/powerpc/include/asm/dma-mapping.h
> > > > > @@ -115,4 +115,10 @@ extern u64 __dma_get_required_mask(struct device *dev);
> > > > > #define ARCH_HAS_DMA_MMAP_COHERENT
> > > > >
> > > > > #endif /* __KERNEL__ */
> > > > > +
> > > > > +#define platform_forces_virtio_dma platform_forces_virtio_dma
> > > > > +
> > > > > +struct virtio_device;
> > > > > +
> > > > > +extern bool platform_forces_virtio_dma(struct virtio_device *vdev);
> > > > > #endif /* _ASM_DMA_MAPPING_H */
> > > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > > > index 06f0296..a2ec15a 100644
> > > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > > @@ -38,6 +38,7 @@
> > > > > #include <linux/of.h>
> > > > > #include <linux/iommu.h>
> > > > > #include <linux/rculist.h>
> > > > > +#include <linux/virtio.h>
> > > > > #include <asm/io.h>
> > > > > #include <asm/prom.h>
> > > > > #include <asm/rtas.h>
> > > > > @@ -1396,3 +1397,13 @@ static int __init disable_multitce(char *str)
> > > > > __setup("multitce=", disable_multitce);
> > > > >
> > > > > machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> > > > > +
> > > > > +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > > > +{
> > > > > + /*
> > > > > + * On protected guest platforms, force virtio core to use DMA
> > > > > + * MAP API for all virtio devices. But there can also be some
> > > > > + * exceptions for individual devices like virtio balloon.
> > > > > + */
> > > > > + return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> > > > > +}
> > > >
> > > > Isn't this kind of slow? vring_use_dma_api is on
> > > > data path and supposed to be very fast.
> > > >
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 21d464a..47ea6c3 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -141,8 +141,18 @@ struct vring_virtqueue {
> > > > > * unconditionally on data path.
> > > > > */
> > > > >
> > > > > +#ifndef platform_forces_virtio_dma
> > > > > +static inline bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > > > +{
> > > > > + return false;
> > > > > +}
> > > > > +#endif
> > > > > +
> > > > > static bool vring_use_dma_api(struct virtio_device *vdev)
> > > > > {
> > > > > + if (platform_forces_virtio_dma(vdev))
> > > > > + return true;
> > > > > +
> > > > > if (!virtio_has_iommu_quirk(vdev))
> > > > > return true;
> > > > >
> > > > > --
> > > > > 2.9.3
On Mon, Jun 04, 2018 at 11:14:36PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-06-04 at 05:55 -0700, Christoph Hellwig wrote:
> > On Mon, Jun 04, 2018 at 03:43:09PM +0300, Michael S. Tsirkin wrote:
> > > Another is that given the basic functionality is in there, optimizations
> > > can possibly wait until per-device quirks in DMA API are supported.
> >
> > We have had per-device dma_ops for quite a while.
>
> I've asked Ansuman to start with a patch that converts virtio to use
> DMA ops always, along with an init quirk to hookup "direct" ops when
> the IOMMU flag isn't set.
>
> This will at least remove that horrid duplication of code path we have
> in there.
>
> Then we can just involve the arch in that init quirk so we can chose an
> alternate set of ops when running a secure VM.
>
> This is completely orthogonal to whether an iommu exist qemu side or
> not, and should be entirely solved on the Linux side.
>
> Cheers,
> Ben.
Sounds good to me.
--
MST
On Mon, 2018-06-04 at 19:21 +0300, Michael S. Tsirkin wrote:
>
> > > > - First qemu doesn't know that the guest will switch to "secure mode"
> > > > in advance. There is no difference between a normal and a secure
> > > > partition until the partition does the magic UV call to "enter secure
> > > > mode" and qemu doesn't see any of it. So who can set the flag here ?
> > >
> > > The user should set it. You just tell user "to be able to use with
> > > feature X, enable IOMMU".
> >
> > That's completely backwards. The user has no idea what that stuff is.
> > And it would have to percolate all the way up the management stack,
> > libvirt, kimchi, whatever else ... that's just nonsense.
> >
> > Especially since, as I explained in my other email, this is *not* a
> > qemu problem and thus the solution shouldn't be messing around with
> > qemu.
>
> virtio is implemented in qemu though. If you prefer to stick
> all your code in either guest or the UV that's your decision
> but it looks like qemu could be helpful here.
Sorry Michael, that doesn't click. Yes of course virtio is implemented
in qemu, but the problem we are trying to solve is *not* a qemu problem
(the fact that the Linux drivers bypass the DMA API is wrong, needs
fixing, and isnt a qemu problem). The fact that the secure guests need
bounce buffering is not a qemu problem either.
Whether qemu chose to use an iommu or not is, and should remain an
orthogonal problem.
Forcing qemu to use the iommu to work around a linux side lack of
proper use of the DMA API is not only just papering over the problem,
it's also forcing changes up 3 or 4 levels of the SW stack to create
that new option that no user will understand the meaning of and that
would otherwise be unnecessary.
> For example what if you have a guest that passes physical addresses
> to qemu bypassing swiotlb? Don't you want to detect
> that and fail gracefully rather than crash the guest?
A guest bug then ? Well it wouldn't so much crash as force the pages to
become encrypted and cause horrible ping/pong between qemu and the
guest (the secure pages aren't accessible to qemu directly).
> That's what VIRTIO_F_IOMMU_PLATFORM will do for you.
Again this is orthogonal. Using an iommu will indeed provide a modicum
of protection against buggy drivers, like it does on HW PCI platforms,
whether those guests are secure or not.
Note however that in practice, we tend to disable the iommu even on
real HW whenever we want performance (of course we can't for guests but
for bare metal systems we do, the added RAS isn't worth the performance
lost for very fast networking for example).
> Still that's hypervisor's decision. What isn't up to the hypervisor is
> the way we structure code. We made an early decision to merge a hack
> with xen, among discussion about how with time DMA API will learn to
> support per-device quirks and we'll be able to switch to that.
> So let's do that now?
The DMA API itself isn't the one that needs to learn "per-device
quirks", it's just plumbing into arch backends. The "quirk" is at the
point of establishing the backend for a given device.
We can go a good way down that path simply by having virtio in Linux
start with putting *itself* its own direct ops in there when
VIRTIO_F_IOMMU_PLATFORM is not set, and removing all the special casing
in the rest of the driver.
Once that's done, we have a single point of establishing the dma ops,
we can quirk in there if needed, that's rather nicely contained, or put
an arch hook, or whatever is necessary.
I would like to keep however the ability to bypass the iommu for
performance reasons, and also because it's qemu default mode of
operation and my secure guest has no clean way to force qemu to turn
the iommu on. The hypervisor *could* return something to qemu when the
guest switch to secure as we do know that, and qemu could walk all of
it's virtio devices as a result and "switch" them over but that's
almost grosser from a qemu perspective.
.../...
> > The point is that requiring specific qemu command line arguments isn't
> > going to fly. We have additional problems due to the fact that our
> > firmware (SLOF) inside qemu doesn't currently deal with iommu's etc...
> > though those can be fixed.
> >
> > Overall, however, this seems to be the most convoluted way of achieving
> > things, require user interventions where none should be needed etc...
> >
> > Again, what's wrong with a 2 lines hook instead that solves it all and
> > completely avoids involving qemu ?
> >
> > Ben.
>
> That each platform wants to add hacks in this data path function.
Sure, then add a single platform hook and the platforms can do what
they want here.
But as I said, it should all be done at initialization time rather than
in the data path, this we absolutely agree. We should just chose the
right set of dma_ops, and have the data path always use the DMA API.
Cheers,
Ben.
> > >
> > > > >
> > > > >
> > > > > > arch/powerpc/include/asm/dma-mapping.h | 6 ++++++
> > > > > > arch/powerpc/platforms/pseries/iommu.c | 11 +++++++++++
> > > > > > drivers/virtio/virtio_ring.c | 10 ++++++++++
> > > > > > 3 files changed, 27 insertions(+)
> > > > > >
> > > > > > diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> > > > > > index 8fa3945..056e578 100644
> > > > > > --- a/arch/powerpc/include/asm/dma-mapping.h
> > > > > > +++ b/arch/powerpc/include/asm/dma-mapping.h
> > > > > > @@ -115,4 +115,10 @@ extern u64 __dma_get_required_mask(struct device *dev);
> > > > > > #define ARCH_HAS_DMA_MMAP_COHERENT
> > > > > >
> > > > > > #endif /* __KERNEL__ */
> > > > > > +
> > > > > > +#define platform_forces_virtio_dma platform_forces_virtio_dma
> > > > > > +
> > > > > > +struct virtio_device;
> > > > > > +
> > > > > > +extern bool platform_forces_virtio_dma(struct virtio_device *vdev);
> > > > > > #endif /* _ASM_DMA_MAPPING_H */
> > > > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > > > > index 06f0296..a2ec15a 100644
> > > > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > > > @@ -38,6 +38,7 @@
> > > > > > #include <linux/of.h>
> > > > > > #include <linux/iommu.h>
> > > > > > #include <linux/rculist.h>
> > > > > > +#include <linux/virtio.h>
> > > > > > #include <asm/io.h>
> > > > > > #include <asm/prom.h>
> > > > > > #include <asm/rtas.h>
> > > > > > @@ -1396,3 +1397,13 @@ static int __init disable_multitce(char *str)
> > > > > > __setup("multitce=", disable_multitce);
> > > > > >
> > > > > > machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> > > > > > +
> > > > > > +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > > > > +{
> > > > > > + /*
> > > > > > + * On protected guest platforms, force virtio core to use DMA
> > > > > > + * MAP API for all virtio devices. But there can also be some
> > > > > > + * exceptions for individual devices like virtio balloon.
> > > > > > + */
> > > > > > + return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> > > > > > +}
> > > > >
> > > > > Isn't this kind of slow? vring_use_dma_api is on
> > > > > data path and supposed to be very fast.
> > > > >
> > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > index 21d464a..47ea6c3 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -141,8 +141,18 @@ struct vring_virtqueue {
> > > > > > * unconditionally on data path.
> > > > > > */
> > > > > >
> > > > > > +#ifndef platform_forces_virtio_dma
> > > > > > +static inline bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > > > > +{
> > > > > > + return false;
> > > > > > +}
> > > > > > +#endif
> > > > > > +
> > > > > > static bool vring_use_dma_api(struct virtio_device *vdev)
> > > > > > {
> > > > > > + if (platform_forces_virtio_dma(vdev))
> > > > > > + return true;
> > > > > > +
> > > > > > if (!virtio_has_iommu_quirk(vdev))
> > > > > > return true;
> > > > > >
> > > > > > --
> > > > > > 2.9.3
On Tue, Jun 05, 2018 at 09:26:56AM +1000, Benjamin Herrenschmidt wrote:
> I would like to keep however the ability to bypass the iommu for
> performance reasons
So that's easy, clear the IOMMU flag and this means "bypass the IOMMU".
--
MST
On Mon, Jun 04, 2018 at 07:48:54PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-06-04 at 18:57 +1000, David Gibson wrote:
> >
> > > - First qemu doesn't know that the guest will switch to "secure mode"
> > > in advance. There is no difference between a normal and a secure
> > > partition until the partition does the magic UV call to "enter secure
> > > mode" and qemu doesn't see any of it. So who can set the flag here ?
> >
> > This seems weird to me. As a rule HV calls should go through qemu -
> > or be allowed to go directly to KVM *by* qemu.
>
> It's not an HV call, it's a UV call, qemu won't see it, qemu isn't
> trusted. Now the UV *will* reflect that to the HV via some synthetized
> HV calls, and we *could* have those do a pass by qemu, however, so far,
> our entire design doesn't rely on *any* qemu knowledge whatsoever and
> it would be sad to add it just for that purpose.
>
> Additionally, this is rather orthogonal, see my other email, the
> problem we are trying to solve is *not* a qemu problem and it doesn't
> make sense to leak that into qemu.
>
> > We generally reserve
> > the latter for hot path things. Since this isn't a hot path, having
> > the call handled directly by the kernel seems wrong.
> >
> > Unless a "UV call" is something different I don't know about.
>
> Yes, a UV call goes to the Ultravisor, not the Hypervisor. The
> Hypervisor isn't trusted.
Ah, right. Is that implemented in the host kernel, or in something
further above?
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Tue, Jun 05, 2018 at 09:26:56AM +1000, Benjamin Herrenschmidt wrote:
> Sorry Michael, that doesn't click. Yes of course virtio is implemented
> in qemu, but the problem we are trying to solve is *not* a qemu problem
> (the fact that the Linux drivers bypass the DMA API is wrong, needs
> fixing, and isnt a qemu problem). The fact that the secure guests need
> bounce buffering is not a qemu problem either.
>
> Whether qemu chose to use an iommu or not is, and should remain an
> orthogonal problem.
Agreed. We have a problem with qemu (old qemu only?) punching a hole
into the VM abstraction by deciding that even if firmware tables
claim use of an IOMMU for a PCI bus it expects virtio to use physiscal
addresses. So far so bad. The answer to that should have been to
quirk the affected qemu versions and move on. Instead we now have
virtio not use the DMA API by default and are creating a worse problem.
Let's fix this issue ASAP and quirk the buggy implementations instead
of letting everyone else suffer.
> The DMA API itself isn't the one that needs to learn "per-device
> quirks", it's just plumbing into arch backends. The "quirk" is at the
> point of establishing the backend for a given device.
>
> We can go a good way down that path simply by having virtio in Linux
> start with putting *itself* its own direct ops in there when
> VIRTIO_F_IOMMU_PLATFORM is not set, and removing all the special casing
> in the rest of the driver.
Yes. And we have all the infrastructure for that now. A few RDMA
drivers quirk to virt_dma_ops, and virtio could quirk to dma_direct_ops
anytime now. In fact given how much time we are spending arguing here
I'm going to give it a spin today.
> Once that's done, we have a single point of establishing the dma ops,
> we can quirk in there if needed, that's rather nicely contained, or put
> an arch hook, or whatever is necessary.
Yes.
On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote:
> Pls work on a long term solution. Short term needs can be served by
> enabling the iommu platform in qemu.
So, I spent some time looking at converting virtio to dma ops overrides,
and the current virtio spec, and the sad through I have to tell is that
both the spec and the Linux implementation are complete and utterly fucked
up.
Both in the flag naming and the implementation there is an implication
of DMA API == IOMMU, which is fundamentally wrong.
The DMA API does a few different things:
a) address translation
This does include IOMMUs. But it also includes random offsets
between PCI bars and system memory that we see on various
platforms. Worse so some of these offsets might be based on
banks, e.g. on the broadcom bmips platform. It also deals
with bitmask in physical addresses related to memory encryption
like AMD SEV. I'd be really curious how for example the
Intel virtio based NIC is going to work on any of those
plaforms.
b) coherency
On many architectures DMA is not cache coherent, and we need
to invalidate and/or write back cache lines before doing
DMA. Again, I wonder how this is every going to work with
hardware based virtio implementations. Even worse I think this
is actually broken at least for VIVT event for virtualized
implementations. E.g. a KVM guest is going to access memory
using different virtual addresses than qemu, vhost might throw
in another different address space.
c) bounce buffering
Many DMA implementations can not address all physical memory
due to addressing limitations. In such cases we copy the
DMA memory into a known addressable bounc buffer and DMA
from there.
d) flushing write combining buffers or similar
On some hardware platforms we need workarounds to e.g. read
from a certain mmio address to make sure DMA can actually
see memory written by the host.
All of this is bypassed by virtio by default despite generally being
platform issues, not particular to a given device.
On Wed, Jun 06, 2018 at 10:23:06PM -0700, Christoph Hellwig wrote:
> On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote:
> > Pls work on a long term solution. Short term needs can be served by
> > enabling the iommu platform in qemu.
>
> So, I spent some time looking at converting virtio to dma ops overrides,
> and the current virtio spec, and the sad through I have to tell is that
> both the spec and the Linux implementation are complete and utterly fucked
> up.
Let me restate it: DMA API has support for a wide range of hardware, and
hardware based virtio implementations likely won't benefit from all of
it.
And given virtio right now is optimized for specific workloads, improving
portability without regressing performance isn't easy.
I think it's unsurprising since it started a strictly a guest/host
mechanism. People did implement offloads on specific platforms though,
and they are known to work. To improve portability even further,
we might need to make spec and code changes.
I'm not really sympathetic to people complaining that they can't even
set a flag in qemu though. If that's the case the stack in question is
way too inflexible.
> Both in the flag naming and the implementation there is an implication
> of DMA API == IOMMU, which is fundamentally wrong.
Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.
It's possible that some setups will benefit from a more
fine-grained approach where some aspects of the DMA
API are bypassed, others aren't.
This seems to be what was being asked for in this thread,
with comments claiming IOMMU flag adds too much overhead.
> The DMA API does a few different things:
>
> a) address translation
>
> This does include IOMMUs. But it also includes random offsets
> between PCI bars and system memory that we see on various
> platforms.
I don't think you mean bars. That's unrelated to DMA.
> Worse so some of these offsets might be based on
> banks, e.g. on the broadcom bmips platform. It also deals
> with bitmask in physical addresses related to memory encryption
> like AMD SEV. I'd be really curious how for example the
> Intel virtio based NIC is going to work on any of those
> plaforms.
SEV guys report that they just set the iommu flag and then it all works.
I guess if there's translation we can think of this as a kind of iommu.
Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
And apparently some people complain that just setting that flag makes
qemu check translation on each access with an unacceptable performance
overhead. Forcing same behaviour for everyone on general principles
even without the flag is unlikely to make them happy.
> b) coherency
>
> On many architectures DMA is not cache coherent, and we need
> to invalidate and/or write back cache lines before doing
> DMA. Again, I wonder how this is every going to work with
> hardware based virtio implementations.
You mean dma_Xmb and friends?
There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
for that.
> Even worse I think this
> is actually broken at least for VIVT event for virtualized
> implementations. E.g. a KVM guest is going to access memory
> using different virtual addresses than qemu, vhost might throw
> in another different address space.
I don't really know what VIVT is. Could you help me please?
> c) bounce buffering
>
> Many DMA implementations can not address all physical memory
> due to addressing limitations. In such cases we copy the
> DMA memory into a known addressable bounc buffer and DMA
> from there.
Don't do it then?
> d) flushing write combining buffers or similar
>
> On some hardware platforms we need workarounds to e.g. read
> from a certain mmio address to make sure DMA can actually
> see memory written by the host.
I guess it isn't an issue as long as WC isn't actually used.
It will become an issue when virtio spec adds some WC capability -
I suspect we can ignore this for now.
>
> All of this is bypassed by virtio by default despite generally being
> platform issues, not particular to a given device.
It's both a device and a platform issue. A PV device is often more like
another CPU than like a PCI device.
--
MST
On Thu, Jun 07, 2018 at 07:28:35PM +0300, Michael S. Tsirkin wrote:
> Let me restate it: DMA API has support for a wide range of hardware, and
> hardware based virtio implementations likely won't benefit from all of
> it.
That is completely wrong. All aspects of the DMA API are about the
system they are used in. The CPU, the PCIe root complex, interconnects.
All the issues I mentioned in my previous mail exist in realy life
systems that you can plug virtio PCI or PCIe cards into.
> I'm not really sympathetic to people complaining that they can't even
> set a flag in qemu though. If that's the case the stack in question is
> way too inflexible.
The flag as defined in the spec is the wrong thing to set, because they
are not using an iommu. They probably don't even do any address
translation.
> > Both in the flag naming and the implementation there is an implication
> > of DMA API == IOMMU, which is fundamentally wrong.
>
> Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.
And the explanation.
>
> It's possible that some setups will benefit from a more
> fine-grained approach where some aspects of the DMA
> API are bypassed, others aren't.
Hell no. DMA API = abstraction for any possble platform wart.
We are not going to make this any more fine grained. It is bad
enough that virtio already has a mode bypassing any of this,
we are not going to make even more of a mess of it.
> This seems to be what was being asked for in this thread,
> with comments claiming IOMMU flag adds too much overhead.
Right now it means implementing a virtual iommu, which I agree is
way too much overhead.
>
> > The DMA API does a few different things:
> >
> > a) address translation
> >
> > This does include IOMMUs. But it also includes random offsets
> > between PCI bars and system memory that we see on various
> > platforms.
>
> I don't think you mean bars. That's unrelated to DMA.
Of course it matters. If the device always needs an offset in
the DMA addresses it is completely related to DMA.
For some examples take a look at:
arch/x86/pci/sta2x11-fixup.c
arch/mips/include/asm/mach-ath25/dma-coherence.h
or anything setting dma_pfn_offset.
> > Worse so some of these offsets might be based on
> > banks, e.g. on the broadcom bmips platform. It also deals
> > with bitmask in physical addresses related to memory encryption
> > like AMD SEV. I'd be really curious how for example the
> > Intel virtio based NIC is going to work on any of those
> > plaforms.
>
> SEV guys report that they just set the iommu flag and then it all works.
> I guess if there's translation we can think of this as a kind of iommu.
> Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
VIRTIO_F_BEHAVES_LIKE_A_REAL_PCI_DEVICE_DONT_TRY_TO_OUTSMART_ME
as said it's not just translations, it is cache coherence as well.
> And apparently some people complain that just setting that flag makes
> qemu check translation on each access with an unacceptable performance
> overhead. Forcing same behaviour for everyone on general principles
> even without the flag is unlikely to make them happy.
That sounds like a qemu implementation bug. If qemu knowns that
guest physiscall == guest dma space there is no need to check.
> > b) coherency
> >
> > On many architectures DMA is not cache coherent, and we need
> > to invalidate and/or write back cache lines before doing
> > DMA. Again, I wonder how this is every going to work with
> > hardware based virtio implementations.
>
>
> You mean dma_Xmb and friends?
> There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
> for that.
No. I mean the fact that PCI(e) devices often are not coherent with
the cache. So you need to writeback the cpu cache before transferring
data to the device, and invalidate the cpu cache before transferring
data from the device. Plus additional workarounds for speculation.
Looks at the implementations and comments around the dma_sync_*
calls.
>
> > Even worse I think this
> > is actually broken at least for VIVT event for virtualized
> > implementations. E.g. a KVM guest is going to access memory
> > using different virtual addresses than qemu, vhost might throw
> > in another different address space.
>
> I don't really know what VIVT is. Could you help me please?
Virtually indexed, virtually tagged. In short you must do cache
maintainance based on the virtual address used to fill the cache.
>
> > c) bounce buffering
> >
> > Many DMA implementations can not address all physical memory
> > due to addressing limitations. In such cases we copy the
> > DMA memory into a known addressable bounc buffer and DMA
> > from there.
>
> Don't do it then?
Because for example your PCIe root complex only supports 32-bit
addressing, but the memory buffer is outside the addressing range.
> > d) flushing write combining buffers or similar
> >
> > On some hardware platforms we need workarounds to e.g. read
> > from a certain mmio address to make sure DMA can actually
> > see memory written by the host.
>
> I guess it isn't an issue as long as WC isn't actually used.
> It will become an issue when virtio spec adds some WC capability -
> I suspect we can ignore this for now.
This is write combibining in the SOC with the root complex. Nothing
your can work around in the device or device driver.
On Thu, Jun 07, 2018 at 07:28:35PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 06, 2018 at 10:23:06PM -0700, Christoph Hellwig wrote:
> > On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote:
> > > Pls work on a long term solution. Short term needs can be served by
> > > enabling the iommu platform in qemu.
> >
> > So, I spent some time looking at converting virtio to dma ops overrides,
> > and the current virtio spec, and the sad through I have to tell is that
> > both the spec and the Linux implementation are complete and utterly fucked
> > up.
>
> Let me restate it: DMA API has support for a wide range of hardware, and
> hardware based virtio implementations likely won't benefit from all of
> it.
>
> And given virtio right now is optimized for specific workloads, improving
> portability without regressing performance isn't easy.
>
> I think it's unsurprising since it started a strictly a guest/host
> mechanism. People did implement offloads on specific platforms though,
> and they are known to work. To improve portability even further,
> we might need to make spec and code changes.
>
> I'm not really sympathetic to people complaining that they can't even
> set a flag in qemu though. If that's the case the stack in question is
> way too inflexible.
We did consider your suggestion. But can't see how it will work.
Maybe you can guide us here.
In our case qemu has absolutely no idea if the VM will switch itself to
secure mode or not. Its a dynamic decision made entirely by the VM
through direct interaction with the hardware/firmware; no
qemu/hypervisor involved.
If the administrator, who invokes qemu, enables the flag, the DMA ops
associated with the virito devices will be called, and hence will be
able to do the right things. Yes we might incur performance hit due to
the IOMMU translations, but lets ignore that for now; the functionality
will work. Good till now.
However if the administrator
ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
flag, virtio will not be able to pass control to the DMA ops associated
with the virtio devices. Which means, we have no opportunity to share
the I/O buffers with the hypervisor/qemu.
How do you suggest, we handle this case?
>
>
>
> > Both in the flag naming and the implementation there is an implication
> > of DMA API == IOMMU, which is fundamentally wrong.
>
> Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.
>
> It's possible that some setups will benefit from a more
> fine-grained approach where some aspects of the DMA
> API are bypassed, others aren't.
>
> This seems to be what was being asked for in this thread,
> with comments claiming IOMMU flag adds too much overhead.
>
>
> > The DMA API does a few different things:
> >
> > a) address translation
> >
> > This does include IOMMUs. But it also includes random offsets
> > between PCI bars and system memory that we see on various
> > platforms.
>
> I don't think you mean bars. That's unrelated to DMA.
>
> > Worse so some of these offsets might be based on
> > banks, e.g. on the broadcom bmips platform. It also deals
> > with bitmask in physical addresses related to memory encryption
> > like AMD SEV. I'd be really curious how for example the
> > Intel virtio based NIC is going to work on any of those
> > plaforms.
>
> SEV guys report that they just set the iommu flag and then it all works.
This is one of the fundamental difference between SEV architecture and
the ultravisor architecture. In SEV, qemu is aware of SEV. In
ultravisor architecture, only the VM that runs within qemu is aware of
ultravisor; hypervisor/qemu/administrator are untrusted entities.
I hope, we can make virtio subsystem flexibe enough to support various
security paradigms.
Apart from the above reason, Christoph and Ben point to so many other
reasons to make it flexibe. So why not, make it happen?
> I guess if there's translation we can think of this as a kind of iommu.
> Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
>
> And apparently some people complain that just setting that flag makes
> qemu check translation on each access with an unacceptable performance
> overhead. Forcing same behaviour for everyone on general principles
> even without the flag is unlikely to make them happy.
>
> > b) coherency
> >
> > On many architectures DMA is not cache coherent, and we need
> > to invalidate and/or write back cache lines before doing
> > DMA. Again, I wonder how this is every going to work with
> > hardware based virtio implementations.
>
>
> You mean dma_Xmb and friends?
> There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
> for that.
>
>
> > Even worse I think this
> > is actually broken at least for VIVT event for virtualized
> > implementations. E.g. a KVM guest is going to access memory
> > using different virtual addresses than qemu, vhost might throw
> > in another different address space.
>
> I don't really know what VIVT is. Could you help me please?
>
> > c) bounce buffering
> >
> > Many DMA implementations can not address all physical memory
> > due to addressing limitations. In such cases we copy the
> > DMA memory into a known addressable bounc buffer and DMA
> > from there.
>
> Don't do it then?
>
>
> > d) flushing write combining buffers or similar
> >
> > On some hardware platforms we need workarounds to e.g. read
> > from a certain mmio address to make sure DMA can actually
> > see memory written by the host.
>
> I guess it isn't an issue as long as WC isn't actually used.
> It will become an issue when virtio spec adds some WC capability -
> I suspect we can ignore this for now.
>
> >
> > All of this is bypassed by virtio by default despite generally being
> > platform issues, not particular to a given device.
>
> It's both a device and a platform issue. A PV device is often more like
> another CPU than like a PCI device.
>
>
>
> --
> MST
--
Ram Pai
On Sun, Jun 10, 2018 at 07:39:09PM -0700, Ram Pai wrote:
> On Thu, Jun 07, 2018 at 07:28:35PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 06, 2018 at 10:23:06PM -0700, Christoph Hellwig wrote:
> > > On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote:
> > > > Pls work on a long term solution. Short term needs can be served by
> > > > enabling the iommu platform in qemu.
> > >
> > > So, I spent some time looking at converting virtio to dma ops overrides,
> > > and the current virtio spec, and the sad through I have to tell is that
> > > both the spec and the Linux implementation are complete and utterly fucked
> > > up.
> >
> > Let me restate it: DMA API has support for a wide range of hardware, and
> > hardware based virtio implementations likely won't benefit from all of
> > it.
> >
> > And given virtio right now is optimized for specific workloads, improving
> > portability without regressing performance isn't easy.
> >
> > I think it's unsurprising since it started a strictly a guest/host
> > mechanism. People did implement offloads on specific platforms though,
> > and they are known to work. To improve portability even further,
> > we might need to make spec and code changes.
> >
> > I'm not really sympathetic to people complaining that they can't even
> > set a flag in qemu though. If that's the case the stack in question is
> > way too inflexible.
>
> We did consider your suggestion. But can't see how it will work.
> Maybe you can guide us here.
>
> In our case qemu has absolutely no idea if the VM will switch itself to
> secure mode or not. Its a dynamic decision made entirely by the VM
> through direct interaction with the hardware/firmware; no
> qemu/hypervisor involved.
>
> If the administrator, who invokes qemu, enables the flag, the DMA ops
> associated with the virito devices will be called, and hence will be
> able to do the right things. Yes we might incur performance hit due to
> the IOMMU translations, but lets ignore that for now; the functionality
> will work. Good till now.
>
> However if the administrator
> ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> flag, virtio will not be able to pass control to the DMA ops associated
> with the virtio devices. Which means, we have no opportunity to share
> the I/O buffers with the hypervisor/qemu.
>
> How do you suggest, we handle this case?
As step 1, ignore it as a user error.
Further you can for example add per-device quirks in virtio so it can be
switched to dma api. make extra decisions in platform code then.
> >
> >
> >
> > > Both in the flag naming and the implementation there is an implication
> > > of DMA API == IOMMU, which is fundamentally wrong.
> >
> > Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.
> >
> > It's possible that some setups will benefit from a more
> > fine-grained approach where some aspects of the DMA
> > API are bypassed, others aren't.
> >
> > This seems to be what was being asked for in this thread,
> > with comments claiming IOMMU flag adds too much overhead.
> >
> >
> > > The DMA API does a few different things:
> > >
> > > a) address translation
> > >
> > > This does include IOMMUs. But it also includes random offsets
> > > between PCI bars and system memory that we see on various
> > > platforms.
> >
> > I don't think you mean bars. That's unrelated to DMA.
> >
> > > Worse so some of these offsets might be based on
> > > banks, e.g. on the broadcom bmips platform. It also deals
> > > with bitmask in physical addresses related to memory encryption
> > > like AMD SEV. I'd be really curious how for example the
> > > Intel virtio based NIC is going to work on any of those
> > > plaforms.
> >
> > SEV guys report that they just set the iommu flag and then it all works.
>
> This is one of the fundamental difference between SEV architecture and
> the ultravisor architecture. In SEV, qemu is aware of SEV. In
> ultravisor architecture, only the VM that runs within qemu is aware of
> ultravisor; hypervisor/qemu/administrator are untrusted entities.
Spo one option is to teach qemu that it's on a platform with an
ultravisor, this might have more advantages.
> I hope, we can make virtio subsystem flexibe enough to support various
> security paradigms.
So if you are worried about qemu attacking guests, I see
more problems than just passing an incorrect iommu
flag.
> Apart from the above reason, Christoph and Ben point to so many other
> reasons to make it flexibe. So why not, make it happen?
>
I don't see a flexibility argument. I just don't think new platforms
should use workarounds that we put in place for old ones.
> > I guess if there's translation we can think of this as a kind of iommu.
> > Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
> >
> > And apparently some people complain that just setting that flag makes
> > qemu check translation on each access with an unacceptable performance
> > overhead. Forcing same behaviour for everyone on general principles
> > even without the flag is unlikely to make them happy.
> >
> > > b) coherency
> > >
> > > On many architectures DMA is not cache coherent, and we need
> > > to invalidate and/or write back cache lines before doing
> > > DMA. Again, I wonder how this is every going to work with
> > > hardware based virtio implementations.
> >
> >
> > You mean dma_Xmb and friends?
> > There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
> > for that.
> >
> >
> > > Even worse I think this
> > > is actually broken at least for VIVT event for virtualized
> > > implementations. E.g. a KVM guest is going to access memory
> > > using different virtual addresses than qemu, vhost might throw
> > > in another different address space.
> >
> > I don't really know what VIVT is. Could you help me please?
> >
> > > c) bounce buffering
> > >
> > > Many DMA implementations can not address all physical memory
> > > due to addressing limitations. In such cases we copy the
> > > DMA memory into a known addressable bounc buffer and DMA
> > > from there.
> >
> > Don't do it then?
> >
> >
> > > d) flushing write combining buffers or similar
> > >
> > > On some hardware platforms we need workarounds to e.g. read
> > > from a certain mmio address to make sure DMA can actually
> > > see memory written by the host.
> >
> > I guess it isn't an issue as long as WC isn't actually used.
> > It will become an issue when virtio spec adds some WC capability -
> > I suspect we can ignore this for now.
> >
> > >
> > > All of this is bypassed by virtio by default despite generally being
> > > platform issues, not particular to a given device.
> >
> > It's both a device and a platform issue. A PV device is often more like
> > another CPU than like a PCI device.
> >
> >
> >
> > --
> > MST
>
> --
> Ram Pai
On Sun, 2018-06-10 at 19:39 -0700, Ram Pai wrote:
>
> However if the administrator
> ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> flag, virtio will not be able to pass control to the DMA ops associated
> with the virtio devices. Which means, we have no opportunity to share
> the I/O buffers with the hypervisor/qemu.
>
> How do you suggest, we handle this case?
At the risk of repeating myself, let's just do the first pass which is
to switch virtio over to always using the DMA API in the actual data
flow code, with a hook at initialization time that replaces the DMA ops
with some home cooked "direct" ops in the case where the IOMMU flag
isn't set.
This will be equivalent to what we have today but avoids having 2
separate code path all over the driver.
Then a second stage, I think, is to replace this "hook" so that the
architecture gets a say in the matter.
Basically, something like:
arch_virtio_update_dma_ops(pci_dev, qemu_direct_mode).
IE, virtio would tell the arch whether the "other side" is in fact QEMU
in a mode that bypasses the IOMMU and is cache coherent with the guest.
This is our legacy "qemu special" mode. If the performance is
sufficient we may want to deprecate it over time and have qemu enable
the iommu by default but we still need it.
A weak implementation of the above will be provied that just puts in
the direct ops when qemu_direct_mode is set, and thus provides today's
behaviour on any arch that doesn't override it. If the flag is not set,
the ops are left to whatever the arch PCI layer already set.
This will provide the opportunity for architectures that want to do
something special, such as in our case, when we want to force even the
"qemu_direct_mode" to go via bounce buffers, to put our own ops in
there, while retaining the legacy behaviour otherwise.
It also means that the "gunk" is entirely localized in that one
function, the rest of virtio just uses the DMA API normally.
Christoph, are you actually hacking "stage 1" above already or should
we produce patches ?
Cheers,
Ben.
On Mon, 2018-06-11 at 06:28 +0300, Michael S. Tsirkin wrote:
>
> > However if the administrator
> > ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> > flag, virtio will not be able to pass control to the DMA ops associated
> > with the virtio devices. Which means, we have no opportunity to share
> > the I/O buffers with the hypervisor/qemu.
> >
> > How do you suggest, we handle this case?
>
> As step 1, ignore it as a user error.
Ugh ... not again. Ram, don't bring that subject back we ALREADY
addressed it, and requiring the *user* to do special things is just
utterly and completely wrong.
The *user* has no bloody idea what that stuff is, will never know to
set whatver magic qemu flag etc... The user will just start a a VM
normally and expect things to work. Requiring the *user* to know things
like that iommu virtio flag is complete nonsense.
If by "user" you mean libvirt, then you are now requesting about 4 or 5
different projects to be patched to add speical cases for something
they know nothing about and is completely irrelevant, while it can be
entirely addressed with a 1-liner in virtio kernel side to allow the
arch to plumb alternate DMA ops.
So for some reason you seem to be dead set on a path that leads to
mountain of user pain, changes to many different projects and overall
havok while there is a much much simpler and elegant solution at hand
which I described (again) in the response to Ram I sent about 5mn ago.
> Further you can for example add per-device quirks in virtio so it can be
> switched to dma api. make extra decisions in platform code then.
>
> > >
> > >
> > >
> > > > Both in the flag naming and the implementation there is an implication
> > > > of DMA API == IOMMU, which is fundamentally wrong.
> > >
> > > Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.
> > >
> > > It's possible that some setups will benefit from a more
> > > fine-grained approach where some aspects of the DMA
> > > API are bypassed, others aren't.
> > >
> > > This seems to be what was being asked for in this thread,
> > > with comments claiming IOMMU flag adds too much overhead.
> > >
> > >
> > > > The DMA API does a few different things:
> > > >
> > > > a) address translation
> > > >
> > > > This does include IOMMUs. But it also includes random offsets
> > > > between PCI bars and system memory that we see on various
> > > > platforms.
> > >
> > > I don't think you mean bars. That's unrelated to DMA.
> > >
> > > > Worse so some of these offsets might be based on
> > > > banks, e.g. on the broadcom bmips platform. It also deals
> > > > with bitmask in physical addresses related to memory encryption
> > > > like AMD SEV. I'd be really curious how for example the
> > > > Intel virtio based NIC is going to work on any of those
> > > > plaforms.
> > >
> > > SEV guys report that they just set the iommu flag and then it all works.
> >
> > This is one of the fundamental difference between SEV architecture and
> > the ultravisor architecture. In SEV, qemu is aware of SEV. In
> > ultravisor architecture, only the VM that runs within qemu is aware of
> > ultravisor; hypervisor/qemu/administrator are untrusted entities.
>
> Spo one option is to teach qemu that it's on a platform with an
> ultravisor, this might have more advantages.
>
> > I hope, we can make virtio subsystem flexibe enough to support various
> > security paradigms.
>
> So if you are worried about qemu attacking guests, I see
> more problems than just passing an incorrect iommu
> flag.
>
>
> > Apart from the above reason, Christoph and Ben point to so many other
> > reasons to make it flexibe. So why not, make it happen?
> >
>
> I don't see a flexibility argument. I just don't think new platforms
> should use workarounds that we put in place for old ones.
>
>
> > > I guess if there's translation we can think of this as a kind of iommu.
> > > Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
> > >
> > > And apparently some people complain that just setting that flag makes
> > > qemu check translation on each access with an unacceptable performance
> > > overhead. Forcing same behaviour for everyone on general principles
> > > even without the flag is unlikely to make them happy.
> > >
> > > > b) coherency
> > > >
> > > > On many architectures DMA is not cache coherent, and we need
> > > > to invalidate and/or write back cache lines before doing
> > > > DMA. Again, I wonder how this is every going to work with
> > > > hardware based virtio implementations.
> > >
> > >
> > > You mean dma_Xmb and friends?
> > > There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
> > > for that.
> > >
> > >
> > > > Even worse I think this
> > > > is actually broken at least for VIVT event for virtualized
> > > > implementations. E.g. a KVM guest is going to access memory
> > > > using different virtual addresses than qemu, vhost might throw
> > > > in another different address space.
> > >
> > > I don't really know what VIVT is. Could you help me please?
> > >
> > > > c) bounce buffering
> > > >
> > > > Many DMA implementations can not address all physical memory
> > > > due to addressing limitations. In such cases we copy the
> > > > DMA memory into a known addressable bounc buffer and DMA
> > > > from there.
> > >
> > > Don't do it then?
> > >
> > >
> > > > d) flushing write combining buffers or similar
> > > >
> > > > On some hardware platforms we need workarounds to e.g. read
> > > > from a certain mmio address to make sure DMA can actually
> > > > see memory written by the host.
> > >
> > > I guess it isn't an issue as long as WC isn't actually used.
> > > It will become an issue when virtio spec adds some WC capability -
> > > I suspect we can ignore this for now.
> > >
> > > >
> > > > All of this is bypassed by virtio by default despite generally being
> > > > platform issues, not particular to a given device.
> > >
> > > It's both a device and a platform issue. A PV device is often more like
> > > another CPU than like a PCI device.
> > >
> > >
> > >
> > > --
> > > MST
> >
> > --
> > Ram Pai
On Mon, Jun 11, 2018 at 01:29:18PM +1000, Benjamin Herrenschmidt wrote:
> At the risk of repeating myself, let's just do the first pass which is
> to switch virtio over to always using the DMA API in the actual data
> flow code, with a hook at initialization time that replaces the DMA ops
> with some home cooked "direct" ops in the case where the IOMMU flag
> isn't set.
>
> This will be equivalent to what we have today but avoids having 2
> separate code path all over the driver.
>
> Then a second stage, I think, is to replace this "hook" so that the
> architecture gets a say in the matter.
I don't think we can actually use dma_direct_ops. It still allows
architectures to override parts of the dma setup, which virtio seems
to blindly assume phys == dma and not cache flushing.
I think the right way forward is to either add a new
VIRTIO_F_IS_PCI_DEVICE (or redefine the existing iommu flag if deemed
possible). And then make sure recent qemu always sets it.
On Wed, 2018-06-13 at 00:41 -0700, Christoph Hellwig wrote:
> On Mon, Jun 11, 2018 at 01:29:18PM +1000, Benjamin Herrenschmidt wrote:
> > At the risk of repeating myself, let's just do the first pass which is
> > to switch virtio over to always using the DMA API in the actual data
> > flow code, with a hook at initialization time that replaces the DMA ops
> > with some home cooked "direct" ops in the case where the IOMMU flag
> > isn't set.
> >
> > This will be equivalent to what we have today but avoids having 2
> > separate code path all over the driver.
> >
> > Then a second stage, I think, is to replace this "hook" so that the
> > architecture gets a say in the matter.
>
> I don't think we can actually use dma_direct_ops. It still allows
> architectures to override parts of the dma setup, which virtio seems
> to blindly assume phys == dma and not cache flushing.
By direct ops I didn't mean *the* dma_direct_ops but a virtio-local
variants that effectively reproduces the existing expectations (ie,
virtio-dma-legacy-ops or something).
> I think the right way forward is to either add a new
> VIRTIO_F_IS_PCI_DEVICE (or redefine the existing iommu flag if deemed
> possible). And then make sure recent qemu always sets it.
Ben.
On Wed, 2018-06-13 at 22:25 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-06-13 at 00:41 -0700, Christoph Hellwig wrote:
> > On Mon, Jun 11, 2018 at 01:29:18PM +1000, Benjamin Herrenschmidt wrote:
> > > At the risk of repeating myself, let's just do the first pass which is
> > > to switch virtio over to always using the DMA API in the actual data
> > > flow code, with a hook at initialization time that replaces the DMA ops
> > > with some home cooked "direct" ops in the case where the IOMMU flag
> > > isn't set.
> > >
> > > This will be equivalent to what we have today but avoids having 2
> > > separate code path all over the driver.
> > >
> > > Then a second stage, I think, is to replace this "hook" so that the
> > > architecture gets a say in the matter.
> >
> > I don't think we can actually use dma_direct_ops. It still allows
> > architectures to override parts of the dma setup, which virtio seems
> > to blindly assume phys == dma and not cache flushing.
>
> By direct ops I didn't mean *the* dma_direct_ops but a virtio-local
> variants that effectively reproduces the existing expectations (ie,
> virtio-dma-legacy-ops or something).
Actually ... the stuff in lib/dma-direct.c seems to be just it, no ?
There's no cache flushing and there's no architecture hooks that I can
see other than the AMD security stuff which is probably fine.
Or am I missing something ?
Cheers,
Ben.
>
> > I think the right way forward is to either add a new
> > VIRTIO_F_IS_PCI_DEVICE (or redefine the existing iommu flag if deemed
> > possible). And then make sure recent qemu always sets it.
>
> Ben.
On Thu, Jun 07, 2018 at 11:36:55PM -0700, Christoph Hellwig wrote:
> > This seems to be what was being asked for in this thread,
> > with comments claiming IOMMU flag adds too much overhead.
>
> Right now it means implementing a virtual iommu, which I agree is
> way too much overhead.
Well not really. The flag in question will have a desired effect without
a virtual iommu.
> > SEV guys report that they just set the iommu flag and then it all works.
> > I guess if there's translation we can think of this as a kind of iommu.
> > Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
>
> VIRTIO_F_BEHAVES_LIKE_A_REAL_PCI_DEVICE_DONT_TRY_TO_OUTSMART_ME
>
> as said it's not just translations, it is cache coherence as well.
Well it's only for DMA. So maybe PLATFORM_DMA.
I suspect people will then come and complain that they
do *not* want cache coherence tricks because virtio is
running on a CPU, but we'll see.
> > And apparently some people complain that just setting that flag makes
> > qemu check translation on each access with an unacceptable performance
> > overhead. Forcing same behaviour for everyone on general principles
> > even without the flag is unlikely to make them happy.
>
> That sounds like a qemu implementation bug. If qemu knowns that
> guest physiscall == guest dma space there is no need to check.
Possibly. Or it's possible it's all just theoretical, no one
posted any numbers.
--
MST
On Wed, Jun 13, 2018 at 12:41:41AM -0700, Christoph Hellwig wrote:
> On Mon, Jun 11, 2018 at 01:29:18PM +1000, Benjamin Herrenschmidt wrote:
> > At the risk of repeating myself, let's just do the first pass which is
> > to switch virtio over to always using the DMA API in the actual data
> > flow code, with a hook at initialization time that replaces the DMA ops
> > with some home cooked "direct" ops in the case where the IOMMU flag
> > isn't set.
> >
> > This will be equivalent to what we have today but avoids having 2
> > separate code path all over the driver.
> >
> > Then a second stage, I think, is to replace this "hook" so that the
> > architecture gets a say in the matter.
>
> I don't think we can actually use dma_direct_ops. It still allows
> architectures to override parts of the dma setup, which virtio seems
> to blindly assume phys == dma and not cache flushing.
>
> I think the right way forward is to either add a new
> VIRTIO_F_IS_PCI_DEVICE (or redefine the existing iommu flag if deemed
> possible).
Given this is exactly what happens now, this seems possible, but maybe
we want a non-PCI specific name.
> And then make sure recent qemu always sets it.
I don't think that part is going to happen, sorry.
Hypervisors can set it when they *actually have* a real PCI device.
People emulate systems which have a bunch of overhead in the DMA API
which is required for real DMA. Your proposal would double that overhead
by first doing it in guest then re-doing it in host.
I don't think it's justified when 99% of the world doesn't need it.
--
MST
On Mon, Jun 11, 2018 at 01:29:18PM +1000, Benjamin Herrenschmidt wrote:
> On Sun, 2018-06-10 at 19:39 -0700, Ram Pai wrote:
> >
> > However if the administrator
> > ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> > flag, virtio will not be able to pass control to the DMA ops associated
> > with the virtio devices. Which means, we have no opportunity to share
> > the I/O buffers with the hypervisor/qemu.
> >
> > How do you suggest, we handle this case?
>
> At the risk of repeating myself, let's just do the first pass which is
> to switch virtio over to always using the DMA API in the actual data
> flow code, with a hook at initialization time that replaces the DMA ops
> with some home cooked "direct" ops in the case where the IOMMU flag
> isn't set.
I'm not sure I understand all of the details, will have to
see the patch, but superficially it sounds good to me.
> This will be equivalent to what we have today but avoids having 2
> separate code path all over the driver.
>
> Then a second stage, I think, is to replace this "hook" so that the
> architecture gets a say in the matter.
>
> Basically, something like:
>
> arch_virtio_update_dma_ops(pci_dev, qemu_direct_mode).
>
> IE, virtio would tell the arch whether the "other side" is in fact QEMU
> in a mode that bypasses the IOMMU and is cache coherent with the guest.
> This is our legacy "qemu special" mode. If the performance is
> sufficient we may want to deprecate it over time and have qemu enable
> the iommu by default but we still need it.
>
> A weak implementation of the above will be provied that just puts in
> the direct ops when qemu_direct_mode is set, and thus provides today's
> behaviour on any arch that doesn't override it. If the flag is not set,
> the ops are left to whatever the arch PCI layer already set.
>
> This will provide the opportunity for architectures that want to do
> something special, such as in our case, when we want to force even the
> "qemu_direct_mode" to go via bounce buffers, to put our own ops in
> there, while retaining the legacy behaviour otherwise.
>
> It also means that the "gunk" is entirely localized in that one
> function, the rest of virtio just uses the DMA API normally.
>
> Christoph, are you actually hacking "stage 1" above already or should
> we produce patches ?
>
> Cheers,
> Ben.
On Mon, Jun 11, 2018 at 01:34:50PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-06-11 at 06:28 +0300, Michael S. Tsirkin wrote:
> >
> > > However if the administrator
> > > ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> > > flag, virtio will not be able to pass control to the DMA ops associated
> > > with the virtio devices. Which means, we have no opportunity to share
> > > the I/O buffers with the hypervisor/qemu.
> > >
> > > How do you suggest, we handle this case?
> >
> > As step 1, ignore it as a user error.
>
> Ugh ... not again. Ram, don't bring that subject back we ALREADY
> addressed it, and requiring the *user* to do special things is just
> utterly and completely wrong.
>
> The *user* has no bloody idea what that stuff is, will never know to
> set whatver magic qemu flag etc... The user will just start a a VM
> normally and expect things to work. Requiring the *user* to know things
> like that iommu virtio flag is complete nonsense.
You should consider teaching QEMU that the platform has support for the
ultravisor thing so it can set flags for you.
That's true even if something else is done for virtio - if you don't
have the capability right now you will come to regret it, host userspace
is just fundamentally in charge of control-path enumeration in the KVM
stack, I understand you want to take some of it away for security but
trying to hide things from QEMU completely is a mistake IMHO.
Just my $.02.
> If by "user" you mean libvirt, then you are now requesting about 4 or 5
> different projects to be patched to add speical cases for something
> they know nothing about and is completely irrelevant, while it can be
> entirely addressed with a 1-liner in virtio kernel side to allow the
> arch to plumb alternate DMA ops.
>
> So for some reason you seem to be dead set on a path that leads to
> mountain of user pain, changes to many different projects and overall
> havok while there is a much much simpler and elegant solution at hand
> which I described (again) in the response to Ram I sent about 5mn ago.
What you sent to Ram sounds OK to me - I only hope we do all mean
the same thing because the 3 paragraphs above are all
about the libvirt/QEMU split mess which linux or virtio
should not really care about.
And I'd like to stress that direct path isn't merely legacy.
It's a common sense approach that when you have a hypervisor doing
things like taking care of cache coherency, you do not want to do them
in the guest as well. And the same guest can use a pass-through device
where it does need to do all the coherency mess, so while it's quite
possible to build a platform-specific way to tell guests which devices
need which kind of treatment, people understandably chose to do it in a
portable way through the virtio device.
I understand you guys are working on a new project that is going to use
bounce buffers anyway so what do you care about optimizations but we
can't just slow everyone else down for your benefit.
> > Further you can for example add per-device quirks in virtio so it can be
> > switched to dma api. make extra decisions in platform code then.
Isn't above exactly what you sent to Ram?
--
MST
On Wed, Jun 13, 2018 at 11:11:01PM +1000, Benjamin Herrenschmidt wrote:
> Actually ... the stuff in lib/dma-direct.c seems to be just it, no ?
>
> There's no cache flushing and there's no architecture hooks that I can
> see other than the AMD security stuff which is probably fine.
>
> Or am I missing something ?
You are missing the __phys_to_dma arch hook that allows architectures
to adjust the dma address. Various systems have offsets, or even
multiple banks with different offsets there. Most of them don't
use the dma-direct code yet (working on it), but there are a few
examples in the tree already.
On Fri, 2018-06-15 at 02:16 -0700, Christoph Hellwig wrote:
> On Wed, Jun 13, 2018 at 11:11:01PM +1000, Benjamin Herrenschmidt wrote:
> > Actually ... the stuff in lib/dma-direct.c seems to be just it, no ?
> >
> > There's no cache flushing and there's no architecture hooks that I can
> > see other than the AMD security stuff which is probably fine.
> >
> > Or am I missing something ?
>
> You are missing the __phys_to_dma arch hook that allows architectures
> to adjust the dma address. Various systems have offsets, or even
> multiple banks with different offsets there. Most of them don't
> use the dma-direct code yet (working on it), but there are a few
> examples in the tree already.
Ok and on those systems, qemu will bypass said offset ? Maybe we could
just create a device DMA "flag" or (or use the attributes) to instruct
dma-direct to not use that for legacy virtio ?
Cheers,
Ben.