2008-10-01 07:20:30

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops

On Tue, Sep 30, 2008 at 10:44:01PM +0300, Muli Ben-Yehuda wrote:
> On Mon, Sep 29, 2008 at 03:33:11PM +0200, Joerg Roedel wrote:
>
> > > Nobody cares about the performance of dma_alloc_coherent. Only the
> > > performance of map_single/map_sg matters.
> > >
> > > I'm not sure how expensive the hypercalls are, but they are more
> > > expensive than bounce buffering coping lots of data for every
> > > I/Os?
> >
> > I don't think that we can avoid bounce buffering into the guests at
> > all (with and without my idea of a paravirtualized IOMMU) when we
> > want to handle dma_masks and requests that cross guest physical
> > pages properly.
>
> It might be possible to have a per-device slow or fast path, where the
> fast path is for devices which have no DMA limitations (high-end
> devices generally don't) and the slow path is for devices which do.

This solves the problem with the DMA masks. But what happens to requests
that cross guest page boundarys?

> > With mapping/unmapping through hypercalls we add the world-switch
> > overhead to the copy-overhead. We can't avoid this when we have no
> > hardware support at all. But already with older IOMMUs like Calgary
> > and GART we can at least avoid the world-switch. And since, for
> > example, every 64 bit capable AMD processor has a GART we can make
> > use of it.
>
> It should be possible to reduce the number and overhead of hypercalls
> to the point where their cost is immaterial. I think that's
> fundamentally a better approach.

Ok, we can queue map_sg allocations together an queue them into one
hypercall. But I remember a paper from you where you wrote that most
allocations are mapping only one area. Are there other ways to optimize
this? I must say that reducing the number of hypercalls was important
while thinking about my idea. If there are better ways I am all ears to
hear from them.

Joerg

--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


2008-10-03 08:40:14

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops

On Wed, Oct 01, 2008 at 09:19:56AM +0200, Joerg Roedel wrote:

> > It might be possible to have a per-device slow or fast path, where
> > the fast path is for devices which have no DMA limitations
> > (high-end devices generally don't) and the slow path is for
> > devices which do.
>
> This solves the problem with the DMA masks. But what happens to
> requests that cross guest page boundarys?

I'm not sure I follow. If a buffer is contiguous in the guest space,
it will remain contiguous (i.e., be mapped contiguously) in the IOMMU
I/O address space, even if each I/O PTE ends up mapping a different
physical frame.

> > > With mapping/unmapping through hypercalls we add the
> > > world-switch overhead to the copy-overhead. We can't avoid this
> > > when we have no hardware support at all. But already with older
> > > IOMMUs like Calgary and GART we can at least avoid the
> > > world-switch. And since, for example, every 64 bit capable AMD
> > > processor has a GART we can make use of it.
> >
> > It should be possible to reduce the number and overhead of
> > hypercalls to the point where their cost is immaterial. I think
> > that's fundamentally a better approach.
>
> Ok, we can queue map_sg allocations together an queue them into one
> hypercall. But I remember a paper from you where you wrote that most
> allocations are mapping only one area.

I'm afraid that bit of the paper was poorly done (mea culpa). As far
as I can recall, the majority of dma_alloc_coherent + scatter-gather
list *element* mappings only map a single frame, but we didn't look at
the time at the average length of a scatter gather list and the
frequency of sg list mappings vs. single page mappings. If the length
and frequency are high enough, and you map entire sg lists in a single
hcall or a single batch of hcalls, it might have a nice boost.

> Are there other ways to optimize this? I must say that reducing the
> number of hypercalls was important while thinking about my idea. If
> there are better ways I am all ears to hear from them.

There were a number of ideas mentioned in our paper (for example,
switching drivers from the streaming DMA API to the persistent DMA
API, which will be a big help to the scheme you propose), and Willman,
Rixner and Cox also had some input to the problem[1]. Unfortunately no
implementations exist yet AFAIK.

[1] "Protection Strategies for Direct Access to Virtualized I/O
Devices", by Paul Willmann, Scott Rixner and Alan L. Cox, USENIX '08.

Cheers,
Muli
--
The First Workshop on I/O Virtualization (WIOV '08)
Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/
<->
SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/