Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances
From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Grant Grundler <grundler@google.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>,
       linux-kernel@vger.kernel.org, mgross@linux.intel.com,
       linux-scsi@vger.kernel.org
In-Reply-To: <da824cf30806051134u4fb8d419p2ba2dafcb1ba33a7@mail.gmail.com>
References: <20080604235053K.fujita.tomonori@lab.ntt.co.jp>
	 <da824cf30806041106p1de2355fg38594fcb9dcbf203@mail.gmail.com>
	 <20080605235322L.fujita.tomonori@lab.ntt.co.jp>
	 <da824cf30806051134u4fb8d419p2ba2dafcb1ba33a7@mail.gmail.com>
Content-Type: text/plain
Date: Thu, 05 Jun 2008 14:01:28 -0500
Message-Id: <1212692488.4241.8.camel@localhost.localdomain>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2853
Lines: 60

On Thu, 2008-06-05 at 11:34 -0700, Grant Grundler wrote:
> On Thu, Jun 5, 2008 at 7:49 AM, FUJITA Tomonori
> <fujita.tomonori@lab.ntt.co.jp> wrote:
> ...
> >> You can easily emulate SSD drives by doing sequential 4K reads
> >> from a normal SATA HD. That should result in ~7-8K IOPS since the disk
> >> will recognize the sequential stream and read ahead. SAS/SCSI/FC will
> >> probably work the same way with different IOP rates.
> >
> > Yeah, probabaly right. I thought that 10GbE give the IOMMU more
> > workloads than SSD does and tried to emulate something like that.
> 
> 10GbE might exercise a different code path. NICs typically use map_single

map_page, actually, but effectively the same thing.  However, all
they're really doing is their own implementation of sg list mapping.

> and storage devices typically use map_sg.  But they both exercise the same
> underlying resource management code since it's the same IOMMU they poke at.
> 
> ...
> >> Sorry, I didn't see a replacement for the deferred_flush_tables.
> >> Mark Gross and I agree this substantially helps with unmap performance.
> >> See http://lkml.org/lkml/2008/3/3/373
> >
> > Yeah, I can add a nice trick in parisc sba_iommu uses. I'll try next
> > time.
> >
> > But it probably gives the bitmap method less gain than the RB tree
> > since clear the bitmap takes less time than changing the tree.
> >
> > The deferred_flush_tables also batches flushing TLB. The patch flushes
> > TLB only when it reaches the end of the bitmap (it's a trick that some
> > IOMMUs like SPARC does).
> 
> The batching of the TLB flushes is the key thing. I was being paranoid
> by not marking the resource free until after the TLB was flushed. If we
> know the allocation is going to be circular through the bitmap, flushing
> the TLB once per iteration through the bitmap should be sufficient since
> we can guarantee the IO Pdir resource won't get re-used until a full
> cycle through the bitmap has been completed.

Not necessarily ... there's a safety vs performance issue here.  As long
as the iotlb mapping persists, the device can use it to write to the
memory.  If you fail to flush, you lose the ability to detect device dma
after free (because the iotlb may still be valid).  On standard systems,
this happens so infrequently as to be worth the tradeoff.  However, in
virtualised systems, which is what the intel iommu is aimed at, stale
iotlb entries can be used by malicious VMs to gain access to memory
outside of their VM, so the intel people at least need to say whether
they're willing to accept this speed for safety tradeoff.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/