LinuxLists.cc - Intel IOMMU (and IOMMU for Virtualization) performances

2008-06-04 14:47:35

Subject: Intel IOMMU (and IOMMU for Virtualization) performances

I resumed the work to make the IOMMU respect drivers' DMA alignment
(since I got a desktop box having VT-d). In short, some IOMMUs
allocate memory areas spanning driver's segment boundary limit (DMA
alignment). It forces drivers to have a workaround to split up scatter
entries into smaller chunks again. To remove such work around in
drivers, I modified several IOMMUs, X86_64 (Calgary and Gart), Alpha,
POWER, PARISC, IA64, SPARC64, and swiotlb.

Now I try to fix Intel IOMMU code, the free space management
algorithm.

The major difference between Intel IOMMU code and the others is Intel
IOMMU code uses Red Black tree to manage free space while the others
use bitmap (swiotlb is the only exception).

The Red Black tree method consumes less memory than the bitmap method,
but it incurs more overheads (the RB tree method needs to walk through
the tree, allocates a new item, and insert it every time it maps an
I/O address). Intel IOMMU (and IOMMUs for virtualization) needs
multiple IOMMU address spaces. That's why the Red Black tree method is
chosen, I guess.

Half a year ago, I tried to convert POWER IOMMU code to use the Red
Black method and saw performance drop:

http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-11/msg00650.html

So I tried to convert Intel IOMMU code to use the bitmap method to see
how much I can get.

I didn't see noticable performance differences with 1GbE. So I tried
the modified driver of a SCSI HBA that just does memory accesses to
emulate the performances of SSD disk drives, 10GbE, Infiniband, etc.

I got the following results with one thread issuing 1KB I/Os:

IOPS (I/O per second)
IOMMU disabled 145253.1 (1.000)
RB tree (mainline) 118313.0 (0.814)
Bitmap 128954.1 (0.887)

I've attached the patch to modify Intel IOMMU code to use the bitmap
method but I have no intention of arguing that Intel IOMMU code
consumes more memory for better performance. :) I want to do more
performance tests with 10GbE (probably, I have to wait for a server
box having VT-d, which is not available on the market now).

As I said, what I want to do now is to make Intel IOMMU code respect
drivers' DMA alignment. Well, it's easier to do that if Intel IOMMU
uses the bitmap method since I can simply convert the IOMMU code to
use lib/iommu-helper but I can modify the RB tree method too.

I'm just interested in other people's opinions on IOMMU
implementations, performances, possible future changes for performance
improvement, etc.

For further information:

LSF'08 "Storage Track" summary by Grant Grundler:
http://iou.parisc-linux.org/lsf2008/SUMMARY-Storage.txt

My LSF'08 slides:
http://iou.parisc-linux.org/lsf2008/IO-DMA_Representations-fujita_tomonori.pdf

Tis patch is against the latst git tree (note that it just converts
Intel IOMMU code to use the bitmap. It doesn't make it respect
drivers' DMA alignment yet).

=
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index dcbec34..06d92d0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,9 @@ config DMAR
and include PCI device scope covered by these DMA
remapping devices.

+config IOMMU_HELPER
+ def_bool DMAR
+
config DMAR_GFX_WA
def_bool y
prompt "Support for Graphics workaround"
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 4d1ce2e..675beb6 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -24,7 +24,7 @@ obj-$(CONFIG_PCI_MSI) += msi.o
obj-$(CONFIG_HT_IRQ) += htirq.o

# Build Intel IOMMU support
-obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o
+obj-$(CONFIG_DMAR) += dmar.o intel-iommu.o

#
# Some architectures use the generic PCI setup functions
diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index f941f60..41ad545 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -26,7 +26,6 @@

#include <linux/pci.h>
#include <linux/dmar.h>
-#include "iova.h"
#include "intel-iommu.h"

#undef PREFIX
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 66c0fd2..839363a 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -32,8 +32,7 @@
#include <linux/dmar.h>
#include <linux/dma-mapping.h>
#include <linux/mempool.h>
-#include <linux/timer.h>
-#include "iova.h"
+#include <linux/iommu-helper.h>
#include "intel-iommu.h"
#include <asm/proto.h> /* force_iommu in this header in x86-64*/
#include <asm/cacheflush.h>
@@ -51,33 +50,15 @@

#define DMAR_OPERATION_TIMEOUT ((cycles_t) tsc_khz*10*1000) /* 10sec */

-#define DOMAIN_MAX_ADDR(gaw) ((((u64)1) << gaw) - 1)
-
-
-static void flush_unmaps_timeout(unsigned long data);
+#define DMA_ERROR_CODE (~(dma_addr_t)0x0)

-DEFINE_TIMER(unmap_timer, flush_unmaps_timeout, 0, 0);
+#define DOMAIN_MAX_ADDR(gaw) ((((u64)1) << gaw) - 1)

static struct intel_iommu *g_iommus;

-#define HIGH_WATER_MARK 250
-struct deferred_flush_tables {
- int next;
- struct iova *iova[HIGH_WATER_MARK];
- struct dmar_domain *domain[HIGH_WATER_MARK];
-};
-
-static struct deferred_flush_tables *deferred_flush;
-
/* bitmap for indexing intel_iommus */
static int g_num_of_iommus;

-static DEFINE_SPINLOCK(async_umap_flush_lock);
-static LIST_HEAD(unmaps_to_do);
-
-static int timer_on;
-static long list_size;
-
static void domain_remove_dev_info(struct dmar_domain *domain);

static int dmar_disabled;
@@ -121,7 +102,6 @@ __setup("intel_iommu=", intel_iommu_setup);

static struct kmem_cache *iommu_domain_cache;
static struct kmem_cache *iommu_devinfo_cache;
-static struct kmem_cache *iommu_iova_cache;

static inline void *iommu_kmem_cache_alloc(struct kmem_cache *cachep)
{
@@ -175,16 +155,6 @@ static inline void free_devinfo_mem(void *vaddr)
kmem_cache_free(iommu_devinfo_cache, vaddr);
}

-struct iova *alloc_iova_mem(void)
-{
- return iommu_kmem_cache_alloc(iommu_iova_cache);
-}
-
-void free_iova_mem(struct iova *iova)
-{
- kmem_cache_free(iommu_iova_cache, iova);
-}
-
static inline void __iommu_flush_cache(
struct intel_iommu *iommu, void *addr, int size)
{
@@ -1124,29 +1094,39 @@ static void iommu_free_domain(struct dmar_domain *domain)
spin_unlock_irqrestore(&domain->iommu->lock, flags);
}

-static struct iova_domain reserved_iova_list;
+static unsigned long *reserved_it_map;
+static unsigned long reserved_it_size;
static struct lock_class_key reserved_alloc_key;
static struct lock_class_key reserved_rbtree_key;

+static void reserve_area(unsigned long *map, unsigned long start, unsigned long end)
+{
+ while (start <= end) {
+ __set_bit(start, map);
+ start++;
+ }
+}
+
static void dmar_init_reserved_ranges(void)
{
struct pci_dev *pdev = NULL;
- struct iova *iova;
int i;
u64 addr, size;

- init_iova_domain(&reserved_iova_list, DMA_32BIT_PFN);
+ reserved_it_size = 1UL << (32 - PAGE_SHIFT_4K);
+ reserved_it_map = kzalloc(reserved_it_size / BITS_PER_BYTE, GFP_ATOMIC);
+ BUG_ON(!reserved_it_map);

lockdep_set_class(&reserved_iova_list.iova_alloc_lock,
&reserved_alloc_key);
lockdep_set_class(&reserved_iova_list.iova_rbtree_lock,
&reserved_rbtree_key);

+ reserve_area(reserved_it_map, 0, IOVA_PFN(IOVA_START_ADDR));
+
/* IOAPIC ranges shouldn't be accessed by DMA */
- iova = reserve_iova(&reserved_iova_list, IOVA_PFN(IOAPIC_RANGE_START),
- IOVA_PFN(IOAPIC_RANGE_END));
- if (!iova)
- printk(KERN_ERR "Reserve IOAPIC range failed\n");
+ reserve_area(reserved_it_map, IOVA_PFN(IOAPIC_RANGE_START),
+ IOVA_PFN(IOAPIC_RANGE_END));

/* Reserve all PCI MMIO to avoid peer-to-peer access */
for_each_pci_dev(pdev) {
@@ -1160,18 +1140,10 @@ static void dmar_init_reserved_ranges(void)
addr &= PAGE_MASK_4K;
size = r->end - addr;
size = PAGE_ALIGN_4K(size);
- iova = reserve_iova(&reserved_iova_list, IOVA_PFN(addr),
- IOVA_PFN(size + addr) - 1);
- if (!iova)
- printk(KERN_ERR "Reserve iova failed\n");
+ reserve_area(reserved_it_map, IOVA_PFN(addr),
+ IOVA_PFN(size + addr) - 1);
}
}
-
-}
-
-static void domain_reserve_special_ranges(struct dmar_domain *domain)
-{
- copy_reserved_iova(&reserved_iova_list, &domain->iovad);
}

static inline int guestwidth_to_adjustwidth(int gaw)
@@ -1191,38 +1163,52 @@ static inline int guestwidth_to_adjustwidth(int gaw)
static int domain_init(struct dmar_domain *domain, int guest_width)
{
struct intel_iommu *iommu;
- int adjust_width, agaw;
+ int ret, adjust_width, agaw;
unsigned long sagaw;

- init_iova_domain(&domain->iovad, DMA_32BIT_PFN);
spin_lock_init(&domain->mapping_lock);

- domain_reserve_special_ranges(domain);
-
/* calculate AGAW */
iommu = domain->iommu;
+
if (guest_width > cap_mgaw(iommu->cap))
guest_width = cap_mgaw(iommu->cap);
domain->gaw = guest_width;
adjust_width = guestwidth_to_adjustwidth(guest_width);
agaw = width_to_agaw(adjust_width);
sagaw = cap_sagaw(iommu->cap);
+
+/* domain->it_size = 1UL << (guest_width - PAGE_SHIFT_4K); */
+ domain->it_size = 1UL << (32 - PAGE_SHIFT_4K);
+ domain->it_map = kzalloc(domain->it_size / BITS_PER_BYTE, GFP_ATOMIC);
+ if (!domain->it_map)
+ return -ENOMEM;
+
+ memcpy(domain->it_map, reserved_it_map, reserved_it_size / BITS_PER_BYTE);
+
if (!test_bit(agaw, &sagaw)) {
/* hardware doesn't support it, choose a bigger one */
pr_debug("IOMMU: hardware doesn't support agaw %d\n", agaw);
agaw = find_next_bit(&sagaw, 5, agaw);
- if (agaw >= 5)
- return -ENODEV;
+ if (agaw >= 5) {
+ ret = -ENODEV;
+ goto free_map;
+ }
}
domain->agaw = agaw;
INIT_LIST_HEAD(&domain->devices);

/* always allocate the top pgd */
domain->pgd = (struct dma_pte *)alloc_pgtable_page();
- if (!domain->pgd)
- return -ENOMEM;
+ if (!domain->pgd) {
+ ret = -ENOMEM;
+ goto free_map;
+ }
__iommu_flush_cache(iommu, domain->pgd, PAGE_SIZE_4K);
return 0;
+free_map:
+ kfree(domain->it_map);
+ return ret;
}

static void domain_exit(struct dmar_domain *domain)
@@ -1234,8 +1220,7 @@ static void domain_exit(struct dmar_domain *domain)
return;

domain_remove_dev_info(domain);
- /* destroy iovas */
- put_iova_domain(&domain->iovad);
+ kfree(domain->it_map);
end = DOMAIN_MAX_ADDR(domain->gaw);
end = end & (~PAGE_MASK_4K);

@@ -1597,12 +1582,9 @@ static int iommu_prepare_identity_map(struct pci_dev *pdev, u64 start, u64 end)
base = start & PAGE_MASK_4K;
size = end - base;
size = PAGE_ALIGN_4K(size);
- if (!reserve_iova(&domain->iovad, IOVA_PFN(base),
- IOVA_PFN(base + size) - 1)) {
- printk(KERN_ERR "IOMMU: reserve iova failed\n");
- ret = -ENOMEM;
- goto error;
- }
+
+ reserve_area(domain->it_map, IOVA_PFN(base),
+ IOVA_PFN(base + size) - 1);

pr_debug("Mapping reserved region %lx@%llx for %s\n",
size, base, pci_name(pdev));
@@ -1722,14 +1704,6 @@ int __init init_dmars(void)
goto error;
}

- deferred_flush = kzalloc(g_num_of_iommus *
- sizeof(struct deferred_flush_tables), GFP_KERNEL);
- if (!deferred_flush) {
- kfree(g_iommus);
- ret = -ENOMEM;
- goto error;
- }
-
i = 0;
for_each_drhd_unit(drhd) {
if (drhd->ignored)
@@ -1834,49 +1808,6 @@ static inline u64 aligned_size(u64 host_addr, size_t size)
return PAGE_ALIGN_4K(addr);
}

-struct iova *
-iommu_alloc_iova(struct dmar_domain *domain, size_t size, u64 end)
-{
- struct iova *piova;
-
- /* Make sure it's in range */
- end = min_t(u64, DOMAIN_MAX_ADDR(domain->gaw), end);
- if (!size || (IOVA_START_ADDR + size > end))
- return NULL;
-
- piova = alloc_iova(&domain->iovad,
- size >> PAGE_SHIFT_4K, IOVA_PFN(end), 1);
- return piova;
-}
-
-static struct iova *
-__intel_alloc_iova(struct device *dev, struct dmar_domain *domain,
- size_t size)
-{
- struct pci_dev *pdev = to_pci_dev(dev);
- struct iova *iova = NULL;
-
- if ((pdev->dma_mask <= DMA_32BIT_MASK) || (dmar_forcedac)) {
- iova = iommu_alloc_iova(domain, size, pdev->dma_mask);
- } else {
- /*
- * First try to allocate an io virtual address in
- * DMA_32BIT_MASK and if that fails then try allocating
- * from higher range
- */
- iova = iommu_alloc_iova(domain, size, DMA_32BIT_MASK);
- if (!iova)
- iova = iommu_alloc_iova(domain, size, pdev->dma_mask);
- }
-
- if (!iova) {
- printk(KERN_ERR"Allocating iova for %s failed", pci_name(pdev));
- return NULL;
- }
-
- return iova;
-}
-
static struct dmar_domain *
get_valid_domain_for_dev(struct pci_dev *pdev)
{
@@ -1905,15 +1836,53 @@ get_valid_domain_for_dev(struct pci_dev *pdev)
return domain;
}

+static unsigned long intel_iommu_area_alloc(struct dmar_domain *domain,
+ int pages)
+{
+ unsigned long start;
+ unsigned long flags;
+ int pass = 0;
+ unsigned long index, limit, boundary_size;
+
+ limit = domain->it_size;
+ start = domain->start;
+ boundary_size = 1UL << (32 - PAGE_SHIFT_4K);
+
+ spin_lock_irqsave(&domain->mapping_lock, flags);
+again:
+ index = iommu_area_alloc(domain->it_map, limit, start, pages,
+ 0, boundary_size, 0);
+ if (index == -1) {
+ if (!pass) {
+ if (!intel_iommu_strict)
+ iommu_flush_iotlb_dsi(domain->iommu,
+ domain->id, 0);
+ start = 0;
+ pass++;
+ goto again;
+ }
+
+ spin_unlock_irqrestore(&domain->mapping_lock, flags);
+ return DMA_ERROR_CODE;
+ }
+
+ domain->start = index + pages;
+
+ spin_unlock_irqrestore(&domain->mapping_lock, flags);
+
+ return index;
+}
+
static dma_addr_t
intel_map_single(struct device *hwdev, phys_addr_t paddr, size_t size, int dir)
{
struct pci_dev *pdev = to_pci_dev(hwdev);
struct dmar_domain *domain;
- unsigned long start_paddr;
- struct iova *iova;
+ unsigned long start_paddr, index;
int prot = 0;
int ret;
+ unsigned long flags;
+ int pages;

BUG_ON(dir == DMA_NONE);
if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO)
@@ -1924,12 +1893,12 @@ intel_map_single(struct device *hwdev, phys_addr_t paddr, size_t size, int dir)
return 0;

size = aligned_size((u64)paddr, size);
+ pages = size >> PAGE_SHIFT_4K;
+ index = intel_iommu_area_alloc(domain, pages);
+ if (index == DMA_ERROR_CODE)
+ return 0;

- iova = __intel_alloc_iova(hwdev, domain, size);
- if (!iova)
- goto error;
-
- start_paddr = iova->pfn_lo << PAGE_SHIFT_4K;
+ start_paddr = index << PAGE_SHIFT_4K;

/*
* Check if DMAR supports zero-length reads on write only
@@ -1957,91 +1926,36 @@ intel_map_single(struct device *hwdev, phys_addr_t paddr, size_t size, int dir)

/* it's a non-present to present mapping */
ret = iommu_flush_iotlb_psi(domain->iommu, domain->id,
- start_paddr, size >> PAGE_SHIFT_4K, 1);
+ start_paddr, pages, 1);
if (ret)
iommu_flush_write_buffer(domain->iommu);

return (start_paddr + ((u64)paddr & (~PAGE_MASK_4K)));

error:
- if (iova)
- __free_iova(&domain->iovad, iova);
+ spin_lock_irqsave(&domain->mapping_lock, flags);
+ iommu_area_free(domain->it_map, index, pages);
+ spin_unlock_irqrestore(&domain->mapping_lock, flags);
+
printk(KERN_ERR"Device %s request: %lx@%llx dir %d --- failed\n",
pci_name(pdev), size, (u64)paddr, dir);
return 0;
}

-static void flush_unmaps(void)
-{
- int i, j;
-
- timer_on = 0;
-
- /* just flush them all */
- for (i = 0; i < g_num_of_iommus; i++) {
- if (deferred_flush[i].next) {
- iommu_flush_iotlb_global(&g_iommus[i], 0);
- for (j = 0; j < deferred_flush[i].next; j++) {
- __free_iova(&deferred_flush[i].domain[j]->iovad,
- deferred_flush[i].iova[j]);
- }
- deferred_flush[i].next = 0;
- }
- }
-
- list_size = 0;
-}
-
-static void flush_unmaps_timeout(unsigned long data)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&async_umap_flush_lock, flags);
- flush_unmaps();
- spin_unlock_irqrestore(&async_umap_flush_lock, flags);
-}
-
-static void add_unmap(struct dmar_domain *dom, struct iova *iova)
-{
- unsigned long flags;
- int next, iommu_id;
-
- spin_lock_irqsave(&async_umap_flush_lock, flags);
- if (list_size == HIGH_WATER_MARK)
- flush_unmaps();
-
- iommu_id = dom->iommu - g_iommus;
- next = deferred_flush[iommu_id].next;
- deferred_flush[iommu_id].domain[next] = dom;
- deferred_flush[iommu_id].iova[next] = iova;
- deferred_flush[iommu_id].next++;
-
- if (!timer_on) {
- mod_timer(&unmap_timer, jiffies + msecs_to_jiffies(10));
- timer_on = 1;
- }
- list_size++;
- spin_unlock_irqrestore(&async_umap_flush_lock, flags);
-}
-
static void intel_unmap_single(struct device *dev, dma_addr_t dev_addr,
size_t size, int dir)
{
struct pci_dev *pdev = to_pci_dev(dev);
struct dmar_domain *domain;
unsigned long start_addr;
- struct iova *iova;
+ unsigned long flags;

if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO)
return;
domain = find_domain(pdev);
BUG_ON(!domain);

- iova = find_iova(&domain->iovad, IOVA_PFN(dev_addr));
- if (!iova)
- return;
-
- start_addr = iova->pfn_lo << PAGE_SHIFT_4K;
+ start_addr = dev_addr & ~(PAGE_SIZE_4K - 1);
size = aligned_size((u64)dev_addr, size);

pr_debug("Device %s unmapping: %lx@%llx\n",
@@ -2051,19 +1965,16 @@ static void intel_unmap_single(struct device *dev, dma_addr_t dev_addr,
dma_pte_clear_range(domain, start_addr, start_addr + size);
/* free page tables */
dma_pte_free_pagetable(domain, start_addr, start_addr + size);
- if (intel_iommu_strict) {
- if (iommu_flush_iotlb_psi(domain->iommu,
- domain->id, start_addr, size >> PAGE_SHIFT_4K, 0))
- iommu_flush_write_buffer(domain->iommu);
- /* free iova */
- __free_iova(&domain->iovad, iova);
- } else {
- add_unmap(domain, iova);
- /*
- * queue up the release of the unmap to save the 1/6th of the
- * cpu used up by the iotlb flush operation...
- */
- }
+
+ if (intel_iommu_strict && iommu_flush_iotlb_psi(domain->iommu,
+ domain->id, start_addr,
+ size >> PAGE_SHIFT_4K, 0))
+ iommu_flush_write_buffer(domain->iommu);
+
+ spin_lock_irqsave(&domain->mapping_lock, flags);
+ iommu_area_free(domain->it_map, start_addr >> PAGE_SHIFT_4K,
+ size >> PAGE_SHIFT_4K);
+ spin_unlock_irqrestore(&domain->mapping_lock, flags);
}

static void * intel_alloc_coherent(struct device *hwdev, size_t size,
@@ -2108,37 +2019,39 @@ static void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist,
struct pci_dev *pdev = to_pci_dev(hwdev);
struct dmar_domain *domain;
unsigned long start_addr;
- struct iova *iova;
size_t size = 0;
void *addr;
struct scatterlist *sg;
+ unsigned index;
+ unsigned long flags;

if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO)
return;

domain = find_domain(pdev);

- iova = find_iova(&domain->iovad, IOVA_PFN(sglist[0].dma_address));
- if (!iova)
- return;
+ index = IOVA_PFN(sglist[0].dma_address);
+
for_each_sg(sglist, sg, nelems, i) {
addr = SG_ENT_VIRT_ADDRESS(sg);
size += aligned_size((u64)addr, sg->length);
}

- start_addr = iova->pfn_lo << PAGE_SHIFT_4K;
+ start_addr = index << PAGE_SHIFT_4K;

/* clear the whole page */
dma_pte_clear_range(domain, start_addr, start_addr + size);
/* free page tables */
dma_pte_free_pagetable(domain, start_addr, start_addr + size);

- if (iommu_flush_iotlb_psi(domain->iommu, domain->id, start_addr,
- size >> PAGE_SHIFT_4K, 0))
+ if (intel_iommu_strict && iommu_flush_iotlb_psi(domain->iommu,
+ domain->id, start_addr,
+ size >> PAGE_SHIFT_4K, 0))
iommu_flush_write_buffer(domain->iommu);

- /* free iova */
- __free_iova(&domain->iovad, iova);
+ spin_lock_irqsave(&domain->mapping_lock, flags);
+ iommu_area_free(domain->it_map, index, size >> PAGE_SHIFT_4K);
+ spin_unlock_irqrestore(&domain->mapping_lock, flags);
}

static int intel_nontranslate_map_sg(struct device *hddev,
@@ -2165,10 +2078,12 @@ static int intel_map_sg(struct device *hwdev, struct scatterlist *sglist,
size_t size = 0;
int prot = 0;
size_t offset = 0;
- struct iova *iova = NULL;
int ret;
struct scatterlist *sg;
unsigned long start_addr;
+ unsigned long index;
+ unsigned long flags;
+ int pages;

BUG_ON(dir == DMA_NONE);
if (pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO)
@@ -2184,8 +2099,9 @@ static int intel_map_sg(struct device *hwdev, struct scatterlist *sglist,
size += aligned_size((u64)addr, sg->length);
}

- iova = __intel_alloc_iova(hwdev, domain, size);
- if (!iova) {
+ pages = size >> PAGE_SHIFT_4K;
+ index = intel_iommu_area_alloc(domain, pages);
+ if (index == DMA_ERROR_CODE) {
sglist->dma_length = 0;
return 0;
}
@@ -2200,7 +2116,7 @@ static int intel_map_sg(struct device *hwdev, struct scatterlist *sglist,
if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
prot |= DMA_PTE_WRITE;

- start_addr = iova->pfn_lo << PAGE_SHIFT_4K;
+ start_addr = index << PAGE_SHIFT_4K;
offset = 0;
for_each_sg(sglist, sg, nelems, i) {
addr = SG_ENT_VIRT_ADDRESS(sg);
@@ -2217,7 +2133,11 @@ static int intel_map_sg(struct device *hwdev, struct scatterlist *sglist,
dma_pte_free_pagetable(domain, start_addr,
start_addr + offset);
/* free iova */
- __free_iova(&domain->iovad, iova);
+ spin_lock_irqsave(&domain->mapping_lock, flags);
+
+ iommu_area_free(domain->it_map, index, pages);
+
+ spin_unlock_irqrestore(&domain->mapping_lock, flags);
return 0;
}
sg->dma_address = start_addr + offset +
@@ -2278,52 +2198,27 @@ static inline int iommu_devinfo_cache_init(void)
return ret;
}

-static inline int iommu_iova_cache_init(void)
-{
- int ret = 0;
-
- iommu_iova_cache = kmem_cache_create("iommu_iova",
- sizeof(struct iova),
- 0,
- SLAB_HWCACHE_ALIGN,
-
- NULL);
- if (!iommu_iova_cache) {
- printk(KERN_ERR "Couldn't create iova cache\n");
- ret = -ENOMEM;
- }
-
- return ret;
-}
-
static int __init iommu_init_mempool(void)
{
int ret;
- ret = iommu_iova_cache_init();
- if (ret)
- return ret;

ret = iommu_domain_cache_init();
if (ret)
- goto domain_error;
+ return -ENOMEM;

ret = iommu_devinfo_cache_init();
- if (!ret)
- return ret;
-
- kmem_cache_destroy(iommu_domain_cache);
-domain_error:
- kmem_cache_destroy(iommu_iova_cache);
+ if (ret) {
+ kmem_cache_destroy(iommu_domain_cache);
+ return -ENOMEM;
+ }

- return -ENOMEM;
+ return 0;
}

static void __init iommu_exit_mempool(void)
{
kmem_cache_destroy(iommu_devinfo_cache);
kmem_cache_destroy(iommu_domain_cache);
- kmem_cache_destroy(iommu_iova_cache);
-
}

void __init detect_intel_iommu(void)
@@ -2395,16 +2290,14 @@ int __init intel_iommu_init(void)
ret = init_dmars();
if (ret) {
printk(KERN_ERR "IOMMU: dmar init failed\n");
- put_iova_domain(&reserved_iova_list);
+ kfree(reserved_it_map);
iommu_exit_mempool();
return ret;
}
printk(KERN_INFO
- "PCI-DMA: Intel(R) Virtualization Technology for Directed I/O\n");
+ "PCI-DMA: Intel(R) Virtualization Technology for Directed I/O (bitmap)\n");

- init_timer(&unmap_timer);
force_iommu = 1;
dma_ops = &intel_dma_ops;
return 0;
}
-
diff --git a/drivers/pci/intel-iommu.h b/drivers/pci/intel-iommu.h
index afc0ad9..6102cfb 100644
--- a/drivers/pci/intel-iommu.h
+++ b/drivers/pci/intel-iommu.h
@@ -291,7 +291,10 @@ struct dmar_domain {
struct intel_iommu *iommu; /* back pointer to owning iommu */

struct list_head devices; /* all devices' list */
- struct iova_domain iovad; /* iova's that belong to this domain */
+
+ unsigned long it_size;
+ unsigned long *it_map;
+ unsigned long start;

struct dma_pte *pgd; /* virtual address */
spinlock_t mapping_lock; /* page table lock */

2008-06-04 17:25:23

by Andi Kleen

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

FUJITA Tomonori <[email protected]> writes:
>
> I'm just interested in other people's opinions on IOMMU
> implementations, performances, possible future changes for performance
> improvement, etc.

I think using the bitmap is an excellent idea and your numbers look good.
Do you have numbers on the memory consumption too?
Trading some memory for performance is ok for something as performance critical
as the IOMMU.

-Andi

2008-06-04 18:06:54

by Grant Grundler

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Wed, Jun 4, 2008 at 7:47 AM, FUJITA Tomonori
<[email protected]> wrote:
...
> Now I try to fix Intel IOMMU code, the free space management
> algorithm.
>
> The major difference between Intel IOMMU code and the others is Intel
> IOMMU code uses Red Black tree to manage free space while the others
> use bitmap (swiotlb is the only exception).
>
> The Red Black tree method consumes less memory than the bitmap method,
> but it incurs more overheads (the RB tree method needs to walk through
> the tree, allocates a new item, and insert it every time it maps an
> I/O address). Intel IOMMU (and IOMMUs for virtualization) needs
> multiple IOMMU address spaces. That's why the Red Black tree method is
> chosen, I guess.

It's possible to split up one flat address space and share the IOMMU
among several users. Each user gets her own segment of bitmap and
corresponding IO Pdir. So I don't see allocation policy as a strong reason
to use Red/Black Tree.

I suspect R/B tree was chosen becuase they expected a host VM to allocate
one large "static" entry for a guest OS and the guest would manage that range
itself. R/B Tree seems like a very efficient way to handle that from the host
VM point of view.

> Half a year ago, I tried to convert POWER IOMMU code to use the Red
> Black method and saw performance drop:
>
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-11/msg00650.html
>
> So I tried to convert Intel IOMMU code to use the bitmap method to see
> how much I can get.
>
> I didn't see noticable performance differences with 1GbE. So I tried
> the modified driver of a SCSI HBA that just does memory accesses to
> emulate the performances of SSD disk drives, 10GbE, Infiniband, etc.

You can easily emulate SSD drives by doing sequential 4K reads
from a normal SATA HD. That should result in ~7-8K IOPS since the disk
will recognize the sequential stream and read ahead. SAS/SCSI/FC will
probably work the same way with different IOP rates.

> I got the following results with one thread issuing 1KB I/Os:
>
> IOPS (I/O per second)
> IOMMU disabled 145253.1 (1.000)
> RB tree (mainline) 118313.0 (0.814)
> Bitmap 128954.1 (0.887)

Just to make this clear, this is a 10% performance difference.

But a second metric is more telling: CPU utilization.
How much time was spent in the IOMMU code for each
implementation with the same workload?

This isn't a demand for that information but just a request
to measure that in any future benchmarking.
oprofile or perfmon2 are the best tools to determine that.

> I've attached the patch to modify Intel IOMMU code to use the bitmap
> method but I have no intention of arguing that Intel IOMMU code
> consumes more memory for better performance. :) I want to do more
> performance tests with 10GbE (probably, I have to wait for a server
> box having VT-d, which is not available on the market now).

> As I said, what I want to do now is to make Intel IOMMU code respect
> drivers' DMA alignment. Well, it's easier to do that if Intel IOMMU
> uses the bitmap method since I can simply convert the IOMMU code to
> use lib/iommu-helper but I can modify the RB tree method too.

Just as important as the allocation data structure is the allocation policy.
The allocation policy will perform best if it matches the IO TLB
replacement implemented in the IOMMU HW. Thrashing the IO TLB
by allocating aliases to competing streams will hurt perf as well.
Obviously a single benchmark is unlikely to detect this.

> I'm just interested in other people's opinions on IOMMU
> implementations, performances, possible future changes for performance
> improvement, etc.
>
> For further information:
>
> LSF'08 "Storage Track" summary by Grant Grundler:
> http://iou.parisc-linux.org/lsf2008/SUMMARY-Storage.txt
>
> My LSF'08 slides:
> http://iou.parisc-linux.org/lsf2008/IO-DMA_Representations-fujita_tomonori.pdf

I personally found this to be one of the more interesting talks :)
Excellent work!

> Tis patch is against the latst git tree (note that it just converts
> Intel IOMMU code to use the bitmap. It doesn't make it respect
> drivers' DMA alignment yet).
...

> diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
> index f941f60..41ad545 100644
> --- a/drivers/pci/dmar.c
> +++ b/drivers/pci/dmar.c
> @@ -26,7 +26,6 @@
>
> #include <linux/pci.h>
> #include <linux/dmar.h>
> -#include "iova.h"
> #include "intel-iommu.h"
>
> #undef PREFIX
> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> index 66c0fd2..839363a 100644
> --- a/drivers/pci/intel-iommu.c
> +++ b/drivers/pci/intel-iommu.c
> @@ -32,8 +32,7 @@
> #include <linux/dmar.h>
> #include <linux/dma-mapping.h>
> #include <linux/mempool.h>
> -#include <linux/timer.h>
> -#include "iova.h"
> +#include <linux/iommu-helper.h>
> #include "intel-iommu.h"
> #include <asm/proto.h> /* force_iommu in this header in x86-64*/
> #include <asm/cacheflush.h>
> @@ -51,33 +50,15 @@
>
> #define DMAR_OPERATION_TIMEOUT ((cycles_t) tsc_khz*10*1000) /* 10sec */
>
> -#define DOMAIN_MAX_ADDR(gaw) ((((u64)1) << gaw) - 1)
> -
> -
> -static void flush_unmaps_timeout(unsigned long data);
> +#define DMA_ERROR_CODE (~(dma_addr_t)0x0)
>
> -DEFINE_TIMER(unmap_timer, flush_unmaps_timeout, 0, 0);
> +#define DOMAIN_MAX_ADDR(gaw) ((((u64)1) << gaw) - 1)
>
> static struct intel_iommu *g_iommus;
>
> -#define HIGH_WATER_MARK 250
> -struct deferred_flush_tables {
> - int next;
> - struct iova *iova[HIGH_WATER_MARK];
> - struct dmar_domain *domain[HIGH_WATER_MARK];
> -};

Sorry, I didn't see a replacement for the deferred_flush_tables.
Mark Gross and I agree this substantially helps with unmap performance.
See http://lkml.org/lkml/2008/3/3/373

...
> static void dmar_init_reserved_ranges(void)
> {
> struct pci_dev *pdev = NULL;
> - struct iova *iova;
> int i;
> u64 addr, size;
>
> - init_iova_domain(&reserved_iova_list, DMA_32BIT_PFN);
> + reserved_it_size = 1UL << (32 - PAGE_SHIFT_4K);

Can you make reserved_it_size a module or kernel parameter?

I've never been able to come up with a good heuristic
for determining the size of the IOVA space. It generally
does NOT need to map all of Host Physical RAM.
The actual requirement depends entirely on the workload,
type and number of IO devices installed. The problem is we
don't know any of those things until well after the IOMMU
is already needed.

> + reserved_it_map = kzalloc(reserved_it_size / BITS_PER_BYTE, GFP_ATOMIC);
> + BUG_ON(!reserved_it_map);
>
> lockdep_set_class(&reserved_iova_list.iova_alloc_lock,
> &reserved_alloc_key);
> lockdep_set_class(&reserved_iova_list.iova_rbtree_lock,
> &reserved_rbtree_key);
>
> + reserve_area(reserved_it_map, 0, IOVA_PFN(IOVA_START_ADDR));
> +
> /* IOAPIC ranges shouldn't be accessed by DMA */
> - iova = reserve_iova(&reserved_iova_list, IOVA_PFN(IOAPIC_RANGE_START),
> - IOVA_PFN(IOAPIC_RANGE_END));
> - if (!iova)
> - printk(KERN_ERR "Reserve IOAPIC range failed\n");
> + reserve_area(reserved_it_map, IOVA_PFN(IOAPIC_RANGE_START),
> + IOVA_PFN(IOAPIC_RANGE_END));
>
> /* Reserve all PCI MMIO to avoid peer-to-peer access */
> for_each_pci_dev(pdev) {

I didn't check....but reserving MMIO address space might be better done
by looking at MMIO ranges routed by all the top level PCI Host-bus controllers
(aka PCI-e Root ports). Maybe this is an idea for Mark Gross to implement.

> -static void domain_reserve_special_ranges(struct dmar_domain *domain)
> -{
> - copy_reserved_iova(&reserved_iova_list, &domain->iovad);
> }
>
> static inline int guestwidth_to_adjustwidth(int gaw)
> @@ -1191,38 +1163,52 @@ static inline int guestwidth_to_adjustwidth(int gaw)
> static int domain_init(struct dmar_domain *domain, int guest_width)
> {
> struct intel_iommu *iommu;
> - int adjust_width, agaw;
> + int ret, adjust_width, agaw;
> unsigned long sagaw;
>
> - init_iova_domain(&domain->iovad, DMA_32BIT_PFN);
> spin_lock_init(&domain->mapping_lock);
>
> - domain_reserve_special_ranges(domain);
> -
> /* calculate AGAW */
> iommu = domain->iommu;
> +
> if (guest_width > cap_mgaw(iommu->cap))
> guest_width = cap_mgaw(iommu->cap);
> domain->gaw = guest_width;
> adjust_width = guestwidth_to_adjustwidth(guest_width);
> agaw = width_to_agaw(adjust_width);
> sagaw = cap_sagaw(iommu->cap);
> +
> +/* domain->it_size = 1UL << (guest_width - PAGE_SHIFT_4K); */
> + domain->it_size = 1UL << (32 - PAGE_SHIFT_4K);

"32-PAGE_SHIFT_4K" expression is used in several places but I didn't see
an explanation of why 32. Can you add one someplace?

out of time...

thanks!
grant

2008-06-05 14:49:34

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Wed, 04 Jun 2008 18:56:35 +0200
Andi Kleen <[email protected]> wrote:

> FUJITA Tomonori <[email protected]> writes:
> >
> > I'm just interested in other people's opinions on IOMMU
> > implementations, performances, possible future changes for performance
> > improvement, etc.
>
> I think using the bitmap is an excellent idea and your numbers look good.
> Do you have numbers on the memory consumption too?
> Trading some memory for performance is ok for something as performance critical
> as the IOMMU.

If we use 4GB virtual DMA address space (as the patch does), we need
128 KB for the bitmap for one domain.

With the RB tree, the memory consumption depends on how many addresses
are mapped (it needs one entry for one address though we could merge
multiple addresses).

2008-06-05 14:51:58

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Wed, 4 Jun 2008 11:06:15 -0700
"Grant Grundler" <[email protected]> wrote:

> On Wed, Jun 4, 2008 at 7:47 AM, FUJITA Tomonori
> <[email protected]> wrote:
> ...
> > Now I try to fix Intel IOMMU code, the free space management
> > algorithm.
> >
> > The major difference between Intel IOMMU code and the others is Intel
> > IOMMU code uses Red Black tree to manage free space while the others
> > use bitmap (swiotlb is the only exception).
> >
> > The Red Black tree method consumes less memory than the bitmap method,
> > but it incurs more overheads (the RB tree method needs to walk through
> > the tree, allocates a new item, and insert it every time it maps an
> > I/O address). Intel IOMMU (and IOMMUs for virtualization) needs
> > multiple IOMMU address spaces. That's why the Red Black tree method is
> > chosen, I guess.
>
> It's possible to split up one flat address space and share the IOMMU
> among several users. Each user gets her own segment of bitmap and
> corresponding IO Pdir. So I don't see allocation policy as a strong reason
> to use Red/Black Tree.
>
> I suspect R/B tree was chosen becuase they expected a host VM to allocate
> one large "static" entry for a guest OS and the guest would manage that range
> itself. R/B Tree seems like a very efficient way to handle that from the host
> VM point of view.

Good point, it would be more appropriate to evaluate the performances
of such workload rather than simple I/Os.

I'm interested in what workload Intel people played with and why they
chose RB tree.

> > Half a year ago, I tried to convert POWER IOMMU code to use the Red
> > Black method and saw performance drop:
> >
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-11/msg00650.html
> >
> > So I tried to convert Intel IOMMU code to use the bitmap method to see
> > how much I can get.
> >
> > I didn't see noticable performance differences with 1GbE. So I tried
> > the modified driver of a SCSI HBA that just does memory accesses to
> > emulate the performances of SSD disk drives, 10GbE, Infiniband, etc.
>
> You can easily emulate SSD drives by doing sequential 4K reads
> from a normal SATA HD. That should result in ~7-8K IOPS since the disk
> will recognize the sequential stream and read ahead. SAS/SCSI/FC will
> probably work the same way with different IOP rates.

Yeah, probabaly right. I thought that 10GbE give the IOMMU more
workloads than SSD does and tried to emulate something like that.

> > I got the following results with one thread issuing 1KB I/Os:
> >
> > IOPS (I/O per second)
> > IOMMU disabled 145253.1 (1.000)
> > RB tree (mainline) 118313.0 (0.814)
> > Bitmap 128954.1 (0.887)
>
> Just to make this clear, this is a 10% performance difference.
>
> But a second metric is more telling: CPU utilization.
> How much time was spent in the IOMMU code for each
> implementation with the same workload?
>
> This isn't a demand for that information but just a request
> to measure that in any future benchmarking.
> oprofile or perfmon2 are the best tools to determine that.

OK, I'll try next time.

> > I've attached the patch to modify Intel IOMMU code to use the bitmap
> > method but I have no intention of arguing that Intel IOMMU code
> > consumes more memory for better performance. :) I want to do more
> > performance tests with 10GbE (probably, I have to wait for a server
> > box having VT-d, which is not available on the market now).
>
>
> > As I said, what I want to do now is to make Intel IOMMU code respect
> > drivers' DMA alignment. Well, it's easier to do that if Intel IOMMU
> > uses the bitmap method since I can simply convert the IOMMU code to
> > use lib/iommu-helper but I can modify the RB tree method too.
>
> Just as important as the allocation data structure is the allocation policy.
> The allocation policy will perform best if it matches the IO TLB
> replacement implemented in the IOMMU HW. Thrashing the IO TLB
> by allocating aliases to competing streams will hurt perf as well.
> Obviously a single benchmark is unlikely to detect this.

Agreed.

> > I'm just interested in other people's opinions on IOMMU
> > implementations, performances, possible future changes for performance
> > improvement, etc.
> >
> > For further information:
> >
> > LSF'08 "Storage Track" summary by Grant Grundler:
> > http://iou.parisc-linux.org/lsf2008/SUMMARY-Storage.txt
> >
> > My LSF'08 slides:
> > http://iou.parisc-linux.org/lsf2008/IO-DMA_Representations-fujita_tomonori.pdf
>
> I personally found this to be one of the more interesting talks :)
> Excellent work!

Thanks! I think that the summary is excellent too :)

> > Tis patch is against the latst git tree (note that it just converts
> > Intel IOMMU code to use the bitmap. It doesn't make it respect
> > drivers' DMA alignment yet).
> ...
>
>
> > diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
> > index f941f60..41ad545 100644
> > --- a/drivers/pci/dmar.c
> > +++ b/drivers/pci/dmar.c
> > @@ -26,7 +26,6 @@
> >
> > #include <linux/pci.h>
> > #include <linux/dmar.h>
> > -#include "iova.h"
> > #include "intel-iommu.h"
> >
> > #undef PREFIX
> > diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> > index 66c0fd2..839363a 100644
> > --- a/drivers/pci/intel-iommu.c
> > +++ b/drivers/pci/intel-iommu.c
> > @@ -32,8 +32,7 @@
> > #include <linux/dmar.h>
> > #include <linux/dma-mapping.h>
> > #include <linux/mempool.h>
> > -#include <linux/timer.h>
> > -#include "iova.h"
> > +#include <linux/iommu-helper.h>
> > #include "intel-iommu.h"
> > #include <asm/proto.h> /* force_iommu in this header in x86-64*/
> > #include <asm/cacheflush.h>
> > @@ -51,33 +50,15 @@
> >
> > #define DMAR_OPERATION_TIMEOUT ((cycles_t) tsc_khz*10*1000) /* 10sec */
> >
> > -#define DOMAIN_MAX_ADDR(gaw) ((((u64)1) << gaw) - 1)
> > -
> > -
> > -static void flush_unmaps_timeout(unsigned long data);
> > +#define DMA_ERROR_CODE (~(dma_addr_t)0x0)
> >
> > -DEFINE_TIMER(unmap_timer, flush_unmaps_timeout, 0, 0);
> > +#define DOMAIN_MAX_ADDR(gaw) ((((u64)1) << gaw) - 1)
> >
> > static struct intel_iommu *g_iommus;
> >
> > -#define HIGH_WATER_MARK 250
> > -struct deferred_flush_tables {
> > - int next;
> > - struct iova *iova[HIGH_WATER_MARK];
> > - struct dmar_domain *domain[HIGH_WATER_MARK];
> > -};
>
> Sorry, I didn't see a replacement for the deferred_flush_tables.
> Mark Gross and I agree this substantially helps with unmap performance.
> See http://lkml.org/lkml/2008/3/3/373

Yeah, I can add a nice trick in parisc sba_iommu uses. I'll try next
time.

But it probably gives the bitmap method less gain than the RB tree
since clear the bitmap takes less time than changing the tree.

The deferred_flush_tables also batches flushing TLB. The patch flushes
TLB only when it reaches the end of the bitmap (it's a trick that some
IOMMUs like SPARC does).

> ...
> > static void dmar_init_reserved_ranges(void)
> > {
> > struct pci_dev *pdev = NULL;
> > - struct iova *iova;
> > int i;
> > u64 addr, size;
> >
> > - init_iova_domain(&reserved_iova_list, DMA_32BIT_PFN);
> > + reserved_it_size = 1UL << (32 - PAGE_SHIFT_4K);
>
> Can you make reserved_it_size a module or kernel parameter?
>
> I've never been able to come up with a good heuristic
> for determining the size of the IOVA space. It generally
> does NOT need to map all of Host Physical RAM.
> The actual requirement depends entirely on the workload,
> type and number of IO devices installed. The problem is we
> don't know any of those things until well after the IOMMU
> is already needed.

Agreed. VT-d can handle DMA virtual address space larger than 32 bits
but it means that we need more memory for the bitmap. I think that the
majority of systems don't need DMA virtual address space larger than
32 bits. Making it as a kernel parameter is a reasonable approach, I
think.

> > + reserved_it_map = kzalloc(reserved_it_size / BITS_PER_BYTE, GFP_ATOMIC);
> > + BUG_ON(!reserved_it_map);
> >
> > lockdep_set_class(&reserved_iova_list.iova_alloc_lock,
> > &reserved_alloc_key);
> > lockdep_set_class(&reserved_iova_list.iova_rbtree_lock,
> > &reserved_rbtree_key);
> >
> > + reserve_area(reserved_it_map, 0, IOVA_PFN(IOVA_START_ADDR));
> > +
> > /* IOAPIC ranges shouldn't be accessed by DMA */
> > - iova = reserve_iova(&reserved_iova_list, IOVA_PFN(IOAPIC_RANGE_START),
> > - IOVA_PFN(IOAPIC_RANGE_END));
> > - if (!iova)
> > - printk(KERN_ERR "Reserve IOAPIC range failed\n");
> > + reserve_area(reserved_it_map, IOVA_PFN(IOAPIC_RANGE_START),
> > + IOVA_PFN(IOAPIC_RANGE_END));
> >
> > /* Reserve all PCI MMIO to avoid peer-to-peer access */
> > for_each_pci_dev(pdev) {
>
> I didn't check....but reserving MMIO address space might be better done
> by looking at MMIO ranges routed by all the top level PCI Host-bus controllers
> (aka PCI-e Root ports). Maybe this is an idea for Mark Gross to implement.

Calgary IOMMU code does the similar stuff, but I'm not sure. Anyway,
as you said, it's about what Mark might be interested in.

> > -static void domain_reserve_special_ranges(struct dmar_domain *domain)
> > -{
> > - copy_reserved_iova(&reserved_iova_list, &domain->iovad);
> > }
> >
> > static inline int guestwidth_to_adjustwidth(int gaw)
> > @@ -1191,38 +1163,52 @@ static inline int guestwidth_to_adjustwidth(int gaw)
> > static int domain_init(struct dmar_domain *domain, int guest_width)
> > {
> > struct intel_iommu *iommu;
> > - int adjust_width, agaw;
> > + int ret, adjust_width, agaw;
> > unsigned long sagaw;
> >
> > - init_iova_domain(&domain->iovad, DMA_32BIT_PFN);
> > spin_lock_init(&domain->mapping_lock);
> >
> > - domain_reserve_special_ranges(domain);
> > -
> > /* calculate AGAW */
> > iommu = domain->iommu;
> > +
> > if (guest_width > cap_mgaw(iommu->cap))
> > guest_width = cap_mgaw(iommu->cap);
> > domain->gaw = guest_width;
> > adjust_width = guestwidth_to_adjustwidth(guest_width);
> > agaw = width_to_agaw(adjust_width);
> > sagaw = cap_sagaw(iommu->cap);
> > +
> > +/* domain->it_size = 1UL << (guest_width - PAGE_SHIFT_4K); */
> > + domain->it_size = 1UL << (32 - PAGE_SHIFT_4K);
>
> "32-PAGE_SHIFT_4K" expression is used in several places but I didn't see
> an explanation of why 32. Can you add one someplace?

OK, I'll do next time. Most of them are about 4GB virtual address
space that the patch uses.

> out of time...

Thanks a lot! I didn't expect this patch to be reviewed. I really
appreciate it.

2008-06-05 18:35:30

by Grant Grundler

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Thu, Jun 5, 2008 at 7:49 AM, FUJITA Tomonori
<[email protected]> wrote:
...
>> You can easily emulate SSD drives by doing sequential 4K reads
>> from a normal SATA HD. That should result in ~7-8K IOPS since the disk
>> will recognize the sequential stream and read ahead. SAS/SCSI/FC will
>> probably work the same way with different IOP rates.
>
> Yeah, probabaly right. I thought that 10GbE give the IOMMU more
> workloads than SSD does and tried to emulate something like that.

10GbE might exercise a different code path. NICs typically use map_single
and storage devices typically use map_sg. But they both exercise the same
underlying resource management code since it's the same IOMMU they poke at.

...
>> Sorry, I didn't see a replacement for the deferred_flush_tables.
>> Mark Gross and I agree this substantially helps with unmap performance.
>> See http://lkml.org/lkml/2008/3/3/373
>
> Yeah, I can add a nice trick in parisc sba_iommu uses. I'll try next
> time.
>
> But it probably gives the bitmap method less gain than the RB tree
> since clear the bitmap takes less time than changing the tree.
>
> The deferred_flush_tables also batches flushing TLB. The patch flushes
> TLB only when it reaches the end of the bitmap (it's a trick that some
> IOMMUs like SPARC does).

The batching of the TLB flushes is the key thing. I was being paranoid
by not marking the resource free until after the TLB was flushed. If we
know the allocation is going to be circular through the bitmap, flushing
the TLB once per iteration through the bitmap should be sufficient since
we can guarantee the IO Pdir resource won't get re-used until a full
cycle through the bitmap has been completed.

I expect this will work for parisc too and I can test that. Funny that didn't
"click" with me when I original wrote the parisc code. DaveM had even told
me the SPARC code was only flushing the IOTLB once per iteration.

...
> Agreed. VT-d can handle DMA virtual address space larger than 32 bits
> but it means that we need more memory for the bitmap. I think that the
> majority of systems don't need DMA virtual address space larger than
> 32 bits. Making it as a kernel parameter is a reasonable approach, I
> think.

Agreed. It needs a resonable default and a way to change it at runtime
for odd cases.

...
>> "32-PAGE_SHIFT_4K" expression is used in several places but I didn't see
>> an explanation of why 32. Can you add one someplace?
>
> OK, I'll do next time. Most of them are about 4GB virtual address
> space that the patch uses.

thanks! The comment should then explain why 4GB is "reasonable" (vs
1GB for example).

...
> Thanks a lot! I didn't expect this patch to be reviewed. I really
> appreciate it.

very welcome,
grant

2008-06-05 19:01:43

by James Bottomley

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Thu, 2008-06-05 at 11:34 -0700, Grant Grundler wrote:
> On Thu, Jun 5, 2008 at 7:49 AM, FUJITA Tomonori
> <[email protected]> wrote:
> ...
> >> You can easily emulate SSD drives by doing sequential 4K reads
> >> from a normal SATA HD. That should result in ~7-8K IOPS since the disk
> >> will recognize the sequential stream and read ahead. SAS/SCSI/FC will
> >> probably work the same way with different IOP rates.
> >
> > Yeah, probabaly right. I thought that 10GbE give the IOMMU more
> > workloads than SSD does and tried to emulate something like that.
>
> 10GbE might exercise a different code path. NICs typically use map_single

map_page, actually, but effectively the same thing. However, all
they're really doing is their own implementation of sg list mapping.

> and storage devices typically use map_sg. But they both exercise the same
> underlying resource management code since it's the same IOMMU they poke at.
>
> ...
> >> Sorry, I didn't see a replacement for the deferred_flush_tables.
> >> Mark Gross and I agree this substantially helps with unmap performance.
> >> See http://lkml.org/lkml/2008/3/3/373
> >
> > Yeah, I can add a nice trick in parisc sba_iommu uses. I'll try next
> > time.
> >
> > But it probably gives the bitmap method less gain than the RB tree
> > since clear the bitmap takes less time than changing the tree.
> >
> > The deferred_flush_tables also batches flushing TLB. The patch flushes
> > TLB only when it reaches the end of the bitmap (it's a trick that some
> > IOMMUs like SPARC does).
>
> The batching of the TLB flushes is the key thing. I was being paranoid
> by not marking the resource free until after the TLB was flushed. If we
> know the allocation is going to be circular through the bitmap, flushing
> the TLB once per iteration through the bitmap should be sufficient since
> we can guarantee the IO Pdir resource won't get re-used until a full
> cycle through the bitmap has been completed.

Not necessarily ... there's a safety vs performance issue here. As long
as the iotlb mapping persists, the device can use it to write to the
memory. If you fail to flush, you lose the ability to detect device dma
after free (because the iotlb may still be valid). On standard systems,
this happens so infrequently as to be worth the tradeoff. However, in
virtualised systems, which is what the intel iommu is aimed at, stale
iotlb entries can be used by malicious VMs to gain access to memory
outside of their VM, so the intel people at least need to say whether
they're willing to accept this speed for safety tradeoff.

James

2008-06-05 22:03:44

by mark gross

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Wed, Jun 04, 2008 at 11:47:01PM +0900, FUJITA Tomonori wrote:
> I resumed the work to make the IOMMU respect drivers' DMA alignment
> (since I got a desktop box having VT-d). In short, some IOMMUs
> allocate memory areas spanning driver's segment boundary limit (DMA
> alignment). It forces drivers to have a workaround to split up scatter
> entries into smaller chunks again. To remove such work around in
> drivers, I modified several IOMMUs, X86_64 (Calgary and Gart), Alpha,
> POWER, PARISC, IA64, SPARC64, and swiotlb.
>
> Now I try to fix Intel IOMMU code, the free space management
> algorithm.
>
> The major difference between Intel IOMMU code and the others is Intel
> IOMMU code uses Red Black tree to manage free space while the others
> use bitmap (swiotlb is the only exception).
>
> The Red Black tree method consumes less memory than the bitmap method,
> but it incurs more overheads (the RB tree method needs to walk through
> the tree, allocates a new item, and insert it every time it maps an
> I/O address). Intel IOMMU (and IOMMUs for virtualization) needs
> multiple IOMMU address spaces. That's why the Red Black tree method is
> chosen, I guess.
>
> Half a year ago, I tried to convert POWER IOMMU code to use the Red
> Black method and saw performance drop:
>
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-11/msg00650.html
>
> So I tried to convert Intel IOMMU code to use the bitmap method to see
> how much I can get.
>
> I didn't see noticable performance differences with 1GbE. So I tried
> the modified driver of a SCSI HBA that just does memory accesses to
> emulate the performances of SSD disk drives, 10GbE, Infiniband, etc.
>
> I got the following results with one thread issuing 1KB I/Os:
>
> IOPS (I/O per second)
> IOMMU disabled 145253.1 (1.000)
> RB tree (mainline) 118313.0 (0.814)
> Bitmap 128954.1 (0.887)
>

FWIW: You'll see bigger deltas if you boot with intel_iommu=strict, but
those will be because of waiting on IOMMU hardware to flush caches and
may further hide effects of gong with a bitmap as opposed to a RB tree.

>
> I've attached the patch to modify Intel IOMMU code to use the bitmap
> method but I have no intention of arguing that Intel IOMMU code
> consumes more memory for better performance. :) I want to do more
> performance tests with 10GbE (probably, I have to wait for a server
> box having VT-d, which is not available on the market now).
>
> As I said, what I want to do now is to make Intel IOMMU code respect
> drivers' DMA alignment. Well, it's easier to do that if Intel IOMMU
> uses the bitmap method since I can simply convert the IOMMU code to
> use lib/iommu-helper but I can modify the RB tree method too.
>

I'm going to be out of contact for a few weeks but this work sounds
interesting.

> I'm just interested in other people's opinions on IOMMU
> implementations, performances, possible future changes for performance
> improvement, etc.
>
> For further information:
>
> LSF'08 "Storage Track" summary by Grant Grundler:
> http://iou.parisc-linux.org/lsf2008/SUMMARY-Storage.txt
>
> My LSF'08 slides:
> http://iou.parisc-linux.org/lsf2008/IO-DMA_Representations-fujita_tomonori.pdf
>
>
> Tis patch is against the latst git tree (note that it just converts
> Intel IOMMU code to use the bitmap. It doesn't make it respect
> drivers' DMA alignment yet).
>

I'll look closely at your patch later.

--mgross

2008-06-06 04:44:48

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Thu, 5 Jun 2008 15:02:16 -0700
mark gross <[email protected]> wrote:

> On Wed, Jun 04, 2008 at 11:47:01PM +0900, FUJITA Tomonori wrote:
> > I resumed the work to make the IOMMU respect drivers' DMA alignment
> > (since I got a desktop box having VT-d). In short, some IOMMUs
> > allocate memory areas spanning driver's segment boundary limit (DMA
> > alignment). It forces drivers to have a workaround to split up scatter
> > entries into smaller chunks again. To remove such work around in
> > drivers, I modified several IOMMUs, X86_64 (Calgary and Gart), Alpha,
> > POWER, PARISC, IA64, SPARC64, and swiotlb.
> >
> > Now I try to fix Intel IOMMU code, the free space management
> > algorithm.
> >
> > The major difference between Intel IOMMU code and the others is Intel
> > IOMMU code uses Red Black tree to manage free space while the others
> > use bitmap (swiotlb is the only exception).
> >
> > The Red Black tree method consumes less memory than the bitmap method,
> > but it incurs more overheads (the RB tree method needs to walk through
> > the tree, allocates a new item, and insert it every time it maps an
> > I/O address). Intel IOMMU (and IOMMUs for virtualization) needs
> > multiple IOMMU address spaces. That's why the Red Black tree method is
> > chosen, I guess.
> >
> > Half a year ago, I tried to convert POWER IOMMU code to use the Red
> > Black method and saw performance drop:
> >
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-11/msg00650.html
> >
> > So I tried to convert Intel IOMMU code to use the bitmap method to see
> > how much I can get.
> >
> > I didn't see noticable performance differences with 1GbE. So I tried
> > the modified driver of a SCSI HBA that just does memory accesses to
> > emulate the performances of SSD disk drives, 10GbE, Infiniband, etc.
> >
> > I got the following results with one thread issuing 1KB I/Os:
> >
> > IOPS (I/O per second)
> > IOMMU disabled 145253.1 (1.000)
> > RB tree (mainline) 118313.0 (0.814)
> > Bitmap 128954.1 (0.887)
> >
>
> FWIW: You'll see bigger deltas if you boot with intel_iommu=strict, but
> those will be because of waiting on IOMMU hardware to flush caches and
> may further hide effects of gong with a bitmap as opposed to a RB tree.

Yeah, I know. I'll test 'intel_iommu=strict' option next time.

The patch also has 'intel_iommu=strict' option. Wiht it enabled, it
flushes TLB cache every time dma_unmap_* is called as the original
code does.

> > I've attached the patch to modify Intel IOMMU code to use the bitmap
> > method but I have no intention of arguing that Intel IOMMU code
> > consumes more memory for better performance. :) I want to do more
> > performance tests with 10GbE (probably, I have to wait for a server
> > box having VT-d, which is not available on the market now).
> >
> > As I said, what I want to do now is to make Intel IOMMU code respect
> > drivers' DMA alignment. Well, it's easier to do that if Intel IOMMU
> > uses the bitmap method since I can simply convert the IOMMU code to
> > use lib/iommu-helper but I can modify the RB tree method too.
> >
>
> I'm going to be out of contact for a few weeks but this work sounds
> interesting.

Why did you choose the RB tree instead of a traditional bitmap scheme
to manage free space?

> > I'm just interested in other people's opinions on IOMMU
> > implementations, performances, possible future changes for performance
> > improvement, etc.
> >
> > For further information:
> >
> > LSF'08 "Storage Track" summary by Grant Grundler:
> > http://iou.parisc-linux.org/lsf2008/SUMMARY-Storage.txt
> >
> > My LSF'08 slides:
> > http://iou.parisc-linux.org/lsf2008/IO-DMA_Representations-fujita_tomonori.pdf
> >
> >
> > Tis patch is against the latst git tree (note that it just converts
> > Intel IOMMU code to use the bitmap. It doesn't make it respect
> > drivers' DMA alignment yet).
> >
>
> I'll look closely at your patch later.

Thanks a lot!

2008-06-06 04:45:15

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Thu, 05 Jun 2008 14:01:28 -0500
James Bottomley <[email protected]> wrote:

> On Thu, 2008-06-05 at 11:34 -0700, Grant Grundler wrote:
> > On Thu, Jun 5, 2008 at 7:49 AM, FUJITA Tomonori
> > <[email protected]> wrote:
> > ...
> > >> You can easily emulate SSD drives by doing sequential 4K reads
> > >> from a normal SATA HD. That should result in ~7-8K IOPS since the disk
> > >> will recognize the sequential stream and read ahead. SAS/SCSI/FC will
> > >> probably work the same way with different IOP rates.
> > >
> > > Yeah, probabaly right. I thought that 10GbE give the IOMMU more
> > > workloads than SSD does and tried to emulate something like that.
> >
> > 10GbE might exercise a different code path. NICs typically use map_single
>
> map_page, actually, but effectively the same thing. However, all
> they're really doing is their own implementation of sg list mapping.

Yeah, they are nearly same. map_single allocates only one DMA address
while sg_map does allocates a DMA address again and again.

> > and storage devices typically use map_sg. But they both exercise the same
> > underlying resource management code since it's the same IOMMU they poke at.
> >
> > ...
> > >> Sorry, I didn't see a replacement for the deferred_flush_tables.
> > >> Mark Gross and I agree this substantially helps with unmap performance.
> > >> See http://lkml.org/lkml/2008/3/3/373
> > >
> > > Yeah, I can add a nice trick in parisc sba_iommu uses. I'll try next
> > > time.
> > >
> > > But it probably gives the bitmap method less gain than the RB tree
> > > since clear the bitmap takes less time than changing the tree.
> > >
> > > The deferred_flush_tables also batches flushing TLB. The patch flushes
> > > TLB only when it reaches the end of the bitmap (it's a trick that some
> > > IOMMUs like SPARC does).
> >
> > The batching of the TLB flushes is the key thing. I was being paranoid
> > by not marking the resource free until after the TLB was flushed. If we
> > know the allocation is going to be circular through the bitmap, flushing
> > the TLB once per iteration through the bitmap should be sufficient since
> > we can guarantee the IO Pdir resource won't get re-used until a full
> > cycle through the bitmap has been completed.
>
> Not necessarily ... there's a safety vs performance issue here. As long
> as the iotlb mapping persists, the device can use it to write to the
> memory. If you fail to flush, you lose the ability to detect device dma
> after free (because the iotlb may still be valid). On standard systems,
> this happens so infrequently as to be worth the tradeoff. However, in
> virtualised systems, which is what the intel iommu is aimed at, stale
> iotlb entries can be used by malicious VMs to gain access to memory
> outside of their VM, so the intel people at least need to say whether
> they're willing to accept this speed for safety tradeoff.

Agreed.

The current Intel IOMMU scheme is a bit unbalanced. It invalidates the
translation table every time dma_unmap_* is called, but it does the
batching of the TLB flushes. But it's what the most of Linux's IOMMU
code does.

I think that only PARISC (and IA64, of course) IOMMUs do the batching
of invalidating the translation table entries.

2008-06-06 05:48:35

by Grant Grundler

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Thu, Jun 5, 2008 at 9:44 PM, FUJITA Tomonori
<[email protected]> wrote:
...
> The current Intel IOMMU scheme is a bit unbalanced. It invalidates the
> translation table every time dma_unmap_* is called, but it does the
> batching of the TLB flushes. But it's what the most of Linux's IOMMU
> code does.
>
> I think that only PARISC (and IA64, of course) IOMMUs do the batching
> of invalidating the translation table entries.

1/2 correct. PARISC and IA64 could be the same in this regard but are not.
See where sba_mark_invalid() is called in the respective sba_iommu.c.
PARISC invalidates the IO Pdir entry immediately but batches the
IO TLB shootdown and resource "free". IA64 could (and probably should)
do the same. Added Alex Williamson and Bjorn Helgaas to CC list.
Not an urgent issue though unless they are doing perf measurements
with SSDs or other block device with equivalent IOPS.

Since parisc-linux is unlikely to ever run VM's and the IOMMU has a
very limited number of IO TLB entries (8 or 16 about), I'm thinking the
batching is a waste of time and parisc should follow SPARC behavior.
I'll chat more with jejb/kyle/willy about it sometime.

thanks,
grant

2008-06-06 20:23:43

by Muli Ben-Yehuda

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Thu, Jun 05, 2008 at 11:49:12PM +0900, FUJITA Tomonori wrote:

> > I didn't check....but reserving MMIO address space might be better
> > done by looking at MMIO ranges routed by all the top level PCI
> > Host-bus controllers (aka PCI-e Root ports). Maybe this is an idea
> > for Mark Gross to implement.
>
> Calgary IOMMU code does the similar stuff, but I'm not sure. Anyway,
> as you said, it's about what Mark might be interested in.

I'm not sure it's strictly necessary, but we observed Bad Things(TM)
happening when we gave out DMA addresses that corresponded to MMIO
regions and black-listing those regions seemed the safest thing to do.

Cheers,
Muli

2008-06-06 20:22:20

by Muli Ben-Yehuda

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Wed, Jun 04, 2008 at 11:06:15AM -0700, Grant Grundler wrote:
> On Wed, Jun 4, 2008 at 7:47 AM, FUJITA Tomonori
> <[email protected]> wrote:
> ...
> > Now I try to fix Intel IOMMU code, the free space management
> > algorithm.
> >
> > The major difference between Intel IOMMU code and the others is Intel
> > IOMMU code uses Red Black tree to manage free space while the others
> > use bitmap (swiotlb is the only exception).
> >
> > The Red Black tree method consumes less memory than the bitmap method,
> > but it incurs more overheads (the RB tree method needs to walk through
> > the tree, allocates a new item, and insert it every time it maps an
> > I/O address). Intel IOMMU (and IOMMUs for virtualization) needs
> > multiple IOMMU address spaces. That's why the Red Black tree method is
> > chosen, I guess.
>
> It's possible to split up one flat address space and share the IOMMU
> among several users. Each user gets her own segment of bitmap and
> corresponding IO Pdir. So I don't see allocation policy as a strong
> reason to use Red/Black Tree.

Do you mean multiple users sharing the same I/O address space (but
each user using a different segment), or multiple users, each with its
own I/O address space, but only using a specific segment of that
address space and using a single bitmap to represent free space in all
segments? If the former, then you are losing some of the benefit of
the IOMMU since all users can DMA to other users areas (same I/O
address space). If the latter, having a bitmap per IO address space
seems simpler and would have the same memory consumption.

> > I got the following results with one thread issuing 1KB I/Os:
> >
> > IOPS (I/O per second)
> > IOMMU disabled 145253.1 (1.000)
> > RB tree (mainline) 118313.0 (0.814)
> > Bitmap 128954.1 (0.887)
>
> Just to make this clear, this is a 10% performance difference.
>
> But a second metric is more telling: CPU utilization. How much time
> was spent in the IOMMU code for each implementation with the same
> workload?
>
> This isn't a demand for that information but just a request to
> measure that in any future benchmarking. oprofile or perfmon2 are
> the best tools to determine that.

Agreed, CPU utilization would be very interesting here.

> Just as important as the allocation data structure is the allocation
> policy. The allocation policy will perform best if it matches the
> IO TLB replacement implemented in the IOMMU HW. Thrashing the IO TLB
> by allocating aliases to competing streams will hurt perf as well.
> Obviously a single benchmark is unlikely to detect this.

Is there a public description of the caching policies of currently
available VT-d hardware?

> I've never been able to come up with a good heuristic for
> determining the size of the IOVA space. It generally does NOT need
> to map all of Host Physical RAM. The actual requirement depends
> entirely on the workload, type and number of IO devices
> installed. The problem is we don't know any of those things until
> well after the IOMMU is already needed.

Why not do what hash-tables implementation do, start small and resize
when we approach half-full?

Cheers,
Muli

2008-06-06 21:28:56

by Grant Grundler

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Fri, Jun 6, 2008 at 1:21 PM, Muli Ben-Yehuda <[email protected]> wrote:
....
>> It's possible to split up one flat address space and share the IOMMU
>> among several users. Each user gets her own segment of bitmap and
>> corresponding IO Pdir. So I don't see allocation policy as a strong
>> reason to use Red/Black Tree.
>
> Do you mean multiple users sharing the same I/O address space (but
> each user using a different segment), or multiple users, each with its
> own I/O address space, but only using a specific segment of that
> address space and using a single bitmap to represent free space in all
> segments?

Yes, I meant the former.

> If the former, then you are losing some of the benefit of
> the IOMMU since all users can DMA to other users areas (same I/O
> address space). If the latter, having a bitmap per IO address space
> seems simpler and would have the same memory consumption.

Agreed. It's a trade off.

...
>> I've never been able to come up with a good heuristic for
>> determining the size of the IOVA space. It generally does NOT need
>> to map all of Host Physical RAM. The actual requirement depends
>> entirely on the workload, type and number of IO devices
>> installed. The problem is we don't know any of those things until
>> well after the IOMMU is already needed.
>
> Why not do what hash-tables implementation do, start small and resize
> when we approach half-full?

Historically the IOMMUs needed physically contiguous memory and
resizing essentially meant quiescing all DMA, moving the IO Pdir data
to the new bigger location, allocating a new bitmap and cloning the
state into that as well, and then resuming DMA operations. The DMA
quiesce requirement effectively meant a reboot. My understanding of
Vt-d is the ranges can be added range at a time and thus can be easily
resized. But it will mean more complex logic in the IOMMU bitmap
handling for a domain which owns multiple bitmaps and thus a slightly
higher CPU utilization cost. At least that's my guess. I'm not working
on any IOMMU code lately...

thanks,
grant

2008-06-06 21:38:22

by Muli Ben-Yehuda

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Fri, Jun 06, 2008 at 02:28:36PM -0700, Grant Grundler wrote:

> Historically the IOMMUs needed physically contiguous memory and
> resizing essentially meant quiescing all DMA, moving the IO Pdir
> data to the new bigger location, allocating a new bitmap and cloning
> the state into that as well, and then resuming DMA operations. The
> DMA quiesce requirement effectively meant a reboot. My understanding
> of Vt-d is the ranges can be added range at a time and thus can be
> easily resized.

VT-d uses a multi-level page-table, so there would be no problem
resizing.

> But it will mean more complex logic in the IOMMU bitmap handling for
> a domain which owns multiple bitmaps and thus a slightly higher CPU
> utilization cost. At least that's my guess. I'm not working on any
> IOMMU code lately...

The logic seems pretty simple to me, all we need to do is keep track
of how many entries are currently allocated in the bitmap vs. the size
of the bitmap. Once we get to the half-way point, double the size of
the bitmap. Having said that, I'm not sure it's worth it, and have no
plans to implement it :-)

Cheers,
Muli

2008-06-06 21:51:48

by Grant Grundler

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Fri, Jun 6, 2008 at 2:36 PM, Muli Ben-Yehuda <[email protected]> wrote:
...
> The logic seems pretty simple to me, all we need to do is keep track
> of how many entries are currently allocated in the bitmap vs. the size
> of the bitmap. Once we get to the half-way point, double the size of
> the bitmap.

"half-full" would be one hueristic but others might be more suitable
(e.g. wait until it's
nearly full or bitmap search fails).

> Having said that, I'm not sure it's worth it, and have no
> plans to implement it :-)

The historical DMA mapping "failure mode" is a kernel panic. Resizing or
creating a new IOMMU page table would avoid a panic, my guess is people
would prefer _anything_ that avoids a panic. :) Especially in the context of
Virtual Machines and multiply guests.

And I'm not exposed to this problem (yet)...I expect Mark Gross
(Intel) is on the
hook for this and has access to all the documentation.

I'll let him thank you for suggesting more work. :P

thanks,
grant

2008-06-09 08:19:02

by Andi Kleen

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

"Grant Grundler" <[email protected]> writes:
>
> The historical DMA mapping "failure mode" is a kernel panic. Resizing or

Hasn't been for a long time, except in some extreme cases. All drivers
are expected to check return values for a long time now.

-Andi

2008-06-09 09:37:01

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Thu, 5 Jun 2008 22:48:13 -0700
"Grant Grundler" <[email protected]> wrote:

> On Thu, Jun 5, 2008 at 9:44 PM, FUJITA Tomonori
> <[email protected]> wrote:
> ...
> > The current Intel IOMMU scheme is a bit unbalanced. It invalidates the
> > translation table every time dma_unmap_* is called, but it does the
> > batching of the TLB flushes. But it's what the most of Linux's IOMMU
> > code does.
> >
> > I think that only PARISC (and IA64, of course) IOMMUs do the batching
> > of invalidating the translation table entries.
>
> 1/2 correct. PARISC and IA64 could be the same in this regard but are not.
> See where sba_mark_invalid() is called in the respective sba_iommu.c.
> PARISC invalidates the IO Pdir entry immediately but batches the
> IO TLB shootdown and resource "free". IA64 could (and probably should)
> do the same. Added Alex Williamson and Bjorn Helgaas to CC list.
> Not an urgent issue though unless they are doing perf measurements
> with SSDs or other block device with equivalent IOPS.

Oops, thanks.

Seems that IA64 does the batching of sba_mark_invalid, sba_free_range,
and flushing TLB. IA64 and PARISC look different in this regard.

2008-06-09 09:40:30

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Mon, 09 Jun 2008 10:17:11 +0200
Andi Kleen <[email protected]> wrote:

> "Grant Grundler" <[email protected]> writes:
> >
> > The historical DMA mapping "failure mode" is a kernel panic. Resizing or
>
> Hasn't been for a long time, except in some extreme cases. All drivers
> are expected to check return values for a long time now.

Agreed, but I think that lots of network drivers still assume that DMA
mapping always succeeds (they don't check return values).

Most of the SCSI drivers were fixed in this regard. Very old HBA
drivers (that are unlikely to be used with an IOMMU) just
crashes. Recent HBA drivers can handle the dma mapping failure (as far
as I know, the only exception is cciss.

Old IOMMU code has "failure mode" but new IOMMU code like VT-d doesn't
crash on the failure. Now drivers need to handle the dma mapping
failure properly.

2008-06-09 10:10:54

by Andi Kleen

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Mon, Jun 09, 2008 at 06:36:29PM +0900, FUJITA Tomonori wrote:
> On Mon, 09 Jun 2008 10:17:11 +0200
> Andi Kleen <[email protected]> wrote:
>
> > "Grant Grundler" <[email protected]> writes:
> > >
> > > The historical DMA mapping "failure mode" is a kernel panic. Resizing or
> >
> > Hasn't been for a long time, except in some extreme cases. All drivers
> > are expected to check return values for a long time now.
>
> Agreed, but I think that lots of network drivers still assume that DMA
> mapping always succeeds (they don't check return values).

Those should be just fixed.

You're right. I reviewed some new drivers and was surprised that they
even got that wrong (code reviewers were supposed to flag this in the
first place)

I would guess in practice the main offender would be tg3.c/bnx2.c.
On the other hand Intel IOMMU systems should usually use e1000e which
checks.

-Andi

2008-06-23 17:54:33

by mark gross

[permalink] [raw]

Subject: Re: Intel IOMMU (and IOMMU for Virtualization) performances

On Fri, Jun 06, 2008 at 01:44:30PM +0900, FUJITA Tomonori wrote:
> On Thu, 5 Jun 2008 15:02:16 -0700
> mark gross <[email protected]> wrote:
>
> > On Wed, Jun 04, 2008 at 11:47:01PM +0900, FUJITA Tomonori wrote:
> > > I resumed the work to make the IOMMU respect drivers' DMA alignment
> > > (since I got a desktop box having VT-d). In short, some IOMMUs
> > > allocate memory areas spanning driver's segment boundary limit (DMA
> > > alignment). It forces drivers to have a workaround to split up scatter
> > > entries into smaller chunks again. To remove such work around in
> > > drivers, I modified several IOMMUs, X86_64 (Calgary and Gart), Alpha,
> > > POWER, PARISC, IA64, SPARC64, and swiotlb.
> > >
> > > Now I try to fix Intel IOMMU code, the free space management
> > > algorithm.
> > >
> > > The major difference between Intel IOMMU code and the others is Intel
> > > IOMMU code uses Red Black tree to manage free space while the others
> > > use bitmap (swiotlb is the only exception).
> > >
> > > The Red Black tree method consumes less memory than the bitmap method,
> > > but it incurs more overheads (the RB tree method needs to walk through
> > > the tree, allocates a new item, and insert it every time it maps an
> > > I/O address). Intel IOMMU (and IOMMUs for virtualization) needs
> > > multiple IOMMU address spaces. That's why the Red Black tree method is
> > > chosen, I guess.
> > >
> > > Half a year ago, I tried to convert POWER IOMMU code to use the Red
> > > Black method and saw performance drop:
> > >
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-11/msg00650.html
> > >
> > > So I tried to convert Intel IOMMU code to use the bitmap method to see
> > > how much I can get.
> > >
> > > I didn't see noticable performance differences with 1GbE. So I tried
> > > the modified driver of a SCSI HBA that just does memory accesses to
> > > emulate the performances of SSD disk drives, 10GbE, Infiniband, etc.
> > >
> > > I got the following results with one thread issuing 1KB I/Os:
> > >
> > > IOPS (I/O per second)
> > > IOMMU disabled 145253.1 (1.000)
> > > RB tree (mainline) 118313.0 (0.814)
> > > Bitmap 128954.1 (0.887)
> > >
> >
> > FWIW: You'll see bigger deltas if you boot with intel_iommu=strict, but
> > those will be because of waiting on IOMMU hardware to flush caches and
> > may further hide effects of gong with a bitmap as opposed to a RB tree.
>
> Yeah, I know. I'll test 'intel_iommu=strict' option next time.
>
> The patch also has 'intel_iommu=strict' option. Wiht it enabled, it
> flushes TLB cache every time dma_unmap_* is called as the original
> code does.
>
>
> > > I've attached the patch to modify Intel IOMMU code to use the bitmap
> > > method but I have no intention of arguing that Intel IOMMU code
> > > consumes more memory for better performance. :) I want to do more
> > > performance tests with 10GbE (probably, I have to wait for a server
> > > box having VT-d, which is not available on the market now).
> > >
> > > As I said, what I want to do now is to make Intel IOMMU code respect
> > > drivers' DMA alignment. Well, it's easier to do that if Intel IOMMU
> > > uses the bitmap method since I can simply convert the IOMMU code to
> > > use lib/iommu-helper but I can modify the RB tree method too.
> > >
> >
> > I'm going to be out of contact for a few weeks but this work sounds
> > interesting.
>
> Why did you choose the RB tree instead of a traditional bitmap scheme
> to manage free space?
>

I inherited this code. And I'm passing it on to David Woodhouse soon.
I don't know why RB was used over BM. I guess for scalability to many
10Gig IO devices, but thats just a guess.

>
> > > I'm just interested in other people's opinions on IOMMU
> > > implementations, performances, possible future changes for performance
> > > improvement, etc.
> > >
> > > For further information:
> > >
> > > LSF'08 "Storage Track" summary by Grant Grundler:
> > > http://iou.parisc-linux.org/lsf2008/SUMMARY-Storage.txt
> > >
> > > My LSF'08 slides:
> > > http://iou.parisc-linux.org/lsf2008/IO-DMA_Representations-fujita_tomonori.pdf
> > >
> > >
> > > Tis patch is against the latst git tree (note that it just converts
> > > Intel IOMMU code to use the bitmap. It doesn't make it respect
> > > drivers' DMA alignment yet).
> > >
> >
> > I'll look closely at your patch later.
>
> Thanks a lot!