2014-04-25 00:37:02

by Sumner, William

[permalink] [raw]
Subject: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

The following series implements a fix for:
A kdump problem about DMA that has been discussed for a long time.
That is, when a kernel panics and boots into the kdump kernel, DMA that was
started by the panicked kernel is not stopped before the kdump kernel is booted;
and the kdump kernel disables the IOMMU while this DMA continues.
This causes the IOMMU to stop translating the DMA addresses as IOVAs and
begin to treat them as physical memory addresses -- which causes the DMA to either:
1. generate DMAR errors or
2. generate PCI SERR errors or
3. transfer data to or from incorrect areas of memory.
Often this causes the dump to fail.

This patch set modifies the behavior of the Intel iommu in the crashdump kernel:
1. to accept the iommu hardware in an active state,
2. to leave the current translations in-place so that legacy DMA will continue
using its current buffers until the device drivers in the crashdump kernel
initialize and initialize their devices,
3. to use different portions of the iova address ranges for the device drivers
in the crashdump kernel than the iova ranges that were in-use at the time
of the panic.

Advantages of this approach:
1. All manipulation of the IO-device is done by the Linux device-driver
for that device.
2. This approach behaves in a manner very similar to operation without an
active iommu.
3. Any activity between the IO-device and its RMRR areas is handled by the
device-driver in the same manner as during a non-kdump boot.
4. If an IO-device has no driver in the kdump kernel, it is simply left alone.
This supports the practice of creating a special kdump kernel without
drivers for any devices that are not required for taking a crashdump.
5. Minimal code-changes among the existing mainline intel-iommu code.

Summary of changes in this patch set:
1. Updated to a more current top-of-tree and merged the code with the
large number of changes that were recently taken-in to intel-iommu.c
2. Returned to the structure of a patch-set
3. Enabled the intel-iommu driver to consist of multiple *.c files
by moving many of the #defines, prototypes, and inline functions
into a new file: intel-iommu-private.h (First three patches implement
only this enhancement -- could be applied independent of the last 5)
4. Moved the new "crashdump fix" code into a new file: intel-iommu-kdump.c
5. Removed the pr_debug constructs from the new code that implements the
"crashdump fix" -- making the code much cleaner and easier to read.
6. Miscellaneous cleanups such as enum-values for return-codes.
7. Simplified the code that retrieves the values needed to initialize a new
domain by using the linked-list of previously-collected values
instead of stepping back into the tree of translation tables.

Bill Sumner (8):
Fix a few existing lines for checkpatch.pl
Consolidate lines for a new private header
Create intel-iommu-private.h
Update iommu_attach_domain() and its callers
Add Items needed for fixing crashdump to private header
Create intel-iommu-kdump.c
Add domain-id functions to intel-iommu-kdump.c
Add remaining changes to intel-iommu.c to fix crashdump

drivers/iommu/Makefile | 2 +-
drivers/iommu/intel-iommu-kdump.c | 629 ++++++++++++++++++++++++++++++++++++
drivers/iommu/intel-iommu-private.h | 457 ++++++++++++++++++++++++++
drivers/iommu/intel-iommu.c | 512 ++++++++---------------------
4 files changed, 1229 insertions(+), 371 deletions(-)
create mode 100644 drivers/iommu/intel-iommu-kdump.c
create mode 100644 drivers/iommu/intel-iommu-private.h

--
Bill Sumner <[email protected]>


2014-04-25 00:38:11

by Sumner, William

[permalink] [raw]
Subject: [PATCH 7/8] iommu/vt-d: Add domain-id functions to intel-iommu-kdump.c

Interfaces for when a new domain in the crashdump kernel needs some
values from the panicked kernel's context entries.

Signed-off-by: Bill Sumner <[email protected]>
---
drivers/iommu/intel-iommu-kdump.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/drivers/iommu/intel-iommu-kdump.c b/drivers/iommu/intel-iommu-kdump.c
index 4e653e048..f3c777e 100644
--- a/drivers/iommu/intel-iommu-kdump.c
+++ b/drivers/iommu/intel-iommu-kdump.c
@@ -53,6 +53,45 @@ static struct list_head *domain_values_list;


/* ========================================================================
+ * Interfaces for when a new domain in the crashdump kernel needs some
+ * values from the panicked kernel's context entries
+ * ------------------------------------------------------------------------
+ */
+
+
+struct domain_values_entry *intel_iommu_did_to_domain_values_entry(int did,
+ struct intel_iommu *iommu)
+{
+ struct domain_values_entry *dve; /* iterator */
+
+ list_for_each_entry(dve, &domain_values_list[iommu->seq_id], link)
+ if (dve->did == did)
+ return dve;
+ return NULL;
+}
+
+
+/* Mark domain-id's from old kernel as in-use on this iommu so that a new
+ * domain-id is allocated in the case where there is a device in the new kernel
+ * that was not in the old kernel -- and therefore a new domain-id is needed.
+ */
+int intel_iommu_get_dids_from_old_kernel(struct intel_iommu *iommu)
+{
+ struct domain_values_entry *dve; /* iterator */
+
+ pr_info("IOMMU:%d Domain ids from panicked kernel:\n", iommu->seq_id);
+
+ list_for_each_entry(dve, &domain_values_list[iommu->seq_id], link) {
+ set_bit(dve->did, iommu->domain_ids);
+ pr_info("DID did:%d(0x%4.4x)\n", dve->did, dve->did);
+ }
+
+ pr_info("----------------------------------------\n");
+ return 0;
+}
+
+
+/* ========================================================================
* Copy iommu translation tables from old kernel into new kernel.
* Entry to this set of functions is: intel_iommu_copy_translation_tables()
* ------------------------------------------------------------------------
--
Bill Sumner <[email protected]>

2014-04-25 00:38:27

by Sumner, William

[permalink] [raw]
Subject: [PATCH 8/8] iommu/vt-d: Changes to support kdump

Add "#include <linux/crash_dump.h>" to gain access to is_kdump_kernel()

Modify the operation of the following functions when called during crash dump:
device_to_domain_id
get_domain_for_dev
init_dmars
intel_iommu_init

Signed-off-by: Bill Sumner <[email protected]>
---
drivers/iommu/intel-iommu.c | 133 +++++++++++++++++++++++++++++++++++++++-----
1 file changed, 120 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 226c1d0..8b3e509 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -46,6 +46,7 @@
#include "irq_remapping.h"
#include "pci.h"
#include "intel-iommu-private.h"
+#include <linux/crash_dump.h>

static LIST_HEAD(dmar_atsr_units);
static LIST_HEAD(dmar_rmrr_units);
@@ -53,6 +54,7 @@ static LIST_HEAD(dmar_rmrr_units);
#define for_each_rmrr_units(rmrr) \
list_for_each_entry(rmrr, &dmar_rmrr_units, list)

+static int intel_iommu_in_crashdump;

static void __init check_tylersburg_isoch(void);

@@ -1796,6 +1798,24 @@ static void domain_remove_dev_info(struct dmar_domain *domain)
spin_unlock_irqrestore(&device_domain_lock, flags);
}

+#ifdef CONFIG_CRASH_DUMP
+static int device_to_domain_id(struct intel_iommu *iommu, u8 bus, u8 devfn)
+{
+ int did = -1; /* domain-id returned */
+ struct root_entry *root;
+ struct context_entry *context;
+ unsigned long flags;
+
+ spin_lock_irqsave(&iommu->lock, flags);
+ root = &iommu->root_entry[bus];
+ context = get_context_addr_from_root(root);
+ if (context && context_present(context+devfn))
+ did = context_get_did(context+devfn);
+ spin_unlock_irqrestore(&iommu->lock, flags);
+ return did;
+}
+#endif /* CONFIG_CRASH_DUMP */
+
/*
* find_domain
* Note: we use struct device->archdata.iommu stores the info
@@ -1880,6 +1900,9 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
unsigned long flags;
u8 bus, devfn, bridge_bus, bridge_devfn;
int did = -1; /* Default to "no domain_id supplied" */
+#ifdef CONFIG_CRASH_DUMP
+ struct domain_values_entry *dve = NULL;
+#endif /* CONFIG_CRASH_DUMP */

domain = find_domain(dev);
if (domain)
@@ -1920,6 +1943,24 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
domain = alloc_domain(false);
if (!domain)
goto error;
+
+#ifdef CONFIG_CRASH_DUMP
+ if (intel_iommu_in_crashdump) {
+ /*
+ * if this device had a did in the old kernel
+ * use its values instead of generating new ones
+ */
+ did = device_to_domain_id(iommu, bus, devfn);
+ if (did > 0 || (did == 0 && !cap_caching_mode(iommu->cap)))
+ dve = intel_iommu_did_to_domain_values_entry(did,
+ iommu);
+ if (dve)
+ gaw = dve->gaw;
+ else
+ did = -1;
+ }
+#endif /* CONFIG_CRASH_DUMP */
+
if (iommu_attach_domain(domain, iommu, did)) {
free_domain_mem(domain);
domain = NULL;
@@ -1929,6 +1970,18 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
if (domain_init(domain, gaw))
goto error;

+#ifdef CONFIG_CRASH_DUMP
+ if (intel_iommu_in_crashdump && dve) {
+
+ if (domain->pgd)
+ free_pgtable_page(domain->pgd);
+
+ domain->pgd = dve->pgd;
+
+ copy_reserved_iova(&dve->iovad, &domain->iovad);
+ }
+#endif /* CONFIG_CRASH_DUMP */
+
/* register pcie-to-pci device */
if (dev_tmp) {
domain = dmar_insert_dev_info(iommu, bridge_bus, bridge_devfn,
@@ -2331,6 +2384,10 @@ static int __init init_dmars(void)
struct device *dev;
struct intel_iommu *iommu;
int i, ret;
+#ifdef CONFIG_CRASH_DUMP
+ struct root_entry *root_old_phys;
+ struct root_entry *root_new_virt;
+#endif /* CONFIG_CRASH_DUMP */

/*
* for each drhd
@@ -2374,16 +2431,40 @@ static int __init init_dmars(void)
if (ret)
goto free_iommu;

- /*
- * TBD:
- * we could share the same root & context tables
- * among all IOMMU's. Need to Split it later.
- */
- ret = iommu_alloc_root_entry(iommu);
- if (ret) {
- printk(KERN_ERR "IOMMU: allocate root entry failed\n");
- goto free_iommu;
+#ifdef CONFIG_CRASH_DUMP
+ if (intel_iommu_in_crashdump) {
+ pr_info("IOMMU Copying translate tables from panicked kernel\n");
+ ret = intel_iommu_copy_translation_tables(drhd,
+ &root_old_phys, &root_new_virt,
+ g_num_of_iommus);
+ if (ret) {
+ pr_err("IOMMU: Copy translate tables failed\n");
+
+ /* Best to stop trying */
+ intel_iommu_in_crashdump = false;
+ goto error;
+ }
+ iommu->root_entry = root_new_virt;
+ pr_info("IOMMU: root_new_virt:0x%12.12llx phys:0x%12.12llx\n",
+ (u64)root_new_virt,
+ virt_to_phys(root_new_virt));
+ intel_iommu_get_dids_from_old_kernel(iommu);
+ } else {
+#endif /* CONFIG_CRASH_DUMP */
+ /*
+ * TBD:
+ * we could share the same root & context tables
+ * among all IOMMU's. Need to Split it later.
+ */
+ ret = iommu_alloc_root_entry(iommu);
+ if (ret) {
+ printk(KERN_ERR "IOMMU: allocate root entry failed\n");
+ goto free_iommu;
+ }
+#ifdef CONFIG_CRASH_DUMP
}
+#endif /* CONFIG_CRASH_DUMP */
+
if (!ecap_pass_through(iommu->ecap))
hw_pass_through = 0;
}
@@ -2442,6 +2523,16 @@ static int __init init_dmars(void)

check_tylersburg_isoch();

+#ifdef CONFIG_CRASH_DUMP
+ /*
+ * In the crashdump kernel: Skip setting-up new domains for
+ * si, rmrr, and the isa bus on the expectation that these
+ * translations were copied from the old kernel.
+ */
+ if (intel_iommu_in_crashdump)
+ goto skip_new_domains_for_si_rmrr_isa;
+#endif /* CONFIG_CRASH_DUMP */
+
/*
* If pass through is not set or not enabled, setup context entries for
* identity mappings for rmrr, gfx, and isa and may fall back to static
@@ -2482,6 +2573,10 @@ static int __init init_dmars(void)

iommu_prepare_isa();

+#ifdef CONFIG_CRASH_DUMP
+skip_new_domains_for_si_rmrr_isa:;
+#endif /* CONFIG_CRASH_DUMP */
+
/*
* for each drhd
* enable fault log
@@ -3624,12 +3719,24 @@ int __init intel_iommu_init(void)
goto out_free_dmar;
}

+#ifdef CONFIG_CRASH_DUMP
/*
- * Disable translation if already enabled prior to OS handover.
+ * If (This is the crash kernel)
+ * Set: copy iommu translate tables from old kernel
+ * Skip disabling the iommu hardware translations
*/
- for_each_active_iommu(iommu, drhd)
- if (iommu->gcmd & DMA_GCMD_TE)
- iommu_disable_translation(iommu);
+ if (is_kdump_kernel()) {
+ intel_iommu_in_crashdump = true;
+ pr_info("IOMMU intel_iommu_in_crashdump = true\n");
+ pr_info("IOMMU Skip disabling iommu hardware translations\n");
+ } else
+#endif /* CONFIG_CRASH_DUMP */
+ /*
+ * Disable translation if already enabled prior to OS handover.
+ */
+ for_each_active_iommu(iommu, drhd)
+ if (iommu->gcmd & DMA_GCMD_TE)
+ iommu_disable_translation(iommu);

if (dmar_dev_scope_init() < 0) {
if (force_on)
--
Bill Sumner <[email protected]>

2014-04-25 00:38:52

by Sumner, William

[permalink] [raw]
Subject: [PATCH 6/8] iommu/vt-d: Create intel-iommu-kdump.c

Create intel-iommu-kdump.c
Populate it with support functions to copy iommu translation tables from
from the panicked kernel into the kdump kernel in the event of a crash.

Update Makefile to build intel-iommu-kdump.o

Signed-off-by: Bill Sumner <[email protected]>
---
drivers/iommu/Makefile | 2 +-
drivers/iommu/intel-iommu-kdump.c | 590 ++++++++++++++++++++++++++++++++++++++
2 files changed, 591 insertions(+), 1 deletion(-)
create mode 100644 drivers/iommu/intel-iommu-kdump.c

diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 5d58bf1..bd61452 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
obj-$(CONFIG_AMD_IOMMU_V2) += amd_iommu_v2.o
obj-$(CONFIG_ARM_SMMU) += arm-smmu.o
obj-$(CONFIG_DMAR_TABLE) += dmar.o
-obj-$(CONFIG_INTEL_IOMMU) += iova.o intel-iommu.o
+obj-$(CONFIG_INTEL_IOMMU) += iova.o intel-iommu.o intel-iommu-kdump.o
obj-$(CONFIG_IRQ_REMAP) += intel_irq_remapping.o irq_remapping.o
obj-$(CONFIG_OMAP_IOMMU) += omap-iommu.o
obj-$(CONFIG_OMAP_IOMMU) += omap-iommu2.o
diff --git a/drivers/iommu/intel-iommu-kdump.c b/drivers/iommu/intel-iommu-kdump.c
new file mode 100644
index 0000000..4e653e048
--- /dev/null
+++ b/drivers/iommu/intel-iommu-kdump.c
@@ -0,0 +1,590 @@
+/*
+ * Copyright (C) 2014 Hewlett-Packard Development Company, L.P.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * Copyright (C) 2014 Hewlett-Packard Development Company, L.P.
+ * Author: Bill Sumner <[email protected]>
+ */
+#include <linux/init.h>
+#include <linux/bitmap.h>
+#include <linux/debugfs.h>
+#include <linux/export.h>
+#include <linux/slab.h>
+#include <linux/irq.h>
+#include <linux/interrupt.h>
+#include <linux/spinlock.h>
+#include <linux/pci.h>
+#include <linux/dmar.h>
+#include <linux/dma-mapping.h>
+#include <linux/mempool.h>
+#include <linux/timer.h>
+#include <linux/iova.h>
+#include <linux/iommu.h>
+#include <linux/intel-iommu.h>
+#include <linux/syscore_ops.h>
+#include <linux/tboot.h>
+#include <linux/dmi.h>
+#include <linux/pci-ats.h>
+#include <linux/memblock.h>
+#include <asm/irq_remapping.h>
+#include <asm/cacheflush.h>
+#include <linux/iommu.h>
+
+#include "irq_remapping.h"
+#include "pci.h"
+#include "intel-iommu-private.h"
+#include <linux/crash_dump.h>
+
+#ifdef CONFIG_CRASH_DUMP
+
+
+/* Lists of domain_values_entry to hold domain values found during the copy.
+ * One list for each iommu in g_number_of_iommus.
+ */
+static struct list_head *domain_values_list;
+
+
+/* ========================================================================
+ * Copy iommu translation tables from old kernel into new kernel.
+ * Entry to this set of functions is: intel_iommu_copy_translation_tables()
+ * ------------------------------------------------------------------------
+ */
+#define RET_BADCOPY -1 /* Return-code: Cannot copy translate tables */
+
+/*
+ * Copy memory from a physically-addressed area into a virtually-addressed area
+ */
+static int oldcopy(void *to, void *from, int size)
+{
+ size_t ret = 0; /* Length copied */
+ unsigned long pfn; /* Page Frame Number */
+ char *buf = to; /* Adr(Output buffer) */
+ size_t csize = (size_t)size; /* Num(bytes to copy) */
+ unsigned long offset; /* Lower 12 bits of from */
+ int userbuf = 0; /* to is in kernel space */
+
+
+ pfn = ((unsigned long) from) >> VTD_PAGE_SHIFT;
+ offset = ((unsigned long) from) & (~VTD_PAGE_MASK);
+ ret = copy_oldmem_page(pfn, buf, csize, offset, userbuf);
+
+ return (int) ret;
+}
+
+
+
+/*
+ * Struct copy_page_addr_parms is used to allow copy_page_addr()
+ * to accumulate values across multiple calls and returns.
+ */
+struct copy_page_addr_parms {
+ u32 first; /* flag: first-time */
+ u32 last; /* flag: last-time */
+ u32 bus; /* last bus number we saw */
+ u32 devfn; /* last devfn we saw */
+ u32 shift; /* last shift we saw */
+ u64 pte; /* Page Table Entry */
+ u64 next_addr; /* next-expected page_addr */
+
+ u64 page_addr; /* page_addr accumulating size */
+ u64 page_size; /* page_size accumulated */
+
+ struct domain_values_entry *dve; /* to accumulate iova ranges */
+};
+
+/*
+ * constant for initializing instances of copy_page_addr_parms properly.
+ */
+static struct copy_page_addr_parms copy_page_addr_parms_init = {1, 0};
+
+
+
+/*
+ * Lowest-level function in the 'Copy Page Tables' set
+ * Called once for each page_addr present in an iommu page-address table.
+ *
+ * Because of the depth-first traversal of the page-tables by the
+ * higher-level functions that call 'copy_page_addr', all pages
+ * of a domain will be presented in ascending order of IO Virtual Address.
+ *
+ * This function accumulates each contiguous range of these IOVAs and
+ * reserves it within the proper domain in the crashdump kernel when a
+ * non-contiguous range is detected, as determined by any of the following:
+ * 1. a change in the bus or device owning the presented page
+ * 2. a change in the page-size of the presented page (parameter shift)
+ * 3. a change in the page-table entry of the presented page
+ * 4. a presented IOVA that does not match the expected next-page address
+ * 5. the 'last' flag is set, indicating that all IOVAs have been seen.
+ */
+static int copy_page_addr(u64 page_addr, u32 shift, u32 bus, u32 devfn,
+ u64 pte, struct domain_values_entry *dve,
+ void *parms)
+{
+ struct copy_page_addr_parms *ppap = parms;
+
+ u64 page_size = ((u64)1 << shift); /* page_size */
+ u64 pfn_lo; /* For reserving IOVA range */
+ u64 pfn_hi; /* For reserving IOVA range */
+ struct iova *iova_p; /* For reserving IOVA range */
+
+ if (!ppap) {
+ pr_err("ERROR: ppap is NULL: 0x%3.3x(%3.3d) DevFn: 0x%3.3x(%3.3d) Page: 0x%16.16llx Size: 0x%16.16llx(%lld)\n",
+ bus, bus, devfn, devfn, page_addr,
+ page_size, page_size);
+ return 0;
+ }
+
+ /* If (only extending current addr range) */
+ if (ppap->first == 0 &&
+ ppap->last == 0 &&
+ ppap->bus == bus &&
+ ppap->devfn == devfn &&
+ ppap->shift == shift &&
+ (ppap->pte & ~VTD_PAGE_MASK) == (pte & ~VTD_PAGE_MASK) &&
+ ppap->next_addr == page_addr) {
+
+ /* Update page size and next-expected address */
+ ppap->next_addr += page_size;
+ ppap->page_size += page_size;
+ return 0;
+ }
+
+ if (!ppap->first) {
+ /* Close-out the accumulated IOVA address range */
+
+ if (!ppap->dve) {
+ pr_err("%s ERROR: ppap->dve is NULL -- needed to reserve range for B:D:F=%2.2x:%2.2x:%1.1x\n",
+ __func__,
+ ppap->bus, ppap->devfn >> 3, ppap->devfn & 0x7);
+ return RET_BADCOPY;
+ }
+ pfn_lo = IOVA_PFN(ppap->page_addr);
+ pfn_hi = IOVA_PFN(ppap->page_addr + ppap->page_size);
+ iova_p = reserve_iova(&ppap->dve->iovad, pfn_lo, pfn_hi);
+ }
+
+ /* Prepare for a new IOVA address range */
+ ppap->first = 0; /* Not first-time anymore */
+ ppap->bus = bus;
+ ppap->devfn = devfn;
+ ppap->shift = shift;
+ ppap->pte = pte;
+ ppap->next_addr = page_addr + page_size; /* Next-expected page_addr */
+
+ ppap->page_addr = page_addr; /* Addr(new page) */
+ ppap->page_size = page_size; /* Size(new page) */
+
+ ppap->dve = dve; /* adr(device_values_entry for new range) */
+
+ return 0;
+}
+
+/*
+ * Recursive function to copy the tree of page tables (max 6 recursions)
+ * Parameter 'shift' controls the recursion
+ */
+static int copy_page_table(struct dma_pte **dma_pte_new_p,
+ struct dma_pte *dma_pte_phys,
+ u32 shift, u64 page_addr,
+ struct intel_iommu *iommu,
+ u32 bus, u32 devfn,
+ struct domain_values_entry *dve, void *ppap)
+{
+ int ret; /* Integer return code */
+ struct dma_pte *p; /* Physical adr(each entry) iterator */
+ struct dma_pte *pgt_new_virt; /* Adr(dma_pte in new kernel) */
+ struct dma_pte *dma_pte_next; /* Adr(next table down) */
+ u64 u; /* index(each entry in page_table) */
+
+
+ /* If (already done all levels -- problem) */
+ if (shift < 12) {
+ pr_err("ERROR %s shift < 12 %p\n", __func__, dma_pte_phys);
+ pr_err("shift %d, page_addr %16.16llu bus %3.3u devfn %3.3u\n",
+ shift, page_addr, bus, devfn);
+ return RET_BADCOPY;
+ }
+
+ /* allocate a page table in the new kernel
+ * copy contents from old kernel
+ * then update each entry in the table in the new kernel
+ */
+
+ pgt_new_virt = (struct dma_pte *)alloc_pgtable_page(iommu->node);
+ if (!pgt_new_virt)
+ return -ENOMEM;
+
+ ret = oldcopy(pgt_new_virt, dma_pte_phys, VTD_PAGE_SIZE);
+ if (ret <= 0)
+ return ret;
+
+ for (u = 0, p = pgt_new_virt; u < 512; u++, p++) {
+
+ if (((p->val & DMA_PTE_READ) == 0) &&
+ ((p->val & DMA_PTE_WRITE) == 0))
+ continue;
+
+ if (dma_pte_superpage(p) || (shift == 12)) {
+
+ ret = copy_page_addr(page_addr | (u << shift),
+ shift, bus, devfn, p->val, dve, ppap);
+ if (ret)
+ return ret;
+ continue;
+ }
+
+ ret = copy_page_table(&dma_pte_next,
+ (struct dma_pte *)(p->val & VTD_PAGE_MASK),
+ shift-9, page_addr | (u << shift),
+ iommu, bus, devfn, dve, ppap);
+ if (ret)
+ return ret;
+
+ p->val &= ~VTD_PAGE_MASK; /* Clear old and set new pgd */
+ p->val |= ((u64)dma_pte_next & VTD_PAGE_MASK);
+ }
+
+ *dma_pte_new_p = (struct dma_pte *)virt_to_phys(pgt_new_virt);
+ __iommu_flush_cache(iommu, pgt_new_virt, VTD_PAGE_SIZE);
+
+ return 0;
+}
+
+
+/*
+ * Called once for each context_entry found in a copied context_entry_table
+ * Each context_entry represents one PCIe device handled by the IOMMU.
+ *
+ * The 'domain_values_list' contains one 'domain_values_entry' for each
+ * unique domain-id found while copying the context entries for each iommu.
+ *
+ * The Intel-iommu spec. requires that every context_entry that contains
+ * the same domain-id point to the same set of page translation tables.
+ * The hardware uses this to improve the use of its translation cache.
+ * In order to insure that the copied translate tables abide by this
+ * requirement, this function keeps a list of domain-ids (dids) that
+ * have already been seen for this iommu. This function checks each entry
+ * already on the list for a domain-id that matches the domain-id in this
+ * context_entry. If found, this function places the address of the previous
+ * context's tree of page translation tables into this context_entry.
+ * If a matching previous entry is not found, a new 'domain_values_entry'
+ * structure is created for the domain-id in this context_entry and
+ * copy_page_table is called to duplicate its tree of page tables.
+ */
+
+enum returns_from_copy_context_entry {
+RET_CCE_NOT_PRESENT = 1,
+RET_CCE_NEW_PAGE_TABLES,
+RET_CCE_PASS_THROUGH_1,
+RET_CCE_PASS_THROUGH_2,
+RET_CCE_RESERVED_VALUE,
+RET_CCE_PREVIOUS_DID
+};
+static int copy_context_entry(struct intel_iommu *iommu, u32 bus, u32 devfn,
+ void *ppap, struct context_entry *ce)
+{
+ int ret = 0; /* Integer Return Code */
+ u32 shift = 0; /* bits to shift page_addr */
+ u64 page_addr = 0; /* Address of translated page */
+ struct dma_pte *pgt_old_phys; /* Adr(page_table in the old kernel) */
+ struct dma_pte *pgt_new_phys; /* Adr(page_table in the new kernel) */
+ unsigned long asr; /* New asr value for new context */
+ u8 t; /* Translation-type from context */
+ u8 aw; /* Address-width from context */
+ u32 aw_shift[8] = {
+ 12+9+9, /* [000b] 30-bit AGAW (2-level page table) */
+ 12+9+9+9, /* [001b] 39-bit AGAW (3-level page table) */
+ 12+9+9+9+9, /* [010b] 48-bit AGAW (4-level page table) */
+ 12+9+9+9+9+9, /* [011b] 57-bit AGAW (5-level page table) */
+ 12+9+9+9+9+9+9, /* [100b] 64-bit AGAW (6-level page table) */
+ 0, /* [111b] Reserved */
+ 0, /* [110b] Reserved */
+ 0, /* [111b] Reserved */
+ };
+
+ struct domain_values_entry *dve = NULL;
+
+
+ if (!context_get_p(ce)) { /* If (context not present) */
+ ret = RET_CCE_NOT_PRESENT; /* Skip it */
+ goto exit;
+ }
+
+ t = context_get_t(ce);
+
+ /* If we have seen this domain-id before on this iommu,
+ * give this context the same page-tables and we are done.
+ */
+ list_for_each_entry(dve, &domain_values_list[iommu->seq_id], link) {
+ if (dve->did == (int) context_get_did(ce)) {
+ switch (t) {
+ case 0: /* page tables */
+ case 1: /* page tables */
+ asr = virt_to_phys(dve->pgd) >> VTD_PAGE_SHIFT;
+ context_put_asr(ce, asr);
+ ret = RET_CCE_PREVIOUS_DID;
+ break;
+
+ case 2: /* Pass through */
+ if (dve->pgd == NULL)
+ ret = RET_CCE_PASS_THROUGH_2;
+ else
+ ret = RET_BADCOPY;
+ break;
+
+ default: /* Bad value of 't'*/
+ ret = RET_BADCOPY;
+ break;
+ }
+ goto exit;
+ }
+ }
+
+ /* Since we now know that this is a new domain-id for this iommu,
+ * create a new entry, add it to the list, and handle its
+ * page tables.
+ */
+
+ dve = kcalloc(1, sizeof(struct domain_values_entry), GFP_KERNEL);
+ if (!dve) {
+ ret = -ENOMEM;
+ goto exit;
+ }
+
+ dve->did = (int) context_get_did(ce);
+ dve->gaw = (int) agaw_to_width(context_get_aw(ce));
+ dve->pgd = NULL;
+ init_iova_domain(&dve->iovad, DMA_32BIT_PFN);
+
+ list_add(&dve->link, &domain_values_list[iommu->seq_id]);
+
+
+ if (t == 0 || t == 1) { /* If (context has page tables) */
+ aw = context_get_aw(ce);
+ shift = aw_shift[aw];
+
+ pgt_old_phys = (struct dma_pte *)(context_get_asr(ce) << 12);
+
+ ret = copy_page_table(&pgt_new_phys, pgt_old_phys,
+ shift-9, page_addr, iommu, bus, devfn, dve, ppap);
+
+ if (ret) /* if (problem) bail out */
+ goto exit;
+
+ asr = ((unsigned long)(pgt_new_phys)) >> VTD_PAGE_SHIFT;
+ context_put_asr(ce, asr);
+ dve->pgd = phys_to_virt((unsigned long)pgt_new_phys);
+ ret = RET_CCE_NEW_PAGE_TABLES;
+ goto exit;
+ }
+
+ if (t == 2) { /* If (Identity mapped pass-through) */
+ ret = RET_CCE_PASS_THROUGH_1; /* REVISIT: Skip for now */
+ goto exit;
+ }
+
+ ret = RET_CCE_RESERVED_VALUE; /* Else ce->t is a Reserved value */
+ /* Note fall-through */
+
+exit: /* all returns come through here to insure good clean-up */
+ return ret;
+}
+
+
+/*
+ * Called once for each context_entry_table found in the root_entry_table
+ */
+static int copy_context_entry_table(struct intel_iommu *iommu,
+ u32 bus, void *ppap,
+ struct context_entry **context_new_p,
+ struct context_entry *context_old_phys)
+{
+ int ret = 0; /* Integer return code */
+ struct context_entry *ce; /* Iterator */
+ struct context_entry *context_new_phys; /* adr(table in new kernel) */
+ struct context_entry *context_new_virt; /* adr(table in new kernel) */
+ u32 devfn = 0; /* PCI Device & function */
+
+ /* allocate a context-entry table in the new kernel
+ * copy contents from old kernel
+ * then update each entry in the table in the new kernel
+ */
+ context_new_virt =
+ (struct context_entry *)alloc_pgtable_page(iommu->node);
+ if (!context_new_virt)
+ return -ENOMEM;
+
+ context_new_phys =
+ (struct context_entry *)virt_to_phys(context_new_virt);
+
+ oldcopy(context_new_virt, context_old_phys, VTD_PAGE_SIZE);
+
+ for (devfn = 0, ce = context_new_virt; devfn < 256; devfn++, ce++) {
+
+ if (!context_get_p(ce)) /* If (context not present) */
+ continue; /* Skip it */
+
+ ret = copy_context_entry(iommu, bus, devfn, ppap, ce);
+ if (ret < 0) /* if (problem) */
+ return RET_BADCOPY;
+
+ switch (ret) {
+ case RET_CCE_NOT_PRESENT:
+ continue;
+ case RET_CCE_NEW_PAGE_TABLES:
+ continue;
+ case RET_CCE_PASS_THROUGH_1:
+ continue;
+ case RET_CCE_PASS_THROUGH_2:
+ continue;
+ case RET_CCE_RESERVED_VALUE:
+ return RET_BADCOPY;
+ case RET_CCE_PREVIOUS_DID:
+ continue;
+ default:
+ return RET_BADCOPY;
+ };
+ }
+
+ *context_new_p = context_new_phys;
+ __iommu_flush_cache(iommu, context_new_virt, VTD_PAGE_SIZE);
+ return 0;
+}
+
+
+/*
+ * Highest-level function in the 'copy translation tables' set of functions
+ */
+static int copy_root_entry_table(struct intel_iommu *iommu, void *ppap,
+ struct root_entry **root_new_virt_p,
+ struct root_entry *root_old_phys)
+{
+ int ret = 0; /* Integer return code */
+ u32 bus; /* Index: root-entry-table */
+ struct root_entry *re; /* Virt(iterator: new table) */
+ struct root_entry *root_new_virt; /* Virt(table in new kernel) */
+ struct context_entry *context_old_phys; /* Phys(context table entry) */
+ struct context_entry *context_new_phys; /* Phys(new context_entry) */
+
+ /*
+ * allocate a root-entry table in the new kernel
+ * copy contents from old kernel
+ * then update each entry in the table in the new kernel
+ */
+
+ root_new_virt = (struct root_entry *)alloc_pgtable_page(iommu->node);
+ if (!root_new_virt)
+ return -ENOMEM;
+
+ oldcopy(root_new_virt, root_old_phys, VTD_PAGE_SIZE);
+
+ for (bus = 0, re = root_new_virt; bus < 256; bus += 1, re += 1) {
+
+ if (!root_present(re))
+ continue;
+
+ context_old_phys = get_context_phys_from_root(re);
+
+ if (!context_old_phys)
+ continue;
+
+ ret = copy_context_entry_table(iommu, bus, ppap,
+ &context_new_phys,
+ context_old_phys);
+ if (ret)
+ return ret;
+
+ re->val &= ~VTD_PAGE_MASK;
+ set_root_value(re, (unsigned long)context_new_phys);
+ }
+
+ *root_new_virt_p = root_new_virt;
+ __iommu_flush_cache(iommu, root_new_virt, VTD_PAGE_SIZE);
+ return 0;
+}
+
+/*
+ * Interface to the "copy translation tables" set of functions
+ * from mainline code.
+ */
+int intel_iommu_copy_translation_tables(struct dmar_drhd_unit *drhd,
+ struct root_entry **root_old_phys_p,
+ struct root_entry **root_new_virt_p,
+ int g_num_of_iommus)
+{
+ struct intel_iommu *iommu; /* Virt(iommu hardware registers) */
+ unsigned long long q; /* quadword scratch */
+ struct root_entry *root_phys; /* Phys(table in old kernel) */
+ struct root_entry *root_new; /* Virt(table in new kernel) */
+ int ret = 0; /* Integer return code */
+ int i = 0; /* Loop index */
+
+ /* Structure so copy_page_addr() can accumulate things
+ * over multiple calls and returns
+ */
+ struct copy_page_addr_parms ppa_parms = copy_page_addr_parms_init;
+ struct copy_page_addr_parms *ppap = &ppa_parms;
+
+
+ iommu = drhd->iommu;
+ q = readq(iommu->reg + DMAR_RTADDR_REG);
+
+ if (!q)
+ return -1;
+
+ *root_old_phys_p = (struct root_entry *)q; /* Returned to caller */
+
+ /* If (list needs initializing) do it here */
+ if (!domain_values_list) {
+ domain_values_list =
+ kcalloc(g_num_of_iommus, sizeof(struct list_head),
+ GFP_KERNEL);
+
+ if (!domain_values_list) {
+ pr_err("Allocation failed for domain_values_list array\n");
+ return -ENOMEM;
+ }
+ for (i = 0; i < g_num_of_iommus; i++)
+ INIT_LIST_HEAD(&domain_values_list[i]);
+ }
+
+ /* Copy the root-entry table from the old kernel
+ * foreach context_entry_table in root_entry
+ * foreach context_entry in context_entry_table
+ * foreach level-1 page_table_entry in context_entry
+ * foreach level-2 page_table_entry in level 1 page_table_entry
+ * Above pattern continues up to 6 levels of page tables
+ * Sanity-check the entry
+ * Process the bus, devfn, page_address, page_size
+ */
+
+ root_phys = (struct root_entry *)q;
+ ret = copy_root_entry_table(iommu, ppap, &root_new, root_phys);
+ if (ret)
+ return ret;
+
+
+ ppa_parms.last = 1;
+ copy_page_addr(0, 0, 0, 0, 0, NULL, ppap);
+ *root_new_virt_p = root_new; /* Returned to caller */
+
+ /* The translation tables in the new kernel should now contain
+ * the same translations as the tables in the old kernel.
+ * This will allow us to update the iommu hdw to use the new tables.
+ *
+ * NOTE: Neither the iommu hardware nor the iommu->root_entry
+ * struct-value is updated herein.
+ * These are left for the caller to do.
+ */
+
+ return 0;
+}
+#endif /* CONFIG_CRASH_DUMP */
--
Bill Sumner <[email protected]>

2014-04-25 00:38:09

by Sumner, William

[permalink] [raw]
Subject: [PATCH 3/8] iommu/vt-d: Create intel-iommu-private.h

Move the single block of #define, static inline ... ; struct definitions
to intel-iommu-private.h from intel-iommu.c
Replace them with #include "intel-iommu-private.h"

This introduces no functional change from current behaviour.

Signed-off-by: Bill Sumner <[email protected]>
---
drivers/iommu/intel-iommu-private.h | 363 ++++++++++++++++++++++++++++++++++++
drivers/iommu/intel-iommu.c | 345 +---------------------------------
2 files changed, 364 insertions(+), 344 deletions(-)
create mode 100644 drivers/iommu/intel-iommu-private.h

diff --git a/drivers/iommu/intel-iommu-private.h b/drivers/iommu/intel-iommu-private.h
new file mode 100644
index 0000000..480399c
--- /dev/null
+++ b/drivers/iommu/intel-iommu-private.h
@@ -0,0 +1,363 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * Copyright (C) 2006-2008 Intel Corporation
+ * Author: Ashok Raj <[email protected]>
+ * Author: Shaohua Li <[email protected]>
+ * Author: Anil S Keshavamurthy <[email protected]>
+ * Author: Fenghua Yu <[email protected]>
+ */
+
+
+#define ROOT_SIZE VTD_PAGE_SIZE
+#define CONTEXT_SIZE VTD_PAGE_SIZE
+
+#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
+#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
+#define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
+
+#define IOAPIC_RANGE_START (0xfee00000)
+#define IOAPIC_RANGE_END (0xfeefffff)
+#define IOVA_START_ADDR (0x1000)
+
+#define DEFAULT_DOMAIN_ADDRESS_WIDTH 48
+
+#define MAX_AGAW_WIDTH 64
+#define MAX_AGAW_PFN_WIDTH (MAX_AGAW_WIDTH - VTD_PAGE_SHIFT)
+
+#define __DOMAIN_MAX_PFN(gaw) ((((uint64_t)1) << (gaw-VTD_PAGE_SHIFT)) - 1)
+#define __DOMAIN_MAX_ADDR(gaw) ((((uint64_t)1) << gaw) - 1)
+
+/* We limit DOMAIN_MAX_PFN to fit in an unsigned long, and DOMAIN_MAX_ADDR
+ to match. That way, we can use 'unsigned long' for PFNs with impunity. */
+#define DOMAIN_MAX_PFN(gaw) ((unsigned long) min_t(uint64_t, \
+ __DOMAIN_MAX_PFN(gaw), (unsigned long)-1))
+#define DOMAIN_MAX_ADDR(gaw) (((uint64_t)__DOMAIN_MAX_PFN(gaw)) << \
+ VTD_PAGE_SHIFT)
+
+#define IOVA_PFN(addr) ((addr) >> PAGE_SHIFT)
+#define DMA_32BIT_PFN IOVA_PFN(DMA_BIT_MASK(32))
+#define DMA_64BIT_PFN IOVA_PFN(DMA_BIT_MASK(64))
+
+/* page table handling */
+#define LEVEL_STRIDE (9)
+#define LEVEL_MASK (((u64)1 << LEVEL_STRIDE) - 1)
+
+/*
+ * This bitmap is used to advertise the page sizes our hardware support
+ * to the IOMMU core, which will then use this information to split
+ * physically contiguous memory regions it is mapping into page sizes
+ * that we support.
+ *
+ * Traditionally the IOMMU core just handed us the mappings directly,
+ * after making sure the size is an order of a 4KiB page and that the
+ * mapping has natural alignment.
+ *
+ * To retain this behavior, we currently advertise that we support
+ * all page sizes that are an order of 4KiB.
+ *
+ * If at some point we'd like to utilize the IOMMU core's new behavior,
+ * we could change this to advertise the real page sizes we support.
+ */
+#define INTEL_IOMMU_PGSIZES (~0xFFFUL)
+
+static inline int agaw_to_level(int agaw)
+{
+ return agaw + 2;
+}
+
+static inline int agaw_to_width(int agaw)
+{
+ return min_t(int, 30 + agaw * LEVEL_STRIDE, MAX_AGAW_WIDTH);
+}
+
+static inline int width_to_agaw(int width)
+{
+ return DIV_ROUND_UP(width - 30, LEVEL_STRIDE);
+}
+
+static inline unsigned int level_to_offset_bits(int level)
+{
+ return (level - 1) * LEVEL_STRIDE;
+}
+
+static inline int pfn_level_offset(unsigned long pfn, int level)
+{
+ return (pfn >> level_to_offset_bits(level)) & LEVEL_MASK;
+}
+
+static inline unsigned long level_mask(int level)
+{
+ return -1UL << level_to_offset_bits(level);
+}
+
+static inline unsigned long level_size(int level)
+{
+ return 1UL << level_to_offset_bits(level);
+}
+
+static inline unsigned long align_to_level(unsigned long pfn, int level)
+{
+ return (pfn + level_size(level) - 1) & level_mask(level);
+}
+
+static inline unsigned long lvl_to_nr_pages(unsigned int lvl)
+{
+ return 1 << min_t(int, (lvl - 1) * LEVEL_STRIDE, MAX_AGAW_PFN_WIDTH);
+}
+
+/* VT-d pages must always be _smaller_ than MM pages. Otherwise things
+ are never going to work. */
+static inline unsigned long dma_to_mm_pfn(unsigned long dma_pfn)
+{
+ return dma_pfn >> (PAGE_SHIFT - VTD_PAGE_SHIFT);
+}
+
+static inline unsigned long mm_to_dma_pfn(unsigned long mm_pfn)
+{
+ return mm_pfn << (PAGE_SHIFT - VTD_PAGE_SHIFT);
+}
+static inline unsigned long page_to_dma_pfn(struct page *pg)
+{
+ return mm_to_dma_pfn(page_to_pfn(pg));
+}
+static inline unsigned long virt_to_dma_pfn(void *p)
+{
+ return page_to_dma_pfn(virt_to_page(p));
+}
+
+
+/*
+ * 0: Present
+ * 1-11: Reserved
+ * 12-63: Context Ptr (12 - (haw-1))
+ * 64-127: Reserved
+ */
+struct root_entry {
+ u64 val;
+ u64 rsvd1;
+};
+#define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
+static inline bool root_present(struct root_entry *root)
+{
+ return root->val & 1;
+}
+static inline void set_root_present(struct root_entry *root)
+{
+ root->val |= 1;
+}
+static inline void set_root_value(struct root_entry *root, unsigned long value)
+{
+ root->val |= value & VTD_PAGE_MASK;
+}
+
+static inline struct context_entry *
+get_context_addr_from_root(struct root_entry *root)
+{
+ return (struct context_entry *)
+ (root_present(root)?phys_to_virt(
+ root->val & VTD_PAGE_MASK) :
+ NULL);
+}
+
+/*
+ * low 64 bits:
+ * 0: present
+ * 1: fault processing disable
+ * 2-3: translation type
+ * 12-63: address space root
+ * high 64 bits:
+ * 0-2: address width
+ * 3-6: aval
+ * 8-23: domain id
+ */
+struct context_entry {
+ u64 lo;
+ u64 hi;
+};
+
+static inline bool context_present(struct context_entry *context)
+{
+ return context->lo & 1;
+}
+static inline void context_set_present(struct context_entry *context)
+{
+ context->lo |= 1;
+}
+
+static inline void context_set_fault_enable(struct context_entry *context)
+{
+ context->lo &= (((u64)-1) << 2) | 1;
+}
+
+static inline void context_set_translation_type(struct context_entry *context,
+ unsigned long value)
+{
+ context->lo &= (((u64)-1) << 4) | 3;
+ context->lo |= (value & 3) << 2;
+}
+
+static inline void context_set_address_root(struct context_entry *context,
+ unsigned long value)
+{
+ context->lo |= value & VTD_PAGE_MASK;
+}
+
+static inline void context_set_address_width(struct context_entry *context,
+ unsigned long value)
+{
+ context->hi |= value & 7;
+}
+
+static inline void context_set_domain_id(struct context_entry *context,
+ unsigned long value)
+{
+ context->hi |= (value & ((1 << 16) - 1)) << 8;
+}
+
+static inline void context_clear_entry(struct context_entry *context)
+{
+ context->lo = 0;
+ context->hi = 0;
+}
+
+/*
+ * 0: readable
+ * 1: writable
+ * 2-6: reserved
+ * 7: super page
+ * 8-10: available
+ * 11: snoop behavior
+ * 12-63: Host physcial address
+ */
+struct dma_pte {
+ u64 val;
+};
+
+static inline void dma_clear_pte(struct dma_pte *pte)
+{
+ pte->val = 0;
+}
+
+static inline u64 dma_pte_addr(struct dma_pte *pte)
+{
+#ifdef CONFIG_64BIT
+ return pte->val & VTD_PAGE_MASK;
+#else
+ /* Must have a full atomic 64-bit read */
+ return __cmpxchg64(&pte->val, 0ULL, 0ULL) & VTD_PAGE_MASK;
+#endif
+}
+
+static inline bool dma_pte_present(struct dma_pte *pte)
+{
+ return (pte->val & 3) != 0;
+}
+
+static inline bool dma_pte_superpage(struct dma_pte *pte)
+{
+ return pte->val & (1 << 7);
+}
+
+static inline int first_pte_in_page(struct dma_pte *pte)
+{
+ return !((unsigned long)pte & ~VTD_PAGE_MASK);
+}
+
+/* devices under the same p2p bridge are owned in one domain */
+#define DOMAIN_FLAG_P2P_MULTIPLE_DEVICES (1 << 0)
+
+/* domain represents a virtual machine, more than one devices
+ * across iommus may be owned in one domain, e.g. kvm guest.
+ */
+#define DOMAIN_FLAG_VIRTUAL_MACHINE (1 << 1)
+
+/* si_domain contains mulitple devices */
+#define DOMAIN_FLAG_STATIC_IDENTITY (1 << 2)
+
+/* define the limit of IOMMUs supported in each domain */
+#ifdef CONFIG_X86
+# define IOMMU_UNITS_SUPPORTED MAX_IO_APICS
+#else
+# define IOMMU_UNITS_SUPPORTED 64
+#endif
+
+struct dmar_domain {
+ int id; /* domain id */
+ int nid; /* node id */
+ DECLARE_BITMAP(iommu_bmp, IOMMU_UNITS_SUPPORTED);
+ /* bitmap of iommus this domain uses*/
+
+ struct list_head devices; /* all devices' list */
+ struct iova_domain iovad; /* iova's that belong to this domain */
+
+ struct dma_pte *pgd; /* virtual address */
+ int gaw; /* max guest address width */
+
+ /* adjusted guest address width, 0 is level 2 30-bit */
+ int agaw;
+
+ int flags; /* flags to find out type of domain */
+
+ int iommu_coherency;/* indicate coherency of iommu access */
+ int iommu_snooping; /* indicate snooping control feature*/
+ int iommu_count; /* reference count of iommu */
+ int iommu_superpage;/* Level of superpages supported:
+ 0 == 4KiB (no superpages), 1 == 2MiB,
+ 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
+ spinlock_t iommu_lock; /* protect iommu set in domain */
+ u64 max_addr; /* maximum mapped address */
+};
+
+/* PCI domain-device relationship */
+struct device_domain_info {
+ struct list_head link; /* link to domain siblings */
+ struct list_head global; /* link to global list */
+ u8 bus; /* PCI bus number */
+ u8 devfn; /* PCI devfn number */
+ struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
+ struct intel_iommu *iommu; /* IOMMU used by this device */
+ struct dmar_domain *domain; /* pointer to domain */
+};
+
+struct dmar_rmrr_unit {
+ struct list_head list; /* list of rmrr units */
+ struct acpi_dmar_header *hdr; /* ACPI header */
+ u64 base_address; /* reserved base address*/
+ u64 end_address; /* reserved end address */
+ struct dmar_dev_scope *devices; /* target devices */
+ int devices_cnt; /* target device count */
+};
+
+struct dmar_atsr_unit {
+ struct list_head list; /* list of ATSR units */
+ struct acpi_dmar_header *hdr; /* ACPI header */
+ struct dmar_dev_scope *devices; /* target devices */
+ int devices_cnt; /* target device count */
+ u8 include_all:1; /* include all ports */
+};
+
+static inline void *alloc_pgtable_page(int node)
+{
+ struct page *page;
+ void *vaddr = NULL;
+
+ page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
+ if (page)
+ vaddr = page_address(page);
+ return vaddr;
+}
+
+static inline void free_pgtable_page(void *vaddr)
+{
+ free_page((unsigned long)vaddr);
+}
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 49fdac5..4116377 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -45,350 +45,7 @@

#include "irq_remapping.h"
#include "pci.h"
-
-#define ROOT_SIZE VTD_PAGE_SIZE
-#define CONTEXT_SIZE VTD_PAGE_SIZE
-
-#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
-#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
-#define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
-
-#define IOAPIC_RANGE_START (0xfee00000)
-#define IOAPIC_RANGE_END (0xfeefffff)
-#define IOVA_START_ADDR (0x1000)
-
-#define DEFAULT_DOMAIN_ADDRESS_WIDTH 48
-
-#define MAX_AGAW_WIDTH 64
-#define MAX_AGAW_PFN_WIDTH (MAX_AGAW_WIDTH - VTD_PAGE_SHIFT)
-
-#define __DOMAIN_MAX_PFN(gaw) ((((uint64_t)1) << (gaw-VTD_PAGE_SHIFT)) - 1)
-#define __DOMAIN_MAX_ADDR(gaw) ((((uint64_t)1) << gaw) - 1)
-
-/* We limit DOMAIN_MAX_PFN to fit in an unsigned long, and DOMAIN_MAX_ADDR
- to match. That way, we can use 'unsigned long' for PFNs with impunity. */
-#define DOMAIN_MAX_PFN(gaw) ((unsigned long) min_t(uint64_t, \
- __DOMAIN_MAX_PFN(gaw), (unsigned long)-1))
-#define DOMAIN_MAX_ADDR(gaw) (((uint64_t)__DOMAIN_MAX_PFN(gaw)) << \
- VTD_PAGE_SHIFT)
-
-#define IOVA_PFN(addr) ((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN IOVA_PFN(DMA_BIT_MASK(32))
-#define DMA_64BIT_PFN IOVA_PFN(DMA_BIT_MASK(64))
-
-/* page table handling */
-#define LEVEL_STRIDE (9)
-#define LEVEL_MASK (((u64)1 << LEVEL_STRIDE) - 1)
-
-/*
- * This bitmap is used to advertise the page sizes our hardware support
- * to the IOMMU core, which will then use this information to split
- * physically contiguous memory regions it is mapping into page sizes
- * that we support.
- *
- * Traditionally the IOMMU core just handed us the mappings directly,
- * after making sure the size is an order of a 4KiB page and that the
- * mapping has natural alignment.
- *
- * To retain this behavior, we currently advertise that we support
- * all page sizes that are an order of 4KiB.
- *
- * If at some point we'd like to utilize the IOMMU core's new behavior,
- * we could change this to advertise the real page sizes we support.
- */
-#define INTEL_IOMMU_PGSIZES (~0xFFFUL)
-
-static inline int agaw_to_level(int agaw)
-{
- return agaw + 2;
-}
-
-static inline int agaw_to_width(int agaw)
-{
- return min_t(int, 30 + agaw * LEVEL_STRIDE, MAX_AGAW_WIDTH);
-}
-
-static inline int width_to_agaw(int width)
-{
- return DIV_ROUND_UP(width - 30, LEVEL_STRIDE);
-}
-
-static inline unsigned int level_to_offset_bits(int level)
-{
- return (level - 1) * LEVEL_STRIDE;
-}
-
-static inline int pfn_level_offset(unsigned long pfn, int level)
-{
- return (pfn >> level_to_offset_bits(level)) & LEVEL_MASK;
-}
-
-static inline unsigned long level_mask(int level)
-{
- return -1UL << level_to_offset_bits(level);
-}
-
-static inline unsigned long level_size(int level)
-{
- return 1UL << level_to_offset_bits(level);
-}
-
-static inline unsigned long align_to_level(unsigned long pfn, int level)
-{
- return (pfn + level_size(level) - 1) & level_mask(level);
-}
-
-static inline unsigned long lvl_to_nr_pages(unsigned int lvl)
-{
- return 1 << min_t(int, (lvl - 1) * LEVEL_STRIDE, MAX_AGAW_PFN_WIDTH);
-}
-
-/* VT-d pages must always be _smaller_ than MM pages. Otherwise things
- are never going to work. */
-static inline unsigned long dma_to_mm_pfn(unsigned long dma_pfn)
-{
- return dma_pfn >> (PAGE_SHIFT - VTD_PAGE_SHIFT);
-}
-
-static inline unsigned long mm_to_dma_pfn(unsigned long mm_pfn)
-{
- return mm_pfn << (PAGE_SHIFT - VTD_PAGE_SHIFT);
-}
-static inline unsigned long page_to_dma_pfn(struct page *pg)
-{
- return mm_to_dma_pfn(page_to_pfn(pg));
-}
-static inline unsigned long virt_to_dma_pfn(void *p)
-{
- return page_to_dma_pfn(virt_to_page(p));
-}
-
-
-/*
- * 0: Present
- * 1-11: Reserved
- * 12-63: Context Ptr (12 - (haw-1))
- * 64-127: Reserved
- */
-struct root_entry {
- u64 val;
- u64 rsvd1;
-};
-#define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
-static inline bool root_present(struct root_entry *root)
-{
- return root->val & 1;
-}
-static inline void set_root_present(struct root_entry *root)
-{
- root->val |= 1;
-}
-static inline void set_root_value(struct root_entry *root, unsigned long value)
-{
- root->val |= value & VTD_PAGE_MASK;
-}
-
-static inline struct context_entry *
-get_context_addr_from_root(struct root_entry *root)
-{
- return (struct context_entry *)
- (root_present(root)?phys_to_virt(
- root->val & VTD_PAGE_MASK) :
- NULL);
-}
-
-/*
- * low 64 bits:
- * 0: present
- * 1: fault processing disable
- * 2-3: translation type
- * 12-63: address space root
- * high 64 bits:
- * 0-2: address width
- * 3-6: aval
- * 8-23: domain id
- */
-struct context_entry {
- u64 lo;
- u64 hi;
-};
-
-static inline bool context_present(struct context_entry *context)
-{
- return context->lo & 1;
-}
-static inline void context_set_present(struct context_entry *context)
-{
- context->lo |= 1;
-}
-
-static inline void context_set_fault_enable(struct context_entry *context)
-{
- context->lo &= (((u64)-1) << 2) | 1;
-}
-
-static inline void context_set_translation_type(struct context_entry *context,
- unsigned long value)
-{
- context->lo &= (((u64)-1) << 4) | 3;
- context->lo |= (value & 3) << 2;
-}
-
-static inline void context_set_address_root(struct context_entry *context,
- unsigned long value)
-{
- context->lo |= value & VTD_PAGE_MASK;
-}
-
-static inline void context_set_address_width(struct context_entry *context,
- unsigned long value)
-{
- context->hi |= value & 7;
-}
-
-static inline void context_set_domain_id(struct context_entry *context,
- unsigned long value)
-{
- context->hi |= (value & ((1 << 16) - 1)) << 8;
-}
-
-static inline void context_clear_entry(struct context_entry *context)
-{
- context->lo = 0;
- context->hi = 0;
-}
-
-/*
- * 0: readable
- * 1: writable
- * 2-6: reserved
- * 7: super page
- * 8-10: available
- * 11: snoop behavior
- * 12-63: Host physcial address
- */
-struct dma_pte {
- u64 val;
-};
-
-static inline void dma_clear_pte(struct dma_pte *pte)
-{
- pte->val = 0;
-}
-
-static inline u64 dma_pte_addr(struct dma_pte *pte)
-{
-#ifdef CONFIG_64BIT
- return pte->val & VTD_PAGE_MASK;
-#else
- /* Must have a full atomic 64-bit read */
- return __cmpxchg64(&pte->val, 0ULL, 0ULL) & VTD_PAGE_MASK;
-#endif
-}
-
-static inline bool dma_pte_present(struct dma_pte *pte)
-{
- return (pte->val & 3) != 0;
-}
-
-static inline bool dma_pte_superpage(struct dma_pte *pte)
-{
- return pte->val & (1 << 7);
-}
-
-static inline int first_pte_in_page(struct dma_pte *pte)
-{
- return !((unsigned long)pte & ~VTD_PAGE_MASK);
-}
-
-/* devices under the same p2p bridge are owned in one domain */
-#define DOMAIN_FLAG_P2P_MULTIPLE_DEVICES (1 << 0)
-
-/* domain represents a virtual machine, more than one devices
- * across iommus may be owned in one domain, e.g. kvm guest.
- */
-#define DOMAIN_FLAG_VIRTUAL_MACHINE (1 << 1)
-
-/* si_domain contains mulitple devices */
-#define DOMAIN_FLAG_STATIC_IDENTITY (1 << 2)
-
-/* define the limit of IOMMUs supported in each domain */
-#ifdef CONFIG_X86
-# define IOMMU_UNITS_SUPPORTED MAX_IO_APICS
-#else
-# define IOMMU_UNITS_SUPPORTED 64
-#endif
-
-struct dmar_domain {
- int id; /* domain id */
- int nid; /* node id */
- DECLARE_BITMAP(iommu_bmp, IOMMU_UNITS_SUPPORTED);
- /* bitmap of iommus this domain uses*/
-
- struct list_head devices; /* all devices' list */
- struct iova_domain iovad; /* iova's that belong to this domain */
-
- struct dma_pte *pgd; /* virtual address */
- int gaw; /* max guest address width */
-
- /* adjusted guest address width, 0 is level 2 30-bit */
- int agaw;
-
- int flags; /* flags to find out type of domain */
-
- int iommu_coherency;/* indicate coherency of iommu access */
- int iommu_snooping; /* indicate snooping control feature*/
- int iommu_count; /* reference count of iommu */
- int iommu_superpage;/* Level of superpages supported:
- 0 == 4KiB (no superpages), 1 == 2MiB,
- 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
- spinlock_t iommu_lock; /* protect iommu set in domain */
- u64 max_addr; /* maximum mapped address */
-};
-
-/* PCI domain-device relationship */
-struct device_domain_info {
- struct list_head link; /* link to domain siblings */
- struct list_head global; /* link to global list */
- u8 bus; /* PCI bus number */
- u8 devfn; /* PCI devfn number */
- struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
- struct intel_iommu *iommu; /* IOMMU used by this device */
- struct dmar_domain *domain; /* pointer to domain */
-};
-
-struct dmar_rmrr_unit {
- struct list_head list; /* list of rmrr units */
- struct acpi_dmar_header *hdr; /* ACPI header */
- u64 base_address; /* reserved base address*/
- u64 end_address; /* reserved end address */
- struct dmar_dev_scope *devices; /* target devices */
- int devices_cnt; /* target device count */
-};
-
-struct dmar_atsr_unit {
- struct list_head list; /* list of ATSR units */
- struct acpi_dmar_header *hdr; /* ACPI header */
- struct dmar_dev_scope *devices; /* target devices */
- int devices_cnt; /* target device count */
- u8 include_all:1; /* include all ports */
-};
-
-static inline void *alloc_pgtable_page(int node)
-{
- struct page *page;
- void *vaddr = NULL;
-
- page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
- if (page)
- vaddr = page_address(page);
- return vaddr;
-}
-
-static inline void free_pgtable_page(void *vaddr)
-{
- free_page((unsigned long)vaddr);
-}
+#include "intel-iommu-private.h"

static LIST_HEAD(dmar_atsr_units);
static LIST_HEAD(dmar_rmrr_units);
--
Bill Sumner <[email protected]>

2014-04-25 00:38:07

by Sumner, William

[permalink] [raw]
Subject: [PATCH 5/8] iommu/vt-d: Items required for kdump

Add data declarations and prototypes needed for kdump.

Signed-off-by: Bill Sumner <[email protected]>
---
drivers/iommu/intel-iommu-private.h | 94 +++++++++++++++++++++++++++++++++++++
1 file changed, 94 insertions(+)

diff --git a/drivers/iommu/intel-iommu-private.h b/drivers/iommu/intel-iommu-private.h
index 480399c..3f20234 100644
--- a/drivers/iommu/intel-iommu-private.h
+++ b/drivers/iommu/intel-iommu-private.h
@@ -361,3 +361,97 @@ static inline void free_pgtable_page(void *vaddr)
{
free_page((unsigned long)vaddr);
}
+
+#ifdef CONFIG_CRASH_DUMP
+/*
+ * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU
+ *
+ * Fixes the crashdump kernel to deal with an active iommu and legacy
+ * DMA from the (old) panicked kernel in a manner similar to how legacy
+ * DMA is handled when no hardware iommu was in use by the old kernel --
+ * allow the legacy DMA to continue into its current buffers.
+ *
+ * In the crashdump kernel, this code:
+ * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA)
+ * 2. leaves the current translations in-place so that legacy DMA will
+ * continue to use its current buffers,
+ * 3. allocates to the device drivers in the crashdump kernel
+ * portions of the iova address ranges that are different
+ * from the iova address ranges that were being used by the old kernel
+ * at the time of the panic.
+ *
+ */
+
+struct domain_values_entry {
+ struct list_head link; /* link entries into a list */
+ struct iova_domain iovad; /* iova's that belong to this domain */
+ struct dma_pte *pgd; /* virtual address */
+ int did; /* domain id */
+ int gaw; /* max guest address width */
+ int iommu_superpage; /* Level of superpages supported:
+ 0 == 4KiB (no superpages), 1 == 2MiB,
+ 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
+};
+
+
+
+/* ------------------------------------------------------------------------
+ * Prototypes of interface functions
+ * ------------------------------------------------------------------------
+ */
+
+extern int
+intel_iommu_copy_translation_tables(struct dmar_drhd_unit *drhd,
+ struct root_entry **root_old_p, struct root_entry **root_new_p,
+ int g_num_of_iommus);
+
+extern int
+intel_iommu_get_dids_from_old_kernel(struct intel_iommu *iommu);
+
+extern struct domain_values_entry *
+intel_iommu_did_to_domain_values_entry(int did, struct intel_iommu *iommu);
+
+
+/* ------------------------------------------------------------------------
+ * Utility functions for accessing the iommu Translation Tables
+ * ------------------------------------------------------------------------
+ */
+
+static inline struct context_entry *
+get_context_phys_from_root(struct root_entry *root)
+{
+ return (struct context_entry *)
+ (root_present(root) ? (void *) (root->val & VTD_PAGE_MASK)
+ : NULL);
+}
+
+static inline int
+context_get_p(struct context_entry *c) {return((c->lo >> 0) & 0x1); }
+
+static inline int
+context_get_fpdi(struct context_entry *c) {return((c->lo >> 1) & 0x1); }
+
+static inline int
+context_get_t(struct context_entry *c) {return((c->lo >> 2) & 0x3); }
+
+static inline u64
+context_get_asr(struct context_entry *c) {return((c->lo >> 12)); }
+
+static inline int
+context_get_aw(struct context_entry *c) {return((c->hi >> 0) & 0x7); }
+
+static inline int
+context_get_aval(struct context_entry *c) {return((c->hi >> 3) & 0xf); }
+
+static inline int
+context_get_did(struct context_entry *c) {return((c->hi >> 8) & 0xffff); }
+
+static inline void
+context_put_asr(struct context_entry *c, unsigned long asr)
+{
+ c->lo &= (~VTD_PAGE_MASK);
+ c->lo |= (asr << VTD_PAGE_SHIFT);
+}
+
+#endif /* CONFIG_CRASH_DUMP */
+
--
Bill Sumner <[email protected]>

2014-04-25 00:38:04

by Sumner, William

[permalink] [raw]
Subject: [PATCH 2/8] iommu/vt-d: Consolidate lines for a new private header

In intel-iommu.c, move downward the few lines near the
front that should not move to an intel-iommu-private.h
file (mostly data-item definitions.) Also move upward
a couple of inline functions that allocate and free pages.
This leaves at the front of the file a consolidated block
of the lines that would move to an intel-iommu-private.h file.

Signed-off-by: Bill Sumner <[email protected]>
---
drivers/iommu/intel-iommu.c | 74 +++++++++++++++++++++++----------------------
1 file changed, 38 insertions(+), 36 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index f80b4bb..49fdac5 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -163,17 +163,6 @@ static inline unsigned long virt_to_dma_pfn(void *p)
return page_to_dma_pfn(virt_to_page(p));
}

-/* global iommu list, set NULL for ignored DMAR units */
-static struct intel_iommu **g_iommus;
-
-static void __init check_tylersburg_isoch(void);
-static int rwbf_quirk;
-
-/*
- * set to 1 to panic kernel if can't successfully enable VT-d
- * (used when kernel is launched w/ TXT)
- */
-static int force_on;

/*
* 0: Present
@@ -312,15 +301,6 @@ static inline int first_pte_in_page(struct dma_pte *pte)
return !((unsigned long)pte & ~VTD_PAGE_MASK);
}

-/*
- * This domain is a statically identity mapping domain.
- * 1. This domain creats a static 1:1 mapping to all usable memory.
- * 2. It maps to each iommu if successful.
- * 3. Each iommu mapps to this domain if successful.
- */
-static struct dmar_domain *si_domain;
-static int hw_pass_through = 1;
-
/* devices under the same p2p bridge are owned in one domain */
#define DOMAIN_FLAG_P2P_MULTIPLE_DEVICES (1 << 0)

@@ -394,12 +374,50 @@ struct dmar_atsr_unit {
u8 include_all:1; /* include all ports */
};

+static inline void *alloc_pgtable_page(int node)
+{
+ struct page *page;
+ void *vaddr = NULL;
+
+ page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
+ if (page)
+ vaddr = page_address(page);
+ return vaddr;
+}
+
+static inline void free_pgtable_page(void *vaddr)
+{
+ free_page((unsigned long)vaddr);
+}
+
static LIST_HEAD(dmar_atsr_units);
static LIST_HEAD(dmar_rmrr_units);

#define for_each_rmrr_units(rmrr) \
list_for_each_entry(rmrr, &dmar_rmrr_units, list)

+
+static void __init check_tylersburg_isoch(void);
+
+/* global iommu list, set NULL for ignored DMAR units */
+static struct intel_iommu **g_iommus;
+static int rwbf_quirk;
+
+/*
+ * set to 1 to panic kernel if can't successfully enable VT-d
+ * (used when kernel is launched w/ TXT)
+ */
+static int force_on;
+
+/*
+ * This domain is a statically identity mapping domain.
+ * 1. This domain creats a static 1:1 mapping to all usable memory.
+ * 2. It maps to each iommu if successful.
+ * 3. Each iommu mapps to this domain if successful.
+ */
+static struct dmar_domain *si_domain;
+static int hw_pass_through = 1;
+
static void flush_unmaps_timeout(unsigned long data);

static DEFINE_TIMER(unmap_timer, flush_unmaps_timeout, 0, 0);
@@ -494,22 +512,6 @@ static struct kmem_cache *iommu_domain_cache;
static struct kmem_cache *iommu_devinfo_cache;
static struct kmem_cache *iommu_iova_cache;

-static inline void *alloc_pgtable_page(int node)
-{
- struct page *page;
- void *vaddr = NULL;
-
- page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
- if (page)
- vaddr = page_address(page);
- return vaddr;
-}
-
-static inline void free_pgtable_page(void *vaddr)
-{
- free_page((unsigned long)vaddr);
-}
-
static inline void *alloc_domain_mem(void)
{
return kmem_cache_alloc(iommu_domain_cache, GFP_ATOMIC);
--
Bill Sumner <[email protected]>

2014-04-25 00:38:01

by Sumner, William

[permalink] [raw]
Subject: [PATCH 4/8] iommu/vt-d: Update iommu_attach_domain() and its callers

Allow specification of the domain-id for the new domain.
This patch only adds the 'did' parameter to iommu_attach_domain()
and modifies all of its callers to specify the default value of -1
which says "no did specified, allocate a new one".

This is no functional change from current behaviour -- just enables
a functional change to be made in a later patch.

Signed-off-by: Bill Sumner <[email protected]>
---
drivers/iommu/intel-iommu.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 4116377..226c1d0 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1181,7 +1181,8 @@ static struct dmar_domain *alloc_domain(bool vm)
}

static int iommu_attach_domain(struct dmar_domain *domain,
- struct intel_iommu *iommu)
+ struct intel_iommu *iommu,
+ int domain_number)
{
int num;
unsigned long ndomains;
@@ -1191,12 +1192,15 @@ static int iommu_attach_domain(struct dmar_domain *domain,

spin_lock_irqsave(&iommu->lock, flags);

- num = find_first_zero_bit(iommu->domain_ids, ndomains);
- if (num >= ndomains) {
- spin_unlock_irqrestore(&iommu->lock, flags);
- printk(KERN_ERR "IOMMU: no free domain ids\n");
- return -ENOMEM;
- }
+ if (domain_number < 0) {
+ num = find_first_zero_bit(iommu->domain_ids, ndomains);
+ if (num >= ndomains) {
+ spin_unlock_irqrestore(&iommu->lock, flags);
+ pr_err("IOMMU: no free domain ids\n");
+ return -ENOMEM;
+ }
+ } else
+ num = domain_number;

domain->id = num;
domain->iommu_count++;
@@ -1875,6 +1879,7 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
struct pci_dev *dev_tmp = NULL;
unsigned long flags;
u8 bus, devfn, bridge_bus, bridge_devfn;
+ int did = -1; /* Default to "no domain_id supplied" */

domain = find_domain(dev);
if (domain)
@@ -1915,7 +1920,7 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
domain = alloc_domain(false);
if (!domain)
goto error;
- if (iommu_attach_domain(domain, iommu)) {
+ if (iommu_attach_domain(domain, iommu, did)) {
free_domain_mem(domain);
domain = NULL;
goto error;
@@ -2083,7 +2088,7 @@ static int __init si_domain_init(int hw)
si_domain->flags = DOMAIN_FLAG_STATIC_IDENTITY;

for_each_active_iommu(iommu, drhd) {
- ret = iommu_attach_domain(si_domain, iommu);
+ ret = iommu_attach_domain(si_domain, iommu, (int) -1);
if (ret) {
domain_exit(si_domain);
return -EFAULT;
--
Bill Sumner <[email protected]>

2014-04-25 00:37:58

by Sumner, William

[permalink] [raw]
Subject: [PATCH 1/8] iommu/vt-d: Fix a few existing lines for checkpatch.pl

Things like:
1. no space before tabs
2. no initialization of static variables to 0 or NULL
3. no extra parentheses around the value on the 'return' statement
4. no line over 80-characters

The first three patches in this set enable the intel-iommu driver to consist
of multiple *.c source files by moving many of the existing definitions,
prototypes, and structure definitions from the front of intel-iommu.c into
a new intel-iommu-private.h file -- and replacing them with a #include
of that file.


The last five patches in this set use the above enablement to implement,
within the new source file intel-iommu-kdump.c, a fix for:

A kdump problem about DMA that has been discussed for a long time.
That is, when a kernel panics and boots into the kdump kernel, DMA that was
started by the panicked kernel is not stopped before the kdump kernel is booted;
and the kdump kernel disables the IOMMU while this DMA continues.
This causes the IOMMU to stop translating the DMA addresses as IOVAs and
begin to treat them as physical memory addresses -- which causes the DMA to either:
1. generate DMAR errors or
2. generate PCI SERR errors or
3. transfer data to or from incorrect areas of memory.
Often this causes the dump to fail.

This patch set modifies the behavior of the Intel iommu in the crashdump kernel:
1. to accept the iommu hardware in an active state,
2. to leave the current translations in-place so that legacy DMA will continue
using its current buffers until the device drivers in the crashdump kernel
initialize and initialize their devices,
3. to use different portions of the iova address ranges for the device drivers
in the crashdump kernel than the iova ranges that were in-use at the time
of the panic.

Advantages of this approach:
1. All manipulation of the IO-device is done by the Linux device-driver
for that device.
2. This approach behaves in a manner very similar to operation without an
active iommu.
3. Any activity between the IO-device and its RMRR areas is handled by the
device-driver in the same manner as during a non-kdump boot.
4. If an IO-device has no driver in the kdump kernel, it is simply left alone.
This supports the practice of creating a special kdump kernel without
drivers for any devices that are not required for taking a crashdump.
5. Minimal code-changes among the existing mainline intel-iommu code.


Signed-off-by: Bill Sumner <[email protected]>
---
drivers/iommu/intel-iommu.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 69fa7da..f80b4bb 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -69,7 +69,8 @@
to match. That way, we can use 'unsigned long' for PFNs with impunity. */
#define DOMAIN_MAX_PFN(gaw) ((unsigned long) min_t(uint64_t, \
__DOMAIN_MAX_PFN(gaw), (unsigned long)-1))
-#define DOMAIN_MAX_ADDR(gaw) (((uint64_t)__DOMAIN_MAX_PFN(gaw)) << VTD_PAGE_SHIFT)
+#define DOMAIN_MAX_ADDR(gaw) (((uint64_t)__DOMAIN_MAX_PFN(gaw)) << \
+ VTD_PAGE_SHIFT)

#define IOVA_PFN(addr) ((addr) >> PAGE_SHIFT)
#define DMA_32BIT_PFN IOVA_PFN(DMA_BIT_MASK(32))
@@ -172,7 +173,7 @@ static int rwbf_quirk;
* set to 1 to panic kernel if can't successfully enable VT-d
* (used when kernel is launched w/ TXT)
*/
-static int force_on = 0;
+static int force_on;

/*
* 0: Present
@@ -187,7 +188,7 @@ struct root_entry {
#define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
static inline bool root_present(struct root_entry *root)
{
- return (root->val & 1);
+ return root->val & 1;
}
static inline void set_root_present(struct root_entry *root)
{
@@ -225,7 +226,7 @@ struct context_entry {

static inline bool context_present(struct context_entry *context)
{
- return (context->lo & 1);
+ return context->lo & 1;
}
static inline void context_set_present(struct context_entry *context)
{
@@ -303,7 +304,7 @@ static inline bool dma_pte_present(struct dma_pte *pte)

static inline bool dma_pte_superpage(struct dma_pte *pte)
{
- return (pte->val & (1 << 7));
+ return pte->val & (1 << 7);
}

static inline int first_pte_in_page(struct dma_pte *pte)
@@ -314,7 +315,7 @@ static inline int first_pte_in_page(struct dma_pte *pte)
/*
* This domain is a statically identity mapping domain.
* 1. This domain creats a static 1:1 mapping to all usable memory.
- * 2. It maps to each iommu if successful.
+ * 2. It maps to each iommu if successful.
* 3. Each iommu mapps to this domain if successful.
*/
static struct dmar_domain *si_domain;
@@ -344,7 +345,7 @@ struct dmar_domain {
DECLARE_BITMAP(iommu_bmp, IOMMU_UNITS_SUPPORTED);
/* bitmap of iommus this domain uses*/

- struct list_head devices; /* all devices' list */
+ struct list_head devices; /* all devices' list */
struct iova_domain iovad; /* iova's that belong to this domain */

struct dma_pte *pgd; /* virtual address */
--
Bill Sumner <[email protected]>

2014-04-30 10:49:46

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

On Thu, 2014-04-24 at 18:36 -0600, Bill Sumner wrote:
>
> This patch set modifies the behavior of the Intel iommu in the crashdump kernel:
> 1. to accept the iommu hardware in an active state,
> 2. to leave the current translations in-place so that legacy DMA will continue
> using its current buffers until the device drivers in the crashdump kernel
> initialize and initialize their devices,
> 3. to use different portions of the iova address ranges for the device drivers
> in the crashdump kernel than the iova ranges that were in-use at the time
> of the panic.

There could be all kinds of existing mappings in the DMA page tables,
and I'm not sure it's safe to preserve them. What prevents the crashdump
kernel from trying to use any of the physical pages which are
accessible, and which could thus be corrupted by stray DMA?

In fact, the old kernel could even have set up 1:1 passthrough mappings
for some devices, which would then be able to DMA *anywhere*. Surely we
need to prevent that?

After the last round of this patchset, we discussed a potential
improvement where you point every virtual bus address at the *same*
physical scratch page.

That way, we allow the "rogue" DMA to continue to the same virtual bus
addresses, but it can only ever affect one piece of physical memory and
can't have detrimental effects elsewhere.

Was that option considered and discounted for some reason? It seems like
it would make sense...

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation


Attachments:
smime.p7s (5.61 kB)

2014-07-02 13:33:28

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

Hi David,

On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
> There could be all kinds of existing mappings in the DMA page tables,
> and I'm not sure it's safe to preserve them. What prevents the crashdump
> kernel from trying to use any of the physical pages which are
> accessible, and which could thus be corrupted by stray DMA?
>
> In fact, the old kernel could even have set up 1:1 passthrough mappings
> for some devices, which would then be able to DMA *anywhere*. Surely we
> need to prevent that?

Ideally we would prevent that, yes. But the problem is that a failed DMA
transaction might put the device into an unrecoverable state. Usually
any in-flight DMA transactions should only target buffers set up by the
previous kernel and not corrupt any data.

> After the last round of this patchset, we discussed a potential
> improvement where you point every virtual bus address at the *same*
> physical scratch page.

That is a solution to prevent the in-flight DMA failures. But what
happens when there is some in-flight DMA to a disk to write some inodes
or a new superblock. Then this scratch address-space may cause
filesystem corruption at worst.

So with this in mind I would prefer initially taking over the
page-tables from the old kernel before the device drivers re-initialize
the devices.


Joerg

2014-07-11 16:28:09

by Jerry Hoemann

[permalink] [raw]
Subject: Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

On Wed, Jul 02, 2014 at 03:32:59PM +0200, Joerg Roedel wrote:
> Hi David,
>
> On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
> > There could be all kinds of existing mappings in the DMA page tables,
> > and I'm not sure it's safe to preserve them. What prevents the crashdump
> > kernel from trying to use any of the physical pages which are
> > accessible, and which could thus be corrupted by stray DMA?
> >
> > In fact, the old kernel could even have set up 1:1 passthrough mappings
> > for some devices, which would then be able to DMA *anywhere*. Surely we
> > need to prevent that?
>
> Ideally we would prevent that, yes. But the problem is that a failed DMA
> transaction might put the device into an unrecoverable state. Usually
> any in-flight DMA transactions should only target buffers set up by the
> previous kernel and not corrupt any data.
>
> > After the last round of this patchset, we discussed a potential
> > improvement where you point every virtual bus address at the *same*
> > physical scratch page.
>
> That is a solution to prevent the in-flight DMA failures. But what
> happens when there is some in-flight DMA to a disk to write some inodes
> or a new superblock. Then this scratch address-space may cause
> filesystem corruption at worst.
>
> So with this in mind I would prefer initially taking over the
> page-tables from the old kernel before the device drivers re-initialize
> the devices.
>
>
> Joerg

David, Joerg,

What do you think here? Do you want me to update the patch set for 3.17?

Jerry

--

----------------------------------------------------------------------------
Jerry Hoemann Software Engineer Hewlett-Packard

3404 E Harmony Rd. MS 57 phone: (970) 898-1022
Ft. Collins, CO 80528 FAX: (970) 898-XXXX
email: [email protected]
----------------------------------------------------------------------------

2014-10-15 08:10:25

by Li, ZhenHua

[permalink] [raw]
Subject: Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

David, Joerg,
I plan to merge this patch set with 3.17 stable kernel, and split this
patch set into two :
1. The core part, including the changed functions, like [Patch 4/8],
[Patch 8/8].
2. For the formatting issues, like [Patch 1/8],[Patch 3/8], including
the changes for code formations, creation of new files
intel-iommu-kdump.c, intel-iommu-private.h.

I believe this will make the patch set more clear to read and understand.

What are your suggestions?

Thanks
Zhenhua


On 07/12/2014 12:27 AM, Jerry Hoemann wrote:
> On Wed, Jul 02, 2014 at 03:32:59PM +0200, Joerg Roedel wrote:
>> Hi David,
>>
>> On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
>>> There could be all kinds of existing mappings in the DMA page tables,
>>> and I'm not sure it's safe to preserve them. What prevents the crashdump
>>> kernel from trying to use any of the physical pages which are
>>> accessible, and which could thus be corrupted by stray DMA?
>>>
>>> In fact, the old kernel could even have set up 1:1 passthrough mappings
>>> for some devices, which would then be able to DMA *anywhere*. Surely we
>>> need to prevent that?
>>
>> Ideally we would prevent that, yes. But the problem is that a failed DMA
>> transaction might put the device into an unrecoverable state. Usually
>> any in-flight DMA transactions should only target buffers set up by the
>> previous kernel and not corrupt any data.
>>
>>> After the last round of this patchset, we discussed a potential
>>> improvement where you point every virtual bus address at the *same*
>>> physical scratch page.
>>
>> That is a solution to prevent the in-flight DMA failures. But what
>> happens when there is some in-flight DMA to a disk to write some inodes
>> or a new superblock. Then this scratch address-space may cause
>> filesystem corruption at worst.
>>
>> So with this in mind I would prefer initially taking over the
>> page-tables from the old kernel before the device drivers re-initialize
>> the devices.
>>
>>
>> Joerg
>
> David, Joerg,
>
> What do you think here? Do you want me to update the patch set for 3.17?
>
> Jerry
>

2014-10-15 08:45:47

by Li, ZhenHua

[permalink] [raw]
Subject: Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

Add Tom to CC list.
On 10/15/2014 04:10 PM, Li, ZhenHua wrote:
> David, Joerg,
> I plan to merge this patch set with 3.17 stable kernel, and split this
> patch set into two :
> 1. The core part, including the changed functions, like [Patch 4/8],
> [Patch 8/8].
> 2. For the formatting issues, like [Patch 1/8],[Patch 3/8], including
> the changes for code formations, creation of new files
> intel-iommu-kdump.c, intel-iommu-private.h.
>
> I believe this will make the patch set more clear to read and understand.
>
> What are your suggestions?
>
> Thanks
> Zhenhua
>
>
> On 07/12/2014 12:27 AM, Jerry Hoemann wrote:
>> On Wed, Jul 02, 2014 at 03:32:59PM +0200, Joerg Roedel wrote:
>>> Hi David,
>>>
>>> On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
>>>> There could be all kinds of existing mappings in the DMA page tables,
>>>> and I'm not sure it's safe to preserve them. What prevents the
>>>> crashdump
>>>> kernel from trying to use any of the physical pages which are
>>>> accessible, and which could thus be corrupted by stray DMA?
>>>>
>>>> In fact, the old kernel could even have set up 1:1 passthrough mappings
>>>> for some devices, which would then be able to DMA *anywhere*. Surely we
>>>> need to prevent that?
>>>
>>> Ideally we would prevent that, yes. But the problem is that a failed DMA
>>> transaction might put the device into an unrecoverable state. Usually
>>> any in-flight DMA transactions should only target buffers set up by the
>>> previous kernel and not corrupt any data.
>>>
>>>> After the last round of this patchset, we discussed a potential
>>>> improvement where you point every virtual bus address at the *same*
>>>> physical scratch page.
>>>
>>> That is a solution to prevent the in-flight DMA failures. But what
>>> happens when there is some in-flight DMA to a disk to write some inodes
>>> or a new superblock. Then this scratch address-space may cause
>>> filesystem corruption at worst.
>>>
>>> So with this in mind I would prefer initially taking over the
>>> page-tables from the old kernel before the device drivers re-initialize
>>> the devices.
>>>
>>>
>>> Joerg
>>
>> David, Joerg,
>>
>> What do you think here? Do you want me to update the patch set for 3.17?
>>
>> Jerry
>>
>

2014-10-22 02:17:10

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

[-cc Bill, +cc Zhen-Hua, Eric, Tom, Jerry]

Hi Joerg,

I was looking at Zhen-Hua's recent patches, trying to figure out if I
need to do anything with them. Resetting devices in the old kernel
seems like a non-starter. Resetting devices in the new kernel, ...,
well, maybe. It seems ugly, and it seems like the sort of problem
that IOMMUs are designed to solve. Anyway, I found this old
discussion that I didn't quite understand:

On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <[email protected]> wrote:
> On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:

>> After the last round of this patchset, we discussed a potential
>> improvement where you point every virtual bus address at the *same*
>> physical scratch page.
>
> That is a solution to prevent the in-flight DMA failures. But what
> happens when there is some in-flight DMA to a disk to write some inodes
> or a new superblock. Then this scratch address-space may cause
> filesystem corruption at worst.

This in-flight DMA is from a device programmed by the old kernel, and
it would be reading data from the old kernel's buffers. I think
you're suggesting that we might want that DMA read to complete so the
device can update filesystem metadata?

I don't really understand that argument. Don't we usually want to
stop any data from escaping the machine after a crash, on the theory
that the old kernel is crashing because something is catastrophically
wrong and we may have already corrupted things in memory? If so,
allowing this old DMA to complete is just as likely to make things
worse as to make them better.

Without kdump, we likely would reboot through the BIOS and the device
would get reset and the DMA would never happen at all. So if we made
the dump kernel program the IOMMU to prevent the DMA, that seems like
a similar situation.

> So with this in mind I would prefer initially taking over the
> page-tables from the old kernel before the device drivers re-initialize
> the devices.

This makes the dump kernel more dependent on data from the old kernel,
which we obviously want to avoid when possible.

I didn't find the previous discussion where pointing every virtual bus
address at the same physical scratch page was proposed. Why was that
better than programming the IOMMU to reject every DMA?

Bjorn

2014-10-22 02:48:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

Bjorn Helgaas <[email protected]> writes:

> [-cc Bill, +cc Zhen-Hua, Eric, Tom, Jerry]
>
> Hi Joerg,
>
> I was looking at Zhen-Hua's recent patches, trying to figure out if I
> need to do anything with them. Resetting devices in the old kernel
> seems like a non-starter. Resetting devices in the new kernel, ...,
> well, maybe. It seems ugly, and it seems like the sort of problem
> that IOMMUs are designed to solve. Anyway, I found this old
> discussion that I didn't quite understand:

For context here is the kexec on panic design, and what I know from
previous rounds of similar conversations.

The way kexec on panic aka kdump is designed to work is that the
recovery kernel lives in a piece of memory reserved at boot time and
known not to be in use by any driver (because we never ever use it for
DMA). If DMA's continue from any source the old kernel may be a little
more corrupted but our currently running kernel should not.

Device drivers that we use in the recovery kernel are required to be
able to initialize their devices from an arbitrary state or fail to
initialize their devices.

We have discussed things on various occassions but IOMMUs all have their
own individual idiosynchrousies and came late to the party so that it
is hard to generalize.

The reserved region is generally low enough in memory that simply
not using IOMMUs works.

The major challenge with initializing an IOMMU would be that there are
potentially devices whose driver is not loaded in the recover kernel
with on-going DMA sessions (perhaps a NIC in response to network
packet).

Which essentially means that if you are going to use an IOMMU slot in a
recovery kernel you have to either know that IOMMU slot was reserved for
the recovery kernel (what has always felt like the easiest way to me).
Or you have to know everything that could target that IOMMU slot has
been reset or has it's driver loaded.

I have always thought the simplist and easiest solution would be to
reserve a few IOMMU slots for the kexec on panic kernel. But if folks
can find other ways to guarantee that an on-going DMA isn't targeting
an IOMMU slot (such as resetting everything downstream from that
IOMMU slot) more power to you.

> On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <[email protected]> wrote:
>> On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
>
>>> After the last round of this patchset, we discussed a potential
>>> improvement where you point every virtual bus address at the *same*
>>> physical scratch page.
>>
>> That is a solution to prevent the in-flight DMA failures. But what
>> happens when there is some in-flight DMA to a disk to write some inodes
>> or a new superblock. Then this scratch address-space may cause
>> filesystem corruption at worst.
>
> This in-flight DMA is from a device programmed by the old kernel, and
> it would be reading data from the old kernel's buffers. I think
> you're suggesting that we might want that DMA read to complete so the
> device can update filesystem metadata?
>
> I don't really understand that argument. Don't we usually want to
> stop any data from escaping the machine after a crash, on the theory
> that the old kernel is crashing because something is catastrophically
> wrong and we may have already corrupted things in memory? If so,
> allowing this old DMA to complete is just as likely to make things
> worse as to make them better.
>
> Without kdump, we likely would reboot through the BIOS and the device
> would get reset and the DMA would never happen at all. So if we made
> the dump kernel program the IOMMU to prevent the DMA, that seems like
> a similar situation.
>
>> So with this in mind I would prefer initially taking over the
>> page-tables from the old kernel before the device drivers re-initialize
>> the devices.
>
> This makes the dump kernel more dependent on data from the old kernel,
> which we obviously want to avoid when possible.
>
> I didn't find the previous discussion where pointing every virtual bus
> address at the same physical scratch page was proposed. Why was that
> better than programming the IOMMU to reject every DMA?
>
> Bjorn

Eric

2014-10-22 03:18:18

by Li, ZhenHua

[permalink] [raw]
Subject: Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

Need more time to read and think about these mails. I just want to
clarify one thing: Bill has left HP, and now I inherited his works.
That's why I sent an update of his patch
https://lkml.org/lkml/2014/10/21/134

On 10/22/2014 10:47 AM, Eric W. Biederman wrote:
> Bjorn Helgaas <[email protected]> writes:
>
>> [-cc Bill, +cc Zhen-Hua, Eric, Tom, Jerry]
>>
>> Hi Joerg,
>>
>> I was looking at Zhen-Hua's recent patches, trying to figure out if I
>> need to do anything with them. Resetting devices in the old kernel
>> seems like a non-starter. Resetting devices in the new kernel, ...,
>> well, maybe. It seems ugly, and it seems like the sort of problem
>> that IOMMUs are designed to solve. Anyway, I found this old
>> discussion that I didn't quite understand:
>
> For context here is the kexec on panic design, and what I know from
> previous rounds of similar conversations.
>
> The way kexec on panic aka kdump is designed to work is that the
> recovery kernel lives in a piece of memory reserved at boot time and
> known not to be in use by any driver (because we never ever use it for
> DMA). If DMA's continue from any source the old kernel may be a little
> more corrupted but our currently running kernel should not.
>
> Device drivers that we use in the recovery kernel are required to be
> able to initialize their devices from an arbitrary state or fail to
> initialize their devices.
>
> We have discussed things on various occassions but IOMMUs all have their
> own individual idiosynchrousies and came late to the party so that it
> is hard to generalize.
>
> The reserved region is generally low enough in memory that simply
> not using IOMMUs works.
>
> The major challenge with initializing an IOMMU would be that there are
> potentially devices whose driver is not loaded in the recover kernel
> with on-going DMA sessions (perhaps a NIC in response to network
> packet).
>
> Which essentially means that if you are going to use an IOMMU slot in a
> recovery kernel you have to either know that IOMMU slot was reserved for
> the recovery kernel (what has always felt like the easiest way to me).
> Or you have to know everything that could target that IOMMU slot has
> been reset or has it's driver loaded.
>
> I have always thought the simplist and easiest solution would be to
> reserve a few IOMMU slots for the kexec on panic kernel. But if folks
> can find other ways to guarantee that an on-going DMA isn't targeting
> an IOMMU slot (such as resetting everything downstream from that
> IOMMU slot) more power to you.
>
>> On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <[email protected]> wrote:
>>> On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
>>
>>>> After the last round of this patchset, we discussed a potential
>>>> improvement where you point every virtual bus address at the *same*
>>>> physical scratch page.
>>>
>>> That is a solution to prevent the in-flight DMA failures. But what
>>> happens when there is some in-flight DMA to a disk to write some inodes
>>> or a new superblock. Then this scratch address-space may cause
>>> filesystem corruption at worst.
>>
>> This in-flight DMA is from a device programmed by the old kernel, and
>> it would be reading data from the old kernel's buffers. I think
>> you're suggesting that we might want that DMA read to complete so the
>> device can update filesystem metadata?
>>
>> I don't really understand that argument. Don't we usually want to
>> stop any data from escaping the machine after a crash, on the theory
>> that the old kernel is crashing because something is catastrophically
>> wrong and we may have already corrupted things in memory? If so,
>> allowing this old DMA to complete is just as likely to make things
>> worse as to make them better.
>>
>> Without kdump, we likely would reboot through the BIOS and the device
>> would get reset and the DMA would never happen at all. So if we made
>> the dump kernel program the IOMMU to prevent the DMA, that seems like
>> a similar situation.
>>
>>> So with this in mind I would prefer initially taking over the
>>> page-tables from the old kernel before the device drivers re-initialize
>>> the devices.
>>
>> This makes the dump kernel more dependent on data from the old kernel,
>> which we obviously want to avoid when possible.
>>
>> I didn't find the previous discussion where pointing every virtual bus
>> address at the same physical scratch page was proposed. Why was that
>> better than programming the IOMMU to reject every DMA?
>>
>> Bjorn
>
> Eric
>

2014-10-22 13:22:04

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

Hi Bjorn,

On Tue, Oct 21, 2014 at 08:16:46PM -0600, Bjorn Helgaas wrote:
> I was looking at Zhen-Hua's recent patches, trying to figure out if I
> need to do anything with them. Resetting devices in the old kernel
> seems like a non-starter. Resetting devices in the new kernel, ...,
> well, maybe. It seems ugly, and it seems like the sort of problem
> that IOMMUs are designed to solve.

Actually resetting the devices in the kdump kernel would be one of the
better solutions for this problem. When we have a generic way to stop
all in-flight DMA from the PCI endpoints we could safely disable and
then re-enable the IOMMU.

> On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <[email protected]> wrote:
> > That is a solution to prevent the in-flight DMA failures. But what
> > happens when there is some in-flight DMA to a disk to write some inodes
> > or a new superblock. Then this scratch address-space may cause
> > filesystem corruption at worst.
>
> This in-flight DMA is from a device programmed by the old kernel, and
> it would be reading data from the old kernel's buffers. I think
> you're suggesting that we might want that DMA read to complete so the
> device can update filesystem metadata?

Well, it is not about updating filesystem metadata. In the kdump kernel
we have these options:

1) Disable the IOMMU. Problem here is, that DMA is now
untranslated, so that any in-flight DMA might read or write
from a random location in memory, corrupting the kdump or
even the new kexec kernel memory. So this is a non-starter.

2) Re-program the IOMMU to block all DMA. This is safer as it
does not corrupt any data in memory. But some devices react
very poorly on a master abort from the IOMMU, so bad that the
driver in the kdump kernel fails to initialize that device.
In this case taking dump itself might fail (and often does,
according to reports)

3) To prevent master aborts like in option (2), David suggested
to map the whole DMA address space to a scratch page. This
also prevents system memory corruption and the master aborts.
But the problem is, that in-flight DMA will now read all
zeros. This can corrupt disk and network data, at worst it
nulls out the superblocks of your filesystem and you lose all
data. So this is not an option too.

What we currently do in the VT-d driver is a mixture of (1) and (2). The
VT-d driver disables the IOMMU hardware (opening a race window for
memory data corruption), re-initializes it to reject any ongoing DMA
(which causes master aborts for in-flight DMA) and re-enables the IOMMU
hardware.

But this strategy fails in heavy IO environments quite often and we look
into ways to solve the problem, or at least improve the current
situation as good as we can.

I talked to David about this at LPC and we came up with basically this
procedure:

1. If the VT-d driver finds the IOMMU enabled, it reuses its
root-context table. This way the IOMMU must not be disabled
and re-enabled, eliminating the race we have now.
Other data structures like the context-entries are copied
over from the old kernel. We basically keep all mappings
from the old kernel, allowing any in-flight DMA to succeed.
No memory or disk data corruption.
The issue here is, that we modify data from the old kernel
which is about to be dumped. But there are ways to work
around that.

2. When a device driver issues the first dma_map command for a
device, we assign a new and empty page-table, thus removing
all mappings from the old kernel for the device.
Rationale is, that at this point the device driver should
have reset the device to a point where all in-flight DMA is
canceled.

This approach goes into the same direction as Bill Sumners patch-set,
which Li took over. But it goes not as far as keeping the old mappings
while the kdump kernel is still working with the devices (which might
introduce new issues and corner cases).

> > So with this in mind I would prefer initially taking over the
> > page-tables from the old kernel before the device drivers re-initialize
> > the devices.
>
> This makes the dump kernel more dependent on data from the old kernel,
> which we obviously want to avoid when possible.

Sure, but this is not really possible here (unless we have a generic and
reliable way to reset all PCI endpoint devices and cancel all in-flight
DMA before we disable the IOMMU in the kdump kernel).
Otherwise we always risk data corruption somewhere, in system memory or
on disk.

> I didn't find the previous discussion where pointing every virtual bus
> address at the same physical scratch page was proposed. Why was that
> better than programming the IOMMU to reject every DMA?

As I said, the problem is that this causes master aborts which some
devices can't recover from, so that the device driver in the kdump
kernel fails to initialize the device.


Joerg

2014-10-22 18:26:38

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

On Wed, Oct 22, 2014 at 7:21 AM, Joerg Roedel <[email protected]> wrote:
> Hi Bjorn,
>
> On Tue, Oct 21, 2014 at 08:16:46PM -0600, Bjorn Helgaas wrote:
>> I was looking at Zhen-Hua's recent patches, trying to figure out if I
>> need to do anything with them. Resetting devices in the old kernel
>> seems like a non-starter. Resetting devices in the new kernel, ...,
>> well, maybe. It seems ugly, and it seems like the sort of problem
>> that IOMMUs are designed to solve.
>
> Actually resetting the devices in the kdump kernel would be one of the
> better solutions for this problem. When we have a generic way to stop
> all in-flight DMA from the PCI endpoints we could safely disable and
> then re-enable the IOMMU.
>
>> On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <[email protected]> wrote:
>> > That is a solution to prevent the in-flight DMA failures. But what
>> > happens when there is some in-flight DMA to a disk to write some inodes
>> > or a new superblock. Then this scratch address-space may cause
>> > filesystem corruption at worst.
>>
>> This in-flight DMA is from a device programmed by the old kernel, and
>> it would be reading data from the old kernel's buffers. I think
>> you're suggesting that we might want that DMA read to complete so the
>> device can update filesystem metadata?
>
> Well, it is not about updating filesystem metadata. In the kdump kernel
> we have these options:
>
> 1) Disable the IOMMU. Problem here is, that DMA is now
> untranslated, so that any in-flight DMA might read or write
> from a random location in memory, corrupting the kdump or
> even the new kexec kernel memory. So this is a non-starter.

Agreed (at least if the IOMMU was enabled in the crashed kernel).

> 2) Re-program the IOMMU to block all DMA. This is safer as it
> does not corrupt any data in memory. But some devices react
> very poorly on a master abort from the IOMMU, so bad that the
> driver in the kdump kernel fails to initialize that device.
> In this case taking dump itself might fail (and often does,
> according to reports)

Sounds like an option, even though broken devices work poorly.

> 3) To prevent master aborts like in option (2), David suggested
> to map the whole DMA address space to a scratch page. This
> also prevents system memory corruption and the master aborts.
> But the problem is, that in-flight DMA will now read all
> zeros. This can corrupt disk and network data, at worst it
> nulls out the superblocks of your filesystem and you lose all
> data. So this is not an option too.

Ah, yes, I see your point now. This allows corrupted data, e.g., all
zeroes, to be written to disk or network after the kernel crash. I
agree; this doesn't sound like a good option.

And the proposal below is a 4th option (leave IOMMU enabled, reusing
crashed kernel's mappings until drivers make new mappings).

> What we currently do in the VT-d driver is a mixture of (1) and (2). The
> VT-d driver disables the IOMMU hardware (opening a race window for
> memory data corruption), re-initializes it to reject any ongoing DMA
> (which causes master aborts for in-flight DMA) and re-enables the IOMMU
> hardware.
>
> But this strategy fails in heavy IO environments quite often and we look
> into ways to solve the problem, or at least improve the current
> situation as good as we can.
>
> I talked to David about this at LPC and we came up with basically this
> procedure:
>
> 1. If the VT-d driver finds the IOMMU enabled, it reuses its
> root-context table. This way the IOMMU must not be disabled
> and re-enabled, eliminating the race we have now.
> Other data structures like the context-entries are copied
> over from the old kernel. We basically keep all mappings
> from the old kernel, allowing any in-flight DMA to succeed.
> No memory or disk data corruption.

If the crashed kernel had corrupted memory, couldn't an in-flight DMA
read that corrupted data from memory and write it to disk?

I guess you could argue that this is merely a race, and the in-flight
DMA could just as easily have happened before the kernel crash, so
there's always a window and the only question is whether it closes
when the IOMMU driver starts up or when the device driver starts up.

> The issue here is, that we modify data from the old kernel
> which is about to be dumped. But there are ways to work
> around that.
>
> 2. When a device driver issues the first dma_map command for a
> device, we assign a new and empty page-table, thus removing
> all mappings from the old kernel for the device.
> Rationale is, that at this point the device driver should
> have reset the device to a point where all in-flight DMA is
> canceled.
>
> This approach goes into the same direction as Bill Sumners patch-set,
> which Li took over. But it goes not as far as keeping the old mappings
> while the kdump kernel is still working with the devices (which might
> introduce new issues and corner cases).
>
>> > So with this in mind I would prefer initially taking over the
>> > page-tables from the old kernel before the device drivers re-initialize
>> > the devices.
>>
>> This makes the dump kernel more dependent on data from the old kernel,
>> which we obviously want to avoid when possible.
>
> Sure, but this is not really possible here (unless we have a generic and
> reliable way to reset all PCI endpoint devices and cancel all in-flight
> DMA before we disable the IOMMU in the kdump kernel).
> Otherwise we always risk data corruption somewhere, in system memory or
> on disk.
>
>> I didn't find the previous discussion where pointing every virtual bus
>> address at the same physical scratch page was proposed. Why was that
>> better than programming the IOMMU to reject every DMA?
>
> As I said, the problem is that this causes master aborts which some
> devices can't recover from, so that the device driver in the kdump
> kernel fails to initialize the device.

Yes, thanks for making that explicit again.

Bjorn