2015-04-10 08:43:14

by Li, ZhenHua

[permalink] [raw]
Subject: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

This patchset is an update of Bill Sumner's patchset, implements a fix for:
If a kernel boots with intel_iommu=on on a system that supports intel vt-d,
when a panic happens, the kdump kernel will boot with these faults:

dmar: DRHD: handling fault status reg 102
dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff80000
DMAR:[fault reason 01] Present bit in root entry is clear

dmar: DRHD: handling fault status reg 2
dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear

On some system, the interrupt remapping fault will also happen even if the
intel_iommu is not set to on, because the interrupt remapping will be enabled
when x2apic is needed by the system.

The cause of the DMA fault is described in Bill's original version, and the
INTR-Remap fault is caused by a similar reason. In short, the initialization
of vt-d drivers causes the in-flight DMA and interrupt requests get wrong
response.

To fix this problem, we modifies the behaviors of the intel vt-d in the
crashdump kernel:

For DMA Remapping:
1. To accept the vt-d hardware in an active state,
2. Do not disable and re-enable the translation, keep it enabled.
3. Use the old root entry table, do not rewrite the RTA register.
4. Malloc and use new context entry table, copy data from the old ones that
used by the old kernel.
5. Keep using the old page tables before driver is loaded.
6. After device driver is loaded, when it issues the first dma_map command,
free the dmar_domain structure for this device, and generate a new one, so
that the device can be assigned a new and empty page table.
7. When a new context entry table is generated, we also save its address to
the old root entry table.

For Interrupt Remapping:
1. To accept the vt-d hardware in an active state,
2. Do not disable and re-enable the interrupt remapping, keep it enabled.
3. Use the old interrupt remapping table, do not rewrite the IRTA register.
4. When ioapic entry is setup, the interrupt remapping table is changed, and
the updated data will be stored to the old interrupt remapping table.

Advantages of this approach:
1. All manipulation of the IO-device is done by the Linux device-driver
for that device.
2. This approach behaves in a manner very similar to operation without an
active iommu.
3. Any activity between the IO-device and its RMRR areas is handled by the
device-driver in the same manner as during a non-kdump boot.
4. If an IO-device has no driver in the kdump kernel, it is simply left alone.
This supports the practice of creating a special kdump kernel without
drivers for any devices that are not required for taking a crashdump.
5. Minimal code-changes among the existing mainline intel vt-d code.

Summary of changes in this patch set:
1. Added some useful function for root entry table in code intel-iommu.c
2. Added new members to struct root_entry and struct irte;
3. Functions to load old root entry table to iommu->root_entry from the memory
of old kernel.
4. Functions to malloc new context entry table and copy the data from the old
ones to the malloced new ones.
5. Functions to enable support for DMA remapping in kdump kernel.
6. Functions to load old irte data from the old kernel to the kdump kernel.
7. Some code changes that support other behaviours that have been listed.
8. In the new functions, use physical address as "unsigned long" type, not
pointers.

Original version by Bill Sumner:
https://lkml.org/lkml/2014/1/10/518
https://lkml.org/lkml/2014/4/15/716
https://lkml.org/lkml/2014/4/24/836

Zhenhua's updates:
https://lkml.org/lkml/2014/10/21/134
https://lkml.org/lkml/2014/12/15/121
https://lkml.org/lkml/2014/12/22/53
https://lkml.org/lkml/2015/1/6/1166
https://lkml.org/lkml/2015/1/12/35
https://lkml.org/lkml/2015/3/19/33

Changelog[v10]:
1. Do not use CONFIG_CRASH_DUMP and is_kdump_kernel().
Use one flag which stores the te and ir status in last kernel:
iommu->pre_enabled_trans
iommu->pre_enabled_ir

Changelog[v9]:
1. Add new function iommu_attach_domain_with_id.
2. Do not copy old page tables, keep using the old ones.
3. Remove functions:
intel_iommu_did_to_domain_values_entry
intel_iommu_get_dids_from_old_kernel
device_to_domain_id
copy_page_addr
copy_page_table
copy_context_entry
copy_context_entry_table
4. Add new function device_to_existing_context_entry.

Changelog[v8]:
1. Add a missing __iommu_flush_cache in function copy_page_table.

Changelog[v7]:
1. Use __iommu_flush_cache to flush the data to hardware.

Changelog[v6]:
1. Use "unsigned long" as type of physical address.
2. Use new function unmap_device_dma to unmap the old dma.
3. Some small incorrect bits order for aw shift.

Changelog[v5]:
1. Do not disable and re-enable traslation and interrupt remapping.
2. Use old root entry table.
3. Use old interrupt remapping table.
4. New functions to copy data from old kernel, and save to old kernel mem.
5. New functions to save updated root entry table and irte table.
6. Use intel_unmap to unmap the old dma;
7. Allocate new pages while driver is being loaded.

Changelog[v4]:
1. Cut off the patches that move some defines and functions to new files.
2. Reduce the numbers of patches to five, make it more easier to read.
3. Changed the name of functions, make them consistent with current context
get/set functions.
4. Add change to function __iommu_attach_domain.

Changelog[v3]:
1. Commented-out "#define DEBUG 1" to eliminate debug messages.
2. Updated the comments about changes in each version.
3. Fixed: one-line added to Copy-Translations patch to initialize the iovad
struct as recommended by Baoquan He [[email protected]]
init_iova_domain(&domain->iovad, DMA_32BIT_PFN);

Changelog[v2]:
The following series implements a fix for:
A kdump problem about DMA that has been discussed for a long time. That is,
when a kernel panics and boots into the kdump kernel, DMA started by the
panicked kernel is not stopped before the kdump kernel is booted and the
kdump kernel disables the IOMMU while this DMA continues. This causes the
IOMMU to stop translating the DMA addresses as IOVAs and begin to treat
them as physical memory addresses -- which causes the DMA to either:
(1) generate DMAR errors or
(2) generate PCI SERR errors or
(3) transfer data to or from incorrect areas of memory. Often this
causes the dump to fail.

Changelog[v1]:
The original version.

Changed in this version:
1. Do not disable and re-enable traslation and interrupt remapping.
2. Use old root entry table.
3. Use old interrupt remapping table.
4. Use "unsigned long" as physical address.
5. Use intel_unmap to unmap the old dma;

Baoquan He <[email protected]> helps testing this patchset.
Takao Indoh <[email protected]> gives valuable suggestions.

Li, Zhen-Hua (10):
iommu/vt-d: New function to attach domain with id
iommu/vt-d: Items required for kdump
iommu/vt-d: Function to get old context entry
iommu/vt-d: functions to copy data from old mem
iommu/vt-d: Add functions to load and save old re
iommu/vt-d: datatypes and functions used for kdump
iommu/vt-d: enable kdump support in iommu module
iommu/vt-d: assign new page table for dma_map
iommu/vt-d: Copy functions for irte
iommu/vt-d: Use old irte in kdump kernel

drivers/iommu/intel-iommu.c | 518 ++++++++++++++++++++++++++++++++++--
drivers/iommu/intel_irq_remapping.c | 96 ++++++-
include/linux/intel-iommu.h | 16 ++
3 files changed, 605 insertions(+), 25 deletions(-)

--
2.0.0-rc0


2015-04-10 08:43:21

by Li, ZhenHua

[permalink] [raw]
Subject: [PATCH v10 01/10] iommu/vt-d: New function to attach domain with id

Allow specification of the domain-id for the new domain.
This patch only adds a new function iommu_attach_domain_with_id, it is like
the function iommu_attach_domain(), only adding a parameter "did".

Bill Sumner:
(In older versions) Add new 'did' parameter to iommu_attach_domain();
The caller of this function.

Li, Zhenhua:
New function iommu_attach_domain_with_id(), instead of updating function
iommu_attach_domain();

Signed-off-by: Bill Sumner <[email protected]>
Signed-off-by: Li, Zhen-Hua <[email protected]>
---
drivers/iommu/intel-iommu.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 2d1e05b..3a93ce0 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1558,6 +1558,16 @@ static int iommu_attach_domain(struct dmar_domain *domain,
return num;
}

+static int iommu_attach_domain_with_id(struct dmar_domain *domain,
+ struct intel_iommu *iommu,
+ int domain_number)
+{
+ if (domain_number >= 0)
+ return domain_number;
+
+ return iommu_attach_domain(domain, iommu);
+}
+
static int iommu_attach_vm_domain(struct dmar_domain *domain,
struct intel_iommu *iommu)
{
@@ -2224,6 +2234,7 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
u16 dma_alias;
unsigned long flags;
u8 bus, devfn;
+ int did = -1; /* Default to "no domain_id supplied" */

domain = find_domain(dev);
if (domain)
@@ -2257,7 +2268,7 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
domain = alloc_domain(0);
if (!domain)
return NULL;
- domain->id = iommu_attach_domain(domain, iommu);
+ domain->id = iommu_attach_domain_with_id(domain, iommu, did);
if (domain->id < 0) {
free_domain_mem(domain);
return NULL;
--
2.0.0-rc0

2015-04-10 08:43:29

by Li, ZhenHua

[permalink] [raw]
Subject: [PATCH v10 02/10] iommu/vt-d: Items required for kdump

Add context entry functions needed for kdump.

Bill Sumner:
Original version;

Li, Zhenhua:
Changed the name of new functions, make them consistent with current
context get/set functions.
Remove the structure dve which is not used in new version.

Signed-off-by: Bill Sumner <[email protected]>
Signed-off-by: Li, Zhen-Hua <[email protected]>
---
drivers/iommu/intel-iommu.c | 54 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3a93ce0..735e28f 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -211,6 +211,12 @@ get_context_addr_from_root(struct root_entry *root)
NULL);
}

+static inline unsigned long
+get_context_phys_from_root(struct root_entry *root)
+{
+ return root_present(root) ? (root->val & VTD_PAGE_MASK) : 0;
+}
+
/*
* low 64 bits:
* 0: present
@@ -231,6 +237,32 @@ static inline bool context_present(struct context_entry *context)
{
return (context->lo & 1);
}
+
+static inline int context_fault_enable(struct context_entry *c)
+{
+ return((c->lo >> 1) & 0x1);
+}
+
+static inline int context_translation_type(struct context_entry *c)
+{
+ return((c->lo >> 2) & 0x3);
+}
+
+static inline u64 context_address_root(struct context_entry *c)
+{
+ return((c->lo >> VTD_PAGE_SHIFT));
+}
+
+static inline int context_address_width(struct context_entry *c)
+{
+ return((c->hi >> 0) & 0x7);
+}
+
+static inline int context_domain_id(struct context_entry *c)
+{
+ return((c->hi >> 8) & 0xffff);
+}
+
static inline void context_set_present(struct context_entry *context)
{
context->lo |= 1;
@@ -316,6 +348,28 @@ static inline int first_pte_in_page(struct dma_pte *pte)
return !((unsigned long)pte & ~VTD_PAGE_MASK);
}

+
+/*
+ * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU
+ *
+ * Fixes the crashdump kernel to deal with an active iommu and legacy
+ * DMA from the (old) panicked kernel in a manner similar to how legacy
+ * DMA is handled when no hardware iommu was in use by the old kernel --
+ * allow the legacy DMA to continue into its current buffers.
+ *
+ * In the crashdump kernel, this code:
+ * 1. skips disabling the IOMMU's translating.
+ * 2. Do not re-enable IOMMU's translating.
+ * 3. In kdump kernel, use the old root entry table.
+ * 4. Allocate pages for new context entry, copy data from old context entries
+ * in the old kernel to the new ones.
+ *
+ * In other kinds of kernel, for example, a kernel started by kexec,
+ * do the same thing as crashdump kernel.
+ */
+
+
+
/*
* This domain is a statically identity mapping domain.
* 1. This domain creats a static 1:1 mapping to all usable memory.
--
2.0.0-rc0

2015-04-10 08:43:37

by Li, ZhenHua

[permalink] [raw]
Subject: [PATCH v10 03/10] iommu/vt-d: Function to get old context entry

Interface for when a new domain in the crashdump kernel needs some
values from the panicked kernel's context entries.

Signed-off-by: Li, Zhen-Hua <[email protected]>
---
drivers/iommu/intel-iommu.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 735e28f..ff5ac04 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -369,6 +369,10 @@ static inline int first_pte_in_page(struct dma_pte *pte)
*/


+static struct context_entry *device_to_existing_context_entry(
+ struct intel_iommu *iommu,
+ u8 bus, u8 devfn);
+

/*
* This domain is a statically identity mapping domain.
@@ -4793,3 +4797,23 @@ static void __init check_tylersburg_isoch(void)
printk(KERN_WARNING "DMAR: Recommended TLB entries for ISOCH unit is 16; your BIOS set %d\n",
vtisochctrl);
}
+
+static struct context_entry *device_to_existing_context_entry(
+ struct intel_iommu *iommu,
+ u8 bus, u8 devfn)
+{
+ struct root_entry *root;
+ struct context_entry *context;
+ struct context_entry *ret;
+ unsigned long flags;
+
+ ret = NULL;
+ spin_lock_irqsave(&iommu->lock, flags);
+ root = &iommu->root_entry[bus];
+ context = get_context_addr_from_root(root);
+ if (context && context_present(context+devfn))
+ ret = &context[devfn];
+ spin_unlock_irqrestore(&iommu->lock, flags);
+ return ret;
+}
+
--
2.0.0-rc0

2015-04-10 08:43:47

by Li, ZhenHua

[permalink] [raw]
Subject: [PATCH v10 04/10] iommu/vt-d: functions to copy data from old mem

Add some functions to copy the data from old kernel.
These functions are used to copy context tables and page tables.

To avoid calling iounmap between spin_lock_irqsave and spin_unlock_irqrestore,
use a link here, store the pointers , and then use iounmap to free them in
another place.

Li, Zhen-hua:
The functions and logics.

Takao Indoh:
Check if pfn is ram:
if (page_is_ram(pfn))

Signed-off-by: Li, Zhen-Hua <[email protected]>
Signed-off-by: Takao Indoh <[email protected]>
---
drivers/iommu/intel-iommu.c | 102 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/intel-iommu.h | 6 +++
2 files changed, 108 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index ff5ac04..5ba403a 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -373,6 +373,17 @@ static struct context_entry *device_to_existing_context_entry(
struct intel_iommu *iommu,
u8 bus, u8 devfn);

+/*
+ * A structure used to store the address allocated by ioremap();
+ * The we need to call iounmap() to free them out of spin_lock_irqsave/unlock;
+ */
+struct iommu_remapped_entry {
+ struct list_head list;
+ void __iomem *mem;
+};
+static LIST_HEAD(__iommu_remapped_mem);
+static DEFINE_MUTEX(__iommu_mem_list_lock);
+

/*
* This domain is a statically identity mapping domain.
@@ -4817,3 +4828,94 @@ static struct context_entry *device_to_existing_context_entry(
return ret;
}

+/*
+ * Copy memory from a physically-addressed area into a virtually-addressed area
+ */
+int __iommu_load_from_oldmem(void *to, unsigned long from, unsigned long size)
+{
+ unsigned long pfn; /* Page Frame Number */
+ size_t csize = (size_t)size; /* Num(bytes to copy) */
+ unsigned long offset; /* Lower 12 bits of to */
+ void __iomem *virt_mem;
+ struct iommu_remapped_entry *mapped;
+
+ pfn = from >> VTD_PAGE_SHIFT;
+ offset = from & (~VTD_PAGE_MASK);
+
+ if (page_is_ram(pfn)) {
+ memcpy(to, pfn_to_kaddr(pfn) + offset, csize);
+ } else{
+
+ mapped = kzalloc(sizeof(struct iommu_remapped_entry),
+ GFP_KERNEL);
+ if (!mapped)
+ return -ENOMEM;
+
+ virt_mem = ioremap_cache((unsigned long)from, size);
+ if (!virt_mem) {
+ kfree(mapped);
+ return -ENOMEM;
+ }
+ memcpy(to, virt_mem, size);
+
+ mutex_lock(&__iommu_mem_list_lock);
+ mapped->mem = virt_mem;
+ list_add_tail(&mapped->list, &__iommu_remapped_mem);
+ mutex_unlock(&__iommu_mem_list_lock);
+ }
+ return size;
+}
+
+/*
+ * Copy memory from a virtually-addressed area into a physically-addressed area
+ */
+int __iommu_save_to_oldmem(unsigned long to, void *from, unsigned long size)
+{
+ unsigned long pfn; /* Page Frame Number */
+ size_t csize = (size_t)size; /* Num(bytes to copy) */
+ unsigned long offset; /* Lower 12 bits of to */
+ void __iomem *virt_mem;
+ struct iommu_remapped_entry *mapped;
+
+ pfn = to >> VTD_PAGE_SHIFT;
+ offset = to & (~VTD_PAGE_MASK);
+
+ if (page_is_ram(pfn)) {
+ memcpy(pfn_to_kaddr(pfn) + offset, from, csize);
+ } else{
+ mapped = kzalloc(sizeof(struct iommu_remapped_entry),
+ GFP_KERNEL);
+ if (!mapped)
+ return -ENOMEM;
+
+ virt_mem = ioremap_cache((unsigned long)to, size);
+ if (!virt_mem) {
+ kfree(mapped);
+ return -ENOMEM;
+ }
+ memcpy(virt_mem, from, size);
+ mutex_lock(&__iommu_mem_list_lock);
+ mapped->mem = virt_mem;
+ list_add_tail(&mapped->list, &__iommu_remapped_mem);
+ mutex_unlock(&__iommu_mem_list_lock);
+ }
+ return size;
+}
+
+/*
+ * Free the mapped memory for ioremap;
+ */
+int __iommu_free_mapped_mem(void)
+{
+ struct iommu_remapped_entry *mem_entry, *tmp;
+
+ mutex_lock(&__iommu_mem_list_lock);
+ list_for_each_entry_safe(mem_entry, tmp, &__iommu_remapped_mem, list) {
+ iounmap(mem_entry->mem);
+ list_del(&mem_entry->list);
+ kfree(mem_entry);
+ }
+ mutex_unlock(&__iommu_mem_list_lock);
+ return 0;
+}
+
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index a65208a..4bca7b5 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -368,4 +368,10 @@ extern int dmar_ir_support(void);

extern const struct attribute_group *intel_iommu_groups[];

+extern int __iommu_load_from_oldmem(void *to, unsigned long from,
+ unsigned long size);
+extern int __iommu_save_to_oldmem(unsigned long to, void *from,
+ unsigned long size);
+extern int __iommu_free_mapped_mem(void);
+
#endif
--
2.0.0-rc0

2015-04-10 08:43:53

by Li, ZhenHua

[permalink] [raw]
Subject: [PATCH v10 05/10] iommu/vt-d: Add functions to load and save old re

Add functions to load root entry table from old kernel, and to save updated
root entry table.
Add two member in struct intel_iommu, to store the RTA in old kernel, and
the mapped virt address of it.

We use the old RTA in dump kernel, and when the iommu->root_entry is used as
a cache in kdump kernel, its phys address will not be save to RTA register,
but when its data is changed, we will save the new data to old root entry table.

Li, Zhen-hua:
The functions and logics.

Takao Indoh:
Add __iommu_flush_cache.

Signed-off-by: Li, Zhen-Hua <[email protected]>
Signed-off-by: Takao Indoh <[email protected]>
---
drivers/iommu/intel-iommu.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/intel-iommu.h | 3 +++
2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 5ba403a..ef8a99c 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -373,6 +373,10 @@ static struct context_entry *device_to_existing_context_entry(
struct intel_iommu *iommu,
u8 bus, u8 devfn);

+static void __iommu_load_old_root_entry(struct intel_iommu *iommu);
+
+static void __iommu_update_old_root_entry(struct intel_iommu *iommu, int index);
+
/*
* A structure used to store the address allocated by ioremap();
* The we need to call iounmap() to free them out of spin_lock_irqsave/unlock;
@@ -384,7 +388,6 @@ struct iommu_remapped_entry {
static LIST_HEAD(__iommu_remapped_mem);
static DEFINE_MUTEX(__iommu_mem_list_lock);

-
/*
* This domain is a statically identity mapping domain.
* 1. This domain creats a static 1:1 mapping to all usable memory.
@@ -4919,3 +4922,52 @@ int __iommu_free_mapped_mem(void)
return 0;
}

+/*
+ * Load the old root entry table to new root entry table.
+ */
+static void __iommu_load_old_root_entry(struct intel_iommu *iommu)
+{
+ if ((!iommu)
+ || (!iommu->root_entry)
+ || (!iommu->root_entry_old_virt)
+ || (!iommu->root_entry_old_phys))
+ return;
+ memcpy(iommu->root_entry, iommu->root_entry_old_virt, PAGE_SIZE);
+
+ __iommu_flush_cache(iommu, iommu->root_entry, PAGE_SIZE);
+}
+
+/*
+ * When the data in new root entry table is changed, this function
+ * must be called to save the updated data to old root entry table.
+ */
+static void __iommu_update_old_root_entry(struct intel_iommu *iommu, int index)
+{
+ u8 start;
+ unsigned long size;
+ void __iomem *to;
+ void *from;
+
+ if ((!iommu)
+ || (!iommu->root_entry)
+ || (!iommu->root_entry_old_virt)
+ || (!iommu->root_entry_old_phys))
+ return;
+
+ if (index < -1 || index >= ROOT_ENTRY_NR)
+ return;
+
+ if (index == -1) {
+ start = 0;
+ size = ROOT_ENTRY_NR * sizeof(struct root_entry);
+ } else {
+ start = index * sizeof(struct root_entry);
+ size = sizeof(struct root_entry);
+ }
+ to = iommu->root_entry_old_virt;
+ from = iommu->root_entry;
+ memcpy(to + start, from + start, size);
+
+ __iommu_flush_cache(iommu, to + start, size);
+}
+
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 4bca7b5..6fa0804 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -328,6 +328,9 @@ struct intel_iommu {
spinlock_t lock; /* protect context, domain ids */
struct root_entry *root_entry; /* virtual address */

+ void __iomem *root_entry_old_virt; /* mapped from old root entry */
+ unsigned long root_entry_old_phys; /* root entry in old kernel */
+
struct iommu_flush flush;
#endif
struct q_inval *qi; /* Queued invalidation info */
--
2.0.0-rc0

2015-04-10 08:44:01

by Li, ZhenHua

[permalink] [raw]
Subject: [PATCH v10 06/10] iommu/vt-d: datatypes and functions used for kdump

Populate it with support functions to copy iommu translation tables from
from the panicked kernel into the kdump kernel in the event of a crash.

Functions:
Use old root entry table, and load the old data to root_entry as cache.
Malloc new context table and copy old context table to the new one.

Bill Sumner:
Original version, the creation of the data types and functions.

Li, Zhenhua:
Create new function iommu_check_pre_te_status() to check status.
Update the caller of context_get_* and context_put*, use context_*
and context_set_* for replacement.
Update the name of the function that loads root entry table.
Use new function to copy old context entry tables and page tables.
Use "unsigned long" for physical address.
Remove the functions to copy page table in Bill's version.
Remove usage of dve and ppap in Bill's version.

Signed-off-by: Bill Sumner <[email protected]>
Signed-off-by: Li, Zhen-Hua <[email protected]>
---
drivers/iommu/intel-iommu.c | 121 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/intel-iommu.h | 3 ++
2 files changed, 124 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index ef8a99c..78c1d65 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -388,6 +388,18 @@ struct iommu_remapped_entry {
static LIST_HEAD(__iommu_remapped_mem);
static DEFINE_MUTEX(__iommu_mem_list_lock);

+/* ========================================================================
+ * Copy iommu translation tables from old kernel into new kernel.
+ * Entry to this set of functions is: intel_iommu_load_translation_tables()
+ * ------------------------------------------------------------------------
+ */
+
+static int copy_root_entry_table(struct intel_iommu *iommu);
+
+static int intel_iommu_load_translation_tables(struct intel_iommu *iommu);
+
+static void iommu_check_pre_te_status(struct intel_iommu *iommu);
+
/*
* This domain is a statically identity mapping domain.
* 1. This domain creats a static 1:1 mapping to all usable memory.
@@ -4971,3 +4983,112 @@ static void __iommu_update_old_root_entry(struct intel_iommu *iommu, int index)
__iommu_flush_cache(iommu, to + start, size);
}

+/*
+ * Load root entry tables from old kernel.
+ */
+static int copy_root_entry_table(struct intel_iommu *iommu)
+{
+ u32 bus; /* Index: root-entry-table */
+ struct root_entry *re; /* Virt(iterator: new table) */
+ unsigned long context_old_phys; /* Phys(context table entry) */
+ struct context_entry *context_new_virt; /* Virt(new context_entry) */
+
+ /*
+ * A new root entry table has been allocated ,
+ * we need copy re from old kernel to the new allocated one.
+ */
+
+ if (!iommu->root_entry_old_phys)
+ return -ENOMEM;
+
+ for (bus = 0, re = iommu->root_entry; bus < 256; bus += 1, re += 1) {
+ if (!root_present(re))
+ continue;
+
+ context_old_phys = get_context_phys_from_root(re);
+
+ if (!context_old_phys)
+ continue;
+
+ context_new_virt =
+ (struct context_entry *)alloc_pgtable_page(iommu->node);
+
+ if (!context_new_virt)
+ return -ENOMEM;
+
+ __iommu_load_from_oldmem(context_new_virt,
+ context_old_phys,
+ VTD_PAGE_SIZE);
+
+ __iommu_flush_cache(iommu, context_new_virt, VTD_PAGE_SIZE);
+
+ set_root_value(re, virt_to_phys(context_new_virt));
+ }
+
+ return 0;
+}
+
+/*
+ * Interface to the "load translation tables" set of functions
+ * from mainline code.
+ */
+static int intel_iommu_load_translation_tables(struct intel_iommu *iommu)
+{
+ unsigned long long q; /* quadword scratch */
+ int ret = 0; /* Integer return code */
+ unsigned long flags;
+
+ q = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
+ if (!q)
+ return -1;
+
+ spin_lock_irqsave(&iommu->lock, flags);
+
+ /* Load the root-entry table from the old kernel
+ * foreach context_entry_table in root_entry
+ * Copy each entry table from old kernel
+ */
+ if (!iommu->root_entry) {
+ iommu->root_entry =
+ (struct root_entry *)alloc_pgtable_page(iommu->node);
+ if (!iommu->root_entry) {
+ spin_unlock_irqrestore(&iommu->lock, flags);
+ return -ENOMEM;
+ }
+ }
+
+ iommu->root_entry_old_phys = q & VTD_PAGE_MASK;
+ if (!iommu->root_entry_old_phys) {
+ pr_err("Could not read old root entry address.");
+ return -1;
+ }
+
+ iommu->root_entry_old_virt = ioremap_cache(iommu->root_entry_old_phys,
+ VTD_PAGE_SIZE);
+ if (!iommu->root_entry_old_virt) {
+ pr_err("Could not map the old root entry.");
+ return -ENOMEM;
+ }
+
+ __iommu_load_old_root_entry(iommu);
+ ret = copy_root_entry_table(iommu);
+ __iommu_flush_cache(iommu, iommu->root_entry, PAGE_SIZE);
+ __iommu_update_old_root_entry(iommu, -1);
+
+ spin_unlock_irqrestore(&iommu->lock, flags);
+
+ __iommu_free_mapped_mem();
+
+ return ret;
+}
+
+static void iommu_check_pre_te_status(struct intel_iommu *iommu)
+{
+ u32 sts;
+
+ sts = readl(iommu->reg + DMAR_GSTS_REG);
+ if (sts & DMA_GSTS_TES) {
+ pr_info("Translation is enabled prior to OS.\n");
+ iommu->pre_enabled_trans = 1;
+ }
+}
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 6fa0804..9ca025a 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -328,6 +328,9 @@ struct intel_iommu {
spinlock_t lock; /* protect context, domain ids */
struct root_entry *root_entry; /* virtual address */

+ /* whether translation is enabled prior to OS*/
+ u8 pre_enabled_trans;
+
void __iomem *root_entry_old_virt; /* mapped from old root entry */
unsigned long root_entry_old_phys; /* root entry in old kernel */

--
2.0.0-rc0

2015-04-10 08:45:12

by Li, ZhenHua

[permalink] [raw]
Subject: [PATCH v10 07/10] iommu/vt-d: enable kdump support in iommu module

Modify the operation of the following functions when called during crash dump:
device_to_context_entry
free_context_table
get_domain_for_dev
init_dmars
intel_iommu_init

Bill Sumner:
Original version.

Zhenhua:
The name of new calling functions.
Do not disable and re-enable TE in kdump kernel.
Use the did and gaw from old context entry;

Signed-off-by: Bill Sumner <[email protected]>
Signed-off-by: Li, Zhen-Hua <[email protected]>
---
drivers/iommu/intel-iommu.c | 94 +++++++++++++++++++++++++++++++++++++++------
1 file changed, 82 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 78c1d65..3d4ea43 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -399,6 +399,7 @@ static int copy_root_entry_table(struct intel_iommu *iommu);
static int intel_iommu_load_translation_tables(struct intel_iommu *iommu);

static void iommu_check_pre_te_status(struct intel_iommu *iommu);
+static u8 g_translation_pre_enabled;

/*
* This domain is a statically identity mapping domain.
@@ -839,6 +840,9 @@ static struct context_entry * device_to_context_entry(struct intel_iommu *iommu,
set_root_value(root, phy_addr);
set_root_present(root);
__iommu_flush_cache(iommu, root, sizeof(*root));
+
+ if (iommu->pre_enabled_trans)
+ __iommu_update_old_root_entry(iommu, bus);
}
spin_unlock_irqrestore(&iommu->lock, flags);
return &context[devfn];
@@ -890,7 +894,8 @@ static void free_context_table(struct intel_iommu *iommu)

spin_lock_irqsave(&iommu->lock, flags);
if (!iommu->root_entry) {
- goto out;
+ spin_unlock_irqrestore(&iommu->lock, flags);
+ return;
}
for (i = 0; i < ROOT_ENTRY_NR; i++) {
root = &iommu->root_entry[i];
@@ -898,10 +903,23 @@ static void free_context_table(struct intel_iommu *iommu)
if (context)
free_pgtable_page(context);
}
+
+ if (iommu->pre_enabled_trans) {
+ iommu->root_entry_old_phys = 0;
+ root = iommu->root_entry_old_virt;
+ iommu->root_entry_old_virt = NULL;
+ }
+
free_pgtable_page(iommu->root_entry);
iommu->root_entry = NULL;
-out:
+
spin_unlock_irqrestore(&iommu->lock, flags);
+
+ /* We put this out of spin_unlock is because iounmap() may
+ * cause error if surrounded by spin_lock and unlock;
+ */
+ if (iommu->pre_enabled_trans)
+ iounmap(root);
}

static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
@@ -2319,6 +2337,7 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
unsigned long flags;
u8 bus, devfn;
int did = -1; /* Default to "no domain_id supplied" */
+ struct context_entry *ce = NULL;

domain = find_domain(dev);
if (domain)
@@ -2352,6 +2371,20 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
domain = alloc_domain(0);
if (!domain)
return NULL;
+
+ if (iommu->pre_enabled_trans) {
+ /*
+ * if this device had a did in the old kernel
+ * use its values instead of generating new ones
+ */
+ ce = device_to_existing_context_entry(iommu, bus, devfn);
+
+ if (ce) {
+ did = context_domain_id(ce);
+ gaw = agaw_to_width(context_address_width(ce));
+ }
+ }
+
domain->id = iommu_attach_domain_with_id(domain, iommu, did);
if (domain->id < 0) {
free_domain_mem(domain);
@@ -2879,6 +2912,7 @@ static int __init init_dmars(void)
goto free_g_iommus;
}

+ g_translation_pre_enabled = 0; /* To know whether to skip RMRR */
for_each_active_iommu(iommu, drhd) {
g_iommus[iommu->seq_id] = iommu;

@@ -2886,14 +2920,30 @@ static int __init init_dmars(void)
if (ret)
goto free_iommu;

- /*
- * TBD:
- * we could share the same root & context tables
- * among all IOMMU's. Need to Split it later.
- */
- ret = iommu_alloc_root_entry(iommu);
- if (ret)
- goto free_iommu;
+ iommu_check_pre_te_status(iommu);
+ if (iommu->pre_enabled_trans) {
+ pr_info("IOMMU Copying translate tables from panicked kernel\n");
+ ret = intel_iommu_load_translation_tables(iommu);
+ if (ret) {
+ pr_err("IOMMU: Copy translate tables failed\n");
+
+ /* Best to stop trying */
+ goto free_iommu;
+ }
+ pr_info("IOMMU: root_cache:0x%12.12llx phys:0x%12.12llx\n",
+ (u64)iommu->root_entry,
+ (u64)iommu->root_entry_old_phys);
+ } else {
+ /*
+ * TBD:
+ * we could share the same root & context tables
+ * among all IOMMU's. Need to Split it later.
+ */
+ ret = iommu_alloc_root_entry(iommu);
+ if (ret)
+ goto free_iommu;
+ }
+
if (!ecap_pass_through(iommu->ecap))
hw_pass_through = 0;
}
@@ -2911,6 +2961,14 @@ static int __init init_dmars(void)
check_tylersburg_isoch();

/*
+ * In the second kernel: Skip setting-up new domains for
+ * si, rmrr, and the isa bus on the expectation that these
+ * translations were copied from the old kernel.
+ */
+ if (g_translation_pre_enabled)
+ goto skip_new_domains_for_si_rmrr_isa;
+
+ /*
* If pass through is not set or not enabled, setup context entries for
* identity mappings for rmrr, gfx, and isa and may fall back to static
* identity mapping if iommu_identity_mapping is set.
@@ -2950,6 +3008,8 @@ static int __init init_dmars(void)

iommu_prepare_isa();

+skip_new_domains_for_si_rmrr_isa:;
+
/*
* for each drhd
* enable fault log
@@ -2978,7 +3038,13 @@ static int __init init_dmars(void)

iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL);
iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
- iommu_enable_translation(iommu);
+
+ if (iommu->pre_enabled_trans) {
+ if (!(iommu->gcmd & DMA_GCMD_TE))
+ iommu_enable_translation(iommu);
+ } else
+ iommu_enable_translation(iommu);
+
iommu_disable_protect_mem_regions(iommu);
}

@@ -4264,11 +4330,14 @@ int __init intel_iommu_init(void)
}

/*
- * Disable translation if already enabled prior to OS handover.
+ * We do not need to disable translation if already enabled prior
+ * to OS handover. Because we will try to copy old tables;
*/
+ /*
for_each_active_iommu(iommu, drhd)
if (iommu->gcmd & DMA_GCMD_TE)
iommu_disable_translation(iommu);
+ */

if (dmar_dev_scope_init() < 0) {
if (force_on)
@@ -5090,5 +5159,6 @@ static void iommu_check_pre_te_status(struct intel_iommu *iommu)
if (sts & DMA_GSTS_TES) {
pr_info("Translation is enabled prior to OS.\n");
iommu->pre_enabled_trans = 1;
+ g_translation_pre_enabled = 1;
}
}
--
2.0.0-rc0

2015-04-10 08:44:12

by Li, ZhenHua

[permalink] [raw]
Subject: [PATCH v10 08/10] iommu/vt-d: assign new page table for dma_map

When a device driver issues the first dma_map command for a device, we
assign a new and empty page-table, thus removing all mappings from the
old kernel for the device.

Signed-off-by: Li, Zhen-Hua <[email protected]>
---
drivers/iommu/intel-iommu.c | 58 ++++++++++++++++++++++++++++++++++++++-------
1 file changed, 50 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3d4ea43..a874426 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -398,6 +398,9 @@ static int copy_root_entry_table(struct intel_iommu *iommu);

static int intel_iommu_load_translation_tables(struct intel_iommu *iommu);

+static void unmap_device_dma(struct dmar_domain *domain,
+ struct device *dev,
+ struct intel_iommu *iommu);
static void iommu_check_pre_te_status(struct intel_iommu *iommu);
static u8 g_translation_pre_enabled;

@@ -3096,6 +3099,7 @@ static struct iova *intel_alloc_iova(struct device *dev,
static struct dmar_domain *__get_valid_domain_for_dev(struct device *dev)
{
struct dmar_domain *domain;
+ struct intel_iommu *iommu;
int ret;

domain = get_domain_for_dev(dev, DEFAULT_DOMAIN_ADDRESS_WIDTH);
@@ -3105,14 +3109,30 @@ static struct dmar_domain *__get_valid_domain_for_dev(struct device *dev)
return NULL;
}

- /* make sure context mapping is ok */
- if (unlikely(!domain_context_mapped(dev))) {
- ret = domain_context_mapping(domain, dev, CONTEXT_TT_MULTI_LEVEL);
- if (ret) {
- printk(KERN_ERR "Domain context map for %s failed",
- dev_name(dev));
- return NULL;
- }
+ /* if in kdump kernel, we need to unmap the mapped dma pages,
+ * detach this device first.
+ */
+ if (likely(domain_context_mapped(dev))) {
+ iommu = domain_get_iommu(domain);
+ if (iommu->pre_enabled_trans) {
+ unmap_device_dma(domain, dev, iommu);
+
+ domain = get_domain_for_dev(dev,
+ DEFAULT_DOMAIN_ADDRESS_WIDTH);
+ if (!domain) {
+ pr_err("Allocating domain for %s failed",
+ dev_name(dev));
+ return NULL;
+ }
+ } else
+ return domain;
+ }
+
+ ret = domain_context_mapping(domain, dev, CONTEXT_TT_MULTI_LEVEL);
+ if (ret) {
+ pr_err("Domain context map for %s failed",
+ dev_name(dev));
+ return NULL;
}

return domain;
@@ -5151,6 +5171,28 @@ static int intel_iommu_load_translation_tables(struct intel_iommu *iommu)
return ret;
}

+static void unmap_device_dma(struct dmar_domain *domain,
+ struct device *dev,
+ struct intel_iommu *iommu)
+{
+ struct context_entry *ce;
+ struct iova *iova;
+ phys_addr_t phys_addr;
+ dma_addr_t dev_addr;
+ struct pci_dev *pdev;
+
+ pdev = to_pci_dev(dev);
+ ce = device_to_context_entry(iommu, pdev->bus->number, pdev->devfn);
+ phys_addr = context_address_root(ce) << VTD_PAGE_SHIFT;
+ dev_addr = phys_to_dma(dev, phys_addr);
+
+ iova = find_iova(&domain->iovad, IOVA_PFN(dev_addr));
+ if (iova)
+ intel_unmap(dev, dev_addr);
+
+ domain_remove_one_dev_info(domain, dev);
+}
+
static void iommu_check_pre_te_status(struct intel_iommu *iommu)
{
u32 sts;
--
2.0.0-rc0

2015-04-10 08:44:19

by Li, ZhenHua

[permalink] [raw]
Subject: [PATCH v10 09/10] iommu/vt-d: Copy functions for irte

Functions to copy the irte data from the old kernel into the kdump kernel.

Signed-off-by: Li, Zhen-Hua <[email protected]>
---
drivers/iommu/intel_irq_remapping.c | 68 +++++++++++++++++++++++++++++++++++++
include/linux/intel-iommu.h | 4 +++
2 files changed, 72 insertions(+)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 14de1ab..e6426a0 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -17,6 +17,10 @@

#include "irq_remapping.h"

+static int __iommu_load_old_irte(struct intel_iommu *iommu);
+static int __iommu_update_old_irte(struct intel_iommu *iommu, int index);
+static void iommu_check_pre_ir_status(struct intel_iommu *iommu);
+
struct ioapic_scope {
struct intel_iommu *iommu;
unsigned int id;
@@ -1302,3 +1306,67 @@ int dmar_ir_hotplug(struct dmar_drhd_unit *dmaru, bool insert)

return ret;
}
+
+static int __iommu_load_old_irte(struct intel_iommu *iommu)
+{
+ if ((!iommu)
+ || (!iommu->ir_table)
+ || (!iommu->ir_table->base)
+ || (!iommu->ir_table->base_old_phys)
+ || (!iommu->ir_table->base_old_virt))
+ return -1;
+
+ memcpy(iommu->ir_table->base,
+ iommu->ir_table->base_old_virt,
+ INTR_REMAP_TABLE_ENTRIES*sizeof(struct irte));
+
+ __iommu_flush_cache(iommu, iommu->ir_table->base,
+ INTR_REMAP_TABLE_ENTRIES*sizeof(struct irte));
+
+ return 0;
+}
+
+static int __iommu_update_old_irte(struct intel_iommu *iommu, int index)
+{
+ int start;
+ unsigned long size;
+ void __iomem *to;
+ void *from;
+
+ if ((!iommu)
+ || (!iommu->ir_table)
+ || (!iommu->ir_table->base)
+ || (!iommu->ir_table->base_old_phys)
+ || (!iommu->ir_table->base_old_virt))
+ return -1;
+
+ if (index < -1 || index >= INTR_REMAP_TABLE_ENTRIES)
+ return -1;
+
+ if (index == -1) {
+ start = 0;
+ size = INTR_REMAP_TABLE_ENTRIES * sizeof(struct irte);
+ } else {
+ start = index * sizeof(struct irte);
+ size = sizeof(struct irte);
+ }
+
+ to = iommu->ir_table->base_old_virt;
+ from = iommu->ir_table->base;
+ memcpy(to + start, from + start, size);
+
+ __iommu_flush_cache(iommu, to + start, size);
+
+ return 0;
+}
+
+static void iommu_check_pre_ir_status(struct intel_iommu *iommu)
+{
+ u32 sts;
+
+ sts = readl(iommu->reg + DMAR_GSTS_REG);
+ if (sts & DMA_GSTS_IRES) {
+ pr_info("IR is enabled prior to OS.\n");
+ iommu->pre_enabled_ir = 1;
+ }
+}
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 9ca025a..d7ee5ba 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -289,6 +289,8 @@ struct q_inval {
struct ir_table {
struct irte *base;
unsigned long *bitmap;
+ void __iomem *base_old_virt;
+ unsigned long base_old_phys;
};
#endif

@@ -330,6 +332,8 @@ struct intel_iommu {

/* whether translation is enabled prior to OS*/
u8 pre_enabled_trans;
+ /* whether interrupt remapping is enabled prior to OS*/
+ u8 pre_enabled_ir;

void __iomem *root_entry_old_virt; /* mapped from old root entry */
unsigned long root_entry_old_phys; /* root entry in old kernel */
--
2.0.0-rc0

2015-04-10 08:44:28

by Li, ZhenHua

[permalink] [raw]
Subject: [PATCH v10 10/10] iommu/vt-d: Use old irte in kdump kernel

Fix the intr-remapping fault.

[1.594890] dmar: DRHD: handling fault status reg 2
[1.594894] dmar: INTR-REMAP: Request device [[41:00.0] fault index 4d
[1.594894] INTR-REMAP:[fault reason 34] Present field in the IRTE entry
is clear

Use old irte in second kernel, do not disable and re-enable interrupt
remapping.

Signed-off-by: Li, Zhen-Hua <[email protected]>
---
drivers/iommu/intel_irq_remapping.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index e6426a0..02615a6 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -197,6 +197,10 @@ static int modify_irte(int irq, struct irte *irte_modified)

set_64bit(&irte->low, irte_modified->low);
set_64bit(&irte->high, irte_modified->high);
+
+ if (iommu->pre_enabled_ir)
+ __iommu_update_old_irte(iommu, index);
+
__iommu_flush_cache(iommu, irte, sizeof(*irte));

rc = qi_flush_iec(iommu, index, 0);
@@ -258,6 +262,9 @@ static int clear_entries(struct irq_2_iommu *irq_iommu)
bitmap_release_region(iommu->ir_table->bitmap, index,
irq_iommu->irte_mask);

+ if (iommu->pre_enabled_ir)
+ __iommu_update_old_irte(iommu, -1);
+
return qi_flush_iec(iommu, index, irq_iommu->irte_mask);
}

@@ -660,11 +667,13 @@ static int __init intel_enable_irq_remapping(void)
*/
dmar_fault(-1, iommu);

+ iommu_check_pre_ir_status(iommu);
+
/*
- * Disable intr remapping and queued invalidation, if already
- * enabled prior to OS handover.
+ * Here we do not disable intr remapping,
+ * if already enabled prior to OS handover.
*/
- iommu_disable_irq_remapping(iommu);
+ /* iommu_disable_irq_remapping(iommu); */

dmar_disable_qi(iommu);
}
@@ -700,7 +709,18 @@ static int __init intel_enable_irq_remapping(void)
* Setup Interrupt-remapping for all the DRHD's now.
*/
for_each_iommu(iommu, drhd) {
- iommu_set_irq_remapping(iommu, eim);
+ if (iommu->pre_enabled_ir) {
+ unsigned long long q;
+
+ q = dmar_readq(iommu->reg + DMAR_IRTA_REG);
+ iommu->ir_table->base_old_phys = q & VTD_PAGE_MASK;
+ iommu->ir_table->base_old_virt = ioremap_cache(
+ iommu->ir_table->base_old_phys,
+ INTR_REMAP_TABLE_ENTRIES*sizeof(struct irte));
+ __iommu_load_old_irte(iommu);
+ } else
+ iommu_set_irq_remapping(iommu, eim);
+
setup = 1;
}

--
2.0.0-rc0

2015-04-15 00:58:34

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On 04/10/15 at 04:42pm, Li, Zhen-Hua wrote:
> This patchset is an update of Bill Sumner's patchset, implements a fix for:
> If a kernel boots with intel_iommu=on on a system that supports intel vt-d,
> when a panic happens, the kdump kernel will boot with these faults:
>
> dmar: DRHD: handling fault status reg 102
> dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff80000
> DMAR:[fault reason 01] Present bit in root entry is clear
>
> dmar: DRHD: handling fault status reg 2
> dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
> INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
>
> On some system, the interrupt remapping fault will also happen even if the
> intel_iommu is not set to on, because the interrupt remapping will be enabled
> when x2apic is needed by the system.
>
> The cause of the DMA fault is described in Bill's original version, and the
> INTR-Remap fault is caused by a similar reason. In short, the initialization
> of vt-d drivers causes the in-flight DMA and interrupt requests get wrong
> response.
>
> To fix this problem, we modifies the behaviors of the intel vt-d in the
> crashdump kernel:
>
> For DMA Remapping:
> 1. To accept the vt-d hardware in an active state,
> 2. Do not disable and re-enable the translation, keep it enabled.
> 3. Use the old root entry table, do not rewrite the RTA register.
> 4. Malloc and use new context entry table, copy data from the old ones that
> used by the old kernel.
> 5. Keep using the old page tables before driver is loaded.
> 6. After device driver is loaded, when it issues the first dma_map command,
> free the dmar_domain structure for this device, and generate a new one, so
> that the device can be assigned a new and empty page table.
> 7. When a new context entry table is generated, we also save its address to
> the old root entry table.
>
> For Interrupt Remapping:
> 1. To accept the vt-d hardware in an active state,
> 2. Do not disable and re-enable the interrupt remapping, keep it enabled.
> 3. Use the old interrupt remapping table, do not rewrite the IRTA register.
> 4. When ioapic entry is setup, the interrupt remapping table is changed, and
> the updated data will be stored to the old interrupt remapping table.
>
> Advantages of this approach:
> 1. All manipulation of the IO-device is done by the Linux device-driver
> for that device.
> 2. This approach behaves in a manner very similar to operation without an
> active iommu.
> 3. Any activity between the IO-device and its RMRR areas is handled by the
> device-driver in the same manner as during a non-kdump boot.
> 4. If an IO-device has no driver in the kdump kernel, it is simply left alone.
> This supports the practice of creating a special kdump kernel without
> drivers for any devices that are not required for taking a crashdump.
> 5. Minimal code-changes among the existing mainline intel vt-d code.
>
> Summary of changes in this patch set:
> 1. Added some useful function for root entry table in code intel-iommu.c
> 2. Added new members to struct root_entry and struct irte;
> 3. Functions to load old root entry table to iommu->root_entry from the memory
> of old kernel.
> 4. Functions to malloc new context entry table and copy the data from the old
> ones to the malloced new ones.
> 5. Functions to enable support for DMA remapping in kdump kernel.
> 6. Functions to load old irte data from the old kernel to the kdump kernel.
> 7. Some code changes that support other behaviours that have been listed.
> 8. In the new functions, use physical address as "unsigned long" type, not
> pointers.
>
> Original version by Bill Sumner:
> https://lkml.org/lkml/2014/1/10/518
> https://lkml.org/lkml/2014/4/15/716
> https://lkml.org/lkml/2014/4/24/836
>
> Zhenhua's updates:
> https://lkml.org/lkml/2014/10/21/134
> https://lkml.org/lkml/2014/12/15/121
> https://lkml.org/lkml/2014/12/22/53
> https://lkml.org/lkml/2015/1/6/1166
> https://lkml.org/lkml/2015/1/12/35
> https://lkml.org/lkml/2015/3/19/33
>
> Changelog[v10]:
> 1. Do not use CONFIG_CRASH_DUMP and is_kdump_kernel().
> Use one flag which stores the te and ir status in last kernel:
> iommu->pre_enabled_trans
> iommu->pre_enabled_ir
>
> Changelog[v9]:
> 1. Add new function iommu_attach_domain_with_id.
> 2. Do not copy old page tables, keep using the old ones.
> 3. Remove functions:
> intel_iommu_did_to_domain_values_entry
> intel_iommu_get_dids_from_old_kernel
> device_to_domain_id
> copy_page_addr
> copy_page_table
> copy_context_entry
> copy_context_entry_table
> 4. Add new function device_to_existing_context_entry.
>
> Changelog[v8]:
> 1. Add a missing __iommu_flush_cache in function copy_page_table.
>
> Changelog[v7]:
> 1. Use __iommu_flush_cache to flush the data to hardware.
>
> Changelog[v6]:
> 1. Use "unsigned long" as type of physical address.
> 2. Use new function unmap_device_dma to unmap the old dma.
> 3. Some small incorrect bits order for aw shift.
>
> Changelog[v5]:
> 1. Do not disable and re-enable traslation and interrupt remapping.
> 2. Use old root entry table.
> 3. Use old interrupt remapping table.
> 4. New functions to copy data from old kernel, and save to old kernel mem.
> 5. New functions to save updated root entry table and irte table.
> 6. Use intel_unmap to unmap the old dma;
> 7. Allocate new pages while driver is being loaded.
>
> Changelog[v4]:
> 1. Cut off the patches that move some defines and functions to new files.
> 2. Reduce the numbers of patches to five, make it more easier to read.
> 3. Changed the name of functions, make them consistent with current context
> get/set functions.
> 4. Add change to function __iommu_attach_domain.
>
> Changelog[v3]:
> 1. Commented-out "#define DEBUG 1" to eliminate debug messages.
> 2. Updated the comments about changes in each version.
> 3. Fixed: one-line added to Copy-Translations patch to initialize the iovad
> struct as recommended by Baoquan He [[email protected]]
> init_iova_domain(&domain->iovad, DMA_32BIT_PFN);
>
> Changelog[v2]:
> The following series implements a fix for:
> A kdump problem about DMA that has been discussed for a long time. That is,
> when a kernel panics and boots into the kdump kernel, DMA started by the
> panicked kernel is not stopped before the kdump kernel is booted and the
> kdump kernel disables the IOMMU while this DMA continues. This causes the
> IOMMU to stop translating the DMA addresses as IOVAs and begin to treat
> them as physical memory addresses -- which causes the DMA to either:
> (1) generate DMAR errors or
> (2) generate PCI SERR errors or
> (3) transfer data to or from incorrect areas of memory. Often this
> causes the dump to fail.
>
> Changelog[v1]:
> The original version.
>
> Changed in this version:
> 1. Do not disable and re-enable traslation and interrupt remapping.
> 2. Use old root entry table.
> 3. Use old interrupt remapping table.
> 4. Use "unsigned long" as physical address.
> 5. Use intel_unmap to unmap the old dma;
>
> Baoquan He <[email protected]> helps testing this patchset.
> Takao Indoh <[email protected]> gives valuable suggestions.
>
> Li, Zhen-Hua (10):
> iommu/vt-d: New function to attach domain with id
> iommu/vt-d: Items required for kdump
> iommu/vt-d: Function to get old context entry
> iommu/vt-d: functions to copy data from old mem
> iommu/vt-d: Add functions to load and save old re
> iommu/vt-d: datatypes and functions used for kdump
> iommu/vt-d: enable kdump support in iommu module
> iommu/vt-d: assign new page table for dma_map
> iommu/vt-d: Copy functions for irte
> iommu/vt-d: Use old irte in kdump kernel
>
> drivers/iommu/intel-iommu.c | 518 ++++++++++++++++++++++++++++++++++--
> drivers/iommu/intel_irq_remapping.c | 96 ++++++-
> include/linux/intel-iommu.h | 16 ++
> 3 files changed, 605 insertions(+), 25 deletions(-)
>
> --
> 2.0.0-rc0


Again, I think it is bad to use old page table, below issues need consider:
1) make sure old page table are reliable across crash
2) do not allow writing oldmem after crash

Please correct me if I'm wrong, or if above is not doable I think I will vote for
resetting pci bus.

Thanks
Dave

2015-04-15 05:48:05

by Li, ZhenHua

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On 04/15/2015 08:57 AM, Dave Young wrote:
> Again, I think it is bad to use old page table, below issues need consider:
> 1) make sure old page table are reliable across crash
> 2) do not allow writing oldmem after crash
>
> Please correct me if I'm wrong, or if above is not doable I think I will vote for
> resetting pci bus.
>
> Thanks
> Dave
>
Hi Dave,

When updating the context tables, we have to write their address to root
tables, this will cause writing to old mem.

Resetting the pci bus has been discussed, please check this:
http://lists.infradead.org/pipermail/kexec/2014-October/012752.html
https://lkml.org/lkml/2014/10/21/890

Thanks
Zhenhua


2015-04-15 06:48:26

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On 04/15/15 at 01:47pm, Li, ZhenHua wrote:
> On 04/15/2015 08:57 AM, Dave Young wrote:
> >Again, I think it is bad to use old page table, below issues need consider:
> >1) make sure old page table are reliable across crash
> >2) do not allow writing oldmem after crash
> >
> >Please correct me if I'm wrong, or if above is not doable I think I will vote for
> >resetting pci bus.
> >
> >Thanks
> >Dave
> >
> Hi Dave,
>
> When updating the context tables, we have to write their address to root
> tables, this will cause writing to old mem.
>
> Resetting the pci bus has been discussed, please check this:
> http://lists.infradead.org/pipermail/kexec/2014-October/012752.html
> https://lkml.org/lkml/2014/10/21/890

I know one reason to use old pgtable is this looks better because it fixes the
real problem, but it is not a good way if it introduce more problems because of
it have to use oldmem. I will be glad if this is not a problem but I have not
been convinced.

OTOH, there's many types of iommu, intel, amd, a lot of other types. They need
their own fixes, so it looks not that elegant.

For pci reset, it is not perfect, but it has another advantage, the patch is
simpler. The problem I see from the old discusssion is, reset bus in 2nd kernel
is acceptable but it does not fix things on sparc platform. AFAIK current reported
problems are intel and amd iommu, at least pci reset stuff does not make it worse.

Thanks
Dave

2015-04-21 01:40:34

by Li, ZhenHua

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

Hi Dave,
I found the old mail:
http://lkml.iu.edu/hypermail/linux/kernel/1410.2/03584.html

Please check this and you will find the discussion.

Regards
Zhenhua

On 04/15/2015 02:48 PM, Dave Young wrote:
> On 04/15/15 at 01:47pm, Li, ZhenHua wrote:
>> On 04/15/2015 08:57 AM, Dave Young wrote:
>>> Again, I think it is bad to use old page table, below issues need consider:
>>> 1) make sure old page table are reliable across crash
>>> 2) do not allow writing oldmem after crash
>>>
>>> Please correct me if I'm wrong, or if above is not doable I think I will vote for
>>> resetting pci bus.
>>>
>>> Thanks
>>> Dave
>>>
>> Hi Dave,
>>
>> When updating the context tables, we have to write their address to root
>> tables, this will cause writing to old mem.
>>
>> Resetting the pci bus has been discussed, please check this:
>> http://lists.infradead.org/pipermail/kexec/2014-October/012752.html
>> https://lkml.org/lkml/2014/10/21/890
>
> I know one reason to use old pgtable is this looks better because it fixes the
> real problem, but it is not a good way if it introduce more problems because of
> it have to use oldmem. I will be glad if this is not a problem but I have not
> been convinced.
>
> OTOH, there's many types of iommu, intel, amd, a lot of other types. They need
> their own fixes, so it looks not that elegant.
>
> For pci reset, it is not perfect, but it has another advantage, the patch is
> simpler. The problem I see from the old discusssion is, reset bus in 2nd kernel
> is acceptable but it does not fix things on sparc platform. AFAIK current reported
> problems are intel and amd iommu, at least pci reset stuff does not make it worse.
>
> Thanks
> Dave
>

2015-04-21 02:54:18

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

Hi,

On 04/21/15 at 09:39am, Li, ZhenHua wrote:
> Hi Dave,
> I found the old mail:
> http://lkml.iu.edu/hypermail/linux/kernel/1410.2/03584.html

I know and I have read it before.

================== quote ===================
> > > So with this in mind I would prefer initially taking over the
> > > page-tables from the old kernel before the device drivers re-initialize
> > > the devices.
> >
> > This makes the dump kernel more dependent on data from the old kernel,
> > which we obviously want to avoid when possible.

> Sure, but this is not really possible here (unless we have a generic and
> reliable way to reset all PCI endpoint devices and cancel all in-flight
> DMA before we disable the IOMMU in the kdump kernel).
> Otherwise we always risk data corruption somewhere, in system memory or
> on disk.
================= quote ====================

What I understand above is it is not really possible to avoid the problem.

But IMHO we should avoid it or we will have problems in the future, if we
really cannot avoid it I would say switching to pci reset way is better.

>
> Please check this and you will find the discussion.
>
> Regards
> Zhenhua
>
> On 04/15/2015 02:48 PM, Dave Young wrote:
> >On 04/15/15 at 01:47pm, Li, ZhenHua wrote:
> >>On 04/15/2015 08:57 AM, Dave Young wrote:
> >>>Again, I think it is bad to use old page table, below issues need consider:
> >>>1) make sure old page table are reliable across crash
> >>>2) do not allow writing oldmem after crash
> >>>
> >>>Please correct me if I'm wrong, or if above is not doable I think I will vote for
> >>>resetting pci bus.
> >>>
> >>>Thanks
> >>>Dave
> >>>
> >>Hi Dave,
> >>
> >>When updating the context tables, we have to write their address to root
> >>tables, this will cause writing to old mem.
> >>
> >>Resetting the pci bus has been discussed, please check this:
> >>http://lists.infradead.org/pipermail/kexec/2014-October/012752.html
> >>https://lkml.org/lkml/2014/10/21/890
> >
> >I know one reason to use old pgtable is this looks better because it fixes the
> >real problem, but it is not a good way if it introduce more problems because of
> >it have to use oldmem. I will be glad if this is not a problem but I have not
> >been convinced.
> >
> >OTOH, there's many types of iommu, intel, amd, a lot of other types. They need
> >their own fixes, so it looks not that elegant.
> >
> >For pci reset, it is not perfect, but it has another advantage, the patch is
> >simpler. The problem I see from the old discusssion is, reset bus in 2nd kernel
> >is acceptable but it does not fix things on sparc platform. AFAIK current reported
> >problems are intel and amd iommu, at least pci reset stuff does not make it worse.
> >
> >Thanks
> >Dave
> >
>

2015-04-23 08:37:00

by Li, ZhenHua

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

Tested on HP Superdome X.
Result: Passed.

PCI list and log are attached.

superdomex_boot.log: Log for first kernel.
superdomex_dump.log: Log for kdump kernel.
lspci: log for command lspci

Thanks
Zhenhua
On 04/10/2015 04:42 PM, Li, Zhen-Hua wrote:
> This patchset is an update of Bill Sumner's patchset, implements a fix for:
> If a kernel boots with intel_iommu=on on a system that supports intel vt-d,
> when a panic happens, the kdump kernel will boot with these faults:
>
> dmar: DRHD: handling fault status reg 102
> dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff80000
> DMAR:[fault reason 01] Present bit in root entry is clear
>
> dmar: DRHD: handling fault status reg 2
> dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
> INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
>
> On some system, the interrupt remapping fault will also happen even if the
> intel_iommu is not set to on, because the interrupt remapping will be enabled
> when x2apic is needed by the system.
>
> The cause of the DMA fault is described in Bill's original version, and the
> INTR-Remap fault is caused by a similar reason. In short, the initialization
> of vt-d drivers causes the in-flight DMA and interrupt requests get wrong
> response.
>
> To fix this problem, we modifies the behaviors of the intel vt-d in the
> crashdump kernel:
>
> For DMA Remapping:
> 1. To accept the vt-d hardware in an active state,
> 2. Do not disable and re-enable the translation, keep it enabled.
> 3. Use the old root entry table, do not rewrite the RTA register.
> 4. Malloc and use new context entry table, copy data from the old ones that
> used by the old kernel.
> 5. Keep using the old page tables before driver is loaded.
> 6. After device driver is loaded, when it issues the first dma_map command,
> free the dmar_domain structure for this device, and generate a new one, so
> that the device can be assigned a new and empty page table.
> 7. When a new context entry table is generated, we also save its address to
> the old root entry table.
>
> For Interrupt Remapping:
> 1. To accept the vt-d hardware in an active state,
> 2. Do not disable and re-enable the interrupt remapping, keep it enabled.
> 3. Use the old interrupt remapping table, do not rewrite the IRTA register.
> 4. When ioapic entry is setup, the interrupt remapping table is changed, and
> the updated data will be stored to the old interrupt remapping table.
>
> Advantages of this approach:
> 1. All manipulation of the IO-device is done by the Linux device-driver
> for that device.
> 2. This approach behaves in a manner very similar to operation without an
> active iommu.
> 3. Any activity between the IO-device and its RMRR areas is handled by the
> device-driver in the same manner as during a non-kdump boot.
> 4. If an IO-device has no driver in the kdump kernel, it is simply left alone.
> This supports the practice of creating a special kdump kernel without
> drivers for any devices that are not required for taking a crashdump.
> 5. Minimal code-changes among the existing mainline intel vt-d code.
>
> Summary of changes in this patch set:
> 1. Added some useful function for root entry table in code intel-iommu.c
> 2. Added new members to struct root_entry and struct irte;
> 3. Functions to load old root entry table to iommu->root_entry from the memory
> of old kernel.
> 4. Functions to malloc new context entry table and copy the data from the old
> ones to the malloced new ones.
> 5. Functions to enable support for DMA remapping in kdump kernel.
> 6. Functions to load old irte data from the old kernel to the kdump kernel.
> 7. Some code changes that support other behaviours that have been listed.
> 8. In the new functions, use physical address as "unsigned long" type, not
> pointers.
>
> Original version by Bill Sumner:
> https://lkml.org/lkml/2014/1/10/518
> https://lkml.org/lkml/2014/4/15/716
> https://lkml.org/lkml/2014/4/24/836
>
> Zhenhua's updates:
> https://lkml.org/lkml/2014/10/21/134
> https://lkml.org/lkml/2014/12/15/121
> https://lkml.org/lkml/2014/12/22/53
> https://lkml.org/lkml/2015/1/6/1166
> https://lkml.org/lkml/2015/1/12/35
> https://lkml.org/lkml/2015/3/19/33
>
> Changelog[v10]:
> 1. Do not use CONFIG_CRASH_DUMP and is_kdump_kernel().
> Use one flag which stores the te and ir status in last kernel:
> iommu->pre_enabled_trans
> iommu->pre_enabled_ir
>
> Changelog[v9]:
> 1. Add new function iommu_attach_domain_with_id.
> 2. Do not copy old page tables, keep using the old ones.
> 3. Remove functions:
> intel_iommu_did_to_domain_values_entry
> intel_iommu_get_dids_from_old_kernel
> device_to_domain_id
> copy_page_addr
> copy_page_table
> copy_context_entry
> copy_context_entry_table
> 4. Add new function device_to_existing_context_entry.
>
> Changelog[v8]:
> 1. Add a missing __iommu_flush_cache in function copy_page_table.
>
> Changelog[v7]:
> 1. Use __iommu_flush_cache to flush the data to hardware.
>
> Changelog[v6]:
> 1. Use "unsigned long" as type of physical address.
> 2. Use new function unmap_device_dma to unmap the old dma.
> 3. Some small incorrect bits order for aw shift.
>
> Changelog[v5]:
> 1. Do not disable and re-enable traslation and interrupt remapping.
> 2. Use old root entry table.
> 3. Use old interrupt remapping table.
> 4. New functions to copy data from old kernel, and save to old kernel mem.
> 5. New functions to save updated root entry table and irte table.
> 6. Use intel_unmap to unmap the old dma;
> 7. Allocate new pages while driver is being loaded.
>
> Changelog[v4]:
> 1. Cut off the patches that move some defines and functions to new files.
> 2. Reduce the numbers of patches to five, make it more easier to read.
> 3. Changed the name of functions, make them consistent with current context
> get/set functions.
> 4. Add change to function __iommu_attach_domain.
>
> Changelog[v3]:
> 1. Commented-out "#define DEBUG 1" to eliminate debug messages.
> 2. Updated the comments about changes in each version.
> 3. Fixed: one-line added to Copy-Translations patch to initialize the iovad
> struct as recommended by Baoquan He [[email protected]]
> init_iova_domain(&domain->iovad, DMA_32BIT_PFN);
>
> Changelog[v2]:
> The following series implements a fix for:
> A kdump problem about DMA that has been discussed for a long time. That is,
> when a kernel panics and boots into the kdump kernel, DMA started by the
> panicked kernel is not stopped before the kdump kernel is booted and the
> kdump kernel disables the IOMMU while this DMA continues. This causes the
> IOMMU to stop translating the DMA addresses as IOVAs and begin to treat
> them as physical memory addresses -- which causes the DMA to either:
> (1) generate DMAR errors or
> (2) generate PCI SERR errors or
> (3) transfer data to or from incorrect areas of memory. Often this
> causes the dump to fail.
>
> Changelog[v1]:
> The original version.
>
> Changed in this version:
> 1. Do not disable and re-enable traslation and interrupt remapping.
> 2. Use old root entry table.
> 3. Use old interrupt remapping table.
> 4. Use "unsigned long" as physical address.
> 5. Use intel_unmap to unmap the old dma;
>
> Baoquan He <[email protected]> helps testing this patchset.
> Takao Indoh <[email protected]> gives valuable suggestions.
>
> Li, Zhen-Hua (10):
> iommu/vt-d: New function to attach domain with id
> iommu/vt-d: Items required for kdump
> iommu/vt-d: Function to get old context entry
> iommu/vt-d: functions to copy data from old mem
> iommu/vt-d: Add functions to load and save old re
> iommu/vt-d: datatypes and functions used for kdump
> iommu/vt-d: enable kdump support in iommu module
> iommu/vt-d: assign new page table for dma_map
> iommu/vt-d: Copy functions for irte
> iommu/vt-d: Use old irte in kdump kernel
>
> drivers/iommu/intel-iommu.c | 518 ++++++++++++++++++++++++++++++++++--
> drivers/iommu/intel_irq_remapping.c | 96 ++++++-
> include/linux/intel-iommu.h | 16 ++
> 3 files changed, 605 insertions(+), 25 deletions(-)
>


Attachments:
superdomex_boot.log (155.27 kB)
superdomex_dump.log (105.83 kB)
lspci (42.42 kB)
Download all attachments

2015-04-23 08:39:13

by Li, ZhenHua

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

Tested on HP DL980.
Result: Passed.

PCI list and log are attached.

dl980_boot.log: Log for first kernel.
dl980_dump.log: Log for kdump kernel.
lspci: log for command lspci

Thanks
Zhenhua

On 04/23/2015 04:35 PM, Li, ZhenHua wrote:
> Tested on HP Superdome X.
> Result: Passed.
>
> PCI list and log are attached.
>
> superdomex_boot.log: Log for first kernel.
> superdomex_dump.log: Log for kdump kernel.
> lspci: log for command lspci
>
> Thanks
> Zhenhua
> On 04/10/2015 04:42 PM, Li, Zhen-Hua wrote:
>> This patchset is an update of Bill Sumner's patchset, implements a fix
>> for:
>> If a kernel boots with intel_iommu=on on a system that supports intel
>> vt-d,
>> when a panic happens, the kdump kernel will boot with these faults:
>>
>> dmar: DRHD: handling fault status reg 102
>> dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr fff80000
>> DMAR:[fault reason 01] Present bit in root entry is clear
>>
>> dmar: DRHD: handling fault status reg 2
>> dmar: INTR-REMAP: Request device [[61:00.0] fault index 42
>> INTR-REMAP:[fault reason 34] Present field in the IRTE entry is
>> clear
>>
>> On some system, the interrupt remapping fault will also happen even if
>> the
>> intel_iommu is not set to on, because the interrupt remapping will be
>> enabled
>> when x2apic is needed by the system.
>>
>> The cause of the DMA fault is described in Bill's original version,
>> and the
>> INTR-Remap fault is caused by a similar reason. In short, the
>> initialization
>> of vt-d drivers causes the in-flight DMA and interrupt requests get wrong
>> response.
>>
>> To fix this problem, we modifies the behaviors of the intel vt-d in the
>> crashdump kernel:
>>
>> For DMA Remapping:
>> 1. To accept the vt-d hardware in an active state,
>> 2. Do not disable and re-enable the translation, keep it enabled.
>> 3. Use the old root entry table, do not rewrite the RTA register.
>> 4. Malloc and use new context entry table, copy data from the old ones
>> that
>> used by the old kernel.
>> 5. Keep using the old page tables before driver is loaded.
>> 6. After device driver is loaded, when it issues the first dma_map
>> command,
>> free the dmar_domain structure for this device, and generate a new
>> one, so
>> that the device can be assigned a new and empty page table.
>> 7. When a new context entry table is generated, we also save its
>> address to
>> the old root entry table.
>>
>> For Interrupt Remapping:
>> 1. To accept the vt-d hardware in an active state,
>> 2. Do not disable and re-enable the interrupt remapping, keep it enabled.
>> 3. Use the old interrupt remapping table, do not rewrite the IRTA
>> register.
>> 4. When ioapic entry is setup, the interrupt remapping table is
>> changed, and
>> the updated data will be stored to the old interrupt remapping table.
>>
>> Advantages of this approach:
>> 1. All manipulation of the IO-device is done by the Linux device-driver
>> for that device.
>> 2. This approach behaves in a manner very similar to operation without an
>> active iommu.
>> 3. Any activity between the IO-device and its RMRR areas is handled by
>> the
>> device-driver in the same manner as during a non-kdump boot.
>> 4. If an IO-device has no driver in the kdump kernel, it is simply
>> left alone.
>> This supports the practice of creating a special kdump kernel without
>> drivers for any devices that are not required for taking a crashdump.
>> 5. Minimal code-changes among the existing mainline intel vt-d code.
>>
>> Summary of changes in this patch set:
>> 1. Added some useful function for root entry table in code intel-iommu.c
>> 2. Added new members to struct root_entry and struct irte;
>> 3. Functions to load old root entry table to iommu->root_entry from
>> the memory
>> of old kernel.
>> 4. Functions to malloc new context entry table and copy the data from
>> the old
>> ones to the malloced new ones.
>> 5. Functions to enable support for DMA remapping in kdump kernel.
>> 6. Functions to load old irte data from the old kernel to the kdump
>> kernel.
>> 7. Some code changes that support other behaviours that have been listed.
>> 8. In the new functions, use physical address as "unsigned long" type,
>> not
>> pointers.
>>
>> Original version by Bill Sumner:
>> https://lkml.org/lkml/2014/1/10/518
>> https://lkml.org/lkml/2014/4/15/716
>> https://lkml.org/lkml/2014/4/24/836
>>
>> Zhenhua's updates:
>> https://lkml.org/lkml/2014/10/21/134
>> https://lkml.org/lkml/2014/12/15/121
>> https://lkml.org/lkml/2014/12/22/53
>> https://lkml.org/lkml/2015/1/6/1166
>> https://lkml.org/lkml/2015/1/12/35
>> https://lkml.org/lkml/2015/3/19/33
>>
>> Changelog[v10]:
>> 1. Do not use CONFIG_CRASH_DUMP and is_kdump_kernel().
>> Use one flag which stores the te and ir status in last kernel:
>> iommu->pre_enabled_trans
>> iommu->pre_enabled_ir
>>
>> Changelog[v9]:
>> 1. Add new function iommu_attach_domain_with_id.
>> 2. Do not copy old page tables, keep using the old ones.
>> 3. Remove functions:
>> intel_iommu_did_to_domain_values_entry
>> intel_iommu_get_dids_from_old_kernel
>> device_to_domain_id
>> copy_page_addr
>> copy_page_table
>> copy_context_entry
>> copy_context_entry_table
>> 4. Add new function device_to_existing_context_entry.
>>
>> Changelog[v8]:
>> 1. Add a missing __iommu_flush_cache in function copy_page_table.
>>
>> Changelog[v7]:
>> 1. Use __iommu_flush_cache to flush the data to hardware.
>>
>> Changelog[v6]:
>> 1. Use "unsigned long" as type of physical address.
>> 2. Use new function unmap_device_dma to unmap the old dma.
>> 3. Some small incorrect bits order for aw shift.
>>
>> Changelog[v5]:
>> 1. Do not disable and re-enable traslation and interrupt remapping.
>> 2. Use old root entry table.
>> 3. Use old interrupt remapping table.
>> 4. New functions to copy data from old kernel, and save to old
>> kernel mem.
>> 5. New functions to save updated root entry table and irte table.
>> 6. Use intel_unmap to unmap the old dma;
>> 7. Allocate new pages while driver is being loaded.
>>
>> Changelog[v4]:
>> 1. Cut off the patches that move some defines and functions to
>> new files.
>> 2. Reduce the numbers of patches to five, make it more easier to
>> read.
>> 3. Changed the name of functions, make them consistent with
>> current context
>> get/set functions.
>> 4. Add change to function __iommu_attach_domain.
>>
>> Changelog[v3]:
>> 1. Commented-out "#define DEBUG 1" to eliminate debug messages.
>> 2. Updated the comments about changes in each version.
>> 3. Fixed: one-line added to Copy-Translations patch to initialize
>> the iovad
>> struct as recommended by Baoquan He [[email protected]]
>> init_iova_domain(&domain->iovad, DMA_32BIT_PFN);
>>
>> Changelog[v2]:
>> The following series implements a fix for:
>> A kdump problem about DMA that has been discussed for a long
>> time. That is,
>> when a kernel panics and boots into the kdump kernel, DMA started
>> by the
>> panicked kernel is not stopped before the kdump kernel is booted
>> and the
>> kdump kernel disables the IOMMU while this DMA continues. This
>> causes the
>> IOMMU to stop translating the DMA addresses as IOVAs and begin to
>> treat
>> them as physical memory addresses -- which causes the DMA to either:
>> (1) generate DMAR errors or
>> (2) generate PCI SERR errors or
>> (3) transfer data to or from incorrect areas of memory. Often
>> this
>> causes the dump to fail.
>>
>> Changelog[v1]:
>> The original version.
>>
>> Changed in this version:
>> 1. Do not disable and re-enable traslation and interrupt remapping.
>> 2. Use old root entry table.
>> 3. Use old interrupt remapping table.
>> 4. Use "unsigned long" as physical address.
>> 5. Use intel_unmap to unmap the old dma;
>>
>> Baoquan He <[email protected]> helps testing this patchset.
>> Takao Indoh <[email protected]> gives valuable suggestions.
>>
>> Li, Zhen-Hua (10):
>> iommu/vt-d: New function to attach domain with id
>> iommu/vt-d: Items required for kdump
>> iommu/vt-d: Function to get old context entry
>> iommu/vt-d: functions to copy data from old mem
>> iommu/vt-d: Add functions to load and save old re
>> iommu/vt-d: datatypes and functions used for kdump
>> iommu/vt-d: enable kdump support in iommu module
>> iommu/vt-d: assign new page table for dma_map
>> iommu/vt-d: Copy functions for irte
>> iommu/vt-d: Use old irte in kdump kernel
>>
>> drivers/iommu/intel-iommu.c | 518
>> ++++++++++++++++++++++++++++++++++--
>> drivers/iommu/intel_irq_remapping.c | 96 ++++++-
>> include/linux/intel-iommu.h | 16 ++
>> 3 files changed, 605 insertions(+), 25 deletions(-)
>>
>


Attachments:
dl980_boot.log (96.85 kB)
dl980_dump.log (92.66 kB)
lspci (3.18 kB)
Download all attachments

2015-04-24 08:02:37

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On 04/15/15 at 02:48pm, Dave Young wrote:
> On 04/15/15 at 01:47pm, Li, ZhenHua wrote:
> > On 04/15/2015 08:57 AM, Dave Young wrote:
> > >Again, I think it is bad to use old page table, below issues need consider:
> > >1) make sure old page table are reliable across crash
> > >2) do not allow writing oldmem after crash
> > >
> > >Please correct me if I'm wrong, or if above is not doable I think I will vote for
> > >resetting pci bus.
> > >
> > >Thanks
> > >Dave
> > >
> > Hi Dave,
> >
> > When updating the context tables, we have to write their address to root
> > tables, this will cause writing to old mem.
> >
> > Resetting the pci bus has been discussed, please check this:
> > http://lists.infradead.org/pipermail/kexec/2014-October/012752.html
> > https://lkml.org/lkml/2014/10/21/890

I support this patchset.

We should not fear oldmem since reserved crashkernel region is similar.
No one can guarantee that any crazy code won't step into crashkernel
region just because 1st kernel says it's reversed for kdump kernel. Here
the root table and context tables are also not built to allow legal code
to danamge. Both of them has the risk to be corrupted, for trying our
best to get a dumped vmcore the risk is worth being taken.

And the resetting pci way has been NACKed by David Woodhouse, the
maintainer of intel iommu. Because the place calling the resetting pci
code is ugly before kdump kernel or in kdump kernel. And as he said a
certain device made mistakes why we blame on all devices. We should fix
that device who made mistakes.

As for me, periodically poked by customers to ask how iommu fix is
going, I really think this patchset is good enough. Aren't we going to
do thing just because there's a risk with tiny possibility or not perfect
enough. I think people won't agree. Otherwise kdump could have been
killed when author proposed it since crashkernel reserved region is
risky and could be corrupted by 1st kernel.

Anyway, let's comprimise a little. At worst it can be reverted if it's
not satisfactory.

Personal opinion.

By the way, I tested it and it works well on my HP z420 workstation.

Thanks
Baoquan


>
> I know one reason to use old pgtable is this looks better because it fixes the
> real problem, but it is not a good way if it introduce more problems because of
> it have to use oldmem. I will be glad if this is not a problem but I have not
> been convinced.
>
> OTOH, there's many types of iommu, intel, amd, a lot of other types. They need
> their own fixes, so it looks not that elegant.
>
> For pci reset, it is not perfect, but it has another advantage, the patch is
> simpler. The problem I see from the old discusssion is, reset bus in 2nd kernel
> is acceptable but it does not fix things on sparc platform. AFAIK current reported
> problems are intel and amd iommu, at least pci reset stuff does not make it worse.
>
> Thanks
> Dave

2015-04-24 08:26:27

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

Hi, Baoquan

> I support this patchset.
>
> We should not fear oldmem since reserved crashkernel region is similar.
> No one can guarantee that any crazy code won't step into crashkernel
> region just because 1st kernel says it's reversed for kdump kernel. Here
> the root table and context tables are also not built to allow legal code
> to danamge. Both of them has the risk to be corrupted, for trying our
> best to get a dumped vmcore the risk is worth being taken.

old mem is mapped in 1st kernel so compare with the reserved crashkernel
they are more likely to be corrupted. they are totally different.

>
> And the resetting pci way has been NACKed by David Woodhouse, the
> maintainer of intel iommu. Because the place calling the resetting pci
> code is ugly before kdump kernel or in kdump kernel. And as he said a
> certain device made mistakes why we blame on all devices. We should fix
> that device who made mistakes.

Resetting pci bus is not ugly than fixing a problem with risk and to fix
the problem it introduced in the future.

I know it is late to speak out, but sorry I still object and have to NACK this
oldmem approach from my point.

Thanks
Dave

2015-04-24 08:35:38

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On 04/24/15 at 04:25pm, Dave Young wrote:
> Hi, Baoquan
>
> > I support this patchset.
> >
> > We should not fear oldmem since reserved crashkernel region is similar.
> > No one can guarantee that any crazy code won't step into crashkernel
> > region just because 1st kernel says it's reversed for kdump kernel. Here
> > the root table and context tables are also not built to allow legal code
> > to danamge. Both of them has the risk to be corrupted, for trying our
> > best to get a dumped vmcore the risk is worth being taken.
>
> old mem is mapped in 1st kernel so compare with the reserved crashkernel
> they are more likely to be corrupted. they are totally different.

Could you tell how and why they are different? Wrong code will choose
root tables and context tables to danamge when they totally lose
control?

>
> >
> > And the resetting pci way has been NACKed by David Woodhouse, the
> > maintainer of intel iommu. Because the place calling the resetting pci
> > code is ugly before kdump kernel or in kdump kernel. And as he said a
> > certain device made mistakes why we blame on all devices. We should fix
> > that device who made mistakes.
>
> Resetting pci bus is not ugly than fixing a problem with risk and to fix
> the problem it introduced in the future.

There's a problem, we fix the problem. If that's uglier, I need redefine
the 'ugly' in my personal dict. You mean the problem it could introduce
is wrong code will damage root table and context tables, why don't we
fix that wrong code, but blame innocent context tables? So you mean
these tables should deserve being damaged by wrong code?

>
> I know it is late to speak out, but sorry I still object and have to NACK this
> oldmem approach from my point.
>
> Thanks
> Dave

2015-04-24 08:50:19

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On 04/24/15 at 04:35pm, Baoquan He wrote:
> On 04/24/15 at 04:25pm, Dave Young wrote:
> > Hi, Baoquan
> >
> > > I support this patchset.
> > >
> > > We should not fear oldmem since reserved crashkernel region is similar.
> > > No one can guarantee that any crazy code won't step into crashkernel
> > > region just because 1st kernel says it's reversed for kdump kernel. Here
> > > the root table and context tables are also not built to allow legal code
> > > to danamge. Both of them has the risk to be corrupted, for trying our
> > > best to get a dumped vmcore the risk is worth being taken.
> >
> > old mem is mapped in 1st kernel so compare with the reserved crashkernel
> > they are more likely to be corrupted. they are totally different.
>
> Could you tell how and why they are different? Wrong code will choose
> root tables and context tables to danamge when they totally lose
> control?

iommu will map io address to system ram, right? not to reserved ram, but
yes I'm assuming the page table is right, but I was worrying they are corrupted
while kernel panic is happening.

>
> >
> > >
> > > And the resetting pci way has been NACKed by David Woodhouse, the
> > > maintainer of intel iommu. Because the place calling the resetting pci
> > > code is ugly before kdump kernel or in kdump kernel. And as he said a
> > > certain device made mistakes why we blame on all devices. We should fix
> > > that device who made mistakes.
> >
> > Resetting pci bus is not ugly than fixing a problem with risk and to fix
> > the problem it introduced in the future.
>
> There's a problem, we fix the problem. If that's uglier, I need redefine
> the 'ugly' in my personal dict. You mean the problem it could introduce
> is wrong code will damage root table and context tables, why don't we
> fix that wrong code, but blame innocent context tables? So you mean
> these tables should deserve being damaged by wrong code?

I'm more than happy to see this issue can be fixed in the patchset, I do not
agree to add the code there with such problems. OTOH, for now seems there's
no way to fix it.

>
> >
> > I know it is late to speak out, but sorry I still object and have to NACK this
> > oldmem approach from my point.
> >
> > Thanks
> > Dave

2015-04-28 08:55:28

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On 04/24/15 at 04:49pm, Dave Young wrote:
> On 04/24/15 at 04:35pm, Baoquan He wrote:
> > On 04/24/15 at 04:25pm, Dave Young wrote:
> > > Hi, Baoquan
> > >
> > > > I support this patchset.
> > > >
> > > > We should not fear oldmem since reserved crashkernel region is similar.
> > > > No one can guarantee that any crazy code won't step into crashkernel
> > > > region just because 1st kernel says it's reversed for kdump kernel. Here
> > > > the root table and context tables are also not built to allow legal code
> > > > to danamge. Both of them has the risk to be corrupted, for trying our
> > > > best to get a dumped vmcore the risk is worth being taken.
> > >
> > > old mem is mapped in 1st kernel so compare with the reserved crashkernel
> > > they are more likely to be corrupted. they are totally different.
> >
> > Could you tell how and why they are different? Wrong code will choose
> > root tables and context tables to danamge when they totally lose
> > control?
>
> iommu will map io address to system ram, right? not to reserved ram, but
> yes I'm assuming the page table is right, but I was worrying they are corrupted
> while kernel panic is happening.

OK, I think we may need to think more about the old context tables
reuse. Currently dmar faults will cause error or warning message,
occasionally will cause system with iommu hang in kdump kernel. I don't
know what will happen if old root tables or context tables are corrupted
by evil code. For kdump kernel which use the similar mechanism there's a
verification. When load kdump kernel into reserved crashkernel region a
sha256 sum is calculated, then verify it when jump into kdump kernel
after panic. If corrupted context tables will bring worse result, then
we need consider giving it up and change back to the old way and try
to dump though there's error message.

Hi Zhenhua,

I don't know what's your plan about verification whether old root tables
or old context tables are corrupted. Or have you experimented that what
will happen if old tables are corrupted on purpose.

I am fine if you just put this in a TODO list since that's truly in a
rare case. But it maybe necessary to tell it in patch log.

Thanks
Baoquan

2015-04-28 09:01:30

by Li, ZhenHua

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

Hi Baoquan,

If old tables are corrupted, we will see the DMAR faults or INTR faults
(which we have seen), or some other error messages. Most of these
messages are from hardware. This means, hardware will do some check when
running. But I don't think hardware will completely check the
tables.

Till now, I do not have a good idea to do the check in kdump kernel.


Thanks
Zhenhua


On 04/28/2015 04:54 PM, Baoquan He wrote:
> On 04/24/15 at 04:49pm, Dave Young wrote:
>> On 04/24/15 at 04:35pm, Baoquan He wrote:
>>> On 04/24/15 at 04:25pm, Dave Young wrote:
>>>> Hi, Baoquan
>>>>
>>>>> I support this patchset.
>>>>>
>>>>> We should not fear oldmem since reserved crashkernel region is similar.
>>>>> No one can guarantee that any crazy code won't step into crashkernel
>>>>> region just because 1st kernel says it's reversed for kdump kernel. Here
>>>>> the root table and context tables are also not built to allow legal code
>>>>> to danamge. Both of them has the risk to be corrupted, for trying our
>>>>> best to get a dumped vmcore the risk is worth being taken.
>>>>
>>>> old mem is mapped in 1st kernel so compare with the reserved crashkernel
>>>> they are more likely to be corrupted. they are totally different.
>>>
>>> Could you tell how and why they are different? Wrong code will choose
>>> root tables and context tables to danamge when they totally lose
>>> control?
>>
>> iommu will map io address to system ram, right? not to reserved ram, but
>> yes I'm assuming the page table is right, but I was worrying they are corrupted
>> while kernel panic is happening.
>
> OK, I think we may need to think more about the old context tables
> reuse. Currently dmar faults will cause error or warning message,
> occasionally will cause system with iommu hang in kdump kernel. I don't
> know what will happen if old root tables or context tables are corrupted
> by evil code. For kdump kernel which use the similar mechanism there's a
> verification. When load kdump kernel into reserved crashkernel region a
> sha256 sum is calculated, then verify it when jump into kdump kernel
> after panic. If corrupted context tables will bring worse result, then
> we need consider giving it up and change back to the old way and try
> to dump though there's error message.
>
> Hi Zhenhua,
>
> I don't know what's your plan about verification whether old root tables
> or old context tables are corrupted. Or have you experimented that what
> will happen if old tables are corrupted on purpose.
>
> I am fine if you just put this in a TODO list since that's truly in a
> rare case. But it maybe necessary to tell it in patch log.
>
> Thanks
> Baoquan
>

2015-04-29 11:21:31

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

Bad news, I rebuilt a kernel with your patchset on 4.0.0+ (this commit
f614c81). Now dmar fault is seen again.

The lspci log and kdump log are attached, please check:

[ ~]$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.0.0+ root=/dev/mapper/fedora_dhcp--128--28-root ro
rd.lvm.lv=fedora_dhcp-128-28/swap rd.lvm.lv=fedora_dhcp-128-28/root
crashkernel=256M console=ttyS0,115200 intel_iommu=on



Attachments:
(No filename) (391.00 B)
lspci.txt (0.00 B)
kdump_iommu.log (53.86 kB)
Download all attachments

2015-05-03 08:56:41

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On 04/29/15 at 07:20pm, Baoquan He wrote:
> Bad news, I rebuilt a kernel with your patchset on 4.0.0+ (this commit
> f614c81). Now dmar fault is seen again.
>
> The lspci log and kdump log are attached, please check:

I found the lspci log previously attached is emtyp, resend it again.



Attachments:
(No filename) (291.00 B)
lspci.log (34.72 kB)
Download all attachments

2015-05-04 03:07:46

by Li, ZhenHua

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

Hi baoquan,
Could you paste the kernel log of the first kernel ?

Thanks
Zhenhua
On 05/03/2015 04:55 PM, Baoquan He wrote:
> On 04/29/15 at 07:20pm, Baoquan He wrote:
>> Bad news, I rebuilt a kernel with your patchset on 4.0.0+ (this commit
>> f614c81). Now dmar fault is seen again.
>>
>> The lspci log and kdump log are attached, please check:
>
> I found the lspci log previously attached is emtyp, resend it again.
>
>

2015-05-04 03:19:01

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On 05/04/15 at 11:06am, Li, ZhenHua wrote:
> Hi baoquan,
> Could you paste the kernel log of the first kernel ?

Please check the attachment.


Attachments:
(No filename) (142.00 B)
kdump_1st_kernel.log (60.62 kB)
Download all attachments

2015-05-04 16:23:29

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On Fri, Apr 24, 2015 at 04:49:57PM +0800, Dave Young wrote:
> I'm more than happy to see this issue can be fixed in the patchset, I
> do not agree to add the code there with such problems. OTOH, for now
> seems there's no way to fix it.

And that's the point. We discuss this issue and possible solutions for
years by now, and what ZhenHua implemented is what we agreed to be the
best-effort on what we can do in the kdump case with IOMMU enabled.

Of course there are still failure scenarios left, but that is not
different from systems without any IOMMU.


Joerg

2015-05-05 06:15:29

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On 05/04/15 at 06:23pm, Joerg Roedel wrote:
> On Fri, Apr 24, 2015 at 04:49:57PM +0800, Dave Young wrote:
> > I'm more than happy to see this issue can be fixed in the patchset, I
> > do not agree to add the code there with such problems. OTOH, for now
> > seems there's no way to fix it.
>
> And that's the point. We discuss this issue and possible solutions for
> years by now, and what ZhenHua implemented is what we agreed to be the
> best-effort on what we can do in the kdump case with IOMMU enabled.
>
> Of course there are still failure scenarios left, but that is not
> different from systems without any IOMMU.

The failure is nothing different, but as I said in another reply the
difference is we could use corrupted data to possiblly cause more failure.

Thanks
Dave

2015-05-05 16:10:30

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On Tue, May 05, 2015 at 02:14:23PM +0800, Dave Young wrote:
> The failure is nothing different, but as I said in another reply the
> difference is we could use corrupted data to possiblly cause more failure.

I still fail to see how things can get more worse than they already are
by reusing the old data (we just reuse it, we do not modify anything
there). Do you have any specific scenario in mind?


Joerg

2015-05-06 01:51:54

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On 05/05/15 at 05:31pm, Joerg Roedel wrote:
> On Tue, May 05, 2015 at 02:14:23PM +0800, Dave Young wrote:
> > The failure is nothing different, but as I said in another reply the
> > difference is we could use corrupted data to possiblly cause more failure.
>
> I still fail to see how things can get more worse than they already are
> by reusing the old data (we just reuse it, we do not modify anything

DMA write will modify system ram, if the old data is corrupted it is possible
that DMA operation modify wrong ram regions because of wrong mapping.
Am I missing something and is it not possible?

Thanks
Dave

2015-05-06 02:38:06

by Li, ZhenHua

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

Dave,
This patchset will only write root tables in old kernel, if it is
corrupted, faults will also happen in old kernel, and hardware would
mark it. So things will not go worse..

Thanks
Zhenhua
On 05/06/2015 09:51 AM, Dave Young wrote:
> On 05/05/15 at 05:31pm, Joerg Roedel wrote:
>> On Tue, May 05, 2015 at 02:14:23PM +0800, Dave Young wrote:
>>> The failure is nothing different, but as I said in another reply the
>>> difference is we could use corrupted data to possiblly cause more failure.
>>
>> I still fail to see how things can get more worse than they already are
>> by reusing the old data (we just reuse it, we do not modify anything
>
> DMA write will modify system ram, if the old data is corrupted it is possible
> that DMA operation modify wrong ram regions because of wrong mapping.
> Am I missing something and is it not possible?
>
> Thanks
> Dave
>

2015-05-06 08:26:06

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On Wed, May 06, 2015 at 09:51:35AM +0800, Dave Young wrote:
> DMA write will modify system ram, if the old data is corrupted it is possible
> that DMA operation modify wrong ram regions because of wrong mapping.
> Am I missing something and is it not possible?

This might have happened already before the kdump kernel even boots.
Also, if there is no IOMMU, this can happen as well. This (unlikely
but possible) situation doesn't make things worse.


Joerg

2015-05-07 07:50:07

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v10 04/10] iommu/vt-d: functions to copy data from old mem

On 04/10/15 at 04:42pm, Li, Zhen-Hua wrote:
> Add some functions to copy the data from old kernel.
> These functions are used to copy context tables and page tables.
>
> To avoid calling iounmap between spin_lock_irqsave and spin_unlock_irqrestore,
> use a link here, store the pointers , and then use iounmap to free them in
> another place.
>
> Li, Zhen-hua:
> The functions and logics.
>
> Takao Indoh:
> Check if pfn is ram:
> if (page_is_ram(pfn))
>
> Signed-off-by: Li, Zhen-Hua <[email protected]>
> Signed-off-by: Takao Indoh <[email protected]>
> ---
> drivers/iommu/intel-iommu.c | 102 ++++++++++++++++++++++++++++++++++++++++++++
> include/linux/intel-iommu.h | 6 +++
> 2 files changed, 108 insertions(+)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index ff5ac04..5ba403a 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -373,6 +373,17 @@ static struct context_entry *device_to_existing_context_entry(
> struct intel_iommu *iommu,
> u8 bus, u8 devfn);
>
> +/*
> + * A structure used to store the address allocated by ioremap();
> + * The we need to call iounmap() to free them out of spin_lock_irqsave/unlock;
> + */
> +struct iommu_remapped_entry {
> + struct list_head list;
> + void __iomem *mem;
> +};
> +static LIST_HEAD(__iommu_remapped_mem);
> +static DEFINE_MUTEX(__iommu_mem_list_lock);
> +
>
> /*
> * This domain is a statically identity mapping domain.
> @@ -4817,3 +4828,94 @@ static struct context_entry *device_to_existing_context_entry(
> return ret;
> }
>
> +/*
> + * Copy memory from a physically-addressed area into a virtually-addressed area
> + */

I don't find where __iommu_load_from_oldmem is called. Obsolete code of
this patch?

> +int __iommu_load_from_oldmem(void *to, unsigned long from, unsigned long size)
> +{
> + unsigned long pfn; /* Page Frame Number */
> + size_t csize = (size_t)size; /* Num(bytes to copy) */
> + unsigned long offset; /* Lower 12 bits of to */
> + void __iomem *virt_mem;
> + struct iommu_remapped_entry *mapped;
> +
> + pfn = from >> VTD_PAGE_SHIFT;
> + offset = from & (~VTD_PAGE_MASK);
> +
> + if (page_is_ram(pfn)) {
> + memcpy(to, pfn_to_kaddr(pfn) + offset, csize);
> + } else{
> +
> + mapped = kzalloc(sizeof(struct iommu_remapped_entry),
> + GFP_KERNEL);
> + if (!mapped)
> + return -ENOMEM;
> +
> + virt_mem = ioremap_cache((unsigned long)from, size);
> + if (!virt_mem) {
> + kfree(mapped);
> + return -ENOMEM;
> + }
> + memcpy(to, virt_mem, size);
> +
> + mutex_lock(&__iommu_mem_list_lock);
> + mapped->mem = virt_mem;
> + list_add_tail(&mapped->list, &__iommu_remapped_mem);
> + mutex_unlock(&__iommu_mem_list_lock);
> + }
> + return size;
> +}
> +
> +/*
> + * Copy memory from a virtually-addressed area into a physically-addressed area
> + */
> +int __iommu_save_to_oldmem(unsigned long to, void *from, unsigned long size)
> +{
> + unsigned long pfn; /* Page Frame Number */
> + size_t csize = (size_t)size; /* Num(bytes to copy) */
> + unsigned long offset; /* Lower 12 bits of to */
> + void __iomem *virt_mem;
> + struct iommu_remapped_entry *mapped;
> +
> + pfn = to >> VTD_PAGE_SHIFT;
> + offset = to & (~VTD_PAGE_MASK);
> +
> + if (page_is_ram(pfn)) {
> + memcpy(pfn_to_kaddr(pfn) + offset, from, csize);
> + } else{
> + mapped = kzalloc(sizeof(struct iommu_remapped_entry),
> + GFP_KERNEL);
> + if (!mapped)
> + return -ENOMEM;
> +
> + virt_mem = ioremap_cache((unsigned long)to, size);
> + if (!virt_mem) {
> + kfree(mapped);
> + return -ENOMEM;
> + }
> + memcpy(virt_mem, from, size);
> + mutex_lock(&__iommu_mem_list_lock);
> + mapped->mem = virt_mem;
> + list_add_tail(&mapped->list, &__iommu_remapped_mem);
> + mutex_unlock(&__iommu_mem_list_lock);
> + }
> + return size;
> +}
> +
> +/*
> + * Free the mapped memory for ioremap;
> + */
> +int __iommu_free_mapped_mem(void)
> +{
> + struct iommu_remapped_entry *mem_entry, *tmp;
> +
> + mutex_lock(&__iommu_mem_list_lock);
> + list_for_each_entry_safe(mem_entry, tmp, &__iommu_remapped_mem, list) {
> + iounmap(mem_entry->mem);
> + list_del(&mem_entry->list);
> + kfree(mem_entry);
> + }
> + mutex_unlock(&__iommu_mem_list_lock);
> + return 0;
> +}
> +
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index a65208a..4bca7b5 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -368,4 +368,10 @@ extern int dmar_ir_support(void);
>
> extern const struct attribute_group *intel_iommu_groups[];
>
> +extern int __iommu_load_from_oldmem(void *to, unsigned long from,
> + unsigned long size);
> +extern int __iommu_save_to_oldmem(unsigned long to, void *from,
> + unsigned long size);
> +extern int __iommu_free_mapped_mem(void);
> +
> #endif
> --
> 2.0.0-rc0
>

2015-05-07 08:34:30

by Li, ZhenHua

[permalink] [raw]
Subject: Re: [PATCH v10 04/10] iommu/vt-d: functions to copy data from old mem

It is called in
static int copy_root_entry_table(struct intel_iommu *iommu);


On 05/07/2015 03:49 PM, Baoquan He wrote:
> On 04/10/15 at 04:42pm, Li, Zhen-Hua wrote:
>> Add some functions to copy the data from old kernel.
>> These functions are used to copy context tables and page tables.
>>
>> To avoid calling iounmap between spin_lock_irqsave and spin_unlock_irqrestore,
>> use a link here, store the pointers , and then use iounmap to free them in
>> another place.
>>
>> Li, Zhen-hua:
>> The functions and logics.
>>
>> Takao Indoh:
>> Check if pfn is ram:
>> if (page_is_ram(pfn))
>>
>> Signed-off-by: Li, Zhen-Hua <[email protected]>
>> Signed-off-by: Takao Indoh <[email protected]>
>> ---
>> drivers/iommu/intel-iommu.c | 102 ++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/intel-iommu.h | 6 +++
>> 2 files changed, 108 insertions(+)
>>
>> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
>> index ff5ac04..5ba403a 100644
>> --- a/drivers/iommu/intel-iommu.c
>> +++ b/drivers/iommu/intel-iommu.c
>> @@ -373,6 +373,17 @@ static struct context_entry *device_to_existing_context_entry(
>> struct intel_iommu *iommu,
>> u8 bus, u8 devfn);
>>
>> +/*
>> + * A structure used to store the address allocated by ioremap();
>> + * The we need to call iounmap() to free them out of spin_lock_irqsave/unlock;
>> + */
>> +struct iommu_remapped_entry {
>> + struct list_head list;
>> + void __iomem *mem;
>> +};
>> +static LIST_HEAD(__iommu_remapped_mem);
>> +static DEFINE_MUTEX(__iommu_mem_list_lock);
>> +
>>
>> /*
>> * This domain is a statically identity mapping domain.
>> @@ -4817,3 +4828,94 @@ static struct context_entry *device_to_existing_context_entry(
>> return ret;
>> }
>>
>> +/*
>> + * Copy memory from a physically-addressed area into a virtually-addressed area
>> + */
>
> I don't find where __iommu_load_from_oldmem is called. Obsolete code of
> this patch?
>
>> +int __iommu_load_from_oldmem(void *to, unsigned long from, unsigned long size)
>> +{
>> + unsigned long pfn; /* Page Frame Number */
>> + size_t csize = (size_t)size; /* Num(bytes to copy) */
>> + unsigned long offset; /* Lower 12 bits of to */
>> + void __iomem *virt_mem;
>> + struct iommu_remapped_entry *mapped;
>> +
>> + pfn = from >> VTD_PAGE_SHIFT;
>> + offset = from & (~VTD_PAGE_MASK);
>> +
>> + if (page_is_ram(pfn)) {
>> + memcpy(to, pfn_to_kaddr(pfn) + offset, csize);
>> + } else{
>> +
>> + mapped = kzalloc(sizeof(struct iommu_remapped_entry),
>> + GFP_KERNEL);
>> + if (!mapped)
>> + return -ENOMEM;
>> +
>> + virt_mem = ioremap_cache((unsigned long)from, size);
>> + if (!virt_mem) {
>> + kfree(mapped);
>> + return -ENOMEM;
>> + }
>> + memcpy(to, virt_mem, size);
>> +
>> + mutex_lock(&__iommu_mem_list_lock);
>> + mapped->mem = virt_mem;
>> + list_add_tail(&mapped->list, &__iommu_remapped_mem);
>> + mutex_unlock(&__iommu_mem_list_lock);
>> + }
>> + return size;
>> +}
>> +
>> +/*
>> + * Copy memory from a virtually-addressed area into a physically-addressed area
>> + */
>> +int __iommu_save_to_oldmem(unsigned long to, void *from, unsigned long size)
>> +{
>> + unsigned long pfn; /* Page Frame Number */
>> + size_t csize = (size_t)size; /* Num(bytes to copy) */
>> + unsigned long offset; /* Lower 12 bits of to */
>> + void __iomem *virt_mem;
>> + struct iommu_remapped_entry *mapped;
>> +
>> + pfn = to >> VTD_PAGE_SHIFT;
>> + offset = to & (~VTD_PAGE_MASK);
>> +
>> + if (page_is_ram(pfn)) {
>> + memcpy(pfn_to_kaddr(pfn) + offset, from, csize);
>> + } else{
>> + mapped = kzalloc(sizeof(struct iommu_remapped_entry),
>> + GFP_KERNEL);
>> + if (!mapped)
>> + return -ENOMEM;
>> +
>> + virt_mem = ioremap_cache((unsigned long)to, size);
>> + if (!virt_mem) {
>> + kfree(mapped);
>> + return -ENOMEM;
>> + }
>> + memcpy(virt_mem, from, size);
>> + mutex_lock(&__iommu_mem_list_lock);
>> + mapped->mem = virt_mem;
>> + list_add_tail(&mapped->list, &__iommu_remapped_mem);
>> + mutex_unlock(&__iommu_mem_list_lock);
>> + }
>> + return size;
>> +}
>> +
>> +/*
>> + * Free the mapped memory for ioremap;
>> + */
>> +int __iommu_free_mapped_mem(void)
>> +{
>> + struct iommu_remapped_entry *mem_entry, *tmp;
>> +
>> + mutex_lock(&__iommu_mem_list_lock);
>> + list_for_each_entry_safe(mem_entry, tmp, &__iommu_remapped_mem, list) {
>> + iounmap(mem_entry->mem);
>> + list_del(&mem_entry->list);
>> + kfree(mem_entry);
>> + }
>> + mutex_unlock(&__iommu_mem_list_lock);
>> + return 0;
>> +}
>> +
>> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
>> index a65208a..4bca7b5 100644
>> --- a/include/linux/intel-iommu.h
>> +++ b/include/linux/intel-iommu.h
>> @@ -368,4 +368,10 @@ extern int dmar_ir_support(void);
>>
>> extern const struct attribute_group *intel_iommu_groups[];
>>
>> +extern int __iommu_load_from_oldmem(void *to, unsigned long from,
>> + unsigned long size);
>> +extern int __iommu_save_to_oldmem(unsigned long to, void *from,
>> + unsigned long size);
>> +extern int __iommu_free_mapped_mem(void);
>> +
>> #endif
>> --
>> 2.0.0-rc0
>>

2015-05-07 17:32:06

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

Hi Baoquan, ZhenHua,

On Mon, May 04, 2015 at 11:17:49AM +0800, Baoquan He wrote:
> On 05/04/15 at 11:06am, Li, ZhenHua wrote:
> > Hi baoquan,
> > Could you paste the kernel log of the first kernel ?

Please let me know when you have worked this issue out.


Thanks,

Joerg

2015-05-08 01:02:00

by Li, ZhenHua

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

Hi Joerg,
This problem is caused by the latest updates in iommu module, and we are
trying to fix it.
When it is fixed, we will send out a new version of the patchset.

Thanks
Zhenhua

On 05/08/2015 01:32 AM, Joerg Roedel wrote:
> Hi Baoquan, ZhenHua,
>
> On Mon, May 04, 2015 at 11:17:49AM +0800, Baoquan He wrote:
>> On 05/04/15 at 11:06am, Li, ZhenHua wrote:
>>> Hi baoquan,
>>> Could you paste the kernel log of the first kernel ?
>
> Please let me know when you have worked this issue out.
>
>
> Thanks,
>
> Joerg
>

2015-06-11 15:40:19

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

On Fri, 2015-04-10 at 16:42 +0800, Li, Zhen-Hua wrote:
> This patchset is an update of Bill Sumner's patchset, implements a fix for:
> If a kernel boots with intel_iommu=on on a system that supports intel vt-d,
> when a panic happens, the kdump kernel will boot with these faults:

But, in the general case, it *does* boot.

There are two cases where it doesn't actually boot, and those are the
interesting ones.

Firstly, a device just keeps generating faults and we die in an
interrupt storm, reporting the same fault over and over again. That can
actually happen without kdump/kexec and the correct fix for that is to
have rate-limiting, disable fault reporting for the offending device
after too many are seen, and then eventually to tie it in to the PCIe
error handling as has been discussed elsewhere.

Secondly, there are devices which do not correctly respond to a
hardware reset. This is broken hardware, and if we really have to copy
the old contexts from the crashed kernel to work around it then I'd
like it to be on a blacklist basis — we do it only for hardware which
is *known* to be broken in this way.

(There's also some cases where the device driver doesn't even *try* to
reset the hardware and just assumes it'll find it in a sane state as
the BIOS or a cleanly shut down kexec would have left it. In those
cases of course we can just fix the driver).

I don't much like the idea of doing this context copy for *all*
hardware. That's masking hardware issues with reset that we really
*ought* to be finding.

I believe that most of the offending hardware is HP's; they like to do
the most, erm, "interesting" things with odd hardware and RMRRs and
stuff. So Zhen-Hua would you be able to provide the list of broken
devices that HP has shipped, for the purpose of such a blacklist?

I assume you've already contacted the hardware folks responsible and
insisted that their devices are fixed to be resettable already, right?

--
dwmw2


Attachments:
smime.p7s (5.56 kB)