LinuxLists.cc - [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump with iommu

2012-11-27 00:42:25

Subject: [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump with iommu

These patches reset PCIe devices at boot time to address DMA problem on
kdump with iommu. When "reset_devices" is specified, a hot reset is
triggered on each PCIe root port and downstream port to reset its
downstream endpoint.

Background:
A kdump problem about DMA has been discussed for a long time. That is,
when a kernel is switched to the kdump kernel, DMA derived from first
kernel affects second kernel. Especially this problem surfaces when
iommu is used for PCI passthrough on KVM guest. In the case of the
machine I use, when intel_iommu=on is specified, DMAR error is detected
in kdump kernel and PCI SERR is also detected. Finally kdump fails
because some devices does not work correctly.

The root cause is that ongoing DMA from first kernel causes DMAR fault
because page table of DMAR is initialized while kdump kernel is booting
up. Therefore to solve this problem DMA needs to be stopped before DMAR
is initialized at kdump kernel boot time. By these patches, PCIe devices
are reset by hot reset and its DMA is stopped when reset_devices is
specified. One problem of this solution is that the monitor blacks out
when VGA controller is reset. So this patch does not reset the port
whose child endpoint is VGA device.

What I tried:
- Clearing bus master bit and INTx disable bit at boot time
This did not solve this problem. I still got DMAR error on devices.
- Resetting devices in fixup_final(v1 patch)
DMAR error disappeared, but sometimes PCI SERR was detected. This
is well explained here.
https://lkml.org/lkml/2012/9/9/245
This PCI SERR seems to be related to interrupt remapping.
- Clearing bus master in setup_arch() and resetting devices in
fixup_final
Neither DMAR error nor PCI SERR occurred. But on certain machine
kdump kernel hung up when resetting devices. It seems to be a
problem specific to the platform.
- Resetting devices in setup_arch() (v2 and later patch)
This solution solves all problems I found so far.

Changelog:
v7:
Update Yinghai's dummy-pci patch with macros in linux/pci.h, and fix
some bugs

v6:
Rewrite using Yinghai's dummy-pci patch
https://lkml.org/lkml/2012/11/13/118

v5:
Do bus reset after all devices are scanned and its config registers are
saved. This fixes a bug that config register is accessed without delay
after reset.
https://lkml.org/lkml/2012/10/17/47

v4:
Reduce waiting time after resetting devices. A previous patch does reset
like this:
for (each device) {
save config registers
reset
wait for 500 ms
restore config registers
}

If there are N devices to be reset, it takes N*500 ms. On the other
hand, the v4 patch does:
for (each device) {
save config registers
reset
}
wait 500 ms
for (each device) {
restore config registers
}
Though it needs more memory space to save config registers, the waiting
time is always 500ms.
https://lkml.org/lkml/2012/10/15/49

v3:
Move alloc_bootmem and free_bootmem to early_reset_pcie_devices so that
they are called only once.
https://lkml.org/lkml/2012/10/10/57

v2:
Reset devices in setup_arch() because reset need to be done before
interrupt remapping is initialized.
https://lkml.org/lkml/2012/10/2/54

v1:
Add fixup_final quirk to reset PCIe devices
https://lkml.org/lkml/2012/8/3/160

Takao Indoh (5):
x86, pci: add dummy pci device for early stage
PCI: Define the maximum number of PCI function
Make reset_devices available at early stage
x86, pci: Reset PCIe devices at boot time
x86, pci: Enable PCI INTx when MSI is disabled

arch/x86/include/asm/pci-direct.h | 3 +
arch/x86/kernel/setup.c | 3 +
arch/x86/pci/common.c | 4 +-
arch/x86/pci/early.c | 315 +++++++++++++++++++++++++++++++++++++
include/linux/pci.h | 2 +
init/main.c | 4 +-
6 files changed, 328 insertions(+), 3 deletions(-)

2012-11-27 00:42:41

by Takao Indoh

[permalink] [raw]

Subject: [PATCH v7 1/5] x86, pci: add dummy pci device for early stage

So we can pass pci_dev *dev to reuse some generic pci functions.

The original patch was written by Yinghai Lu.

Signed-off-by: Takao Indoh <[email protected]>
---
arch/x86/include/asm/pci-direct.h | 2 +
arch/x86/pci/early.c | 74 +++++++++++++++++++++++++++++++++++++
2 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pci-direct.h b/arch/x86/include/asm/pci-direct.h
index b1e7a45..b6360d3 100644
--- a/arch/x86/include/asm/pci-direct.h
+++ b/arch/x86/include/asm/pci-direct.h
@@ -18,4 +18,6 @@ extern int early_pci_allowed(void);
extern unsigned int pci_early_dump_regs;
extern void early_dump_pci_device(u8 bus, u8 slot, u8 func);
extern void early_dump_pci_devices(void);
+
+struct pci_dev *get_early_pci_dev(int num, int slot, int func);
#endif /* _ASM_X86_PCI_DIRECT_H */
diff --git a/arch/x86/pci/early.c b/arch/x86/pci/early.c
index d1067d5..024def7 100644
--- a/arch/x86/pci/early.c
+++ b/arch/x86/pci/early.c
@@ -109,3 +109,77 @@ void early_dump_pci_devices(void)
}
}
}
+
+static __init int
+early_pci_read(struct pci_bus *bus, unsigned int devfn, int where,
+ int size, u32 *value)
+{
+ int num, slot, func;
+
+ num = bus->number;
+ slot = PCI_SLOT(devfn);
+ func = PCI_FUNC(devfn);
+ switch (size) {
+ case 1:
+ *value = read_pci_config_byte(num, slot, func, where);
+ break;
+ case 2:
+ *value = read_pci_config_16(num, slot, func, where);
+ break;
+ case 4:
+ *value = read_pci_config(num, slot, func, where);
+ break;
+ }
+
+ return 0;
+}
+
+static __init int
+early_pci_write(struct pci_bus *bus, unsigned int devfn, int where,
+ int size, u32 value)
+{
+ int num, slot, func;
+
+ num = bus->number;
+ slot = PCI_SLOT(devfn);
+ func = PCI_FUNC(devfn);
+ switch (size) {
+ case 1:
+ write_pci_config_byte(num, slot, func, where, (u8)value);
+ break;
+ case 2:
+ write_pci_config_16(num, slot, func, where, (u16)value);
+ break;
+ case 4:
+ write_pci_config(num, slot, func, where, (u32)value);
+ break;
+ }
+
+ return 0;
+}
+
+static __initdata struct pci_ops pci_early_ops = {
+ .read = early_pci_read,
+ .write = early_pci_write,
+};
+static __initdata struct pci_bus pci_early_bus = {
+ .ops = &pci_early_ops,
+};
+static __initdata char pci_early_init_name[8];
+static __initdata struct pci_dev pci_early_dev;
+
+__init struct pci_dev *get_early_pci_dev(int num, int slot, int func)
+{
+ struct pci_dev *pdev;
+
+ pdev = &pci_early_dev;
+ memset(pdev, 0, sizeof(*pdev));
+
+ pdev->bus = &pci_early_bus,
+ pdev->dev.init_name = pci_early_init_name;
+ pdev->bus->number = num;
+ pdev->devfn = PCI_DEVFN(slot, func);
+ sprintf((char *)pdev->dev.init_name, "%02x:%02x.%01x", num, slot, func);
+
+ return pdev;
+}
--
1.7.1

2012-11-27 00:42:49

by Takao Indoh

[permalink] [raw]

Subject: [PATCH v7 2/5] PCI: Define the maximum number of PCI function

Define the maximum number of PCI function so that PCI functions can be
enumerated without using "8".

Signed-off-by: Takao Indoh <[email protected]>
---
include/linux/pci.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index ee21795..eca3231 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -35,6 +35,8 @@
/* Include the ID list */
#include <linux/pci_ids.h>

+#define PCI_MAX_FUNCTIONS 8
+
/* pci_slot represents a physical slot */
struct pci_slot {
struct pci_bus *bus; /* The bus this slot is on */
--
1.7.1

2012-11-27 00:43:04

by Takao Indoh

[permalink] [raw]

Subject: [PATCH v7 3/5] Make reset_devices available at early stage

Change reset_devices from __setup to early_param so this parameter is
available at early stage.

Signed-off-by: Takao Indoh <[email protected]>
---
init/main.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/init/main.c b/init/main.c
index e33e09d..f2b24cb 100644
--- a/init/main.c
+++ b/init/main.c
@@ -144,10 +144,10 @@ EXPORT_SYMBOL(reset_devices);
static int __init set_reset_devices(char *str)
{
reset_devices = 1;
- return 1;
+ return 0;
}

-__setup("reset_devices", set_reset_devices);
+early_param("reset_devices", set_reset_devices);

static const char * argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };
const char * envp_init[MAX_INIT_ENVS+2] = { "HOME=/", "TERM=linux", NULL, };
--
1.7.1

2012-11-27 00:43:20

by Takao Indoh

[permalink] [raw]

Subject: [PATCH v7 4/5] x86, pci: Reset PCIe devices at boot time

This patch resets PCIe devices at boot time when "reset_devices" is
specified.

Kdump with intel_iommu=on fails becasue ongoing DMA from first kernel
causes DMAR fault when page table of DMAR is initialized while kdump
kernel is booting up. To solve this problem, this patch resets PCIe
devices during boot to stop its DMA.

Signed-off-by: Takao Indoh <[email protected]>
---
arch/x86/include/asm/pci-direct.h | 1 +
arch/x86/kernel/setup.c | 3 +
arch/x86/pci/early.c | 241 +++++++++++++++++++++++++++++++++++++
3 files changed, 245 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pci-direct.h b/arch/x86/include/asm/pci-direct.h
index b6360d3..5620070 100644
--- a/arch/x86/include/asm/pci-direct.h
+++ b/arch/x86/include/asm/pci-direct.h
@@ -18,6 +18,7 @@ extern int early_pci_allowed(void);
extern unsigned int pci_early_dump_regs;
extern void early_dump_pci_device(u8 bus, u8 slot, u8 func);
extern void early_dump_pci_devices(void);
+extern void early_reset_pcie_devices(void);

struct pci_dev *get_early_pci_dev(int num, int slot, int func);
#endif /* _ASM_X86_PCI_DIRECT_H */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ca45696..2e7928e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1001,6 +1001,9 @@ void __init setup_arch(char **cmdline_p)
generic_apic_probe();

early_quirks();
+#ifdef CONFIG_PCI
+ early_reset_pcie_devices();
+#endif

/*
* Read APIC and some other early information from ACPI tables.
diff --git a/arch/x86/pci/early.c b/arch/x86/pci/early.c
index 024def7..ff737f3 100644
--- a/arch/x86/pci/early.c
+++ b/arch/x86/pci/early.c
@@ -1,5 +1,6 @@
#include <linux/kernel.h>
#include <linux/pci.h>
+#include <linux/bootmem.h>
#include <asm/pci-direct.h>
#include <asm/io.h>
#include <asm/pci_x86.h>
@@ -183,3 +184,243 @@ __init struct pci_dev *get_early_pci_dev(int num, int slot, int func)

return pdev;
}
+
+struct pcie_dev {
+ int cap; /* position of PCI Express capability */
+ int flags; /* PCI_EXP_FLAGS */
+
+ /* saved configration register */
+ u32 pci_cfg[16];
+ u16 pcie_cfg[7];
+};
+
+struct pcie_port {
+ struct list_head dev;
+ u8 bus;
+ u8 slot;
+ u8 func;
+ u8 secondary;
+ struct pcie_dev child[PCI_MAX_FUNCTIONS];
+};
+
+static __initdata LIST_HEAD(device_list);
+
+static void __init early_udelay(int loops)
+{
+ while (loops--) {
+ /* Approximately 1 us */
+ native_io_delay();
+ }
+}
+
+static void __init do_reset(u8 bus, u8 slot, u8 func)
+{
+ struct pci_dev *dev;
+ u16 ctrl;
+
+ dev = get_early_pci_dev(bus, slot, func);
+
+ printk(KERN_INFO "pci 0000:%02x:%02x.%d reset\n", bus, slot, func);
+
+ /* Assert Secondary Bus Reset */
+ pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
+ ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
+ pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
+
+ /*
+ * PCIe spec requires software to ensure a minimum reset duration
+ * (Trst == 1ms). We have here 5ms safety margin because early_udelay
+ * is not precise.
+ */
+ early_udelay(5000);
+
+ /* De-assert Secondary Bus Reset */
+ ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
+ pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
+}
+
+static void __init save_state(u8 bus, u8 slot, u8 func, struct pcie_dev *pdev)
+{
+ struct pci_dev *dev;
+ int i;
+
+ dev = get_early_pci_dev(bus, slot, func);
+ dev->is_pcie = 1;
+ dev->pcie_cap = pdev->cap;
+ dev->pcie_flags_reg = pdev->flags;
+
+ printk(KERN_INFO "pci 0000:%02x:%02x.%d save state\n", bus, slot, func);
+
+ for (i = 0; i < 16; i++)
+ pci_read_config_dword(dev, i * 4, pdev->pci_cfg + i);
+ i = 0;
+ pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &pdev->pcie_cfg[i++]);
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &pdev->pcie_cfg[i++]);
+ pcie_capability_read_word(dev, PCI_EXP_SLTCTL, &pdev->pcie_cfg[i++]);
+ pcie_capability_read_word(dev, PCI_EXP_RTCTL, &pdev->pcie_cfg[i++]);
+ pcie_capability_read_word(dev, PCI_EXP_DEVCTL2, &pdev->pcie_cfg[i++]);
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &pdev->pcie_cfg[i++]);
+ pcie_capability_read_word(dev, PCI_EXP_SLTCTL2, &pdev->pcie_cfg[i++]);
+}
+
+static void __init restore_state(u8 bus, u8 slot, u8 func,
+ struct pcie_dev *pdev)
+{
+ struct pci_dev *dev;
+ int i = 0;
+
+ dev = get_early_pci_dev(bus, slot, func);
+ dev->is_pcie = 1;
+ dev->pcie_cap = pdev->cap;
+ dev->pcie_flags_reg = pdev->flags;
+
+ printk(KERN_INFO "pci 0000:%02x:%02x.%d restore state\n",
+ bus, slot, func);
+
+ pcie_capability_write_word(dev, PCI_EXP_DEVCTL, pdev->pcie_cfg[i++]);
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL, pdev->pcie_cfg[i++]);
+ pcie_capability_write_word(dev, PCI_EXP_SLTCTL, pdev->pcie_cfg[i++]);
+ pcie_capability_write_word(dev, PCI_EXP_RTCTL, pdev->pcie_cfg[i++]);
+ pcie_capability_write_word(dev, PCI_EXP_DEVCTL2, pdev->pcie_cfg[i++]);
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, pdev->pcie_cfg[i++]);
+ pcie_capability_write_word(dev, PCI_EXP_SLTCTL2, pdev->pcie_cfg[i++]);
+
+ for (i = 15; i >= 0; i--)
+ pci_write_config_dword(dev, i * 4, pdev->pci_cfg[i]);
+}
+
+static void __init find_pcie_device(unsigned bus, unsigned slot, unsigned func)
+{
+ struct pci_dev *dev;
+ int f, pcie_type, count;
+ u8 secondary, type;
+ u16 vendor;
+ u32 class;
+ struct pcie_port *port;
+ int pcie_cap[PCI_MAX_FUNCTIONS];
+ int pcie_flags[PCI_MAX_FUNCTIONS];
+
+ dev = get_early_pci_dev(bus, slot, func);
+ set_pcie_port_type(dev);
+ if (!pci_is_pcie(dev))
+ return;
+
+ pcie_type = pci_pcie_type(dev);
+ if ((pcie_type != PCI_EXP_TYPE_ROOT_PORT) &&
+ (pcie_type != PCI_EXP_TYPE_DOWNSTREAM))
+ return;
+
+ pci_read_config_byte(dev, PCI_HEADER_TYPE, &type);
+ if ((type & 0x7f) != PCI_HEADER_TYPE_BRIDGE)
+ return;
+ pci_read_config_byte(dev, PCI_SECONDARY_BUS, &secondary);
+
+ memset(pcie_cap, 0, sizeof(pcie_cap));
+ memset(pcie_flags, 0, sizeof(pcie_flags));
+ for (count = 0, f = 0; f < PCI_MAX_FUNCTIONS; f++) {
+ dev = get_early_pci_dev(secondary, 0, f);
+ pci_read_config_word(dev, PCI_VENDOR_ID, &vendor);
+ if (vendor == 0xffff)
+ continue;
+
+ set_pcie_port_type(dev);
+ if (!pci_is_pcie(dev))
+ continue;
+
+ pcie_type = pci_pcie_type(dev);
+ if ((pcie_type == PCI_EXP_TYPE_UPSTREAM) ||
+ (pcie_type == PCI_EXP_TYPE_PCI_BRIDGE))
+ /* Don't reset switch, bridge */
+ return;
+
+ pci_read_config_dword(dev, PCI_CLASS_REVISION, &class);
+ if ((class >> 24) == PCI_BASE_CLASS_DISPLAY)
+ /* Don't reset VGA device */
+ return;
+
+ count++;
+ pcie_cap[f] = dev->pcie_cap;
+ pcie_flags[f] = dev->pcie_flags_reg;
+ }
+
+ if (!count)
+ return;
+
+ port = (struct pcie_port *)alloc_bootmem(sizeof(struct pcie_port));
+ if (port == NULL) {
+ printk(KERN_ERR "pci 0000:%02x:%02x.%d alloc_bootmem failed\n",
+ bus, slot, func);
+ return;
+ }
+ memset(port, 0, sizeof(*port));
+ port->bus = bus;
+ port->slot = slot;
+ port->func = func;
+ port->secondary = secondary;
+ for (f = 0; f < PCI_MAX_FUNCTIONS; f++)
+ if (pcie_cap[f]) {
+ port->child[f].cap = pcie_cap[f];
+ port->child[f].flags = pcie_flags[f];
+ save_state(secondary, 0, f, &port->child[f]);
+ }
+ list_add_tail(&port->dev, &device_list);
+}
+
+void __init early_reset_pcie_devices(void)
+{
+ unsigned bus, slot, func;
+ struct pcie_port *port, *tmp;
+ struct pci_dev *dev;
+
+ if (!early_pci_allowed() || !reset_devices)
+ return;
+
+ /*
+ * Find PCIe port(root port and downstream port) and save config
+ * registers of its downstream devices
+ */
+ for (bus = 0; bus < 256; bus++) {
+ for (slot = 0; slot < 32; slot++) {
+ for (func = 0; func < PCI_MAX_FUNCTIONS; func++) {
+ u16 vendor;
+ u8 type;
+
+ dev = get_early_pci_dev(bus, slot, func);
+ pci_read_config_word(dev, PCI_VENDOR_ID,
+ &vendor);
+ if (vendor == 0xffff)
+ continue;
+
+ pci_read_config_byte(dev, PCI_HEADER_TYPE,
+ &type);
+ find_pcie_device(bus, slot, func);
+
+ if ((func == 0) && !(type & 0x80))
+ break;
+ }
+ }
+ }
+
+ if (list_empty(&device_list))
+ return;
+
+ /* Do bus reset */
+ list_for_each_entry(port, &device_list, dev)
+ do_reset(port->bus, port->slot, port->func);
+
+ /*
+ * According to PCIe spec, software must wait a minimum of 100 ms
+ * before sending a configuration request. We have 500ms safety margin
+ * here.
+ */
+ early_udelay(500000);
+
+ /* Restore config registers and free memory */
+ list_for_each_entry_safe(port, tmp, &device_list, dev) {
+ for (func = 0; func < PCI_MAX_FUNCTIONS; func++)
+ if (port->child[func].cap)
+ restore_state(port->secondary, 0, func,
+ &port->child[func]);
+ free_bootmem(__pa(port), sizeof(*port));
+ }
+}
--
1.7.1

2012-11-27 00:43:34

by Takao Indoh

[permalink] [raw]

Subject: [PATCH v7 5/5] x86, pci: Enable PCI INTx when MSI is disabled

This patch enables INTx if MSI is disabled in pcibios_enable_device().
In normal case interrupt disable bit in command register is 0b on boot
time, but in case of kdump, this bit may be 1b. It causes problems of
some drivers. At leaset I confirmed mptsas driver does not work in such
a case. This patch fix this problem.

Signed-off-by: Takao Indoh <[email protected]>
---
arch/x86/pci/common.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 720e973..2bb7ecc 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -615,8 +615,10 @@ int pcibios_enable_device(struct pci_dev *dev, int mask)
if ((err = pci_enable_resources(dev, mask)) < 0)
return err;

- if (!pci_dev_msi_enabled(dev))
+ if (!pci_dev_msi_enabled(dev)) {
+ pci_intx(dev, true);
return pcibios_enable_irq(dev);
+ }
return 0;
}

--
1.7.1

2012-11-30 15:50:53

by MUNEDA Takahiro

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump with iommu

On Tue, 27 Nov 2012 09:42:20 +0900 (JST),
Takao Indoh <[email protected]> wrote:

> These patches reset PCIe devices at boot time to address DMA problem on
> kdump with iommu. When "reset_devices" is specified, a hot reset is
> triggered on each PCIe root port and downstream port to reset its
> downstream endpoint.
>
> Background:
> A kdump problem about DMA has been discussed for a long time. That is,
> when a kernel is switched to the kdump kernel, DMA derived from first
> kernel affects second kernel. Especially this problem surfaces when
> iommu is used for PCI passthrough on KVM guest. In the case of the
> machine I use, when intel_iommu=on is specified, DMAR error is detected
> in kdump kernel and PCI SERR is also detected. Finally kdump fails
> because some devices does not work correctly.
>
> The root cause is that ongoing DMA from first kernel causes DMAR fault
> because page table of DMAR is initialized while kdump kernel is booting
> up. Therefore to solve this problem DMA needs to be stopped before DMAR
> is initialized at kdump kernel boot time. By these patches, PCIe devices
> are reset by hot reset and its DMA is stopped when reset_devices is
> specified. One problem of this solution is that the monitor blacks out
> when VGA controller is reset. So this patch does not reset the port
> whose child endpoint is VGA device.
>
> What I tried:
> - Clearing bus master bit and INTx disable bit at boot time
> This did not solve this problem. I still got DMAR error on devices.
> - Resetting devices in fixup_final(v1 patch)
> DMAR error disappeared, but sometimes PCI SERR was detected. This
> is well explained here.
> https://lkml.org/lkml/2012/9/9/245
> This PCI SERR seems to be related to interrupt remapping.
> - Clearing bus master in setup_arch() and resetting devices in
> fixup_final
> Neither DMAR error nor PCI SERR occurred. But on certain machine
> kdump kernel hung up when resetting devices. It seems to be a
> problem specific to the platform.
> - Resetting devices in setup_arch() (v2 and later patch)
> This solution solves all problems I found so far.

Thank you for updating a patchset.
I have a server which raises PCI Error while system is rebooting when
I set intel_iommu=on. With v7 on top of 3.7-rc7, I don't see any PCI
Errors or other hardware related errors. So,

Tested-by: MUNEDA Takahiro <[email protected]>

Thanks,
Takahiro

>
> Changelog:
> v7:
> Update Yinghai's dummy-pci patch with macros in linux/pci.h, and fix
> some bugs
>
> v6:
> Rewrite using Yinghai's dummy-pci patch
> https://lkml.org/lkml/2012/11/13/118
>
> v5:
> Do bus reset after all devices are scanned and its config registers are
> saved. This fixes a bug that config register is accessed without delay
> after reset.
> https://lkml.org/lkml/2012/10/17/47
>
> v4:
> Reduce waiting time after resetting devices. A previous patch does reset
> like this:
> for (each device) {
> save config registers
> reset
> wait for 500 ms
> restore config registers
> }
>
> If there are N devices to be reset, it takes N*500 ms. On the other
> hand, the v4 patch does:
> for (each device) {
> save config registers
> reset
> }
> wait 500 ms
> for (each device) {
> restore config registers
> }
> Though it needs more memory space to save config registers, the waiting
> time is always 500ms.
> https://lkml.org/lkml/2012/10/15/49
>
> v3:
> Move alloc_bootmem and free_bootmem to early_reset_pcie_devices so that
> they are called only once.
> https://lkml.org/lkml/2012/10/10/57
>
> v2:
> Reset devices in setup_arch() because reset need to be done before
> interrupt remapping is initialized.
> https://lkml.org/lkml/2012/10/2/54
>
> v1:
> Add fixup_final quirk to reset PCIe devices
> https://lkml.org/lkml/2012/8/3/160
>
> Takao Indoh (5):
> x86, pci: add dummy pci device for early stage
> PCI: Define the maximum number of PCI function
> Make reset_devices available at early stage
> x86, pci: Reset PCIe devices at boot time
> x86, pci: Enable PCI INTx when MSI is disabled
>
> arch/x86/include/asm/pci-direct.h | 3 +
> arch/x86/kernel/setup.c | 3 +
> arch/x86/pci/common.c | 4 +-
> arch/x86/pci/early.c | 315 +++++++++++++++++++++++++++++++++++++
> include/linux/pci.h | 2 +
> init/main.c | 4 +-
> 6 files changed, 328 insertions(+), 3 deletions(-)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2013-03-04 00:57:25

by Takao Indoh

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump with iommu

(2013/01/23 9:47), Thomas Renninger wrote:
> On Monday, January 21, 2013 10:11:04 AM Takao Indoh wrote:
>> (2013/01/08 4:09), Thomas Renninger wrote:
> ...
>>> I tried the provided patches first on 2.6.32, then I verfied with 3.8-rc2
>>> and in both cases the disk is not detected anymore in
>>> reset_devices (kexec'ed/kdump) case (but things work fine without these
>>> patches).
>>
>> So the problem that the disk is not detected was caused by exactmap
>> problem you guys are discussing? Or still not detected even if exactmap
>> problem is fixed?
> This problem is related to the 5 PCI resetting patches.
> Dumping worked with a 2.6.32 and a 3.8-rc2 kernel, adding the PCI resetting
> patches broke both. I first tried 2.6.32 and verified with 3.8-rc2 to make sure
> I didn't mess up the backport adjustings of the patches to 2.6.32.
>
> Unfortunately this Dell platform takes really long to boot.
> I can give it the one or other test, but please do not bomb me with patches.
>
> For info:
> About the interrupt remapping error interrupt storm in kdump case I tried to
> reproduce on this machine, but never could: The guys who saw that also cannot
> reproduce this anymore.
>
> Two ideas I had about this:
> - As said already, (also) try to catch the error case and try to reset the
> the device in AER/Specific iterrupt remapping error interrupt caught.

I tried this idea but it did not work on megaraid_sas.

I made a experimental patch so that devices are reset when DMAR error is
detected on it. What happened is that:
1) megaraid_sas module is loaded.
2) DMAR error is detected during the driver initialization.
3) Reset device
4) kdump fails because the disk is not found.

When I tested patches which reset all devices in early boot time, the
disk was recognized correctly, so it seems that device reset during its
driver loading does something wrong. I think we need reset device at
least before its driver is loaded.

Thanks,
Takao Indoh

> - Have a look at coreboot, these guys should know how to initialize the PCI
> subsystem from scratch and might have some well tested PCI resetting
> code in place already (no idea, just a thought).
>
> Thomas
>
>

2013-03-04 22:02:05

by Donald Dutile

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump with iommu

On 03/03/2013 07:56 PM, Takao Indoh wrote:
> (2013/01/23 9:47), Thomas Renninger wrote:
>> On Monday, January 21, 2013 10:11:04 AM Takao Indoh wrote:
>>> (2013/01/08 4:09), Thomas Renninger wrote:
>> ...
>>>> I tried the provided patches first on 2.6.32, then I verfied with 3.8-rc2
>>>> and in both cases the disk is not detected anymore in
>>>> reset_devices (kexec'ed/kdump) case (but things work fine without these
>>>> patches).
>>>
>>> So the problem that the disk is not detected was caused by exactmap
>>> problem you guys are discussing? Or still not detected even if exactmap
>>> problem is fixed?
>> This problem is related to the 5 PCI resetting patches.
>> Dumping worked with a 2.6.32 and a 3.8-rc2 kernel, adding the PCI resetting
>> patches broke both. I first tried 2.6.32 and verified with 3.8-rc2 to make sure
>> I didn't mess up the backport adjustings of the patches to 2.6.32.
>>
>> Unfortunately this Dell platform takes really long to boot.
>> I can give it the one or other test, but please do not bomb me with patches.
>>
>> For info:
>> About the interrupt remapping error interrupt storm in kdump case I tried to
>> reproduce on this machine, but never could: The guys who saw that also cannot
>> reproduce this anymore.
>>
>> Two ideas I had about this:
>> - As said already, (also) try to catch the error case and try to reset the
>> the device in AER/Specific iterrupt remapping error interrupt caught.
>
> I tried this idea but it did not work on megaraid_sas.
>
> I made a experimental patch so that devices are reset when DMAR error is
> detected on it. What happened is that:
> 1) megaraid_sas module is loaded.
> 2) DMAR error is detected during the driver initialization.
This driver does something bad that IOMMU code isn't designed for,
or handle correctly -- it starts with one dma-mask, does an IOMMU mapping,
changes its dma-mask, and that moves it into another domain that's not
valid for the first mask.... and does occassional access with original mask.
I have it on my to-do list to dig into the driver more to see if that
sequence can be changed/fixed.

> 3) Reset device
> 4) kdump fails because the disk is not found.
>
> When I tested patches which reset all devices in early boot time, the
> disk was recognized correctly, so it seems that device reset during its
> driver loading does something wrong. I think we need reset device at
driver rest, or master-enable turned off ?

> least before its driver is loaded.
>
> Thanks,
> Takao Indoh
>
>
>> - Have a look at coreboot, these guys should know how to initialize the PCI
>> subsystem from scratch and might have some well tested PCI resetting
>> code in place already (no idea, just a thought).
>>
>> Thomas
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-03-05 00:57:31

by Takao Indoh

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump with iommu

(2013/03/05 7:00), Don Dutile wrote:
> On 03/03/2013 07:56 PM, Takao Indoh wrote:
>> (2013/01/23 9:47), Thomas Renninger wrote:
>>> On Monday, January 21, 2013 10:11:04 AM Takao Indoh wrote:
>>>> (2013/01/08 4:09), Thomas Renninger wrote:
>>> ...
>>>>> I tried the provided patches first on 2.6.32, then I verfied with 3.8-rc2
>>>>> and in both cases the disk is not detected anymore in
>>>>> reset_devices (kexec'ed/kdump) case (but things work fine without these
>>>>> patches).
>>>>
>>>> So the problem that the disk is not detected was caused by exactmap
>>>> problem you guys are discussing? Or still not detected even if exactmap
>>>> problem is fixed?
>>> This problem is related to the 5 PCI resetting patches.
>>> Dumping worked with a 2.6.32 and a 3.8-rc2 kernel, adding the PCI resetting
>>> patches broke both. I first tried 2.6.32 and verified with 3.8-rc2 to make sure
>>> I didn't mess up the backport adjustings of the patches to 2.6.32.
>>>
>>> Unfortunately this Dell platform takes really long to boot.
>>> I can give it the one or other test, but please do not bomb me with patches.
>>>
>>> For info:
>>> About the interrupt remapping error interrupt storm in kdump case I tried to
>>> reproduce on this machine, but never could: The guys who saw that also cannot
>>> reproduce this anymore.
>>>
>>> Two ideas I had about this:
>>> - As said already, (also) try to catch the error case and try to reset the
>>> the device in AER/Specific iterrupt remapping error interrupt caught.
>>
>> I tried this idea but it did not work on megaraid_sas.
>>
>> I made a experimental patch so that devices are reset when DMAR error is
>> detected on it. What happened is that:
>> 1) megaraid_sas module is loaded.
>> 2) DMAR error is detected during the driver initialization.
> This driver does something bad that IOMMU code isn't designed for,
> or handle correctly -- it starts with one dma-mask, does an IOMMU mapping,
> changes its dma-mask, and that moves it into another domain that's not
> valid for the first mask.... and does occassional access with original mask.
> I have it on my to-do list to dig into the driver more to see if that
> sequence can be changed/fixed.
>
>> 3) Reset device
>> 4) kdump fails because the disk is not found.
>>
>> When I tested patches which reset all devices in early boot time, the
>> disk was recognized correctly, so it seems that device reset during its
>> driver loading does something wrong. I think we need reset device at
> driver rest, or master-enable turned off ?

I have another patch to turn off busmaster bit in early quirk, but after
driver loading DMAR error is still detected as follows. This may be
driver problem as you said above.

Loading mptscsih.koigb: Intel(R) Gigabit Ethernet Network Driver - version 4.1.2-k
module
Loadingigb: Copyright (c) 2007-2012 Intel Corporation.
scsi_transport_dmar: DRHD: handling fault status reg 102
dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr ffe16000
DMAR:[fault reason 01] Present bit in root entry is clear
Uhhuh. NMI received for unknown reason 2c on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
fc.ko module
Loigb 0000:01:00.0: irq 76 for MSI/MSI-X
ading dm-log.ko igb 0000:01:00.0: irq 77 for MSI/MSI-X
module
Loading igb 0000:01:00.0: irq 78 for MSI/MSI-X
nf_conntrack_ipv6.ko module
Loading vhost_net.ko module
Loading igb.ko module
igb 0000:01:00.0: DCA enabled
igb 0000:01:00.0: Intel(R) Gigabit Ethernet Network Connection
igb 0000:01:00.0: eth0: (PCIe:2.5Gb/s:Width x2) c8:0a:a9:9d:fa:52
igb 0000:01:00.0: eth0: PBA No: 323131-030
igb 0000:01:00.0: Using MSI-X interrupts. 1 rx queue(s), 1 tx queue(s)
dmar: DRHD: handling fault status reg 202
dmar: DMAR:[DMA Write] Request device [01:00.1] fault addr ffee9000
DMAR:[fault reason 01] Present bit in root entry is clear
igb 0000:01:00.1: irq 79 for MSI/MSI-X
igb 0000:01:00.1: irq 80 for MSI/MSI-X
igb 0000:01:00.1: irq 81 for MSI/MSI-X
(snip)

Thanks,
Takao Indoh

>> least before its driver is loaded.
>>
>> Thanks,
>> Takao Indoh
>>
>>
>>> - Have a look at coreboot, these guys should know how to initialize the PCI
>>> subsystem from scratch and might have some well tested PCI resetting
>>> code in place already (no idea, just a thought).
>>>
>>> Thomas
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>