LinuxLists.cc - [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

Hi Bjorn,

Any comments, ack/nack?

Thanks,
Takao Indoh

(2013/05/15 7:04), Eric W. Biederman wrote:
> Takao Indoh <[email protected]> writes:
>
>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
>> on each PCIe root port and downstream port to reset its downstream
>> endpoint.
>>
>> Problem:
>> This patch solves the problem that kdump can fail when intel_iommu=on is
>> specified. When intel_iommu=on is specified, many dma-remapping errors
>> occur in second kernel and it causes problems like driver error or PCI
>> SERR, at last kdump fails. This problem is caused as follows.
>> 1) Devices are working on first kernel.
>> 2) Switch to second kernel(kdump kernel). The devices are still working
>> and its DMA continues during this switch.
>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>> dma-remapping errors.
>>
>> Solution:
>> All DMA transactions have to be stopped before iommu is initialized. By
>> this patch devices are reset and in-flight DMA is stopped before
>> pci_iommu_init.
>>
>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
>> root port/downstream port whose child is PCIe endpoint, and then reset
>> link between them. If the endpoint is VGA device, it is skipped because
>> the monitor blacks out if VGA controller is reset.
>
> At a quick skim this patch looks reasonable.
>
> Acked-by: "Eric W. Biederman" <[email protected]>
>
>> Changelog:
>> v2:
>> - Read pci config before de-assert secondary bus reset to flush previous
>> write
>> - Change function/variable name
>> - Make a list of devices to be reset
>>
>> v1:
>> https://patchwork.kernel.org/patch/2482291/
>>
>> Signed-off-by: Takao Indoh <[email protected]>
>> ---
>> Documentation/kernel-parameters.txt | 2 +
>> drivers/pci/pci.c | 113 +++++++++++++++++++++++++++++++++++
>> 2 files changed, 115 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>> index c3bfacb..8c9e8e4 100644
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>> any pair of devices, possibly at the cost of
>> reduced performance. This also guarantees
>> that hot-added devices will work.
>> + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by
>> + hot reset
>> cbiosize=nn[KMG] The fixed amount of bus space which is
>> reserved for the CardBus bridge's IO window.
>> The default value is 256 bytes.
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index a899d8b..70c1205 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>> }
>> EXPORT_SYMBOL(pci_fixup_cardbus);
>>
>> +/*
>> + * Return true if dev is PCIe root port or downstream port whose child is PCIe
>> + * endpoint except VGA device.
>> + */
>> +static int __init need_reset(struct pci_dev *dev)
>> +{
>> + struct pci_bus *subordinate;
>> + struct pci_dev *child;
>> +
>> + if (!pci_is_pcie(dev) || !dev->subordinate ||
>> + list_empty(&dev->subordinate->devices) ||
>> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
>> + return 0;
>> +
>> + subordinate = dev->subordinate;
>> + list_for_each_entry(child, &subordinate->devices, bus_list) {
>> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
>> + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
>> + /* Don't reset switch, bridge, VGA device */
>> + return 0;
>> + }
>> +
>> + return 1;
>> +}
>> +
>> +static void __init save_downstream_configs(struct pci_dev *dev)
>> +{
>> + struct pci_bus *subordinate;
>> + struct pci_dev *child;
>> +
>> + subordinate = dev->subordinate;
>> + list_for_each_entry(child, &subordinate->devices, bus_list) {
>> + dev_info(&child->dev, "save state\n");
>> + pci_save_state(child);
>> + }
>> +}
>> +
>> +static void __init restore_downstream_configs(struct pci_dev *dev)
>> +{
>> + struct pci_bus *subordinate;
>> + struct pci_dev *child;
>> +
>> + subordinate = dev->subordinate;
>> + list_for_each_entry(child, &subordinate->devices, bus_list) {
>> + dev_info(&child->dev, "restore state\n");
>> + pci_restore_state(child);
>> + }
>> +}
>> +
>> +static void __init do_downstream_device_reset(struct pci_dev *dev)
>> +{
>> + u16 ctrl;
>> +
>> + dev_info(&dev->dev, "Reset Secondary bus\n");
>> +
>> + /* Assert Secondary Bus Reset */
>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> + /* Read config again to flush previous write */
>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
>> +
>> + msleep(2);
>> +
>> + /* De-assert Secondary Bus Reset */
>> + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>> +}
>> +
>> +struct pci_dev_entry {
>> + struct list_head list;
>> + struct pci_dev *dev;
>> +};
>> +static int __initdata pcie_reset_endpoint_devices;
>> +static int __init reset_pcie_endpoints(void)
>> +{
>> + struct pci_dev *dev = NULL;
>> + struct pci_dev_entry *pdev_entry, *tmp;
>> + LIST_HEAD(pdev_list);
>> +
>> + if (!pcie_reset_endpoint_devices)
>> + return 0;
>> +
>> + for_each_pci_dev(dev)
>> + if (need_reset(dev)) {
>> + pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
>> + pdev_entry->dev = dev;
>> + list_add(&pdev_entry->list, &pdev_list);
>> + }
>> +
>> + list_for_each_entry(pdev_entry, &pdev_list, list)
>> + save_downstream_configs(pdev_entry->dev);
>> +
>> + list_for_each_entry(pdev_entry, &pdev_list, list)
>> + do_downstream_device_reset(pdev_entry->dev);
>> +
>> + msleep(1000);
>> +
>> + list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
>> + restore_downstream_configs(pdev_entry->dev);
>> + kfree(pdev_entry);
>> + }
>> +
>> + return 0;
>> +}
>> +fs_initcall_sync(reset_pcie_endpoints);
>> +
>> static int __init pci_setup(char *str)
>> {
>> while (str) {
>> @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str)
>> pcie_bus_config = PCIE_BUS_PEER2PEER;
>> } else if (!strncmp(str, "pcie_scan_all", 13)) {
>> pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>> + } else if (!strncmp(str, "pcie_reset_endpoint_devices",
>> + 27)) {
>> + pcie_reset_endpoint_devices = 1;
>> } else {
>> printk(KERN_ERR "PCI: Unknown option `%s'\n",
>> str);
>
>

2013-06-06 07:26:54

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

Ping

Bjorn, could you please tell your impression on this idea?

This fix is very important for kdump. For exmaple, kdump does not work
if PCI passthrough is used on KVM guest. Please see patch description
for details.

If you don't agree this patch, please tell me what changes I should
make.

Thanks,
Takao Indoh

(2013/05/14 14:29), Takao Indoh wrote:
> This patch resets PCIe devices on boot to stop ongoing DMA. When
> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
> on each PCIe root port and downstream port to reset its downstream
> endpoint.
>
> Problem:
> This patch solves the problem that kdump can fail when intel_iommu=on is
> specified. When intel_iommu=on is specified, many dma-remapping errors
> occur in second kernel and it causes problems like driver error or PCI
> SERR, at last kdump fails. This problem is caused as follows.
> 1) Devices are working on first kernel.
> 2) Switch to second kernel(kdump kernel). The devices are still working
> and its DMA continues during this switch.
> 3) iommu is initialized during second kernel boot and ongoing DMA causes
> dma-remapping errors.
>
> Solution:
> All DMA transactions have to be stopped before iommu is initialized. By
> this patch devices are reset and in-flight DMA is stopped before
> pci_iommu_init.
>
> To invoke hot reset on an endpoint, its upstream link need to be reset.
> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
> root port/downstream port whose child is PCIe endpoint, and then reset
> link between them. If the endpoint is VGA device, it is skipped because
> the monitor blacks out if VGA controller is reset.
>
> Changelog:
> v2:
> - Read pci config before de-assert secondary bus reset to flush previous
> write
> - Change function/variable name
> - Make a list of devices to be reset
>
> v1:
> https://patchwork.kernel.org/patch/2482291/
>
> Signed-off-by: Takao Indoh <[email protected]>
> ---
> Documentation/kernel-parameters.txt | 2 +
> drivers/pci/pci.c | 113 +++++++++++++++++++++++++++++++++++
> 2 files changed, 115 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index c3bfacb..8c9e8e4 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> any pair of devices, possibly at the cost of
> reduced performance. This also guarantees
> that hot-added devices will work.
> + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by
> + hot reset
> cbiosize=nn[KMG] The fixed amount of bus space which is
> reserved for the CardBus bridge's IO window.
> The default value is 256 bytes.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index a899d8b..70c1205 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
> }
> EXPORT_SYMBOL(pci_fixup_cardbus);
>
> +/*
> + * Return true if dev is PCIe root port or downstream port whose child is PCIe
> + * endpoint except VGA device.
> + */
> +static int __init need_reset(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + if (!pci_is_pcie(dev) || !dev->subordinate ||
> + list_empty(&dev->subordinate->devices) ||
> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
> + return 0;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, &subordinate->devices, bus_list) {
> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
> + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
> + /* Don't reset switch, bridge, VGA device */
> + return 0;
> + }
> +
> + return 1;
> +}
> +
> +static void __init save_downstream_configs(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, &subordinate->devices, bus_list) {
> + dev_info(&child->dev, "save state\n");
> + pci_save_state(child);
> + }
> +}
> +
> +static void __init restore_downstream_configs(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, &subordinate->devices, bus_list) {
> + dev_info(&child->dev, "restore state\n");
> + pci_restore_state(child);
> + }
> +}
> +
> +static void __init do_downstream_device_reset(struct pci_dev *dev)
> +{
> + u16 ctrl;
> +
> + dev_info(&dev->dev, "Reset Secondary bus\n");
> +
> + /* Assert Secondary Bus Reset */
> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +
> + /* Read config again to flush previous write */
> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
> +
> + msleep(2);
> +
> + /* De-assert Secondary Bus Reset */
> + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +}
> +
> +struct pci_dev_entry {
> + struct list_head list;
> + struct pci_dev *dev;
> +};
> +static int __initdata pcie_reset_endpoint_devices;
> +static int __init reset_pcie_endpoints(void)
> +{
> + struct pci_dev *dev = NULL;
> + struct pci_dev_entry *pdev_entry, *tmp;
> + LIST_HEAD(pdev_list);
> +
> + if (!pcie_reset_endpoint_devices)
> + return 0;
> +
> + for_each_pci_dev(dev)
> + if (need_reset(dev)) {
> + pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
> + pdev_entry->dev = dev;
> + list_add(&pdev_entry->list, &pdev_list);
> + }
> +
> + list_for_each_entry(pdev_entry, &pdev_list, list)
> + save_downstream_configs(pdev_entry->dev);
> +
> + list_for_each_entry(pdev_entry, &pdev_list, list)
> + do_downstream_device_reset(pdev_entry->dev);
> +
> + msleep(1000);
> +
> + list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
> + restore_downstream_configs(pdev_entry->dev);
> + kfree(pdev_entry);
> + }
> +
> + return 0;
> +}
> +fs_initcall_sync(reset_pcie_endpoints);
> +
> static int __init pci_setup(char *str)
> {
> while (str) {
> @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str)
> pcie_bus_config = PCIE_BUS_PEER2PEER;
> } else if (!strncmp(str, "pcie_scan_all", 13)) {
> pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
> + } else if (!strncmp(str, "pcie_reset_endpoint_devices",
> + 27)) {
> + pcie_reset_endpoint_devices = 1;
> } else {
> printk(KERN_ERR "PCI: Unknown option `%s'\n",
> str);
>

2013-06-07 04:14:54

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

Sorry it's taken me so long to look at this. I've been putting this
off because the patch doesn't seem "obviously correct," and I don't
feel like I really understand the problem. I'm naive about both
IOMMUs and kexec/kdump, so please pardon (and help me with) any silly
questions or assumptions below.

On Mon, May 13, 2013 at 11:29 PM, Takao Indoh
<[email protected]> wrote:
> This patch resets PCIe devices on boot to stop ongoing DMA. When
> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
> on each PCIe root port and downstream port to reset its downstream
> endpoint.
>
> Problem:
> This patch solves the problem that kdump can fail when intel_iommu=on is
> specified. When intel_iommu=on is specified, many dma-remapping errors
> occur in second kernel and it causes problems like driver error or PCI
> SERR, at last kdump fails. This problem is caused as follows.
> 1) Devices are working on first kernel.
> 2) Switch to second kernel(kdump kernel). The devices are still working
> and its DMA continues during this switch.
> 3) iommu is initialized during second kernel boot and ongoing DMA causes
> dma-remapping errors.

If I understand correctly, the problem only happens on systems with an
IOMMU that's enabled in either the system or kdump kernel (or both).
For systems without an IOMMU (or if it is disabled in both the system
and kdump kernels), any ongoing DMA should use addresses that target
system-kernel memory and should not affect the kdump kernel.

One thing I'm not sure about is that you are only resetting PCIe
devices, but I don't think the problem is actually specific to PCIe,
is it? I think the same issue could occur on any system with an
IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but
there are systems with IOMMUs for plain old PCI devices, e.g.,
PA-RISC.

I tried to make a list of the interesting scenarios and the events
that are relevant to this problem:

Case 1: IOMMU off in system, off in kdump kernel
system kernel leaves IOMMU off
DMA targets system-kernel memory
kexec to kdump kernel (IOMMU off, devices untouched)
DMA targets system-kernel memory (harmless)
kdump kernel re-inits device
DMA targets kdump-kernel memory

Case 2: IOMMU off in system kernel, on in kdump kernel
system kernel leaves IOMMU off
DMA targets system-kernel memory
kexec to kdump kernel (IOMMU off, devices untouched)
DMA targets system-kernel memory (harmless)
kdump kernel enables IOMMU with no valid mappings
DMA causes IOMMU errors (annoying but harmless)
kdump kernel re-inits device
DMA targets IOMMU-mapped kdump-kernel memory

Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
system kernel enables IOMMU
DMA targets IOMMU-mapped system-kernel memory
kexec to kdump kernel (IOMMU on, devices untouched)
DMA targets IOMMU-mapped system-kernel memory
kdump kernel doesn't know about IOMMU or doesn't touch it
DMA targets IOMMU-mapped system-kernel memory
kdump kernel re-inits device
kernel assumes no IOMMU, so all new DMA mappings are invalid
because DMAs actually do go through the IOMMU
(** corruption or other non-recoverable error likely **)

Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
system kernel enables IOMMU
DMA targets IOMMU-mapped system-kernel memory
kexec to kdump kernel (IOMMU on, devices untouched)
DMA targets IOMMU-mapped system-kernel memory
kdump kernel disables IOMMU
DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
(** corruption or other non-recoverable error likely **)
kdump kernel re-inits device
DMA targets kdump-kernel memory

Case 4: IOMMU on in system kernel, on in kdump kernel
system kernel enables IOMMU
DMA targets IOMMU-mapped system-kernel memory
kexec to kdump kernel (IOMMU on, devices untouched)
DMA targets IOMMU-mapped system-kernel memory
kdump kernel enables IOMMU with no valid mappings
DMA causes IOMMU errors (annoying but harmless)
kdump kernel re-inits device
DMA targets IOMMU-mapped kdump-kernel memory

The problem cases I see are 3a and 3b, but that's not the problem
you're describing. Obviously I'm missing something.

It sounds like you're seeing problems in case 2 or case 4, where the
IOMMU is enabled in the kdump kernel. Maybe my assumption about the
IOMMU being enabled with no valid mappings is wrong? Or maybe those
IOMMU errors are not actually harmless?

Resetting PCI/PCIe devices will help with cases 2, 4, and 3b, but not
with case 3a. Therefore, it seems like the kdump kernel *must*
contain IOMMU support unless it knows for certain that the system
kernel wasn't using the IOMMU.

Do you have any bugzilla references or problem report URLs you could
include here?

Obviously I'm very confused here, so please help educate me :)

Also, you mentioned PCI passthrough with a KVM guest as being a
problem. Can you explain more about what makes that a problem? I
don't know enough about passthrough to know why a device being used by
a KVM guest would cause more problems than a device being used
directly by the host.

Bjorn

> Solution:
> All DMA transactions have to be stopped before iommu is initialized. By
> this patch devices are reset and in-flight DMA is stopped before
> pci_iommu_init.
>
> To invoke hot reset on an endpoint, its upstream link need to be reset.
> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
> root port/downstream port whose child is PCIe endpoint, and then reset
> link between them. If the endpoint is VGA device, it is skipped because
> the monitor blacks out if VGA controller is reset.
>
> Signed-off-by: Takao Indoh <[email protected]>
> ---
> Documentation/kernel-parameters.txt | 2 +
> drivers/pci/pci.c | 113 +++++++++++++++++++++++++++++++++++
> 2 files changed, 115 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index c3bfacb..8c9e8e4 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> any pair of devices, possibly at the cost of
> reduced performance. This also guarantees
> that hot-added devices will work.
> + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by
> + hot reset
> cbiosize=nn[KMG] The fixed amount of bus space which is
> reserved for the CardBus bridge's IO window.
> The default value is 256 bytes.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index a899d8b..70c1205 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
> }
> EXPORT_SYMBOL(pci_fixup_cardbus);
>
> +/*
> + * Return true if dev is PCIe root port or downstream port whose child is PCIe
> + * endpoint except VGA device.
> + */
> +static int __init need_reset(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + if (!pci_is_pcie(dev) || !dev->subordinate ||
> + list_empty(&dev->subordinate->devices) ||
> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
> + return 0;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, &subordinate->devices, bus_list) {
> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
> + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
> + /* Don't reset switch, bridge, VGA device */
> + return 0;
> + }
> +
> + return 1;
> +}
> +
> +static void __init save_downstream_configs(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, &subordinate->devices, bus_list) {
> + dev_info(&child->dev, "save state\n");
> + pci_save_state(child);
> + }
> +}
> +
> +static void __init restore_downstream_configs(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, &subordinate->devices, bus_list) {
> + dev_info(&child->dev, "restore state\n");
> + pci_restore_state(child);
> + }
> +}
> +
> +static void __init do_downstream_device_reset(struct pci_dev *dev)
> +{
> + u16 ctrl;
> +
> + dev_info(&dev->dev, "Reset Secondary bus\n");
> +
> + /* Assert Secondary Bus Reset */
> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +
> + /* Read config again to flush previous write */
> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
> +
> + msleep(2);
> +
> + /* De-assert Secondary Bus Reset */
> + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +}
> +
> +struct pci_dev_entry {
> + struct list_head list;
> + struct pci_dev *dev;
> +};
> +static int __initdata pcie_reset_endpoint_devices;
> +static int __init reset_pcie_endpoints(void)
> +{
> + struct pci_dev *dev = NULL;
> + struct pci_dev_entry *pdev_entry, *tmp;
> + LIST_HEAD(pdev_list);
> +
> + if (!pcie_reset_endpoint_devices)
> + return 0;
> +
> + for_each_pci_dev(dev)
> + if (need_reset(dev)) {
> + pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
> + pdev_entry->dev = dev;
> + list_add(&pdev_entry->list, &pdev_list);
> + }
> +
> + list_for_each_entry(pdev_entry, &pdev_list, list)
> + save_downstream_configs(pdev_entry->dev);
> +
> + list_for_each_entry(pdev_entry, &pdev_list, list)
> + do_downstream_device_reset(pdev_entry->dev);
> +
> + msleep(1000);
> +
> + list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
> + restore_downstream_configs(pdev_entry->dev);
> + kfree(pdev_entry);
> + }
> +
> + return 0;
> +}
> +fs_initcall_sync(reset_pcie_endpoints);
> +
> static int __init pci_setup(char *str)
> {
> while (str) {
> @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str)
> pcie_bus_config = PCIE_BUS_PEER2PEER;
> } else if (!strncmp(str, "pcie_scan_all", 13)) {
> pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
> + } else if (!strncmp(str, "pcie_reset_endpoint_devices",
> + 27)) {
> + pcie_reset_endpoint_devices = 1;
> } else {
> printk(KERN_ERR "PCI: Unknown option `%s'\n",
> str);
> --
> 1.7.1
>
>

2013-06-07 08:47:27

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

Thank you for your comments!

(2013/06/07 13:14), Bjorn Helgaas wrote:
> Sorry it's taken me so long to look at this. I've been putting this
> off because the patch doesn't seem "obviously correct," and I don't
> feel like I really understand the problem. I'm naive about both
> IOMMUs and kexec/kdump, so please pardon (and help me with) any silly
> questions or assumptions below.
>
> On Mon, May 13, 2013 at 11:29 PM, Takao Indoh
> <[email protected]> wrote:
>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
>> on each PCIe root port and downstream port to reset its downstream
>> endpoint.
>>
>> Problem:
>> This patch solves the problem that kdump can fail when intel_iommu=on is
>> specified. When intel_iommu=on is specified, many dma-remapping errors
>> occur in second kernel and it causes problems like driver error or PCI
>> SERR, at last kdump fails. This problem is caused as follows.
>> 1) Devices are working on first kernel.
>> 2) Switch to second kernel(kdump kernel). The devices are still working
>> and its DMA continues during this switch.
>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>> dma-remapping errors.
>
> If I understand correctly, the problem only happens on systems with an
> IOMMU that's enabled in either the system or kdump kernel (or both).
> For systems without an IOMMU (or if it is disabled in both the system
> and kdump kernels), any ongoing DMA should use addresses that target
> system-kernel memory and should not affect the kdump kernel.

Yes, that's correct.

> One thing I'm not sure about is that you are only resetting PCIe
> devices, but I don't think the problem is actually specific to PCIe,
> is it? I think the same issue could occur on any system with an
> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but
> there are systems with IOMMUs for plain old PCI devices, e.g.,
> PA-RISC.

Right, this is not specific to PCIe. The reasons why the target is only
PCIe is just to make algorithm to reset simple. It is possible to reset
legacy PCI devices in my patch, but code becomes somewhat complicated. I
thought recently most systems used PCIe and there was little demand for
resetting legacy PCI. Therefore I decided not to reset legacy PCI
devices, but I'll do if there are requests :-)

>
> I tried to make a list of the interesting scenarios and the events
> that are relevant to this problem:
>
> Case 1: IOMMU off in system, off in kdump kernel
> system kernel leaves IOMMU off
> DMA targets system-kernel memory
> kexec to kdump kernel (IOMMU off, devices untouched)
> DMA targets system-kernel memory (harmless)
> kdump kernel re-inits device
> DMA targets kdump-kernel memory
>
> Case 2: IOMMU off in system kernel, on in kdump kernel
> system kernel leaves IOMMU off
> DMA targets system-kernel memory
> kexec to kdump kernel (IOMMU off, devices untouched)
> DMA targets system-kernel memory (harmless)
> kdump kernel enables IOMMU with no valid mappings
> DMA causes IOMMU errors (annoying but harmless)
> kdump kernel re-inits device
> DMA targets IOMMU-mapped kdump-kernel memory
>
> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
> system kernel enables IOMMU
> DMA targets IOMMU-mapped system-kernel memory
> kexec to kdump kernel (IOMMU on, devices untouched)
> DMA targets IOMMU-mapped system-kernel memory
> kdump kernel doesn't know about IOMMU or doesn't touch it
> DMA targets IOMMU-mapped system-kernel memory
> kdump kernel re-inits device
> kernel assumes no IOMMU, so all new DMA mappings are invalid
> because DMAs actually do go through the IOMMU
> (** corruption or other non-recoverable error likely **)
>
> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
> system kernel enables IOMMU
> DMA targets IOMMU-mapped system-kernel memory
> kexec to kdump kernel (IOMMU on, devices untouched)
> DMA targets IOMMU-mapped system-kernel memory
> kdump kernel disables IOMMU
> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
> (** corruption or other non-recoverable error likely **)
> kdump kernel re-inits device
> DMA targets kdump-kernel memory
>
> Case 4: IOMMU on in system kernel, on in kdump kernel
> system kernel enables IOMMU
> DMA targets IOMMU-mapped system-kernel memory
> kexec to kdump kernel (IOMMU on, devices untouched)
> DMA targets IOMMU-mapped system-kernel memory
> kdump kernel enables IOMMU with no valid mappings
> DMA causes IOMMU errors (annoying but harmless)

This is not harmless. Errors like PCI SERR are detected here, and it
makes driver or system unstable, and kdump fails. I also got report that
system hangs up due to this.

> kdump kernel re-inits device
> DMA targets IOMMU-mapped kdump-kernel memory
>
> The problem cases I see are 3a and 3b, but that's not the problem
> you're describing. Obviously I'm missing something.
>
> It sounds like you're seeing problems in case 2 or case 4, where the
> IOMMU is enabled in the kdump kernel. Maybe my assumption about the
> IOMMU being enabled with no valid mappings is wrong? Or maybe those
> IOMMU errors are not actually harmless?

As I wrote above, IOMMU errors are not harmless. What I know is:

Case 1: Harmless
Case 2: Not tested
Case 3a: Not tested
Case 3b: Cause problem, fixed by my patch
Case 4: Cause problem, fixed by my patch

I have never tested case 2 and 3a, but I think it also causes problem.

> Resetting PCI/PCIe devices will help with cases 2, 4, and 3b, but not
> with case 3a. Therefore, it seems like the kdump kernel *must*
> contain IOMMU support unless it knows for certain that the system
> kernel wasn't using the IOMMU.

Yes, kdump kernel has to be compiled with support for IOMMU.

>
> Do you have any bugzilla references or problem report URLs you could
> include here?

I know three Red Hat bugzilla, but I think these are private article and
you cannot see. I'll add you Cc list in bz so that you can see.

BZ#743790: Kdump fails with intel_iommu=on
https://bugzilla.redhat.com/show_bug.cgi?id=743790

BZ#743495: Hardware error is detected with intel_iommu=on
https://bugzilla.redhat.com/show_bug.cgi?id=743495

BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
https://bugzilla.redhat.com/show_bug.cgi?id=833299

>
> Obviously I'm very confused here, so please help educate me :)
>
> Also, you mentioned PCI passthrough with a KVM guest as being a
> problem. Can you explain more about what makes that a problem? I
> don't know enough about passthrough to know why a device being used by
> a KVM guest would cause more problems than a device being used
> directly by the host.

Sorry, my explanation was misleading. I mentioned PCI passthrough as
usage example of IOMMU. To assign devices to KVM guest directly using
PCI passthrough, we have to enable IOMMU for host kernel. But if you
enable IOMMU and panic happens in the host, kdump starts in the host but
it fails due to the problem I mentioned in patch description.
Did I answer your question?

Thanks,
Takao Indoh

>
> Bjorn
>
>> Solution:
>> All DMA transactions have to be stopped before iommu is initialized. By
>> this patch devices are reset and in-flight DMA is stopped before
>> pci_iommu_init.
>>
>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
>> root port/downstream port whose child is PCIe endpoint, and then reset
>> link between them. If the endpoint is VGA device, it is skipped because
>> the monitor blacks out if VGA controller is reset.
>>
>> Signed-off-by: Takao Indoh <[email protected]>
>> ---
>> Documentation/kernel-parameters.txt | 2 +
>> drivers/pci/pci.c | 113 +++++++++++++++++++++++++++++++++++
>> 2 files changed, 115 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>> index c3bfacb..8c9e8e4 100644
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>> any pair of devices, possibly at the cost of
>> reduced performance. This also guarantees
>> that hot-added devices will work.
>> + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by
>> + hot reset
>> cbiosize=nn[KMG] The fixed amount of bus space which is
>> reserved for the CardBus bridge's IO window.
>> The default value is 256 bytes.
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index a899d8b..70c1205 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>> }
>> EXPORT_SYMBOL(pci_fixup_cardbus);
>>
>> +/*
>> + * Return true if dev is PCIe root port or downstream port whose child is PCIe
>> + * endpoint except VGA device.
>> + */
>> +static int __init need_reset(struct pci_dev *dev)
>> +{
>> + struct pci_bus *subordinate;
>> + struct pci_dev *child;
>> +
>> + if (!pci_is_pcie(dev) || !dev->subordinate ||
>> + list_empty(&dev->subordinate->devices) ||
>> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
>> + return 0;
>> +
>> + subordinate = dev->subordinate;
>> + list_for_each_entry(child, &subordinate->devices, bus_list) {
>> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
>> + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
>> + /* Don't reset switch, bridge, VGA device */
>> + return 0;
>> + }
>> +
>> + return 1;
>> +}
>> +
>> +static void __init save_downstream_configs(struct pci_dev *dev)
>> +{
>> + struct pci_bus *subordinate;
>> + struct pci_dev *child;
>> +
>> + subordinate = dev->subordinate;
>> + list_for_each_entry(child, &subordinate->devices, bus_list) {
>> + dev_info(&child->dev, "save state\n");
>> + pci_save_state(child);
>> + }
>> +}
>> +
>> +static void __init restore_downstream_configs(struct pci_dev *dev)
>> +{
>> + struct pci_bus *subordinate;
>> + struct pci_dev *child;
>> +
>> + subordinate = dev->subordinate;
>> + list_for_each_entry(child, &subordinate->devices, bus_list) {
>> + dev_info(&child->dev, "restore state\n");
>> + pci_restore_state(child);
>> + }
>> +}
>> +
>> +static void __init do_downstream_device_reset(struct pci_dev *dev)
>> +{
>> + u16 ctrl;
>> +
>> + dev_info(&dev->dev, "Reset Secondary bus\n");
>> +
>> + /* Assert Secondary Bus Reset */
>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> + /* Read config again to flush previous write */
>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
>> +
>> + msleep(2);
>> +
>> + /* De-assert Secondary Bus Reset */
>> + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>> +}
>> +
>> +struct pci_dev_entry {
>> + struct list_head list;
>> + struct pci_dev *dev;
>> +};
>> +static int __initdata pcie_reset_endpoint_devices;
>> +static int __init reset_pcie_endpoints(void)
>> +{
>> + struct pci_dev *dev = NULL;
>> + struct pci_dev_entry *pdev_entry, *tmp;
>> + LIST_HEAD(pdev_list);
>> +
>> + if (!pcie_reset_endpoint_devices)
>> + return 0;
>> +
>> + for_each_pci_dev(dev)
>> + if (need_reset(dev)) {
>> + pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
>> + pdev_entry->dev = dev;
>> + list_add(&pdev_entry->list, &pdev_list);
>> + }
>> +
>> + list_for_each_entry(pdev_entry, &pdev_list, list)
>> + save_downstream_configs(pdev_entry->dev);
>> +
>> + list_for_each_entry(pdev_entry, &pdev_list, list)
>> + do_downstream_device_reset(pdev_entry->dev);
>> +
>> + msleep(1000);
>> +
>> + list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
>> + restore_downstream_configs(pdev_entry->dev);
>> + kfree(pdev_entry);
>> + }
>> +
>> + return 0;
>> +}
>> +fs_initcall_sync(reset_pcie_endpoints);
>> +
>> static int __init pci_setup(char *str)
>> {
>> while (str) {
>> @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str)
>> pcie_bus_config = PCIE_BUS_PEER2PEER;
>> } else if (!strncmp(str, "pcie_scan_all", 13)) {
>> pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>> + } else if (!strncmp(str, "pcie_reset_endpoint_devices",
>> + 27)) {
>> + pcie_reset_endpoint_devices = 1;
>> } else {
>> printk(KERN_ERR "PCI: Unknown option `%s'\n",
>> str);
>> --
>> 1.7.1
>>
>>
>
>

2013-06-11 02:21:14

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh <[email protected]> wrote:
> (2013/06/07 13:14), Bjorn Helgaas wrote:

>> One thing I'm not sure about is that you are only resetting PCIe
>> devices, but I don't think the problem is actually specific to PCIe,
>> is it? I think the same issue could occur on any system with an
>> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but
>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>> PA-RISC.
>
> Right, this is not specific to PCIe. The reasons why the target is only
> PCIe is just to make algorithm to reset simple. It is possible to reset
> legacy PCI devices in my patch, but code becomes somewhat complicated. I
> thought recently most systems used PCIe and there was little demand for
> resetting legacy PCI. Therefore I decided not to reset legacy PCI
> devices, but I'll do if there are requests :-)

I'm not sure you need to reset legacy devices (or non-PCI devices)
yet, but the current hook isn't anchored anywhere -- it's just an
fs_initcall() that doesn't give the reader any clue about the
connection between the reset and the problem it's solving.

If we do something like this patch, I think it needs to be done at the
point where we enable or disable the IOMMU. That way, it's connected
to the important event, and there's a clue about how to make
corresponding fixes for other IOMMUs.

We already have a "reset_devices" boot option. This is for the same
purpose, as far as I can tell, and I'm not sure there's value in
having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In
fact, there's nothing specific even to PCI here. The Intel VT-d docs
seem carefully written so they could apply to either PCIe or non-PCI
devices.

>> I tried to make a list of the interesting scenarios and the events
>> that are relevant to this problem:
>>
>> Case 1: IOMMU off in system, off in kdump kernel
>> system kernel leaves IOMMU off
>> DMA targets system-kernel memory
>> kexec to kdump kernel (IOMMU off, devices untouched)
>> DMA targets system-kernel memory (harmless)
>> kdump kernel re-inits device
>> DMA targets kdump-kernel memory
>>
>> Case 2: IOMMU off in system kernel, on in kdump kernel
>> system kernel leaves IOMMU off
>> DMA targets system-kernel memory
>> kexec to kdump kernel (IOMMU off, devices untouched)
>> DMA targets system-kernel memory (harmless)
>> kdump kernel enables IOMMU with no valid mappings
>> DMA causes IOMMU errors (annoying but harmless)
>> kdump kernel re-inits device
>> DMA targets IOMMU-mapped kdump-kernel memory
>>
>> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>> system kernel enables IOMMU
>> DMA targets IOMMU-mapped system-kernel memory
>> kexec to kdump kernel (IOMMU on, devices untouched)
>> DMA targets IOMMU-mapped system-kernel memory
>> kdump kernel doesn't know about IOMMU or doesn't touch it
>> DMA targets IOMMU-mapped system-kernel memory
>> kdump kernel re-inits device
>> kernel assumes no IOMMU, so all new DMA mappings are invalid
>> because DMAs actually do go through the IOMMU
>> (** corruption or other non-recoverable error likely **)
>>
>> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>> system kernel enables IOMMU
>> DMA targets IOMMU-mapped system-kernel memory
>> kexec to kdump kernel (IOMMU on, devices untouched)
>> DMA targets IOMMU-mapped system-kernel memory
>> kdump kernel disables IOMMU
>> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>> (** corruption or other non-recoverable error likely **)
>> kdump kernel re-inits device
>> DMA targets kdump-kernel memory
>>
>> Case 4: IOMMU on in system kernel, on in kdump kernel
>> system kernel enables IOMMU
>> DMA targets IOMMU-mapped system-kernel memory
>> kexec to kdump kernel (IOMMU on, devices untouched)
>> DMA targets IOMMU-mapped system-kernel memory
>> kdump kernel enables IOMMU with no valid mappings
>> DMA causes IOMMU errors (annoying but harmless)
>> kdump kernel re-inits device
>> DMA targets IOMMU-mapped kdump-kernel memory
>
> This is not harmless. Errors like PCI SERR are detected here, and it
> makes driver or system unstable, and kdump fails. I also got report that
> system hangs up due to this.

OK, let's take this slowly. Does an IOMMU error in the system kernel
also cause SERR or make the system unstable? Is that the expected
behavior on IOMMU errors, or is there something special about the
kdump scenario that causes SERRs? I see lots of DMAR errors, e.g.,
those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are
reported with printk and don't seem to cause an SERR. Maybe the SERR
is system-specific behavior?

https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public)
report of IOMMU errors related to a driver bug where we just get
printks, not SERR.

https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang
when the kdump kernel reboots (after successfully saving a crashdump).
But it is using "iommu=pt", which I don't believe makes sense. The
scenario is basically case 4 above, but instead of the kdump kernel
starting with no valid IOMMU mappings, it identity-maps bus addresses
to physical memory addresses. That's completely bogus because it's
certainly not what the system kernel did, so it's entirely likely to
make the system unstable or hang. This is not an argument for doing a
reset; it's an argument for doing something smarter than "iommu=pt" in
the kdump kernel.

We might still want to reset PCIe devices, but I want to make sure
that we're not papering over other issues when we do. Therefore, I'd
like to understand why IOMMU errors seem harmless in some cases but
not in others.

> Case 1: Harmless
> Case 2: Not tested
> Case 3a: Not tested
> Case 3b: Cause problem, fixed by my patch
> Case 4: Cause problem, fixed by my patch
>
> I have never tested case 2 and 3a, but I think it also causes problem.

I do not believe we need to support case 3b (IOMMU on in system kernel
but disabled in kdump kernel). There is no way to make that reliable
unless every single device that may perform DMA is reset, and since
you don't reset legacy PCI or VGA devices, you're not even proposing
to do that.

I think we need to support case 1 (for systems with no IOMMU at all)
and case 4 (IOMMU enabled in both system kernel and kdump kernel). If
the kdump kernel can detect whether the IOMMU is enabled, that should
be enough -- it could figure out automatically whether we're in case 1
or 4.

>> Do you have any bugzilla references or problem report URLs you could
>> include here?
>
> I know three Red Hat bugzilla, but I think these are private article and
> you cannot see. I'll add you Cc list in bz so that you can see.
>
> BZ#743790: Kdump fails with intel_iommu=on
> https://bugzilla.redhat.com/show_bug.cgi?id=743790
>
> BZ#743495: Hardware error is detected with intel_iommu=on
> https://bugzilla.redhat.com/show_bug.cgi?id=743495
>
> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
> https://bugzilla.redhat.com/show_bug.cgi?id=833299

Thanks for adding me to the CC lists. I looked all three and I'm not
sure there's anything sensitive in them. It'd be nice if they could
be made public if there's not.

Bjorn

2013-06-11 06:09:08

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

(2013/06/11 11:20), Bjorn Helgaas wrote:
> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh <[email protected]> wrote:
>> (2013/06/07 13:14), Bjorn Helgaas wrote:
>
>>> One thing I'm not sure about is that you are only resetting PCIe
>>> devices, but I don't think the problem is actually specific to PCIe,
>>> is it? I think the same issue could occur on any system with an
>>> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but
>>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>>> PA-RISC.
>>
>> Right, this is not specific to PCIe. The reasons why the target is only
>> PCIe is just to make algorithm to reset simple. It is possible to reset
>> legacy PCI devices in my patch, but code becomes somewhat complicated. I
>> thought recently most systems used PCIe and there was little demand for
>> resetting legacy PCI. Therefore I decided not to reset legacy PCI
>> devices, but I'll do if there are requests :-)
>
> I'm not sure you need to reset legacy devices (or non-PCI devices)
> yet, but the current hook isn't anchored anywhere -- it's just an
> fs_initcall() that doesn't give the reader any clue about the
> connection between the reset and the problem it's solving.
>
> If we do something like this patch, I think it needs to be done at the
> point where we enable or disable the IOMMU. That way, it's connected
> to the important event, and there's a clue about how to make
> corresponding fixes for other IOMMUs.

Ok. pci_iommu_init() is appropriate place to add this hook?

> We already have a "reset_devices" boot option. This is for the same
> purpose, as far as I can tell, and I'm not sure there's value in
> having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In
> fact, there's nothing specific even to PCI here. The Intel VT-d docs
> seem carefully written so they could apply to either PCIe or non-PCI
> devices.

Yeah, I can integrate this option into reset_devices. The reason why
I separate them is to avoid regression.

I have tested my patch on many machines and basically it worked, but I
found two machines where this reset patch caused problem. The first one
was caused by bug of raid card firmware. After updating firmware, this
patch worked. The second one was due to bug of PCIe switch chip. I
reported this bug to the vendor but it is not fixed yet.

Anyway, this patch may cause problems on such a buggy machine, so I
introduced new boot parameter so that user can enable or disable this
function aside from reset_devices.

Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
reset_devices instead of adding new one, and using quirk or something to
avoid such a buggy devices.

So, basically I agree with using reset_devices, but I want to prepare
workaround in case this reset causes something wrong.

>
>>> I tried to make a list of the interesting scenarios and the events
>>> that are relevant to this problem:
>>>
>>> Case 1: IOMMU off in system, off in kdump kernel
>>> system kernel leaves IOMMU off
>>> DMA targets system-kernel memory
>>> kexec to kdump kernel (IOMMU off, devices untouched)
>>> DMA targets system-kernel memory (harmless)
>>> kdump kernel re-inits device
>>> DMA targets kdump-kernel memory
>>>
>>> Case 2: IOMMU off in system kernel, on in kdump kernel
>>> system kernel leaves IOMMU off
>>> DMA targets system-kernel memory
>>> kexec to kdump kernel (IOMMU off, devices untouched)
>>> DMA targets system-kernel memory (harmless)
>>> kdump kernel enables IOMMU with no valid mappings
>>> DMA causes IOMMU errors (annoying but harmless)
>>> kdump kernel re-inits device
>>> DMA targets IOMMU-mapped kdump-kernel memory
>>>
>>> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>>> system kernel enables IOMMU
>>> DMA targets IOMMU-mapped system-kernel memory
>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>> DMA targets IOMMU-mapped system-kernel memory
>>> kdump kernel doesn't know about IOMMU or doesn't touch it
>>> DMA targets IOMMU-mapped system-kernel memory
>>> kdump kernel re-inits device
>>> kernel assumes no IOMMU, so all new DMA mappings are invalid
>>> because DMAs actually do go through the IOMMU
>>> (** corruption or other non-recoverable error likely **)
>>>
>>> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>>> system kernel enables IOMMU
>>> DMA targets IOMMU-mapped system-kernel memory
>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>> DMA targets IOMMU-mapped system-kernel memory
>>> kdump kernel disables IOMMU
>>> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>>> (** corruption or other non-recoverable error likely **)
>>> kdump kernel re-inits device
>>> DMA targets kdump-kernel memory
>>>
>>> Case 4: IOMMU on in system kernel, on in kdump kernel
>>> system kernel enables IOMMU
>>> DMA targets IOMMU-mapped system-kernel memory
>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>> DMA targets IOMMU-mapped system-kernel memory
>>> kdump kernel enables IOMMU with no valid mappings
>>> DMA causes IOMMU errors (annoying but harmless)
>>> kdump kernel re-inits device
>>> DMA targets IOMMU-mapped kdump-kernel memory
>>
>> This is not harmless. Errors like PCI SERR are detected here, and it
>> makes driver or system unstable, and kdump fails. I also got report that
>> system hangs up due to this.
>
> OK, let's take this slowly. Does an IOMMU error in the system kernel
> also cause SERR or make the system unstable? Is that the expected
> behavior on IOMMU errors, or is there something special about the
> kdump scenario that causes SERRs? I see lots of DMAR errors, e.g.,
> those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are
> reported with printk and don't seem to cause an SERR. Maybe the SERR
> is system-specific behavior?
>
> https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public)
> report of IOMMU errors related to a driver bug where we just get
> printks, not SERR.

Yes, it depends on platform or devices. At least PCI SERR is detected on
Fujitsu PRIMERGY BX920S2 and TX300S6.

Intel VT-d documents[1] says:

3.5.1 Hardware Handling of Faulting DMA Requests
DMA requests that result in remapping faults must be blocked by
hardware. The exact method of DMA blocking is
implementation-specific. For example:

- Faulting DMA write requests may be handled in much the same way as
hardware handles write requests to non-existent memory. For
example, the DMA request is discarded in a manner convenient for
implementations (such as by dropping the cycle, completing the
write request to memory with all byte enables masked off,
re-directed to a dummy memory location, etc.).

- Faulting DMA read requests may be handled in much the same way as
hardware handles read requests to non-existent memory. For
example, the request may be redirected to a dummy memory location,
returned as all 0$B!G(Bs or 1$B!G(Bs in a manner convenient to the
implementation, or the request may be completed with an explicit
error indication (preferred). For faulting DMA read requests from
PCI Express devices, hardware must indicate "Unsupported Request"
(UR) in the completion status field of the PCI Express read
completion.

So, after IOMMU error, its behavior is implementation-specific.

[1]
Intel Virtualization Technology for Directed I/O Architecture
Specification Rev1.3

>
> https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang
> when the kdump kernel reboots (after successfully saving a crashdump).
> But it is using "iommu=pt", which I don't believe makes sense. The
> scenario is basically case 4 above, but instead of the kdump kernel
> starting with no valid IOMMU mappings, it identity-maps bus addresses
> to physical memory addresses. That's completely bogus because it's
> certainly not what the system kernel did, so it's entirely likely to
> make the system unstable or hang. This is not an argument for doing a
> reset; it's an argument for doing something smarter than "iommu=pt" in
> the kdump kernel.
>
> We might still want to reset PCIe devices, but I want to make sure
> that we're not papering over other issues when we do. Therefore, I'd
> like to understand why IOMMU errors seem harmless in some cases but
> not in others.

As I wrote above, IOMMU behavior on error is up to platform/devcies. I
think it also depends on the value of Uncorrectable Error Mask Register
in AER registers of each device.

>> Case 1: Harmless
>> Case 2: Not tested
>> Case 3a: Not tested
>> Case 3b: Cause problem, fixed by my patch
>> Case 4: Cause problem, fixed by my patch
>>
>> I have never tested case 2 and 3a, but I think it also causes problem.
>
> I do not believe we need to support case 3b (IOMMU on in system kernel
> but disabled in kdump kernel). There is no way to make that reliable
> unless every single device that may perform DMA is reset, and since
> you don't reset legacy PCI or VGA devices, you're not even proposing
> to do that.
>
> I think we need to support case 1 (for systems with no IOMMU at all)
> and case 4 (IOMMU enabled in both system kernel and kdump kernel). If
> the kdump kernel can detect whether the IOMMU is enabled, that should
> be enough -- it could figure out automatically whether we're in case 1
> or 4.

Ok, I completely agree.

>>> Do you have any bugzilla references or problem report URLs you could
>>> include here?
>>
>> I know three Red Hat bugzilla, but I think these are private article and
>> you cannot see. I'll add you Cc list in bz so that you can see.
>>
>> BZ#743790: Kdump fails with intel_iommu=on
>> https://bugzilla.redhat.com/show_bug.cgi?id=743790
>>
>> BZ#743495: Hardware error is detected with intel_iommu=on
>> https://bugzilla.redhat.com/show_bug.cgi?id=743495
>>
>> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
>> https://bugzilla.redhat.com/show_bug.cgi?id=833299
>
> Thanks for adding me to the CC lists. I looked all three and I'm not
> sure there's anything sensitive in them. It'd be nice if they could
> be made public if there's not.

I really appreciate your comments! I'll confirm I can make it public.

Thanks,
Takao Indoh

2013-06-11 23:20:48

by Sumner, William

[permalink] [raw]

Subject: RE: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

>(2013/06/11 11:20), Bjorn Helgaas wrote:
>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh <[email protected]> wrote:
>>> (2013/06/07 13:14), Bjorn Helgaas wrote:
>>
>>>> One thing I'm not sure about is that you are only resetting PCIe
>>>> devices, but I don't think the problem is actually specific to PCIe,
>>>> is it? I think the same issue could occur on any system with an
>>>> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but
>>>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>>>> PA-RISC.
>>>
>>> Right, this is not specific to PCIe. The reasons why the target is only
>>> PCIe is just to make algorithm to reset simple. It is possible to reset
>>> legacy PCI devices in my patch, but code becomes somewhat complicated. I
>>> thought recently most systems used PCIe and there was little demand for
>>> resetting legacy PCI. Therefore I decided not to reset legacy PCI
>>> devices, but I'll do if there are requests :-)
>>
>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>> yet, but the current hook isn't anchored anywhere -- it's just an
>> fs_initcall() that doesn't give the reader any clue about the
>> connection between the reset and the problem it's solving.
>>
>> If we do something like this patch, I think it needs to be done at the
>> point where we enable or disable the IOMMU. That way, it's connected
>> to the important event, and there's a clue about how to make
>> corresponding fixes for other IOMMUs.
>
>Ok. pci_iommu_init() is appropriate place to add this hook?
>
>> We already have a "reset_devices" boot option. This is for the same
>> purpose, as far as I can tell, and I'm not sure there's value in
>> having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In
>> fact, there's nothing specific even to PCI here. The Intel VT-d docs
>> seem carefully written so they could apply to either PCIe or non-PCI
>> devices.
>
>Yeah, I can integrate this option into reset_devices. The reason why
>I separate them is to avoid regression.
>
>I have tested my patch on many machines and basically it worked, but I
>found two machines where this reset patch caused problem. The first one
>was caused by bug of raid card firmware. After updating firmware, this
>patch worked. The second one was due to bug of PCIe switch chip. I
>reported this bug to the vendor but it is not fixed yet.
>
>Anyway, this patch may cause problems on such a buggy machine, so I
>introduced new boot parameter so that user can enable or disable this
>function aside from reset_devices.
>
>Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>reset_devices instead of adding new one, and using quirk or something to
>avoid such a buggy devices.

With respect to "and using quirk or something to avoid such buggy devices",
I believe that it will be necessary to provide a mechanism for devices that
need special handling to do the reset -- perhaps something like a list
of tuples: (device_type, function_to_call) with a default function_to_call
when the device_type is not found in the list. These functions would
need to be physically separate from the device driver because if the device
is present it needs to be reset even if the crash kernel chooses not to load
the driver for that device.

>
>So, basically I agree with using reset_devices, but I want to prepare
>workaround in case this reset causes something wrong.
>
I like the ability to specify the original "reset_devices" separately from
invoking this new mechanism. With so many different uses for Linux in
so many different environments and with so many different device drivers
it seems reasonable to keep the ability to tell the device drivers to
reset their devices -- instead of pulling the reset line on all devices.

I also like the ability to invoke the new reset feature separately from
telling the device drivers to do it.

>
>>
>>>> I tried to make a list of the interesting scenarios and the events
>>>> that are relevant to this problem:
>>>>
>>>> Case 1: IOMMU off in system, off in kdump kernel
>>>> system kernel leaves IOMMU off
>>>> DMA targets system-kernel memory
>>>> kexec to kdump kernel (IOMMU off, devices untouched)
>>>> DMA targets system-kernel memory (harmless)
>>>> kdump kernel re-inits device
>>>> DMA targets kdump-kernel memory
>>>>
>>>> Case 2: IOMMU off in system kernel, on in kdump kernel
>>>> system kernel leaves IOMMU off
>>>> DMA targets system-kernel memory
>>>> kexec to kdump kernel (IOMMU off, devices untouched)
>>>> DMA targets system-kernel memory (harmless)
>>>> kdump kernel enables IOMMU with no valid mappings
>>>> DMA causes IOMMU errors (annoying but harmless)
>>>> kdump kernel re-inits device
>>>> DMA targets IOMMU-mapped kdump-kernel memory
>>>>
>>>> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>>>> system kernel enables IOMMU
>>>> DMA targets IOMMU-mapped system-kernel memory
>>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>>> DMA targets IOMMU-mapped system-kernel memory
>>>> kdump kernel doesn't know about IOMMU or doesn't touch it
>>>> DMA targets IOMMU-mapped system-kernel memory
>>>> kdump kernel re-inits device
>>>> kernel assumes no IOMMU, so all new DMA mappings are invalid
>>>> because DMAs actually do go through the IOMMU
>>>> (** corruption or other non-recoverable error likely **)
>>>>
>>>>
>>>> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>>>> system kernel enables IOMMU
>>>> DMA targets IOMMU-mapped system-kernel memory
>>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>>> DMA targets IOMMU-mapped system-kernel memory
>>>> kdump kernel disables IOMMU
>>>> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>>>> (** corruption or other non-recoverable error likely **)
>>>> kdump kernel re-inits device
>>>> DMA targets kdump-kernel memory
>>>>
>>>> Case 4: IOMMU on in system kernel, on in kdump kernel
>>>> system kernel enables IOMMU
>>>> DMA targets IOMMU-mapped system-kernel memory
>>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>>> DMA targets IOMMU-mapped system-kernel memory
>>>> kdump kernel enables IOMMU with no valid mappings
>>>> DMA causes IOMMU errors (annoying but harmless)
>>>> kdump kernel re-inits device
>>>> DMA targets IOMMU-mapped kdump-kernel memory
>>>
>>> This is not harmless. Errors like PCI SERR are detected here, and it
>>> makes driver or system unstable, and kdump fails. I also got report that
>>> system hangs up due to this.
>>
>> OK, let's take this slowly. Does an IOMMU error in the system kernel
>> also cause SERR or make the system unstable? Is that the expected
>> behavior on IOMMU errors, or is there something special about the
>> kdump scenario that causes SERRs? I see lots of DMAR errors, e.g.,
>> those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are
>> reported with printk and don't seem to cause an SERR. Maybe the SERR
>> is system-specific behavior?
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public)
>> report of IOMMU errors related to a driver bug where we just get
>> printks, not SERR.
>
>Yes, it depends on platform or devices. At least PCI SERR is detected on
>Fujitsu PRIMERGY BX920S2 and TX300S6.
>
>Intel VT-d documents[1] says:
>
> 3.5.1 Hardware Handling of Faulting DMA Requests
> DMA requests that result in remapping faults must be blocked by
> hardware. The exact method of DMA blocking is
> implementation-specific. For example:
>
> - Faulting DMA write requests may be handled in much the same way as
> hardware handles write requests to non-existent memory. For
> example, the DMA request is discarded in a manner convenient for
> implementations (such as by dropping the cycle, completing the
> write request to memory with all byte enables masked off,
> re-directed to a dummy memory location, etc.).
>
> - Faulting DMA read requests may be handled in much the same way as
> hardware handles read requests to non-existent memory. For
> example, the request may be redirected to a dummy memory location,
> returned as all 0<92>s or 1<92>s in a manner convenient to the
> implementation, or the request may be completed with an explicit
> error indication (preferred). For faulting DMA read requests from
> PCI Express devices, hardware must indicate "Unsupported Request"
> (UR) in the completion status field of the PCI Express read
> completion.
>
>So, after IOMMU error, its behavior is implementation-specific.
>
>[1]
>Intel Virtualization Technology for Directed I/O Architecture
>Specification Rev1.3
>
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang
>> when the kdump kernel reboots (after successfully saving a crashdump).
>> But it is using "iommu=pt", which I don't believe makes sense. The
>> scenario is basically case 4 above, but instead of the kdump kernel
>> starting with no valid IOMMU mappings, it identity-maps bus addresses
>> to physical memory addresses. That's completely bogus because it's
>> certainly not what the system kernel did, so it's entirely likely to
>> make the system unstable or hang. This is not an argument for doing a
>> reset; it's an argument for doing something smarter than "iommu=pt" in
>> the kdump kernel.

>> We might still want to reset PCIe devices, but I want to make sure
>> that we're not papering over other issues when we do. Therefore, I'd
>> like to understand why IOMMU errors seem harmless in some cases but
>> not in others.
>
>As I wrote above, IOMMU behavior on error is up to platform/devcies. I
>think it also depends on the value of Uncorrectable Error Mask Register
>in AER registers of each device.
>
>>> Case 1: Harmless
>>> Case 2: Not tested
>>> Case 3a: Not tested
>>> Case 3b: Cause problem, fixed by my patch
>>> Case 4: Cause problem, fixed by my patch
>>>
>>> I have never tested case 2 and 3a, but I think it also causes problem.
>>
>> I do not believe we need to support case 3b (IOMMU on in system kernel
>> but disabled in kdump kernel). There is no way to make that reliable
>> unless every single device that may perform DMA is reset, and since
>> you don't reset legacy PCI or VGA devices, you're not even proposing
>> to do that.
>>
>> I think we need to support case 1 (for systems with no IOMMU at all)
>> and case 4 (IOMMU enabled in both system kernel and kdump kernel). If
>> the kdump kernel can detect whether the IOMMU is enabled, that should
>> be enough -- it could figure out automatically whether we're in case 1
>> or 4.
>
>Ok, I completely agree.
>
>
>>>> Do you have any bugzilla references or problem report URLs you could
>>>> include here?
>>>
>>> I know three Red Hat bugzilla, but I think these are private article and
>>> you cannot see. I'll add you Cc list in bz so that you can see.
>>>
>>> BZ#743790: Kdump fails with intel_iommu=on
>>> https://bugzilla.redhat.com/show_bug.cgi?id=743790
>>>
>>> BZ#743495: Hardware error is detected with intel_iommu=on
>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495
>>>
>>> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
>>> https://bugzilla.redhat.com/show_bug.cgi?id=833299
>>
>> Thanks for adding me to the CC lists. I looked all three and I'm not
>> sure there's anything sensitive in them. It'd be nice if they could
>> be made public if there's not.
>
>I really appreciate your comments! I'll confirm I can make it public.

I would greatly appreciate being able to see the bugzillas relating to
this patch.
>
>Thanks,
>Takao Indoh

Thinking out of the box:
Much of the discussion about dealing with the ongoing DMA leftover
from the system kernel has assumed that the crash kernel will reset
the IOMMU -- which causes various problems if done while there is any DMA
still active -- which leads to the idea of stopping all of the DMA.

Suppose the crash kernel does not reset the hardware IOMMU, but simply
detects that it is active, resets only the devices that are necessary
for the crash kernel to operate, and re-programs only the translations
for those devices. All other translations remain the same (and remain valid)
so all leftover DMA continues into its buffer in the system kernel area
where it is harmless. New translations needed by the kdump kernel are
added to the existing tables.

I have not yet tried this, so I am not ready to propose it as anything more
than a discussion topic at this time.

It might work this way: (A small modification to case 3a above)

IOMMU on in system kernel, kdump kernel accepts active IOMMU
system kernel enables IOMMU
DMA targets IOMMU-mapped system-kernel memory
kexec to kdump kernel (IOMMU on, devices untouched)
DMA targets IOMMU-mapped system-kernel memory
kdump kernel detects active IOMMU and doesn't touch it
DMA targets IOMMU-mapped system-kernel memory
kdump kernel does not re-initialize IOMMU hardware
kdump kernel initializes IOMMU in-memory management structures
kdump kernel calls device drivers' standard initialization functions
Drivers initialize their own devices -- DMA from that device stops
When drivers request new DMA mappings, the kdump IOMMU driver:
1. Updates its in-memory mgt structures for that device & range
2. Updates IOMMU translate tables for that device & range
. Translations for all other devices & ranges are unchanged
3. Flushes IOMMU TLB to force IOMMU hardware update

Notes:
A. This seems very similar to case 1

B. Only touch devices with drivers loaded into kdump kernel.
. No need to touch devices that kdump is not going to use.

C. This assumes that all DMA device drivers used by kdump will
initialize the device before requesting DMA mappings.
. Seems reasonable, but need to confirm how many do (or don't)
. Only device drivers that might be included in the kdump
kernel need to observe this initialization ordering.

D. Could copy system kernel's translate tables into kdump kernel
and adjust pointers if this feels more trustworthy than using
the original structures where they are in the system kernel.

E. Handle IRQ remappings in a similar manner.

Thanks,
Bill Sumner

2013-06-12 00:53:55

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

[Your quoting is messed up below]
>> This is Takao's text
> This is Bill's text

On Tue, Jun 11, 2013 at 5:19 PM, Sumner, William <[email protected]> wrote:
>>(2013/06/11 11:20), Bjorn Helgaas wrote:
>>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh <[email protected]> wrote:

>>Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>>reset_devices instead of adding new one, and using quirk or something to
>>avoid such a buggy devices.
>
> With respect to "and using quirk or something to avoid such buggy devices",
> I believe that it will be necessary to provide a mechanism for devices that
> need special handling to do the reset ...

You mean something like pci_dev_specific_reset()?

>>So, basically I agree with using reset_devices, but I want to prepare
>>workaround in case this reset causes something wrong.
>>
> I like the ability to specify the original "reset_devices" separately from
> invoking this new mechanism. With so many different uses for Linux in
> so many different environments and with so many different device drivers
> it seems reasonable to keep the ability to tell the device drivers to
> reset their devices -- instead of pulling the reset line on all devices.

The problem with adding new options for this and that is it confuses
users and fragments testing. Users already randomly try options and
publish combinations that happen to work as "solutions." Then we
don't get problem reports and don't get a chance to fix things
properly. The ideal OS would have zero options and be able to figure
everything out by itself. So that's my bias about giving users
flexibility by adding new options -- I don't like it, and I think we
have to accept some lack of flexibility to keep the system complexity
tractable :)

> ...
> Thinking out of the box:
> Much of the discussion about dealing with the ongoing DMA leftover
> from the system kernel has assumed that the crash kernel will reset
> the IOMMU -- which causes various problems if done while there is any DMA
> still active -- which leads to the idea of stopping all of the DMA.
>
> Suppose the crash kernel does not reset the hardware IOMMU, but simply
> detects that it is active, resets only the devices that are necessary
> for the crash kernel to operate, and re-programs only the translations
> for those devices. All other translations remain the same (and remain valid)
> so all leftover DMA continues into its buffer in the system kernel area
> where it is harmless. New translations needed by the kdump kernel are
> added to the existing tables.
>
> I have not yet tried this, so I am not ready to propose it as anything more
> than a discussion topic at this time.
>
> It might work this way: (A small modification to case 3a above)
>
> IOMMU on in system kernel, kdump kernel accepts active IOMMU
> system kernel enables IOMMU
> DMA targets IOMMU-mapped system-kernel memory
> kexec to kdump kernel (IOMMU on, devices untouched)
> DMA targets IOMMU-mapped system-kernel memory
> kdump kernel detects active IOMMU and doesn't touch it
> DMA targets IOMMU-mapped system-kernel memory
> kdump kernel does not re-initialize IOMMU hardware
> kdump kernel initializes IOMMU in-memory management structures
> kdump kernel calls device drivers' standard initialization functions
> Drivers initialize their own devices -- DMA from that device stops
> When drivers request new DMA mappings, the kdump IOMMU driver:
> 1. Updates its in-memory mgt structures for that device & range
> 2. Updates IOMMU translate tables for that device & range
> . Translations for all other devices & ranges are unchanged
> 3. Flushes IOMMU TLB to force IOMMU hardware update

This is certainly an interesting idea.

It would require some additional smarts in the IOMMU driver to take
over existing page tables. Those data structures from the system
kernel are outside the memory map known to the kdump kernel, so
there'd likely be VM issues there. The IOMMU driver would also have
to be able to reconstruct the state for the bus address space
allocator, e.g., use the I/O page tables to rebuild a bitmap of
allocated space.

We'll end up with a bunch of dma_map() calls from the system kernel
that don't have corresponding dma_unmap()s. I guess when a driver
initializes its device, there might have to be a new interface to
"unmap any existing mappings for this device, even though I don't know
what they are."

I think it's possible, but it sounds like quite a lot of work and I'm
not sure it's worth it for this relatively limited use case. And it's
more IOMMU-specific than a "reset devices" strategy, so most of it
would have to be done in each IOMMU driver.

Bjorn

2013-06-12 04:45:50

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

[+cc Vivek, Haren; sorry I didn't think to add you earlier]

On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
<[email protected]> wrote:
> (2013/06/11 11:20), Bjorn Helgaas wrote:

>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>> yet, but the current hook isn't anchored anywhere -- it's just an
>> fs_initcall() that doesn't give the reader any clue about the
>> connection between the reset and the problem it's solving.
>>
>> If we do something like this patch, I think it needs to be done at the
>> point where we enable or disable the IOMMU. That way, it's connected
>> to the important event, and there's a clue about how to make
>> corresponding fixes for other IOMMUs.
>
> Ok. pci_iommu_init() is appropriate place to add this hook?

I looked at various IOMMU init places today, and it's far more
complicated and varied than I had hoped.

This reset scheme depends on enumerating PCI devices before we
initialize the IOMMU used by those devices. x86 works that way today,
but not all architectures do (see the sparc pci_fire_pbm_init(), for
example). And I think conceptually, the IOMMU should be enumerated
and initialized *before* the devices that use it.

So I'm uncomfortable with that aspect of this scheme.

It would be at least conceivable to reset the devices in the system
kernel, before the kexec. I know we want to do as little as possible
in the crashing kernel, but it's at least a possibility, and it might
be cleaner.

Bjorn

2013-06-12 13:19:43

by Donald Dutile

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

On 06/11/2013 07:19 PM, Sumner, William wrote:
>
>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh<[email protected]> wrote:
>>>> (2013/06/07 13:14), Bjorn Helgaas wrote:
>>>
>>>>> One thing I'm not sure about is that you are only resetting PCIe
>>>>> devices, but I don't think the problem is actually specific to PCIe,
>>>>> is it? I think the same issue could occur on any system with an
>>>>> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but
>>>>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>>>>> PA-RISC.
>>>>
>>>> Right, this is not specific to PCIe. The reasons why the target is only
>>>> PCIe is just to make algorithm to reset simple. It is possible to reset
>>>> legacy PCI devices in my patch, but code becomes somewhat complicated. I
>>>> thought recently most systems used PCIe and there was little demand for
>>>> resetting legacy PCI. Therefore I decided not to reset legacy PCI
>>>> devices, but I'll do if there are requests :-)
>>>
>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>> fs_initcall() that doesn't give the reader any clue about the
>>> connection between the reset and the problem it's solving.
>>>
>>> If we do something like this patch, I think it needs to be done at the
>>> point where we enable or disable the IOMMU. That way, it's connected
>>> to the important event, and there's a clue about how to make
>>> corresponding fixes for other IOMMUs.
>>
>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>
>>> We already have a "reset_devices" boot option. This is for the same
>>> purpose, as far as I can tell, and I'm not sure there's value in
>>> having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In
>>> fact, there's nothing specific even to PCI here. The Intel VT-d docs
>>> seem carefully written so they could apply to either PCIe or non-PCI
>>> devices.
>>
>> Yeah, I can integrate this option into reset_devices. The reason why
>> I separate them is to avoid regression.
>>
>> I have tested my patch on many machines and basically it worked, but I
>> found two machines where this reset patch caused problem. The first one
>> was caused by bug of raid card firmware. After updating firmware, this
>> patch worked. The second one was due to bug of PCIe switch chip. I
>> reported this bug to the vendor but it is not fixed yet.
>>
>> Anyway, this patch may cause problems on such a buggy machine, so I
>> introduced new boot parameter so that user can enable or disable this
>> function aside from reset_devices.
>>
>> Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>> reset_devices instead of adding new one, and using quirk or something to
>> avoid such a buggy devices.
>
> With respect to "and using quirk or something to avoid such buggy devices",
> I believe that it will be necessary to provide a mechanism for devices that
> need special handling to do the reset -- perhaps something like a list
> of tuples: (device_type, function_to_call) with a default function_to_call
> when the device_type is not found in the list. These functions would
> need to be physically separate from the device driver because if the device
> is present it needs to be reset even if the crash kernel chooses not to load
> the driver for that device.
>
>>
>> So, basically I agree with using reset_devices, but I want to prepare
>> workaround in case this reset causes something wrong.
>>
> I like the ability to specify the original "reset_devices" separately from
> invoking this new mechanism. With so many different uses for Linux in
> so many different environments and with so many different device drivers
> it seems reasonable to keep the ability to tell the device drivers to
> reset their devices -- instead of pulling the reset line on all devices.
>
> I also like the ability to invoke the new reset feature separately from
> telling the device drivers to do it.
>
>>
>>>
>>>>> I tried to make a list of the interesting scenarios and the events
>>>>> that are relevant to this problem:
>>>>>
>>>>> Case 1: IOMMU off in system, off in kdump kernel
>>>>> system kernel leaves IOMMU off
>>>>> DMA targets system-kernel memory
>>>>> kexec to kdump kernel (IOMMU off, devices untouched)
>>>>> DMA targets system-kernel memory (harmless)
>>>>> kdump kernel re-inits device
>>>>> DMA targets kdump-kernel memory
>>>>>
>>>>> Case 2: IOMMU off in system kernel, on in kdump kernel
>>>>> system kernel leaves IOMMU off
>>>>> DMA targets system-kernel memory
>>>>> kexec to kdump kernel (IOMMU off, devices untouched)
>>>>> DMA targets system-kernel memory (harmless)
>>>>> kdump kernel enables IOMMU with no valid mappings
>>>>> DMA causes IOMMU errors (annoying but harmless)
>>>>> kdump kernel re-inits device
>>>>> DMA targets IOMMU-mapped kdump-kernel memory
>>>>>
>>>>> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>>>>> system kernel enables IOMMU
>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>> kdump kernel doesn't know about IOMMU or doesn't touch it
>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>> kdump kernel re-inits device
>>>>> kernel assumes no IOMMU, so all new DMA mappings are invalid
>>>>> because DMAs actually do go through the IOMMU
>>>>> (** corruption or other non-recoverable error likely **)
>>>>>
>>>>>
>>>>> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>>>>> system kernel enables IOMMU
>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>> kdump kernel disables IOMMU
>>>>> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>>>>> (** corruption or other non-recoverable error likely **)
>>>>> kdump kernel re-inits device
>>>>> DMA targets kdump-kernel memory
>>>>>
>>>>> Case 4: IOMMU on in system kernel, on in kdump kernel
>>>>> system kernel enables IOMMU
>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>> kdump kernel enables IOMMU with no valid mappings
>>>>> DMA causes IOMMU errors (annoying but harmless)
>>>>> kdump kernel re-inits device
>>>>> DMA targets IOMMU-mapped kdump-kernel memory
>>>>
>>>> This is not harmless. Errors like PCI SERR are detected here, and it
>>>> makes driver or system unstable, and kdump fails. I also got report that
>>>> system hangs up due to this.
>>>
>>> OK, let's take this slowly. Does an IOMMU error in the system kernel
>>> also cause SERR or make the system unstable? Is that the expected
>>> behavior on IOMMU errors, or is there something special about the
>>> kdump scenario that causes SERRs? I see lots of DMAR errors, e.g.,
>>> those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are
>>> reported with printk and don't seem to cause an SERR. Maybe the SERR
>>> is system-specific behavior?
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public)
>>> report of IOMMU errors related to a driver bug where we just get
>>> printks, not SERR.
>>
>> Yes, it depends on platform or devices. At least PCI SERR is detected on
>> Fujitsu PRIMERGY BX920S2 and TX300S6.
>>
>> Intel VT-d documents[1] says:
>>
>> 3.5.1 Hardware Handling of Faulting DMA Requests
>> DMA requests that result in remapping faults must be blocked by
>> hardware. The exact method of DMA blocking is
>> implementation-specific. For example:
>>
>> - Faulting DMA write requests may be handled in much the same way as
>> hardware handles write requests to non-existent memory. For
>> example, the DMA request is discarded in a manner convenient for
>> implementations (such as by dropping the cycle, completing the
>> write request to memory with all byte enables masked off,
>> re-directed to a dummy memory location, etc.).
>>
>> - Faulting DMA read requests may be handled in much the same way as
>> hardware handles read requests to non-existent memory. For
>> example, the request may be redirected to a dummy memory location,
>> returned as all 0<92>s or 1<92>s in a manner convenient to the
>> implementation, or the request may be completed with an explicit
>> error indication (preferred). For faulting DMA read requests from
>> PCI Express devices, hardware must indicate "Unsupported Request"
>> (UR) in the completion status field of the PCI Express read
>> completion.
>>
>> So, after IOMMU error, its behavior is implementation-specific.
>>
>> [1]
>> Intel Virtualization Technology for Directed I/O Architecture
>> Specification Rev1.3
>>
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang
>>> when the kdump kernel reboots (after successfully saving a crashdump).
>>> But it is using "iommu=pt", which I don't believe makes sense. The
>>> scenario is basically case 4 above, but instead of the kdump kernel
>>> starting with no valid IOMMU mappings, it identity-maps bus addresses
>>> to physical memory addresses. That's completely bogus because it's
>>> certainly not what the system kernel did, so it's entirely likely to
>>> make the system unstable or hang. This is not an argument for doing a
>>> reset; it's an argument for doing something smarter than "iommu=pt" in
>>> the kdump kernel.
>
>>> We might still want to reset PCIe devices, but I want to make sure
>>> that we're not papering over other issues when we do. Therefore, I'd
>>> like to understand why IOMMU errors seem harmless in some cases but
>>> not in others.
>>
>> As I wrote above, IOMMU behavior on error is up to platform/devcies. I
>> think it also depends on the value of Uncorrectable Error Mask Register
>> in AER registers of each device.
>>
>>>> Case 1: Harmless
>>>> Case 2: Not tested
>>>> Case 3a: Not tested
>>>> Case 3b: Cause problem, fixed by my patch
>>>> Case 4: Cause problem, fixed by my patch
>>>>
>>>> I have never tested case 2 and 3a, but I think it also causes problem.
>>>
>>> I do not believe we need to support case 3b (IOMMU on in system kernel
>>> but disabled in kdump kernel). There is no way to make that reliable
>>> unless every single device that may perform DMA is reset, and since
>>> you don't reset legacy PCI or VGA devices, you're not even proposing
>>> to do that.
>>>
>>> I think we need to support case 1 (for systems with no IOMMU at all)
>>> and case 4 (IOMMU enabled in both system kernel and kdump kernel). If
>>> the kdump kernel can detect whether the IOMMU is enabled, that should
>>> be enough -- it could figure out automatically whether we're in case 1
>>> or 4.
>>
>> Ok, I completely agree.
>>
>>
>>>>> Do you have any bugzilla references or problem report URLs you could
>>>>> include here?
>>>>
>>>> I know three Red Hat bugzilla, but I think these are private article and
>>>> you cannot see. I'll add you Cc list in bz so that you can see.
>>>>
>>>> BZ#743790: Kdump fails with intel_iommu=on
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743790
>>>>
>>>> BZ#743495: Hardware error is detected with intel_iommu=on
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495
>>>>
>>>> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=833299
>>>
>>> Thanks for adding me to the CC lists. I looked all three and I'm not
>>> sure there's anything sensitive in them. It'd be nice if they could
>>> be made public if there's not.
>>
>> I really appreciate your comments! I'll confirm I can make it public.
>
> I would greatly appreciate being able to see the bugzillas relating to
> this patch.
>>
>> Thanks,
>> Takao Indoh
>
> Thinking out of the box:
> Much of the discussion about dealing with the ongoing DMA leftover
> from the system kernel has assumed that the crash kernel will reset
> the IOMMU -- which causes various problems if done while there is any DMA
> still active -- which leads to the idea of stopping all of the DMA.
>
> Suppose the crash kernel does not reset the hardware IOMMU, but simply
> detects that it is active, resets only the devices that are necessary
> for the crash kernel to operate, and re-programs only the translations

This suggestion brings up this one:
Only reset the devices that are generating IOMMU faults, i.e., add a kdump kernel
hook to the IOMMU fault handler. While kdump kernel re-initing, IOMMU faults are grabbed
by the kexec'd kernel and the device is reset. IOW, directed, kdump device reset.
If no faults, then device is idle, or re-init'd and using new maps, and
all is well. Once kdump kernel fully init'd, then just return from
kdump kernel callout, and let system do its fault handling.
It can be made mostly common (reset code in kexec mod under drivers/iommu),
with simple calls out from each IOMMU fault handler.

Of course, this assumes that systems don't turn IOMMU faults into system SERRs,
and crash the system. IMO, as I've stated to a number of system developers,
IOMMU (mapping) faults should not crash the system -- they already isolate, and
prevent corruption, so worse case, some part of the system will stop doing I/O,
and that will either get retried, or aborted and a cleaner panic (and kdump
kernel switch) will ensue.

> for those devices. All other translations remain the same (and remain valid)
> so all leftover DMA continues into its buffer in the system kernel area
> where it is harmless. New translations needed by the kdump kernel are
> added to the existing tables.
>
A crashed system shouldn't assume that ongoing DMA is sane.... it should be stopped.
I would expect the kdump kernel to reset devices & acquire new dma mappings
upon reboot. Thus, the only issue is how to 'throttle' IOMMU faults, and
not allow them to crash systems while the system is recovering/resetting itself,
but it's not one big (power) reset to cause it.

> I have not yet tried this, so I am not ready to propose it as anything more
> than a discussion topic at this time.
>
> It might work this way: (A small modification to case 3a above)
>
> IOMMU on in system kernel, kdump kernel accepts active IOMMU
> system kernel enables IOMMU
> DMA targets IOMMU-mapped system-kernel memory
> kexec to kdump kernel (IOMMU on, devices untouched)
> DMA targets IOMMU-mapped system-kernel memory
it may not, it may be bad/bogus device I/O causing the crash...

> kdump kernel detects active IOMMU and doesn't touch it
> DMA targets IOMMU-mapped system-kernel memory
> kdump kernel does not re-initialize IOMMU hardware
> kdump kernel initializes IOMMU in-memory management structures
> kdump kernel calls device drivers' standard initialization functions
> Drivers initialize their own devices -- DMA from that device stops
> When drivers request new DMA mappings, the kdump IOMMU driver:
> 1. Updates its in-memory mgt structures for that device& range
> 2. Updates IOMMU translate tables for that device& range
> . Translations for all other devices& ranges are unchanged
> 3. Flushes IOMMU TLB to force IOMMU hardware update
>
> Notes:
> A. This seems very similar to case 1
>
> B. Only touch devices with drivers loaded into kdump kernel.
> . No need to touch devices that kdump is not going to use.
>
> C. This assumes that all DMA device drivers used by kdump will
> initialize the device before requesting DMA mappings.
> . Seems reasonable, but need to confirm how many do (or don't)
> . Only device drivers that might be included in the kdump
> kernel need to observe this initialization ordering.
>
> D. Could copy system kernel's translate tables into kdump kernel
> and adjust pointers if this feels more trustworthy than using
> the original structures where they are in the system kernel.
>
> E. Handle IRQ remappings in a similar manner.
An even more serious attack vector: flood system w/interrupts (by assigned device in a guest),
effectively creating a DoS, b/c intrs caused raised (hw) ipl, while bogus DMA does
not cause system ipl elevation(blockage of other system progress), except when it
generates IOMMU faults which become intrs.
hmmm, wondering if another solution is to flip IOMMU fault handling into a poll-mode
at kdump init time, and then flip it to intr-driven, once I/O has been completely init'd ?
Per-device fault throttling would be a nice hw feature! (wishful thinking)

>
> Thanks,
> Bill Sumner
>
>

2013-06-13 02:45:20

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

(2013/06/12 13:45), Bjorn Helgaas wrote:
> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
>
> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
> <[email protected]> wrote:
>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>
>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>> fs_initcall() that doesn't give the reader any clue about the
>>> connection between the reset and the problem it's solving.
>>>
>>> If we do something like this patch, I think it needs to be done at the
>>> point where we enable or disable the IOMMU. That way, it's connected
>>> to the important event, and there's a clue about how to make
>>> corresponding fixes for other IOMMUs.
>>
>> Ok. pci_iommu_init() is appropriate place to add this hook?
>
> I looked at various IOMMU init places today, and it's far more
> complicated and varied than I had hoped.
>
> This reset scheme depends on enumerating PCI devices before we
> initialize the IOMMU used by those devices. x86 works that way today,
> but not all architectures do (see the sparc pci_fire_pbm_init(), for

Sorry, could you tell me which part depends on architecture?

> example). And I think conceptually, the IOMMU should be enumerated
> and initialized *before* the devices that use it.
>
> So I'm uncomfortable with that aspect of this scheme.
>
> It would be at least conceivable to reset the devices in the system
> kernel, before the kexec. I know we want to do as little as possible
> in the crashing kernel, but it's at least a possibility, and it might
> be cleaner.

I bet this will be not accepted by kdump maintainer. Everything in panic
kernel is unreliable.

Thanks,
Takao Indoh

2013-06-13 03:25:52

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

(2013/06/12 22:19), Don Dutile wrote:
> On 06/11/2013 07:19 PM, Sumner, William wrote:
>>
>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh<[email protected]> wrote:
>>>>> (2013/06/07 13:14), Bjorn Helgaas wrote:
>>>>
>>>>>> One thing I'm not sure about is that you are only resetting PCIe
>>>>>> devices, but I don't think the problem is actually specific to PCIe,
>>>>>> is it? I think the same issue could occur on any system with an
>>>>>> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but
>>>>>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>>>>>> PA-RISC.
>>>>>
>>>>> Right, this is not specific to PCIe. The reasons why the target is only
>>>>> PCIe is just to make algorithm to reset simple. It is possible to reset
>>>>> legacy PCI devices in my patch, but code becomes somewhat complicated. I
>>>>> thought recently most systems used PCIe and there was little demand for
>>>>> resetting legacy PCI. Therefore I decided not to reset legacy PCI
>>>>> devices, but I'll do if there are requests :-)
>>>>
>>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>>> fs_initcall() that doesn't give the reader any clue about the
>>>> connection between the reset and the problem it's solving.
>>>>
>>>> If we do something like this patch, I think it needs to be done at the
>>>> point where we enable or disable the IOMMU. That way, it's connected
>>>> to the important event, and there's a clue about how to make
>>>> corresponding fixes for other IOMMUs.
>>>
>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>>
>>>> We already have a "reset_devices" boot option. This is for the same
>>>> purpose, as far as I can tell, and I'm not sure there's value in
>>>> having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In
>>>> fact, there's nothing specific even to PCI here. The Intel VT-d docs
>>>> seem carefully written so they could apply to either PCIe or non-PCI
>>>> devices.
>>>
>>> Yeah, I can integrate this option into reset_devices. The reason why
>>> I separate them is to avoid regression.
>>>
>>> I have tested my patch on many machines and basically it worked, but I
>>> found two machines where this reset patch caused problem. The first one
>>> was caused by bug of raid card firmware. After updating firmware, this
>>> patch worked. The second one was due to bug of PCIe switch chip. I
>>> reported this bug to the vendor but it is not fixed yet.
>>>
>>> Anyway, this patch may cause problems on such a buggy machine, so I
>>> introduced new boot parameter so that user can enable or disable this
>>> function aside from reset_devices.
>>>
>>> Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>>> reset_devices instead of adding new one, and using quirk or something to
>>> avoid such a buggy devices.
>>
>> With respect to "and using quirk or something to avoid such buggy devices",
>> I believe that it will be necessary to provide a mechanism for devices that
>> need special handling to do the reset -- perhaps something like a list
>> of tuples: (device_type, function_to_call) with a default function_to_call
>> when the device_type is not found in the list. These functions would
>> need to be physically separate from the device driver because if the device
>> is present it needs to be reset even if the crash kernel chooses not to load
>> the driver for that device.
>>
>>>
>>> So, basically I agree with using reset_devices, but I want to prepare
>>> workaround in case this reset causes something wrong.
>>>
>> I like the ability to specify the original "reset_devices" separately from
>> invoking this new mechanism. With so many different uses for Linux in
>> so many different environments and with so many different device drivers
>> it seems reasonable to keep the ability to tell the device drivers to
>> reset their devices -- instead of pulling the reset line on all devices.
>>
>> I also like the ability to invoke the new reset feature separately from
>> telling the device drivers to do it.
>>
>>>
>>>>
>>>>>> I tried to make a list of the interesting scenarios and the events
>>>>>> that are relevant to this problem:
>>>>>>
>>>>>> Case 1: IOMMU off in system, off in kdump kernel
>>>>>> system kernel leaves IOMMU off
>>>>>> DMA targets system-kernel memory
>>>>>> kexec to kdump kernel (IOMMU off, devices untouched)
>>>>>> DMA targets system-kernel memory (harmless)
>>>>>> kdump kernel re-inits device
>>>>>> DMA targets kdump-kernel memory
>>>>>>
>>>>>> Case 2: IOMMU off in system kernel, on in kdump kernel
>>>>>> system kernel leaves IOMMU off
>>>>>> DMA targets system-kernel memory
>>>>>> kexec to kdump kernel (IOMMU off, devices untouched)
>>>>>> DMA targets system-kernel memory (harmless)
>>>>>> kdump kernel enables IOMMU with no valid mappings
>>>>>> DMA causes IOMMU errors (annoying but harmless)
>>>>>> kdump kernel re-inits device
>>>>>> DMA targets IOMMU-mapped kdump-kernel memory
>>>>>>
>>>>>> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>>>>>> system kernel enables IOMMU
>>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>>> kdump kernel doesn't know about IOMMU or doesn't touch it
>>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>>> kdump kernel re-inits device
>>>>>> kernel assumes no IOMMU, so all new DMA mappings are invalid
>>>>>> because DMAs actually do go through the IOMMU
>>>>>> (** corruption or other non-recoverable error likely **)
>>>>>>
>>>>>>
>>>>>> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>>>>>> system kernel enables IOMMU
>>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>>> kdump kernel disables IOMMU
>>>>>> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>>>>>> (** corruption or other non-recoverable error likely **)
>>>>>> kdump kernel re-inits device
>>>>>> DMA targets kdump-kernel memory
>>>>>>
>>>>>> Case 4: IOMMU on in system kernel, on in kdump kernel
>>>>>> system kernel enables IOMMU
>>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>>>>> DMA targets IOMMU-mapped system-kernel memory
>>>>>> kdump kernel enables IOMMU with no valid mappings
>>>>>> DMA causes IOMMU errors (annoying but harmless)
>>>>>> kdump kernel re-inits device
>>>>>> DMA targets IOMMU-mapped kdump-kernel memory
>>>>>
>>>>> This is not harmless. Errors like PCI SERR are detected here, and it
>>>>> makes driver or system unstable, and kdump fails. I also got report that
>>>>> system hangs up due to this.
>>>>
>>>> OK, let's take this slowly. Does an IOMMU error in the system kernel
>>>> also cause SERR or make the system unstable? Is that the expected
>>>> behavior on IOMMU errors, or is there something special about the
>>>> kdump scenario that causes SERRs? I see lots of DMAR errors, e.g.,
>>>> those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are
>>>> reported with printk and don't seem to cause an SERR. Maybe the SERR
>>>> is system-specific behavior?
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public)
>>>> report of IOMMU errors related to a driver bug where we just get
>>>> printks, not SERR.
>>>
>>> Yes, it depends on platform or devices. At least PCI SERR is detected on
>>> Fujitsu PRIMERGY BX920S2 and TX300S6.
>>>
>>> Intel VT-d documents[1] says:
>>>
>>> 3.5.1 Hardware Handling of Faulting DMA Requests
>>> DMA requests that result in remapping faults must be blocked by
>>> hardware. The exact method of DMA blocking is
>>> implementation-specific. For example:
>>>
>>> - Faulting DMA write requests may be handled in much the same way as
>>> hardware handles write requests to non-existent memory. For
>>> example, the DMA request is discarded in a manner convenient for
>>> implementations (such as by dropping the cycle, completing the
>>> write request to memory with all byte enables masked off,
>>> re-directed to a dummy memory location, etc.).
>>>
>>> - Faulting DMA read requests may be handled in much the same way as
>>> hardware handles read requests to non-existent memory. For
>>> example, the request may be redirected to a dummy memory location,
>>> returned as all 0<92>s or 1<92>s in a manner convenient to the
>>> implementation, or the request may be completed with an explicit
>>> error indication (preferred). For faulting DMA read requests from
>>> PCI Express devices, hardware must indicate "Unsupported Request"
>>> (UR) in the completion status field of the PCI Express read
>>> completion.
>>>
>>> So, after IOMMU error, its behavior is implementation-specific.
>>>
>>> [1]
>>> Intel Virtualization Technology for Directed I/O Architecture
>>> Specification Rev1.3
>>>
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang
>>>> when the kdump kernel reboots (after successfully saving a crashdump).
>>>> But it is using "iommu=pt", which I don't believe makes sense. The
>>>> scenario is basically case 4 above, but instead of the kdump kernel
>>>> starting with no valid IOMMU mappings, it identity-maps bus addresses
>>>> to physical memory addresses. That's completely bogus because it's
>>>> certainly not what the system kernel did, so it's entirely likely to
>>>> make the system unstable or hang. This is not an argument for doing a
>>>> reset; it's an argument for doing something smarter than "iommu=pt" in
>>>> the kdump kernel.
>>
>>>> We might still want to reset PCIe devices, but I want to make sure
>>>> that we're not papering over other issues when we do. Therefore, I'd
>>>> like to understand why IOMMU errors seem harmless in some cases but
>>>> not in others.
>>>
>>> As I wrote above, IOMMU behavior on error is up to platform/devcies. I
>>> think it also depends on the value of Uncorrectable Error Mask Register
>>> in AER registers of each device.
>>>
>>>>> Case 1: Harmless
>>>>> Case 2: Not tested
>>>>> Case 3a: Not tested
>>>>> Case 3b: Cause problem, fixed by my patch
>>>>> Case 4: Cause problem, fixed by my patch
>>>>>
>>>>> I have never tested case 2 and 3a, but I think it also causes problem.
>>>>
>>>> I do not believe we need to support case 3b (IOMMU on in system kernel
>>>> but disabled in kdump kernel). There is no way to make that reliable
>>>> unless every single device that may perform DMA is reset, and since
>>>> you don't reset legacy PCI or VGA devices, you're not even proposing
>>>> to do that.
>>>>
>>>> I think we need to support case 1 (for systems with no IOMMU at all)
>>>> and case 4 (IOMMU enabled in both system kernel and kdump kernel). If
>>>> the kdump kernel can detect whether the IOMMU is enabled, that should
>>>> be enough -- it could figure out automatically whether we're in case 1
>>>> or 4.
>>>
>>> Ok, I completely agree.
>>>
>>>
>>>>>> Do you have any bugzilla references or problem report URLs you could
>>>>>> include here?
>>>>>
>>>>> I know three Red Hat bugzilla, but I think these are private article and
>>>>> you cannot see. I'll add you Cc list in bz so that you can see.
>>>>>
>>>>> BZ#743790: Kdump fails with intel_iommu=on
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743790
>>>>>
>>>>> BZ#743495: Hardware error is detected with intel_iommu=on
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495
>>>>>
>>>>> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=833299
>>>>
>>>> Thanks for adding me to the CC lists. I looked all three and I'm not
>>>> sure there's anything sensitive in them. It'd be nice if they could
>>>> be made public if there's not.
>>>
>>> I really appreciate your comments! I'll confirm I can make it public.
>>
>> I would greatly appreciate being able to see the bugzillas relating to
>> this patch.
>>>
>>> Thanks,
>>> Takao Indoh
>>
>> Thinking out of the box:
>> Much of the discussion about dealing with the ongoing DMA leftover
>> from the system kernel has assumed that the crash kernel will reset
>> the IOMMU -- which causes various problems if done while there is any DMA
>> still active -- which leads to the idea of stopping all of the DMA.
>>
>> Suppose the crash kernel does not reset the hardware IOMMU, but simply
>> detects that it is active, resets only the devices that are necessary
>> for the crash kernel to operate, and re-programs only the translations
>
> This suggestion brings up this one:
> Only reset the devices that are generating IOMMU faults, i.e., add a kdump kernel
> hook to the IOMMU fault handler. While kdump kernel re-initing, IOMMU faults are grabbed
> by the kexec'd kernel and the device is reset. IOW, directed, kdump device reset.
> If no faults, then device is idle, or re-init'd and using new maps, and
> all is well. Once kdump kernel fully init'd, then just return from
> kdump kernel callout, and let system do its fault handling.
> It can be made mostly common (reset code in kexec mod under drivers/iommu),
> with simple calls out from each IOMMU fault handler.

What I tried before is:
1) Prepare work queue handler
2) In IOMMU fault handler, wake up the work queue handler
3) In the work queue handler, reset devices against the error
source.

As a result, this method did not work. The device are reset during its
initialization. Please take a look at the message below. This is a
message when megaraid_sas driver is loaded in second kernel.

megasas: 00.00.05.40-rh2 Thu. Aug. 4 17:00:00 PDT 2011
megaraid_sas 0000:01:00.0: resetting MSI-X
megasas: 0x1000:0x0073:0x1734:0x1177: bus 1:slot 0:func 0
megaraid_sas 0000:01:00.0: PCI INT A -> GSI 28 (level, low) -> IRQ 28
megaraid_sas 0000:01:00.0: setting latency timer to 64
megasas: Waiting for FW to come to ready state
DRHD: handling fault status reg 702
DMAR:[DMA Write] Request device [01:00.0] fault addr ffff9000
DMAR:[fault reason 05] PTE Write access is not set
megasas: FW now in Ready state
alloc irq_desc for 51 on node -1
alloc kstat_irqs on node -1
megaraid_sas 0000:01:00.0: irq 51 for MSI/MSI-X
megasas_init_mfi: fw_support_ieee=67108864

Note that DMAR error is detected during driver initialization. So when I
added reset method which I mentioned above, the device is reset here,
and after that, megasas could not find its disks and kdump failed.

Thanks,
Takao Indoh

>
> Of course, this assumes that systems don't turn IOMMU faults into system SERRs,
> and crash the system. IMO, as I've stated to a number of system developers,
> IOMMU (mapping) faults should not crash the system -- they already isolate, and
> prevent corruption, so worse case, some part of the system will stop doing I/O,
> and that will either get retried, or aborted and a cleaner panic (and kdump
> kernel switch) will ensue.
>
>> for those devices. All other translations remain the same (and remain valid)
>> so all leftover DMA continues into its buffer in the system kernel area
>> where it is harmless. New translations needed by the kdump kernel are
>> added to the existing tables.
>>
> A crashed system shouldn't assume that ongoing DMA is sane.... it should be stopped.
> I would expect the kdump kernel to reset devices & acquire new dma mappings
> upon reboot. Thus, the only issue is how to 'throttle' IOMMU faults, and
> not allow them to crash systems while the system is recovering/resetting itself,
> but it's not one big (power) reset to cause it.
>
>> I have not yet tried this, so I am not ready to propose it as anything more
>> than a discussion topic at this time.
>>
>> It might work this way: (A small modification to case 3a above)
>>
>> IOMMU on in system kernel, kdump kernel accepts active IOMMU
>> system kernel enables IOMMU
>> DMA targets IOMMU-mapped system-kernel memory
>> kexec to kdump kernel (IOMMU on, devices untouched)
>> DMA targets IOMMU-mapped system-kernel memory
> it may not, it may be bad/bogus device I/O causing the crash...
>
>> kdump kernel detects active IOMMU and doesn't touch it
>> DMA targets IOMMU-mapped system-kernel memory
>> kdump kernel does not re-initialize IOMMU hardware
>> kdump kernel initializes IOMMU in-memory management structures
>> kdump kernel calls device drivers' standard initialization functions
>> Drivers initialize their own devices -- DMA from that device stops
>> When drivers request new DMA mappings, the kdump IOMMU driver:
>> 1. Updates its in-memory mgt structures for that device& range
>> 2. Updates IOMMU translate tables for that device& range
>> . Translations for all other devices& ranges are unchanged
>> 3. Flushes IOMMU TLB to force IOMMU hardware update
>>
>> Notes:
>> A. This seems very similar to case 1
>>
>> B. Only touch devices with drivers loaded into kdump kernel.
>> . No need to touch devices that kdump is not going to use.
>>
>> C. This assumes that all DMA device drivers used by kdump will
>> initialize the device before requesting DMA mappings.
>> . Seems reasonable, but need to confirm how many do (or don't)
>> . Only device drivers that might be included in the kdump
>> kernel need to observe this initialization ordering.
>>
>> D. Could copy system kernel's translate tables into kdump kernel
>> and adjust pointers if this feels more trustworthy than using
>> the original structures where they are in the system kernel.
>>
>> E. Handle IRQ remappings in a similar manner.
> An even more serious attack vector: flood system w/interrupts (by assigned device in a guest),
> effectively creating a DoS, b/c intrs caused raised (hw) ipl, while bogus DMA does
> not cause system ipl elevation(blockage of other system progress), except when it
> generates IOMMU faults which become intrs.
> hmmm, wondering if another solution is to flip IOMMU fault handling into a poll-mode
> at kdump init time, and then flip it to intr-driven, once I/O has been completely init'd ?
> Per-device fault throttling would be a nice hw feature! (wishful thinking)
>
>
>>
>> Thanks,
>> Bill Sumner
>>
>>
>
>
>

2013-06-13 03:41:53

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh <[email protected]> wrote:
> (2013/06/12 13:45), Bjorn Helgaas wrote:
>> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
>>
>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
>> <[email protected]> wrote:
>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>
>>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>>> fs_initcall() that doesn't give the reader any clue about the
>>>> connection between the reset and the problem it's solving.
>>>>
>>>> If we do something like this patch, I think it needs to be done at the
>>>> point where we enable or disable the IOMMU. That way, it's connected
>>>> to the important event, and there's a clue about how to make
>>>> corresponding fixes for other IOMMUs.
>>>
>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>
>> I looked at various IOMMU init places today, and it's far more
>> complicated and varied than I had hoped.
>>
>> This reset scheme depends on enumerating PCI devices before we
>> initialize the IOMMU used by those devices. x86 works that way today,
>> but not all architectures do (see the sparc pci_fire_pbm_init(), for
>
> Sorry, could you tell me which part depends on architecture?

Your patch works if PCIe devices are reset before the kdump kernel
enables the IOMMU. On x86, this is possible because PCI enumeration
happens before the IOMMU initialization. On sparc, the IOMMU is
initialized before PCI devices are enumerated, so there would still be
a window where ongoing DMA could cause an IOMMU error.

Of course, it might be possible to reorganize the sparc code to to the
IOMMU init *after* it enumerates PCI devices. But I think that change
would be hard to justify.

And I think even on x86, it would be better if we did the IOMMU init
before PCI enumeration -- the PCI devices depend on the IOMMU, so
logically the IOMMU should be initialized first so the PCI devices can
be associated with it as they are enumerated.

>> example). And I think conceptually, the IOMMU should be enumerated
>> and initialized *before* the devices that use it.
>>
>> So I'm uncomfortable with that aspect of this scheme.
>>
>> It would be at least conceivable to reset the devices in the system
>> kernel, before the kexec. I know we want to do as little as possible
>> in the crashing kernel, but it's at least a possibility, and it might
>> be cleaner.
>
> I bet this will be not accepted by kdump maintainer. Everything in panic
> kernel is unreliable.

kdump is inherently unreliable. The kdump kernel doesn't start from
an arbitrary machine state. We don't expect it to tolerate all CPUs
running, for example. Maybe it should be expected to tolerate PCI
devices running, either.

Bjorn

2013-06-14 02:11:38

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

(2013/06/13 12:41), Bjorn Helgaas wrote:
> On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh <[email protected]> wrote:
>> (2013/06/12 13:45), Bjorn Helgaas wrote:
>>> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
>>>
>>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
>>> <[email protected]> wrote:
>>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>>
>>>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>>>> fs_initcall() that doesn't give the reader any clue about the
>>>>> connection between the reset and the problem it's solving.
>>>>>
>>>>> If we do something like this patch, I think it needs to be done at the
>>>>> point where we enable or disable the IOMMU. That way, it's connected
>>>>> to the important event, and there's a clue about how to make
>>>>> corresponding fixes for other IOMMUs.
>>>>
>>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>>
>>> I looked at various IOMMU init places today, and it's far more
>>> complicated and varied than I had hoped.
>>>
>>> This reset scheme depends on enumerating PCI devices before we
>>> initialize the IOMMU used by those devices. x86 works that way today,
>>> but not all architectures do (see the sparc pci_fire_pbm_init(), for
>>
>> Sorry, could you tell me which part depends on architecture?
>
> Your patch works if PCIe devices are reset before the kdump kernel
> enables the IOMMU. On x86, this is possible because PCI enumeration
> happens before the IOMMU initialization. On sparc, the IOMMU is
> initialized before PCI devices are enumerated, so there would still be
> a window where ongoing DMA could cause an IOMMU error.

Ok, understood, thanks.

Hmmm, it seems to be difficult to find out method which is common to
all architectures. So, what I can do for now is introducing reset scheme
which is only for x86.

1) Change this patch so that it work only on x86 platform. For example
call this reset code from x86_init.iommu.iommu_init() instead of
fs_initcall.

Or another idea is:

2) Enumerate PCI devices in IOMMU layer. That is:
PCI layer
Just provide interface to reset given strcut pci_dev. Maybe
pci_reset_function() looks good for this purpose.
IOMMU layer
Determine which devices should be reset. On kernel boot, check if
IOMMU is already active or not, and if active, check IOMMU page
table and reset devices whose entry exists there.

> Of course, it might be possible to reorganize the sparc code to to the
> IOMMU init *after* it enumerates PCI devices. But I think that change
> would be hard to justify.
>
> And I think even on x86, it would be better if we did the IOMMU init
> before PCI enumeration -- the PCI devices depend on the IOMMU, so
> logically the IOMMU should be initialized first so the PCI devices can
> be associated with it as they are enumerated.

So third idea is:

3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We
need to implement new code to enumerate PCI devices and reset them
for this purpose.

Idea 2 is not difficult to implement, but one problem is that this
method may be dangerous. We need to scan IOMMU page table which is used
in previous kernel, but it may be broken. Idea 3 seems to be difficult
to implement...

>
>>> example). And I think conceptually, the IOMMU should be enumerated
>>> and initialized *before* the devices that use it.
>>>
>>> So I'm uncomfortable with that aspect of this scheme.
>>>
>>> It would be at least conceivable to reset the devices in the system
>>> kernel, before the kexec. I know we want to do as little as possible
>>> in the crashing kernel, but it's at least a possibility, and it might
>>> be cleaner.
>>
>> I bet this will be not accepted by kdump maintainer. Everything in panic
>> kernel is unreliable.
>
> kdump is inherently unreliable. The kdump kernel doesn't start from
> an arbitrary machine state. We don't expect it to tolerate all CPUs
> running, for example. Maybe it should be expected to tolerate PCI
> devices running, either.

What I wanted to say is that any resources of first kernel are
unreliable. Under panic situation, struct pci_dev tree may be broken, or
pci_lock may be already hold by someone, etc. So, if we do this in first
kernel, maybe kdump needs its own code to enumerate PCI devices and
reset them.

Vivek?

Thanks,
Takao Indoh

2013-07-24 06:30:59

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

Sorry for letting this discussion slide, I was busy on other works:-(
Anyway, the summary of previous discussion is:
- My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
boot. This expects PCI enumeration is done before IOMMU
initialization as follows.
(1) PCI enumeration
(2) fs_initcall ---> device reset
(3) IOMMU initialization
- This works on x86, but does not work on other architecture because
IOMMU is initialized before PCI enumeration on some architectures. So,
device reset should be done where IOMMU is initialized instead of
initcall.
- Or, as another idea, we can reset devices in first kernel(panic kernel)

Resetting devices in panic kernel is against kdump policy and seems not to
be good idea. So I think adding reset code into iommu initialization is
better. I'll post patches for that.

Another discussion point is how to handle buggy devices. Resetting buggy
devices makes system more unstable. One of ideas is using boot parameter
so that user can choose to reset devices or not. An existed parameter
"reset_devices" is not helpful for this purpose because it is always
necessary for kdump and user cannot get rid of it. Introducing new boot
parameter seems to be unpopular for maintainers. Any ideas?

Thanks,
Takao Indoh

(2013/06/14 11:11), Takao Indoh wrote:
> (2013/06/13 12:41), Bjorn Helgaas wrote:
>> On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh <[email protected]> wrote:
>>> (2013/06/12 13:45), Bjorn Helgaas wrote:
>>>> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
>>>>
>>>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
>>>> <[email protected]> wrote:
>>>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>>>
>>>>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>>>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>>>>> fs_initcall() that doesn't give the reader any clue about the
>>>>>> connection between the reset and the problem it's solving.
>>>>>>
>>>>>> If we do something like this patch, I think it needs to be done at the
>>>>>> point where we enable or disable the IOMMU. That way, it's connected
>>>>>> to the important event, and there's a clue about how to make
>>>>>> corresponding fixes for other IOMMUs.
>>>>>
>>>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>>>
>>>> I looked at various IOMMU init places today, and it's far more
>>>> complicated and varied than I had hoped.
>>>>
>>>> This reset scheme depends on enumerating PCI devices before we
>>>> initialize the IOMMU used by those devices. x86 works that way today,
>>>> but not all architectures do (see the sparc pci_fire_pbm_init(), for
>>>
>>> Sorry, could you tell me which part depends on architecture?
>>
>> Your patch works if PCIe devices are reset before the kdump kernel
>> enables the IOMMU. On x86, this is possible because PCI enumeration
>> happens before the IOMMU initialization. On sparc, the IOMMU is
>> initialized before PCI devices are enumerated, so there would still be
>> a window where ongoing DMA could cause an IOMMU error.
>
> Ok, understood, thanks.
>
> Hmmm, it seems to be difficult to find out method which is common to
> all architectures. So, what I can do for now is introducing reset scheme
> which is only for x86.
>
> 1) Change this patch so that it work only on x86 platform. For example
> call this reset code from x86_init.iommu.iommu_init() instead of
> fs_initcall.
>
> Or another idea is:
>
> 2) Enumerate PCI devices in IOMMU layer. That is:
> PCI layer
> Just provide interface to reset given strcut pci_dev. Maybe
> pci_reset_function() looks good for this purpose.
> IOMMU layer
> Determine which devices should be reset. On kernel boot, check if
> IOMMU is already active or not, and if active, check IOMMU page
> table and reset devices whose entry exists there.
>
>> Of course, it might be possible to reorganize the sparc code to to the
>> IOMMU init *after* it enumerates PCI devices. But I think that change
>> would be hard to justify.
>>
>> And I think even on x86, it would be better if we did the IOMMU init
>> before PCI enumeration -- the PCI devices depend on the IOMMU, so
>> logically the IOMMU should be initialized first so the PCI devices can
>> be associated with it as they are enumerated.
>
> So third idea is:
>
> 3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We
> need to implement new code to enumerate PCI devices and reset them
> for this purpose.
>
> Idea 2 is not difficult to implement, but one problem is that this
> method may be dangerous. We need to scan IOMMU page table which is used
> in previous kernel, but it may be broken. Idea 3 seems to be difficult
> to implement...
>
>
>>
>>>> example). And I think conceptually, the IOMMU should be enumerated
>>>> and initialized *before* the devices that use it.
>>>>
>>>> So I'm uncomfortable with that aspect of this scheme.
>>>>
>>>> It would be at least conceivable to reset the devices in the system
>>>> kernel, before the kexec. I know we want to do as little as possible
>>>> in the crashing kernel, but it's at least a possibility, and it might
>>>> be cleaner.
>>>
>>> I bet this will be not accepted by kdump maintainer. Everything in panic
>>> kernel is unreliable.
>>
>> kdump is inherently unreliable. The kdump kernel doesn't start from
>> an arbitrary machine state. We don't expect it to tolerate all CPUs
>> running, for example. Maybe it should be expected to tolerate PCI
>> devices running, either.
>
> What I wanted to say is that any resources of first kernel are
> unreliable. Under panic situation, struct pci_dev tree may be broken, or
> pci_lock may be already hold by someone, etc. So, if we do this in first
> kernel, maybe kdump needs its own code to enumerate PCI devices and
> reset them.
>
> Vivek?
>
> Thanks,
> Takao Indoh
>
>
>

2013-07-25 14:26:09

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

On Wed, Jul 24, 2013 at 03:29:58PM +0900, Takao Indoh wrote:
> Sorry for letting this discussion slide, I was busy on other works:-(
> Anyway, the summary of previous discussion is:
> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
> boot. This expects PCI enumeration is done before IOMMU
> initialization as follows.
> (1) PCI enumeration
> (2) fs_initcall ---> device reset
> (3) IOMMU initialization
> - This works on x86, but does not work on other architecture because
> IOMMU is initialized before PCI enumeration on some architectures. So,
> device reset should be done where IOMMU is initialized instead of
> initcall.
> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>
> Resetting devices in panic kernel is against kdump policy and seems not to
> be good idea. So I think adding reset code into iommu initialization is
> better. I'll post patches for that.

I don't understand all the details but I agree that idea of trying to
reset IOMMU in crashed kernel might not fly.

>
> Another discussion point is how to handle buggy devices. Resetting buggy
> devices makes system more unstable. One of ideas is using boot parameter
> so that user can choose to reset devices or not.

So who would decide which device is buggy and don't reset it. Give
some details here.

Can't we simply blacklist associated module, so that it never loads
and then it never tries to reset the devices?

Thanks
Vivek

2013-07-25 17:01:10

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
<[email protected]> wrote:
> Sorry for letting this discussion slide, I was busy on other works:-(
> Anyway, the summary of previous discussion is:
> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
> boot. This expects PCI enumeration is done before IOMMU
> initialization as follows.
> (1) PCI enumeration
> (2) fs_initcall ---> device reset
> (3) IOMMU initialization
> - This works on x86, but does not work on other architecture because
> IOMMU is initialized before PCI enumeration on some architectures. So,
> device reset should be done where IOMMU is initialized instead of
> initcall.
> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>
> Resetting devices in panic kernel is against kdump policy and seems not to
> be good idea. So I think adding reset code into iommu initialization is
> better. I'll post patches for that.

Of course nobody *wants* to do anything in the panic kernel. But
simply saying "it's against kdump policy and seems not to be a good
idea" is not a technical argument. There are things that are
impractical to do in the kdump kernel, so they have to be done in the
panic kernel even though we know the kernel is unreliable and the
attempt may fail.

My point about IOMMU and PCI initialization order doesn't go away just
because it doesn't fit "kdump policy." Having system initialization
occur in a logical order is far more important than making kdump work.

Bjorn

2013-07-29 00:20:50

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

(2013/07/25 23:24), Vivek Goyal wrote:
> On Wed, Jul 24, 2013 at 03:29:58PM +0900, Takao Indoh wrote:
>> Sorry for letting this discussion slide, I was busy on other works:-(
>> Anyway, the summary of previous discussion is:
>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>> boot. This expects PCI enumeration is done before IOMMU
>> initialization as follows.
>> (1) PCI enumeration
>> (2) fs_initcall ---> device reset
>> (3) IOMMU initialization
>> - This works on x86, but does not work on other architecture because
>> IOMMU is initialized before PCI enumeration on some architectures. So,
>> device reset should be done where IOMMU is initialized instead of
>> initcall.
>> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>>
>> Resetting devices in panic kernel is against kdump policy and seems not to
>> be good idea. So I think adding reset code into iommu initialization is
>> better. I'll post patches for that.
>
> I don't understand all the details but I agree that idea of trying to
> reset IOMMU in crashed kernel might not fly.
>
>>
>> Another discussion point is how to handle buggy devices. Resetting buggy
>> devices makes system more unstable. One of ideas is using boot parameter
>> so that user can choose to reset devices or not.
>
> So who would decide which device is buggy and don't reset it. Give
> some details here.

I found the case that kdump does not work after resetting devices and
it works when removing reset patch. The cause of problem is a bug of
PCIe switch chip. If there is boot parameter not to reset devices,
user can use it as workaround.

I think in this case we should add PCI quirk to avoid this buggy
hardware, but we need to wait errata from vendor and it basically takes
long time.

>
> Can't we simply blacklist associated module, so that it never loads
> and then it never tries to reset the devices?
>

So you mean that device reset should be done on its driver loading?

Thanks,
Takao Indoh

2013-07-29 00:38:36

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

(2013/07/26 2:00), Bjorn Helgaas wrote:
> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
> <[email protected]> wrote:
>> Sorry for letting this discussion slide, I was busy on other works:-(
>> Anyway, the summary of previous discussion is:
>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>> boot. This expects PCI enumeration is done before IOMMU
>> initialization as follows.
>> (1) PCI enumeration
>> (2) fs_initcall ---> device reset
>> (3) IOMMU initialization
>> - This works on x86, but does not work on other architecture because
>> IOMMU is initialized before PCI enumeration on some architectures. So,
>> device reset should be done where IOMMU is initialized instead of
>> initcall.
>> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>>
>> Resetting devices in panic kernel is against kdump policy and seems not to
>> be good idea. So I think adding reset code into iommu initialization is
>> better. I'll post patches for that.
>
> Of course nobody *wants* to do anything in the panic kernel. But
> simply saying "it's against kdump policy and seems not to be a good
> idea" is not a technical argument. There are things that are
> impractical to do in the kdump kernel, so they have to be done in the
> panic kernel even though we know the kernel is unreliable and the
> attempt may fail.

Accessing kernel data in panic kernel causes panic again, so
- Don't touch kernel data in panic situation
- Jump to kdump kernel as quickly as possible, and do things in safe
kernel
These are basic "kdump policy". Of course if there are any works which
we cannot do in kdump kernel and can do only in panic kernel, for
example saving registers or stopping cpus, we should do them in panic
kernel.

Resetting devices in panic kernel is worth considering if we can safely
find pci_dev and reset it, but I have no idea how to do that because
for example struct pci_dev may be borken.

>
> My point about IOMMU and PCI initialization order doesn't go away just
> because it doesn't fit "kdump policy." Having system initialization
> occur in a logical order is far more important than making kdump work.

My next plan is as follows. I think this is matched to logical order
on boot.

drivers/pci/pci.c
- Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)

drivers/iommu/intel-iommu.c
- On initialization, if IOMMU is already enabled, call this bus reset
function before disabling and re-enabling IOMMU.

Thanks,
Takao Indoh

2013-07-29 14:17:56

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <[email protected]> wrote:
> (2013/07/26 2:00), Bjorn Helgaas wrote:
>> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
>> <[email protected]> wrote:
>>> Sorry for letting this discussion slide, I was busy on other works:-(
>>> Anyway, the summary of previous discussion is:
>>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>>> boot. This expects PCI enumeration is done before IOMMU
>>> initialization as follows.
>>> (1) PCI enumeration
>>> (2) fs_initcall ---> device reset
>>> (3) IOMMU initialization
>>> - This works on x86, but does not work on other architecture because
>>> IOMMU is initialized before PCI enumeration on some architectures. So,
>>> device reset should be done where IOMMU is initialized instead of
>>> initcall.
>>> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>>>
>>> Resetting devices in panic kernel is against kdump policy and seems not to
>>> be good idea. So I think adding reset code into iommu initialization is
>>> better. I'll post patches for that.
>>
>> Of course nobody *wants* to do anything in the panic kernel. But
>> simply saying "it's against kdump policy and seems not to be a good
>> idea" is not a technical argument. There are things that are
>> impractical to do in the kdump kernel, so they have to be done in the
>> panic kernel even though we know the kernel is unreliable and the
>> attempt may fail.
>
> Accessing kernel data in panic kernel causes panic again, so
> - Don't touch kernel data in panic situation
> - Jump to kdump kernel as quickly as possible, and do things in safe
> kernel
> These are basic "kdump policy". Of course if there are any works which
> we cannot do in kdump kernel and can do only in panic kernel, for
> example saving registers or stopping cpus, we should do them in panic
> kernel.
>
> Resetting devices in panic kernel is worth considering if we can safely
> find pci_dev and reset it, but I have no idea how to do that because
> for example struct pci_dev may be borken.

Nobody can guarantee that the panic kernel can do *anything* safely
because any arbitrary kernel data or text may be corrupted. But if
you consider any specific data structure, e.g., CPU or PCI device
lists, it's not very likely that it will be corrupted.

>> My point about IOMMU and PCI initialization order doesn't go away just
>> because it doesn't fit "kdump policy." Having system initialization
>> occur in a logical order is far more important than making kdump work.
>
> My next plan is as follows. I think this is matched to logical order
> on boot.
>
> drivers/pci/pci.c
> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
>
> drivers/iommu/intel-iommu.c
> - On initialization, if IOMMU is already enabled, call this bus reset
> function before disabling and re-enabling IOMMU.

I raised this issue because of arches like sparc that enumerate the
IOMMU before the PCI devices that use it. In that situation, I think
you're proposing this:

panic kernel
enable IOMMU
panic
kdump kernel
initialize IOMMU (already enabled)
pci_reset_bus
disable IOMMU
enable IOMMU
enumerate PCI devices

But the problem is that when you call pci_reset_bus(), you haven't
enumerated the PCI devices, so you don't know what to reset.

Bjorn

2013-07-30 06:10:35

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

(2013/07/29 23:17), Bjorn Helgaas wrote:
> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <[email protected]> wrote:
>> (2013/07/26 2:00), Bjorn Helgaas wrote:
>>> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
>>> <[email protected]> wrote:
>>>> Sorry for letting this discussion slide, I was busy on other works:-(
>>>> Anyway, the summary of previous discussion is:
>>>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>>>> boot. This expects PCI enumeration is done before IOMMU
>>>> initialization as follows.
>>>> (1) PCI enumeration
>>>> (2) fs_initcall ---> device reset
>>>> (3) IOMMU initialization
>>>> - This works on x86, but does not work on other architecture because
>>>> IOMMU is initialized before PCI enumeration on some architectures. So,
>>>> device reset should be done where IOMMU is initialized instead of
>>>> initcall.
>>>> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>>>>
>>>> Resetting devices in panic kernel is against kdump policy and seems not to
>>>> be good idea. So I think adding reset code into iommu initialization is
>>>> better. I'll post patches for that.
>>>
>>> Of course nobody *wants* to do anything in the panic kernel. But
>>> simply saying "it's against kdump policy and seems not to be a good
>>> idea" is not a technical argument. There are things that are
>>> impractical to do in the kdump kernel, so they have to be done in the
>>> panic kernel even though we know the kernel is unreliable and the
>>> attempt may fail.
>>
>> Accessing kernel data in panic kernel causes panic again, so
>> - Don't touch kernel data in panic situation
>> - Jump to kdump kernel as quickly as possible, and do things in safe
>> kernel
>> These are basic "kdump policy". Of course if there are any works which
>> we cannot do in kdump kernel and can do only in panic kernel, for
>> example saving registers or stopping cpus, we should do them in panic
>> kernel.
>>
>> Resetting devices in panic kernel is worth considering if we can safely
>> find pci_dev and reset it, but I have no idea how to do that because
>> for example struct pci_dev may be borken.
>
> Nobody can guarantee that the panic kernel can do *anything* safely
> because any arbitrary kernel data or text may be corrupted. But if
> you consider any specific data structure, e.g., CPU or PCI device
> lists, it's not very likely that it will be corrupted.

To reset device we need to scan pci device tree using for_each_pci_dev.
Something like bust_spinlocks() to clear pci_lock forcibly is needed.
Vivek, adding these into kdump is acceptable for you? Or any other
ideas? I think iterating over a list like for_each_pci_dev is dangerous.

>
>>> My point about IOMMU and PCI initialization order doesn't go away just
>>> because it doesn't fit "kdump policy." Having system initialization
>>> occur in a logical order is far more important than making kdump work.
>>
>> My next plan is as follows. I think this is matched to logical order
>> on boot.
>>
>> drivers/pci/pci.c
>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
>>
>> drivers/iommu/intel-iommu.c
>> - On initialization, if IOMMU is already enabled, call this bus reset
>> function before disabling and re-enabling IOMMU.
>
> I raised this issue because of arches like sparc that enumerate the
> IOMMU before the PCI devices that use it. In that situation, I think
> you're proposing this:
>
> panic kernel
> enable IOMMU
> panic
> kdump kernel
> initialize IOMMU (already enabled)
> pci_reset_bus
> disable IOMMU
> enable IOMMU
> enumerate PCI devices
>
> But the problem is that when you call pci_reset_bus(), you haven't
> enumerated the PCI devices, so you don't know what to reset.

Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
initialization is based on the assumption that enumeration of PCI devices
is already done. We can find target device from IOMMU page table instead
of scanning all devices in pci tree.

Therefore, this idea is only for intel-iommu. Other architectures need
to implement their own reset code.

Thanks,
Takao Indoh

2013-07-30 15:59:39

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
<[email protected]> wrote:
> (2013/07/29 23:17), Bjorn Helgaas wrote:
>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <[email protected]> wrote:
>>> (2013/07/26 2:00), Bjorn Helgaas wrote:

>>>> My point about IOMMU and PCI initialization order doesn't go away just
>>>> because it doesn't fit "kdump policy." Having system initialization
>>>> occur in a logical order is far more important than making kdump work.
>>>
>>> My next plan is as follows. I think this is matched to logical order
>>> on boot.
>>>
>>> drivers/pci/pci.c
>>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
>>>
>>> drivers/iommu/intel-iommu.c
>>> - On initialization, if IOMMU is already enabled, call this bus reset
>>> function before disabling and re-enabling IOMMU.
>>
>> I raised this issue because of arches like sparc that enumerate the
>> IOMMU before the PCI devices that use it. In that situation, I think
>> you're proposing this:
>>
>> panic kernel
>> enable IOMMU
>> panic
>> kdump kernel
>> initialize IOMMU (already enabled)
>> pci_reset_bus
>> disable IOMMU
>> enable IOMMU
>> enumerate PCI devices
>>
>> But the problem is that when you call pci_reset_bus(), you haven't
>> enumerated the PCI devices, so you don't know what to reset.
>
> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
> initialization is based on the assumption that enumeration of PCI devices
> is already done. We can find target device from IOMMU page table instead
> of scanning all devices in pci tree.
>
> Therefore, this idea is only for intel-iommu. Other architectures need
> to implement their own reset code.

That's my point. I'm opposed to adding code to PCI when it only
benefits x86 and we know other arches will need a fundamentally
different design. I would rather have a design that can work for all
arches.

If your implementation is totally implemented under arch/x86 (or in
intel-iommu.c, I guess), I can't object as much. However, I think
that eventually even x86 should enumerate the IOMMUs via ACPI before
we enumerate PCI devices.

It's pretty clear that's how BIOS designers expect the OS to work.
For example, sec 8.7.3 of the Intel Virtualization Technology for
Directed I/O spec, rev 1.3, shows the expectation that remapping
hardware (IOMMU) is initialized before discovering the I/O hierarchy
below a hot-added host bridge. Obviously you're not talking about a
hot-add scenario, but I think the same sequence should apply at
boot-time as well.

Bjorn

2013-07-31 01:08:12

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

(2013/07/31 0:59), Bjorn Helgaas wrote:
> On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
> <[email protected]> wrote:
>> (2013/07/29 23:17), Bjorn Helgaas wrote:
>>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <[email protected]> wrote:
>>>> (2013/07/26 2:00), Bjorn Helgaas wrote:
>
>>>>> My point about IOMMU and PCI initialization order doesn't go away just
>>>>> because it doesn't fit "kdump policy." Having system initialization
>>>>> occur in a logical order is far more important than making kdump work.
>>>>
>>>> My next plan is as follows. I think this is matched to logical order
>>>> on boot.
>>>>
>>>> drivers/pci/pci.c
>>>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
>>>>
>>>> drivers/iommu/intel-iommu.c
>>>> - On initialization, if IOMMU is already enabled, call this bus reset
>>>> function before disabling and re-enabling IOMMU.
>>>
>>> I raised this issue because of arches like sparc that enumerate the
>>> IOMMU before the PCI devices that use it. In that situation, I think
>>> you're proposing this:
>>>
>>> panic kernel
>>> enable IOMMU
>>> panic
>>> kdump kernel
>>> initialize IOMMU (already enabled)
>>> pci_reset_bus
>>> disable IOMMU
>>> enable IOMMU
>>> enumerate PCI devices
>>>
>>> But the problem is that when you call pci_reset_bus(), you haven't
>>> enumerated the PCI devices, so you don't know what to reset.
>>
>> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
>> initialization is based on the assumption that enumeration of PCI devices
>> is already done. We can find target device from IOMMU page table instead
>> of scanning all devices in pci tree.
>>
>> Therefore, this idea is only for intel-iommu. Other architectures need
>> to implement their own reset code.
>
> That's my point. I'm opposed to adding code to PCI when it only
> benefits x86 and we know other arches will need a fundamentally
> different design. I would rather have a design that can work for all
> arches.
>
> If your implementation is totally implemented under arch/x86 (or in
> intel-iommu.c, I guess), I can't object as much. However, I think
> that eventually even x86 should enumerate the IOMMUs via ACPI before
> we enumerate PCI devices.
>
> It's pretty clear that's how BIOS designers expect the OS to work.
> For example, sec 8.7.3 of the Intel Virtualization Technology for
> Directed I/O spec, rev 1.3, shows the expectation that remapping
> hardware (IOMMU) is initialized before discovering the I/O hierarchy
> below a hot-added host bridge. Obviously you're not talking about a
> hot-add scenario, but I think the same sequence should apply at
> boot-time as well.

Of course I won't add something just for x86 into common PCI layer. I
attach my new patch, though it is not well tested yet.

On x86, currently IOMMU initialization run *after* PCI enumeration, but
what you are talking about is that it should be changed so that x86
IOMMU initialization is done *before* PCI enumeration like sparc, right?

Hmm, ok, I think I need to post attached patch to iommu list and
discuss it including current order of x86 IOMMU initialization.

Thanks,
Takao Indoh
---
drivers/iommu/intel-iommu.c | 55 +++++++++++++++++++++++++++++++++-
drivers/pci/pci.c | 53 ++++++++++++++++++++++++++++++++
include/linux/pci.h | 1
3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index eec0d3e..fb8a546 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = {
.notifier_call = device_notifier,
};

+/* Reset PCI device if its entry exists in DMAR table */
+static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 segment)
+{
+ u64 addr;
+ struct root_entry *root;
+ struct context_entry *context;
+ int bus, devfn;
+ struct pci_dev *dev;
+
+ addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
+ if (!addr)
+ return;
+
+ /*
+ * In the case of kdump, ioremap is needed because root-entry table
+ * exists in first kernel's memory area which is not mapped in second
+ * kernel
+ */
+ root = (struct root_entry*)ioremap(addr, PAGE_SIZE);
+ if (!root)
+ return;
+
+ for (bus=0; bus<ROOT_ENTRY_NR; bus++) {
+ if (!root_present(&root[bus]))
+ continue;
+
+ context = (struct context_entry *)ioremap(
+ root[bus].val & VTD_PAGE_MASK, PAGE_SIZE);
+ if (!context)
+ continue;
+
+ for (devfn=0; devfn<ROOT_ENTRY_NR; devfn++) {
+ if (!context_present(&context[devfn]))
+ continue;
+
+ dev = pci_get_domain_bus_and_slot(segment, bus, devfn);
+ if (!dev)
+ continue;
+
+ if (!pci_reset_bus(dev->bus)) /* go to next bus */
+ break;
+ else /* Try per-function reset */
+ pci_reset_function(dev);
+
+ }
+ iounmap(context);
+ }
+ iounmap(root);
+}
+
int __init intel_iommu_init(void)
{
int ret = 0;
@@ -3687,8 +3737,11 @@ int __init intel_iommu_init(void)
continue;

iommu = drhd->iommu;
- if (iommu->gcmd & DMA_GCMD_TE)
+ if (iommu->gcmd & DMA_GCMD_TE) {
+ if (reset_devices)
+ iommu_reset_devices(iommu, drhd->segment);
iommu_disable_translation(iommu);
+ }
}

if (dmar_dev_scope_init() < 0) {
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e37fea6..c595997 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3392,6 +3392,59 @@ int pci_reset_function(struct pci_dev *dev)
EXPORT_SYMBOL_GPL(pci_reset_function);

/**
+ * pci_reset_bus - reset a PCI bus
+ * @bus: PCI bus to reset
+ *
+ * Returns 0 if the bus was successfully reset or negative if failed.
+ */
+int pci_reset_bus(struct pci_bus *bus)
+{
+ struct pci_dev *pdev;
+ u16 ctrl;
+
+ if (!bus->self)
+ return -ENOTTY;
+
+ list_for_each_entry(pdev, &bus->devices, bus_list)
+ if (pdev->subordinate)
+ return -ENOTTY;
+
+ /* Save config registers of children */
+ list_for_each_entry(pdev, &bus->devices, bus_list) {
+ dev_info(&pdev->dev, "Save state\n");
+ pci_save_state(pdev);
+ }
+
+ dev_info(&bus->self->dev, "Reset Secondary bus\n");
+
+ /* Assert Secondary Bus Reset */
+ pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
+ ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
+ pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
+
+ /* Read config again to flush previous write */
+ pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
+
+ msleep(2);
+
+ /* De-assert Secondary Bus Reset */
+ ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
+ pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
+
+ /* Wait for completion */
+ msleep(1000);
+
+ /* Restore config registers of children */
+ list_for_each_entry(pdev, &bus->devices, bus_list) {
+ dev_info(&pdev->dev, "Restore state\n");
+ pci_restore_state(pdev);
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(pci_reset_bus);
+
+/**
* pcix_get_max_mmrbc - get PCI-X maximum designed memory read byte count
* @dev: PCI device to query
*
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 0fd1f15..125fbc6 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -924,6 +924,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
int __pci_reset_function(struct pci_dev *dev);
int __pci_reset_function_locked(struct pci_dev *dev);
int pci_reset_function(struct pci_dev *dev);
+int pci_reset_bus(struct pci_bus *bus);
void pci_update_resource(struct pci_dev *dev, int resno);
int __must_check pci_assign_resource(struct pci_dev *dev, int i);
int __must_check pci_reassign_resource(struct pci_dev *dev, int i, resource_size_t add_size, resource_size_t align);

2013-07-31 03:12:47

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

On Wed, 2013-07-31 at 09:35 +0900, Takao Indoh wrote:
> (2013/07/31 0:59), Bjorn Helgaas wrote:
> > On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
> > <[email protected]> wrote:
> >> (2013/07/29 23:17), Bjorn Helgaas wrote:
> >>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <[email protected]> wrote:
> >>>> (2013/07/26 2:00), Bjorn Helgaas wrote:
> >
> >>>>> My point about IOMMU and PCI initialization order doesn't go away just
> >>>>> because it doesn't fit "kdump policy." Having system initialization
> >>>>> occur in a logical order is far more important than making kdump work.
> >>>>
> >>>> My next plan is as follows. I think this is matched to logical order
> >>>> on boot.
> >>>>
> >>>> drivers/pci/pci.c
> >>>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
> >>>>
> >>>> drivers/iommu/intel-iommu.c
> >>>> - On initialization, if IOMMU is already enabled, call this bus reset
> >>>> function before disabling and re-enabling IOMMU.
> >>>
> >>> I raised this issue because of arches like sparc that enumerate the
> >>> IOMMU before the PCI devices that use it. In that situation, I think
> >>> you're proposing this:
> >>>
> >>> panic kernel
> >>> enable IOMMU
> >>> panic
> >>> kdump kernel
> >>> initialize IOMMU (already enabled)
> >>> pci_reset_bus
> >>> disable IOMMU
> >>> enable IOMMU
> >>> enumerate PCI devices
> >>>
> >>> But the problem is that when you call pci_reset_bus(), you haven't
> >>> enumerated the PCI devices, so you don't know what to reset.
> >>
> >> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
> >> initialization is based on the assumption that enumeration of PCI devices
> >> is already done. We can find target device from IOMMU page table instead
> >> of scanning all devices in pci tree.
> >>
> >> Therefore, this idea is only for intel-iommu. Other architectures need
> >> to implement their own reset code.
> >
> > That's my point. I'm opposed to adding code to PCI when it only
> > benefits x86 and we know other arches will need a fundamentally
> > different design. I would rather have a design that can work for all
> > arches.
> >
> > If your implementation is totally implemented under arch/x86 (or in
> > intel-iommu.c, I guess), I can't object as much. However, I think
> > that eventually even x86 should enumerate the IOMMUs via ACPI before
> > we enumerate PCI devices.
> >
> > It's pretty clear that's how BIOS designers expect the OS to work.
> > For example, sec 8.7.3 of the Intel Virtualization Technology for
> > Directed I/O spec, rev 1.3, shows the expectation that remapping
> > hardware (IOMMU) is initialized before discovering the I/O hierarchy
> > below a hot-added host bridge. Obviously you're not talking about a
> > hot-add scenario, but I think the same sequence should apply at
> > boot-time as well.
>
> Of course I won't add something just for x86 into common PCI layer. I
> attach my new patch, though it is not well tested yet.
>
> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> what you are talking about is that it should be changed so that x86
> IOMMU initialization is done *before* PCI enumeration like sparc, right?
>
> Hmm, ok, I think I need to post attached patch to iommu list and
> discuss it including current order of x86 IOMMU initialization.
>
> Thanks,
> Takao Indoh
> ---
> drivers/iommu/intel-iommu.c | 55 +++++++++++++++++++++++++++++++++-
> drivers/pci/pci.c | 53 ++++++++++++++++++++++++++++++++
> include/linux/pci.h | 1
> 3 files changed, 108 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index eec0d3e..fb8a546 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = {
> .notifier_call = device_notifier,
> };
>
> +/* Reset PCI device if its entry exists in DMAR table */
> +static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 segment)
> +{
> + u64 addr;
> + struct root_entry *root;
> + struct context_entry *context;
> + int bus, devfn;
> + struct pci_dev *dev;
> +
> + addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
> + if (!addr)
> + return;
> +
> + /*
> + * In the case of kdump, ioremap is needed because root-entry table
> + * exists in first kernel's memory area which is not mapped in second
> + * kernel
> + */
> + root = (struct root_entry*)ioremap(addr, PAGE_SIZE);
> + if (!root)
> + return;
> +
> + for (bus=0; bus<ROOT_ENTRY_NR; bus++) {
> + if (!root_present(&root[bus]))
> + continue;
> +
> + context = (struct context_entry *)ioremap(
> + root[bus].val & VTD_PAGE_MASK, PAGE_SIZE);
> + if (!context)
> + continue;
> +
> + for (devfn=0; devfn<ROOT_ENTRY_NR; devfn++) {
> + if (!context_present(&context[devfn]))
> + continue;
> +
> + dev = pci_get_domain_bus_and_slot(segment, bus, devfn);
> + if (!dev)
> + continue;
> +
> + if (!pci_reset_bus(dev->bus)) /* go to next bus */
> + break;
> + else /* Try per-function reset */
> + pci_reset_function(dev);
> +
> + }
> + iounmap(context);
> + }
> + iounmap(root);
> +}
> +
> int __init intel_iommu_init(void)
> {
> int ret = 0;
> @@ -3687,8 +3737,11 @@ int __init intel_iommu_init(void)
> continue;
>
> iommu = drhd->iommu;
> - if (iommu->gcmd & DMA_GCMD_TE)
> + if (iommu->gcmd & DMA_GCMD_TE) {
> + if (reset_devices)
> + iommu_reset_devices(iommu, drhd->segment);
> iommu_disable_translation(iommu);
> + }
> }
>
> if (dmar_dev_scope_init() < 0) {
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e37fea6..c595997 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3392,6 +3392,59 @@ int pci_reset_function(struct pci_dev *dev)
> EXPORT_SYMBOL_GPL(pci_reset_function);
>
> /**
> + * pci_reset_bus - reset a PCI bus
> + * @bus: PCI bus to reset
> + *
> + * Returns 0 if the bus was successfully reset or negative if failed.
> + */
> +int pci_reset_bus(struct pci_bus *bus)
> +{
> + struct pci_dev *pdev;
> + u16 ctrl;
> +
> + if (!bus->self)
> + return -ENOTTY;
> +
> + list_for_each_entry(pdev, &bus->devices, bus_list)
> + if (pdev->subordinate)
> + return -ENOTTY;
> +
> + /* Save config registers of children */
> + list_for_each_entry(pdev, &bus->devices, bus_list) {
> + dev_info(&pdev->dev, "Save state\n");
> + pci_save_state(pdev);
> + }
> +
> + dev_info(&bus->self->dev, "Reset Secondary bus\n");
> +
> + /* Assert Secondary Bus Reset */
> + pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
> + pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
> +
> + /* Read config again to flush previous write */
> + pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
> +
> + msleep(2);
> +
> + /* De-assert Secondary Bus Reset */
> + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> + pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
> +
> + /* Wait for completion */
> + msleep(1000);

We already have secondary bus reset code in this file, why are we
duplicating it here? Also, why are these delays different from the
existing code? I'm also in need of a bus reset interface for when we
assign all of the devices on a bus to userspace and do not have working
function level resets per device. I'll post my patch series and perhaps
we can collaborate on a pci bus reset interface. Thanks,

Alex

> +
> + /* Restore config registers of children */
> + list_for_each_entry(pdev, &bus->devices, bus_list) {
> + dev_info(&pdev->dev, "Restore state\n");
> + pci_restore_state(pdev);
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_reset_bus);
> +
> +/**
> * pcix_get_max_mmrbc - get PCI-X maximum designed memory read byte count
> * @dev: PCI device to query
> *
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 0fd1f15..125fbc6 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -924,6 +924,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
> int __pci_reset_function(struct pci_dev *dev);
> int __pci_reset_function_locked(struct pci_dev *dev);
> int pci_reset_function(struct pci_dev *dev);
> +int pci_reset_bus(struct pci_bus *bus);
> void pci_update_resource(struct pci_dev *dev, int resno);
> int __must_check pci_assign_resource(struct pci_dev *dev, int i);
> int __must_check pci_reassign_resource(struct pci_dev *dev, int i, resource_size_t add_size, resource_size_t align);
>

2013-07-31 05:51:31

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

(2013/07/31 12:11), Alex Williamson wrote:
> On Wed, 2013-07-31 at 09:35 +0900, Takao Indoh wrote:
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e37fea6..c595997 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -3392,6 +3392,59 @@ int pci_reset_function(struct pci_dev *dev)
>> EXPORT_SYMBOL_GPL(pci_reset_function);
>>
>> /**
>> + * pci_reset_bus - reset a PCI bus
>> + * @bus: PCI bus to reset
>> + *
>> + * Returns 0 if the bus was successfully reset or negative if failed.
>> + */
>> +int pci_reset_bus(struct pci_bus *bus)
>> +{
>> + struct pci_dev *pdev;
>> + u16 ctrl;
>> +
>> + if (!bus->self)
>> + return -ENOTTY;
>> +
>> + list_for_each_entry(pdev, &bus->devices, bus_list)
>> + if (pdev->subordinate)
>> + return -ENOTTY;
>> +
>> + /* Save config registers of children */
>> + list_for_each_entry(pdev, &bus->devices, bus_list) {
>> + dev_info(&pdev->dev, "Save state\n");
>> + pci_save_state(pdev);
>> + }
>> +
>> + dev_info(&bus->self->dev, "Reset Secondary bus\n");
>> +
>> + /* Assert Secondary Bus Reset */
>> + pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>> + pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> + /* Read config again to flush previous write */
>> + pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
>> +
>> + msleep(2);
>> +
>> + /* De-assert Secondary Bus Reset */
>> + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>> + pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> + /* Wait for completion */
>> + msleep(1000);
>
>
> We already have secondary bus reset code in this file, why are we
> duplicating it here? Also, why are these delays different from the
> existing code? I'm also in need of a bus reset interface for when we
> assign all of the devices on a bus to userspace and do not have working
> function level resets per device. I'll post my patch series and perhaps
> we can collaborate on a pci bus reset interface. Thanks,

Good point. Yes, we have already similar functions.

pci_parent_bus_reset()
1. Assert secondary bus reset
2. msleep(100)
3. De-assert secondary bus reset
4. msleep(100)

aer_do_secondary_bus_reset()
1. Assert secondary bus reset
2. msleep(2)
3. De-assert secondary bus reset,
4. msleep(200)

To be honest, I wrote my reset code almost one years ago, so I forgot
the reason why I separated them.

Basically my reset code is based on aer_do_secondary_bus_reset(). The
different is waiting time after reset. My patch has 1000msec waiting
time.

At first my reset code is almost same as aer_do_secondary_bus_reset().
But when I tested the reset code, I found that on certain machine
restoring config registers failed after reset. It failed because 200msec
waiting time was too short. And I found the following description in
PCIe spec. According to this, I thought we should wait at least 1000msec.

6.6.1. Conventional Reset

* The Root Complex and/or system software must allow at least 1.0s
after a Conventional Reset of a device, before it may determine that a
device which fails to return a Successful Completion status for a
valid Configuration Request is a broken device. This period is
independent of how quickly Link training completes.

Note: This delay is analogous to the Trhfa parameter specified for
PCI/PCI-X, and is intended to allow an adequate amount of time for
devices which require self initialization.

* When attempting a Configuration access to devices on a PCI or PCI-X
bus segment behind a PCI Express/PCI(-X) Bridge, the timing parameter
Trhfa must be respected.

And I saw patches you posted today, yes, your patch looks helpful for
my purpose:-)

Thanks,
Takao Indoh

2013-07-31 16:10:23

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

On Thu, Jul 25, 2013 at 11:00:46AM -0600, Bjorn Helgaas wrote:
> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
> <[email protected]> wrote:
> > Sorry for letting this discussion slide, I was busy on other works:-(
> > Anyway, the summary of previous discussion is:
> > - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
> > boot. This expects PCI enumeration is done before IOMMU
> > initialization as follows.
> > (1) PCI enumeration
> > (2) fs_initcall ---> device reset
> > (3) IOMMU initialization
> > - This works on x86, but does not work on other architecture because
> > IOMMU is initialized before PCI enumeration on some architectures. So,
> > device reset should be done where IOMMU is initialized instead of
> > initcall.
> > - Or, as another idea, we can reset devices in first kernel(panic kernel)
> >
> > Resetting devices in panic kernel is against kdump policy and seems not to
> > be good idea. So I think adding reset code into iommu initialization is
> > better. I'll post patches for that.
>
> Of course nobody *wants* to do anything in the panic kernel. But
> simply saying "it's against kdump policy and seems not to be a good
> idea" is not a technical argument. There are things that are
> impractical to do in the kdump kernel, so they have to be done in the
> panic kernel even though we know the kernel is unreliable and the
> attempt may fail.

I think resetting all devices in crashed kernel is really a lot of
code. If there is a small piece of code, it can still be considered.

I don't know much about IOMMU or PCI or PCIE. But I am taking one step
back and discuss again the idea of not resetting the IOMMU in second
kernel.

I think resetting the bus is a good idea but just resetting PCIE
will solve only part of the problem and we will same issues with
devices on other buses.

So what sounds more appealing if we could fix this particular
problem at IOMMU level first (and continue to develp patches for
resetting various buses).

In the past also these ideas have been proposed that continue to
use translation table from first kernel. Retain those mappings and
don't reset IOMMU. Reserve some space for kdump mappings in first
kernel and use that reserved mapping space in second kernel. It
never got implemented though.

Bjorn, so what's the fundamental problem with this idea?

Also, what's wrong with DMAR error. If some device tried to do DMA,
and DMA was blocked because IOMMU got reset and mappings are no more
there, why does it lead to failure. Shouldn't we just reate limit
error messages in such case and if device is needed, anyway driver
will reset it.

Other problem mentioned in this thread is PCI SERR. What is it? Is
it some kind of error device reports if it can't do DMA successfully.
Can these errors be simply ignored kdump kernel? This problem sounds
similar to a device keeping interrupt asserted in second kernel and
kernel simply disables the interrupt line if nobody claims the
interrupt.

IOW, it feels to me that we should handle the issue (DMAR error) at
IOMMU level first (instead of trying to make sure that by the time
we get to initialize IOMMU(), all devices in system have been quiesced
and nobody is doing DMA).

Thanks
Vivek

2013-07-31 19:57:00

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

On Tue, Jul 30, 2013 at 09:59:16AM -0600, Bjorn Helgaas wrote:
> On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
> <[email protected]> wrote:
> > (2013/07/29 23:17), Bjorn Helgaas wrote:
> >> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <[email protected]> wrote:
> >>> (2013/07/26 2:00), Bjorn Helgaas wrote:
>
> >>>> My point about IOMMU and PCI initialization order doesn't go away just
> >>>> because it doesn't fit "kdump policy." Having system initialization
> >>>> occur in a logical order is far more important than making kdump work.
> >>>
> >>> My next plan is as follows. I think this is matched to logical order
> >>> on boot.
> >>>
> >>> drivers/pci/pci.c
> >>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
> >>>
> >>> drivers/iommu/intel-iommu.c
> >>> - On initialization, if IOMMU is already enabled, call this bus reset
> >>> function before disabling and re-enabling IOMMU.
> >>
> >> I raised this issue because of arches like sparc that enumerate the
> >> IOMMU before the PCI devices that use it. In that situation, I think
> >> you're proposing this:
> >>
> >> panic kernel
> >> enable IOMMU
> >> panic
> >> kdump kernel
> >> initialize IOMMU (already enabled)
> >> pci_reset_bus
> >> disable IOMMU
> >> enable IOMMU
> >> enumerate PCI devices
> >>
> >> But the problem is that when you call pci_reset_bus(), you haven't
> >> enumerated the PCI devices, so you don't know what to reset.
> >
> > Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
> > initialization is based on the assumption that enumeration of PCI devices
> > is already done. We can find target device from IOMMU page table instead
> > of scanning all devices in pci tree.
> >
> > Therefore, this idea is only for intel-iommu. Other architectures need
> > to implement their own reset code.
>
> That's my point. I'm opposed to adding code to PCI when it only
> benefits x86 and we know other arches will need a fundamentally
> different design. I would rather have a design that can work for all
> arches.

I agree. It makes sense to work on a design which works for all
arches.

And that's when I feel that handling this problem at IOMMU level
makes more sense, if we can.

Thanks
Vivek

2013-07-31 21:08:29

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

[+cc Rafael, linux-acpi]

On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh <[email protected]> wrote:

> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> what you are talking about is that it should be changed so that x86
> IOMMU initialization is done *before* PCI enumeration like sparc, right?

Yes. I don't know whether or when that initialization order will ever
be changed, but I do think we should avoid building more
infrastructure that depends on the current order.

Changing the order is a pretty big deal because it's a lot more than
just the IOMMU. Basically I think we should be enumerating ACPI
devices, including the IOMMU, before PCI devices, but there's a lot of
legacy involved in that area. Added Rafael in case he has any
thoughts.

> Hmm, ok, I think I need to post attached patch to iommu list and
> discuss it including current order of x86 IOMMU initialization.

2013-07-31 21:13:21

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote:
> [+cc Rafael, linux-acpi]
>
> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh <[email protected]> wrote:
>
> > On x86, currently IOMMU initialization run *after* PCI enumeration, but
> > what you are talking about is that it should be changed so that x86
> > IOMMU initialization is done *before* PCI enumeration like sparc, right?
>
> Yes. I don't know whether or when that initialization order will ever
> be changed, but I do think we should avoid building more
> infrastructure that depends on the current order.
>
> Changing the order is a pretty big deal because it's a lot more than
> just the IOMMU. Basically I think we should be enumerating ACPI
> devices, including the IOMMU, before PCI devices, but there's a lot of
> legacy involved in that area. Added Rafael in case he has any
> thoughts.

Well, actually, I'm not really familiar with IOMMUs, sorry.

I do think that initializing IOMMU before PCI enumeration would be better,
however. At least if the ordering should be the same on all architectures,
which I suppose is the case, that's the one I'd choose.

Thanks,
Rafael

2013-08-01 06:34:49

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

(2013/08/01 6:23), Rafael J. Wysocki wrote:
> On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote:
>> [+cc Rafael, linux-acpi]
>>
>> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh <[email protected]> wrote:
>>
>>> On x86, currently IOMMU initialization run *after* PCI enumeration, but
>>> what you are talking about is that it should be changed so that x86
>>> IOMMU initialization is done *before* PCI enumeration like sparc, right?
>>
>> Yes. I don't know whether or when that initialization order will ever
>> be changed, but I do think we should avoid building more
>> infrastructure that depends on the current order.
>>
>> Changing the order is a pretty big deal because it's a lot more than
>> just the IOMMU. Basically I think we should be enumerating ACPI
>> devices, including the IOMMU, before PCI devices, but there's a lot of
>> legacy involved in that area. Added Rafael in case he has any
>> thoughts.
>
> Well, actually, I'm not really familiar with IOMMUs, sorry.
>
> I do think that initializing IOMMU before PCI enumeration would be better,
> however. At least if the ordering should be the same on all architectures,
> which I suppose is the case, that's the one I'd choose.

Ok guys. If x86 IOMMU maintainer also thinks changing order is
necessary, maybe I need to give up device reset in kdump kernel and
consider doing it in panic kernel.

Either way, I need bus reset interface to reset devices. Bjorn, could
you review the bus reset patches Alex posted yesterday?

Thanks,
Takao Indoh

2013-08-01 12:42:52

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

On Thu, 2013-08-01 at 15:34 +0900, Takao Indoh wrote:
> (2013/08/01 6:23), Rafael J. Wysocki wrote:
> > On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote:
> >> [+cc Rafael, linux-acpi]
> >>
> >> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh <[email protected]> wrote:
> >>
> >>> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> >>> what you are talking about is that it should be changed so that x86
> >>> IOMMU initialization is done *before* PCI enumeration like sparc, right?
> >>
> >> Yes. I don't know whether or when that initialization order will ever
> >> be changed, but I do think we should avoid building more
> >> infrastructure that depends on the current order.
> >>
> >> Changing the order is a pretty big deal because it's a lot more than
> >> just the IOMMU. Basically I think we should be enumerating ACPI
> >> devices, including the IOMMU, before PCI devices, but there's a lot of
> >> legacy involved in that area. Added Rafael in case he has any
> >> thoughts.
> >
> > Well, actually, I'm not really familiar with IOMMUs, sorry.
> >
> > I do think that initializing IOMMU before PCI enumeration would be better,
> > however. At least if the ordering should be the same on all architectures,
> > which I suppose is the case, that's the one I'd choose.
>
> Ok guys. If x86 IOMMU maintainer also thinks changing order is
> necessary, maybe I need to give up device reset in kdump kernel and
> consider doing it in panic kernel.
>
> Either way, I need bus reset interface to reset devices. Bjorn, could
> you review the bus reset patches Alex posted yesterday?

I'll post a non-RFC version today, I've made a couple cleanups, tuned
the delays and rolled in the AER version of secondary bus reset.
Thanks,

Alex

2013-08-01 13:20:54