Recently was noticed in an HP GEN9 system that kdump couldn't succeed
due to an irq storm coming from an Intel NIC, narrowed down to be lack
of clearing the MSI/MSI-X enable bits during the kdump kernel boot.
For that, we need an early quirk to manually turn off MSI/MSI-X for
PCI devices - this was worked as an optional boot parameter in a
subsequent patch.
Problem is that in our test system, the Intel NICs were not present in
any secondary bus under the first PCIe root complex, so they couldn't
be reached by the recursion in check_dev_quirk(). Modern systems,
specially with multi-processors and multiple NUMA nodes expose multiple
root complexes, describing more than one PCI hierarchy domain. Currently
the simple recursion present in the early-quirks code from x86 starts a
descending recursion from bus 0000:00, and reach many other busses by
navigating this hierarchy walking through the bridges. This is not
enough in systems with more than one root complex/host bridge, since
the recursion won't "traverse" to other root complexes by starting
statically in 0000:00 (for more details, see [0]).
This patch hence implements the full bus/device/function scan in
early_quirks(), by checking all possible busses instead of using a
recursion based on the first root bus or limiting the search scope to
the first 32 busses (like it was done in the beginning [1]).
[0] https://bugs.launchpad.net/bugs/1797990
[1] From historical perspective, early PCI scan dates back
to BitKeeper, added by Andi Kleen's "[PATCH] APIC fixes for x86-64",
on October/2003. It initially restricted the search to the first
32 busses and slots.
Due to a potential bug found in Nvidia chipsets, the scan
was changed to run only in the first root bus: see
commit 8659c406ade3 ("x86: only scan the root bus in early PCI quirks")
Finally, secondary busses reachable from the 1st bus were re-added back by:
commit 850c321027c2 ("x86/quirks: Reintroduce scanning of secondary buses")
Reported-by: Dan Streetman <[email protected]>
Signed-off-by: Guilherme G. Piccoli <[email protected]>
---
arch/x86/kernel/early-quirks.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 50d5848bf22e..fd50f9e21623 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -731,7 +731,6 @@ static int __init check_dev_quirk(int num, int slot, int func)
u16 vendor;
u16 device;
u8 type;
- u8 sec;
int i;
class = read_pci_config_16(num, slot, func, PCI_CLASS_DEVICE);
@@ -760,11 +759,8 @@ static int __init check_dev_quirk(int num, int slot, int func)
type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
- if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE) {
- sec = read_pci_config_byte(num, slot, func, PCI_SECONDARY_BUS);
- if (sec > num)
- early_pci_scan_bus(sec);
- }
+ if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE)
+ return -1;
if (!(type & 0x80))
return -1;
@@ -787,8 +783,11 @@ static void __init early_pci_scan_bus(int bus)
void __init early_quirks(void)
{
+ int bus;
+
if (!early_pci_allowed())
return;
- early_pci_scan_bus(0);
+ for (bus = 0; bus < 256; bus++)
+ early_pci_scan_bus(bus);
}
--
2.19.0
This patch exports (and renames) the function find_cap() to be used
in the early PCI quirk code, by the next patch.
This is being moved out from AGP code to generic early-PCI code
since it's not AGP-specific and can be used for any PCI device.
No functional changes intended.
Signed-off-by: Guilherme G. Piccoli <[email protected]>
---
arch/x86/include/asm/pci-direct.h | 1 +
arch/x86/kernel/aperture_64.c | 30 ++----------------------------
arch/x86/pci/early.c | 25 +++++++++++++++++++++++++
3 files changed, 28 insertions(+), 28 deletions(-)
diff --git a/arch/x86/include/asm/pci-direct.h b/arch/x86/include/asm/pci-direct.h
index 94597a3cf3d0..813996305bf5 100644
--- a/arch/x86/include/asm/pci-direct.h
+++ b/arch/x86/include/asm/pci-direct.h
@@ -10,6 +10,7 @@
extern u32 read_pci_config(u8 bus, u8 slot, u8 func, u8 offset);
extern u8 read_pci_config_byte(u8 bus, u8 slot, u8 func, u8 offset);
extern u16 read_pci_config_16(u8 bus, u8 slot, u8 func, u8 offset);
+extern u32 pci_early_find_cap(int bus, int slot, int func, int cap);
extern void write_pci_config(u8 bus, u8 slot, u8 func, u8 offset, u32 val);
extern void write_pci_config_byte(u8 bus, u8 slot, u8 func, u8 offset, u8 val);
extern void write_pci_config_16(u8 bus, u8 slot, u8 func, u8 offset, u16 val);
diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index 2c4d5ece7456..365fcc37b2a2 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -120,32 +120,6 @@ static u32 __init allocate_aperture(void)
}
-/* Find a PCI capability */
-static u32 __init find_cap(int bus, int slot, int func, int cap)
-{
- int bytes;
- u8 pos;
-
- if (!(read_pci_config_16(bus, slot, func, PCI_STATUS) &
- PCI_STATUS_CAP_LIST))
- return 0;
-
- pos = read_pci_config_byte(bus, slot, func, PCI_CAPABILITY_LIST);
- for (bytes = 0; bytes < 48 && pos >= 0x40; bytes++) {
- u8 id;
-
- pos &= ~3;
- id = read_pci_config_byte(bus, slot, func, pos+PCI_CAP_LIST_ID);
- if (id == 0xff)
- break;
- if (id == cap)
- return pos;
- pos = read_pci_config_byte(bus, slot, func,
- pos+PCI_CAP_LIST_NEXT);
- }
- return 0;
-}
-
/* Read a standard AGPv3 bridge header */
static u32 __init read_agp(int bus, int slot, int func, int cap, u32 *order)
{
@@ -234,8 +208,8 @@ static u32 __init search_agp_bridge(u32 *order, int *valid_agp)
case PCI_CLASS_BRIDGE_HOST:
case PCI_CLASS_BRIDGE_OTHER: /* needed? */
/* AGP bridge? */
- cap = find_cap(bus, slot, func,
- PCI_CAP_ID_AGP);
+ cap = pci_early_find_cap(bus, slot,
+ func, PCI_CAP_ID_AGP);
if (!cap)
break;
*valid_agp = 1;
diff --git a/arch/x86/pci/early.c b/arch/x86/pci/early.c
index f5fc953e5848..f1ba9d781b52 100644
--- a/arch/x86/pci/early.c
+++ b/arch/x86/pci/early.c
@@ -51,6 +51,31 @@ void write_pci_config_16(u8 bus, u8 slot, u8 func, u8 offset, u16 val)
outw(val, 0xcfc + (offset&2));
}
+u32 pci_early_find_cap(int bus, int slot, int func, int cap)
+{
+ int bytes;
+ u8 pos;
+
+ if (!(read_pci_config_16(bus, slot, func, PCI_STATUS) &
+ PCI_STATUS_CAP_LIST))
+ return 0;
+
+ pos = read_pci_config_byte(bus, slot, func, PCI_CAPABILITY_LIST);
+ for (bytes = 0; bytes < 48 && pos >= 0x40; bytes++) {
+ u8 id;
+
+ pos &= ~3;
+ id = read_pci_config_byte(bus, slot, func, pos+PCI_CAP_LIST_ID);
+ if (id == 0xff)
+ break;
+ if (id == cap)
+ return pos;
+ pos = read_pci_config_byte(bus, slot, func,
+ pos+PCI_CAP_LIST_NEXT);
+ }
+ return 0;
+}
+
int early_pci_allowed(void)
{
return (pci_probe & (PCI_PROBE_CONF1|PCI_PROBE_NOEARLY)) ==
--
2.19.0
We observed a kdump failure in x86 that was narrowed down to MSI irq
storm coming from a PCI network device. The bug manifests as a lack of
progress in the boot process of kdump kernel, and a flood of kernel
messages like:
[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]
The root cause of the issue is that kexec process of the kdump kernel
doesn't ensure PCI devices are reset or MSI capabilities are disabled,
so a PCI adapter could produce a huge amount of irqs which would steal
all the processing time for the CPU (specially since we usually restrict
kdump kernel to use a single CPU only).
This patch implements the kernel parameter "pci=clearmsi" to clear the
MSI/MSI-X enable bits in the Message Control register for all PCI devices
during early boot time, thus preventing potential issues in the kexec'ed
kernel. PCI spec also supports/enforces this need (see PCI Local Bus
spec sections 6.8.1.3 and 6.8.2.3).
Suggested-by: Dan Streetman <[email protected]>
Suggested-by: Gavin Shan <[email protected]>
Signed-off-by: Guilherme G. Piccoli <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 6 ++++
arch/x86/include/asm/pci-direct.h | 1 +
arch/x86/kernel/early-quirks.c | 32 +++++++++++++++++++
arch/x86/pci/common.c | 4 +++
4 files changed, 43 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 92eb1f42240d..aeb510e484d4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3161,6 +3161,12 @@
nomsi [MSI] If the PCI_MSI kernel config parameter is
enabled, this kernel boot option can be used to
disable the use of MSI interrupts system-wide.
+ clearmsi [X86] Clears MSI/MSI-X enable bits early in boot
+ time in order to avoid issues like adapters
+ screaming irqs and preventing boot progress.
+ Also, it enforces the PCI Local Bus spec
+ rule that those bits should be 0 in system reset
+ events (useful for kexec/kdump cases).
noioapicquirk [APIC] Disable all boot interrupt quirks.
Safety option to keep boot IRQs enabled. This
should never be necessary.
diff --git a/arch/x86/include/asm/pci-direct.h b/arch/x86/include/asm/pci-direct.h
index 813996305bf5..ebb3db2eee41 100644
--- a/arch/x86/include/asm/pci-direct.h
+++ b/arch/x86/include/asm/pci-direct.h
@@ -15,5 +15,6 @@ extern void write_pci_config(u8 bus, u8 slot, u8 func, u8 offset, u32 val);
extern void write_pci_config_byte(u8 bus, u8 slot, u8 func, u8 offset, u8 val);
extern void write_pci_config_16(u8 bus, u8 slot, u8 func, u8 offset, u16 val);
+extern unsigned int pci_early_clear_msi;
extern int early_pci_allowed(void);
#endif /* _ASM_X86_PCI_DIRECT_H */
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index fd50f9e21623..21060d80441e 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -28,6 +28,37 @@
#include <asm/irq_remapping.h>
#include <asm/early_ioremap.h>
+static void __init early_pci_clear_msi(int bus, int slot, int func)
+{
+ int pos;
+ u16 ctrl;
+
+ if (likely(!pci_early_clear_msi))
+ return;
+
+ pr_info_once("Clearing MSI/MSI-X enable bits early in boot (quirk)\n");
+
+ pos = pci_early_find_cap(bus, slot, func, PCI_CAP_ID_MSI);
+ if (pos) {
+ ctrl = read_pci_config_16(bus, slot, func, pos + PCI_MSI_FLAGS);
+ ctrl &= ~PCI_MSI_FLAGS_ENABLE;
+ write_pci_config_16(bus, slot, func, pos + PCI_MSI_FLAGS, ctrl);
+
+ /* Read again to flush previous write */
+ ctrl = read_pci_config_16(bus, slot, func, pos + PCI_MSI_FLAGS);
+ }
+
+ pos = pci_early_find_cap(bus, slot, func, PCI_CAP_ID_MSIX);
+ if (pos) {
+ ctrl = read_pci_config_16(bus, slot, func, pos + PCI_MSIX_FLAGS);
+ ctrl &= ~PCI_MSIX_FLAGS_ENABLE;
+ write_pci_config_16(bus, slot, func, pos + PCI_MSIX_FLAGS, ctrl);
+
+ /* Read again to flush previous write */
+ ctrl = read_pci_config_16(bus, slot, func, pos + PCI_MSIX_FLAGS);
+ }
+}
+
static void __init fix_hypertransport_config(int num, int slot, int func)
{
u32 htcfg;
@@ -709,6 +740,7 @@ static struct chipset early_qrk[] __initdata = {
PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, force_disable_hpet},
{ PCI_VENDOR_ID_BROADCOM, 0x4331,
PCI_CLASS_NETWORK_OTHER, PCI_ANY_ID, 0, apple_airport_reset},
+ { PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0, early_pci_clear_msi},
{}
};
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index d4ec117c1142..7f6f85bd47a3 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -32,6 +32,7 @@ int noioapicreroute = 1;
#endif
int pcibios_last_bus = -1;
unsigned long pirq_table_addr;
+unsigned int pci_early_clear_msi;
const struct pci_raw_ops *__read_mostly raw_pci_ops;
const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
@@ -604,6 +605,9 @@ char *__init pcibios_setup(char *str)
} else if (!strcmp(str, "skip_isa_align")) {
pci_probe |= PCI_CAN_SKIP_ISA_ALIGN;
return NULL;
+ } else if (!strcmp(str, "clearmsi")) {
+ pci_early_clear_msi = 1;
+ return NULL;
} else if (!strcmp(str, "noioapicquirk")) {
noioapicquirk = 1;
return NULL;
--
2.19.0
On 18/10/2018 17:08, Sinan Kaya wrote:
> On 10/18/2018 2:37 PM, Guilherme G. Piccoli wrote:
>> We observed a kdump failure in x86 that was narrowed down to MSI irq
>> storm coming from a PCI network device. The bug manifests as a lack of
>> progress in the boot process of kdump kernel, and a flood of kernel
>> messages like:
>>
>> [...]
>> [ 342.265294] do_IRQ: 0.155 No irq handler for vector
>> [ 342.266916] do_IRQ: 0.155 No irq handler for vector
>> [ 347.258422] do_IRQ: 14053260 callbacks suppressed
>> [...]
>
> These kind of issues are usually fixed by fixing the network driver's
> shutdown routine to ensure that MSI interrupts are cleared there.
Sinan, I'm not sure shutdown handlers for drivers are called in panic
kexec (I remember of an old experiment I did, loading a kernel
with "kexec -p" didn't trigger the handlers).
But this case is even worse, because the NICs were in PCI passthrough
mode, using vfio. So, they were completely unaware of what happened
in the host kernel.
Also, this is spec compliant - system reset events should guarantee the
bits are cleared (although kexec is not exactly a system reset, it's
similar)
Cheers,
Guilherme
On 10/18/2018 2:37 PM, Guilherme G. Piccoli wrote:
> We observed a kdump failure in x86 that was narrowed down to MSI irq
> storm coming from a PCI network device. The bug manifests as a lack of
> progress in the boot process of kdump kernel, and a flood of kernel
> messages like:
>
> [...]
> [ 342.265294] do_IRQ: 0.155 No irq handler for vector
> [ 342.266916] do_IRQ: 0.155 No irq handler for vector
> [ 347.258422] do_IRQ: 14053260 callbacks suppressed
> [...]
These kind of issues are usually fixed by fixing the network driver's
shutdown routine to ensure that MSI interrupts are cleared there.
On 10/18/2018 4:13 PM, Guilherme G. Piccoli wrote:
>> These kind of issues are usually fixed by fixing the network driver's
>> shutdown routine to ensure that MSI interrupts are cleared there.
>
> Sinan, I'm not sure shutdown handlers for drivers are called in panic
> kexec (I remember of an old experiment I did, loading a kernel
> with "kexec -p" didn't trigger the handlers).
AFAIK, all shutdown (not remove) routines are called before launching the next
kernel even in crash scenario. It is not safe to start the new kernel while
hardware is doing a DMA to the system memory and triggering interrupts.
Shutdown routine in PCI core used to disable MSI/MSI-x on behalf of all
endpoints but it was later decided that this is the responsibility of the
endpoint driver.
commit fda78d7a0ead144f4b2cdb582dcba47911f4952c
Author: Prarit Bhargava <[email protected]>
Date: Thu Jan 26 14:07:47 2017 -0500
PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()
The pci_bus_type .shutdown method, pci_device_shutdown(), is called from
device_shutdown() in the kernel restart and shutdown paths.
Previously, pci_device_shutdown() called pci_msi_shutdown() and
pci_msix_shutdown(). This disables MSI and MSI-X, which causes the device
to fall back to raising interrupts via INTx. But the driver is still bound
to the device, it doesn't know about this change, and it likely doesn't
have an INTx handler, so these INTx interrupts cause "nobody cared"
warnings like this:
irq 16: nobody cared (try booting with the "irqpoll" option)
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.2-1.el7_UNSUPPORTED.x86_64 #1
Hardware name: Hewlett-Packard HP Z820 Workstation/158B, BIOS J63 v03.90 06/
...
The MSI disabling code was added by d52877c7b1af ("pci/irq: let
pci_device_shutdown to call pci_msi_shutdown v2") because a driver left MSI
enabled and kdump failed because the kexeced kernel wasn't prepared to
receive the MSI interrupts.
Subsequent commits 1851617cd2da ("PCI/MSI: Disable MSI at enumeration even
if kernel doesn't support MSI") and e80e7edc55ba ("PCI/MSI: Initialize MSI
capability for all architectures") changed the kexeced kernel to disable
all MSIs itself so it no longer depends on the crashed kernel to clean up
after itself.
Stop disabling MSI/MSI-X in pci_device_shutdown(). This resolves the
"nobody cared" unhandled IRQ issue above. It also allows PCI serial
devices, which may rely on the MSI interrupts, to continue outputting
messages during reboot/shutdown.
[bhelgaas: changelog, drop pci_msi_shutdown() and pci_msix_shutdown() calls
altogether]
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=187351
Signed-off-by: Prarit Bhargava <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
CC: Alex Williamson <[email protected]>
CC: David Arcari <[email protected]>
CC: Myron Stowe <[email protected]>
CC: Lukas Wunner <[email protected]>
CC: Keith Busch <[email protected]>
CC: Mika Westerberg <[email protected]>
>
> But this case is even worse, because the NICs were in PCI passthrough
> mode, using vfio. So, they were completely unaware of what happened
> in the host kernel.
>
> Also, this is spec compliant - system reset events should guarantee the
> bits are cleared (although kexec is not exactly a system reset, it's
> similar)
On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
> Recently was noticed in an HP GEN9 system that kdump couldn't succeed
> due to an irq storm coming from an Intel NIC, narrowed down to be lack
> of clearing the MSI/MSI-X enable bits during the kdump kernel boot.
> For that, we need an early quirk to manually turn off MSI/MSI-X for
> PCI devices - this was worked as an optional boot parameter in a
> subsequent patch.
>
> Problem is that in our test system, the Intel NICs were not present in
> any secondary bus under the first PCIe root complex, so they couldn't
> be reached by the recursion in check_dev_quirk(). Modern systems,
> specially with multi-processors and multiple NUMA nodes expose multiple
> root complexes, describing more than one PCI hierarchy domain. Currently
> the simple recursion present in the early-quirks code from x86 starts a
> descending recursion from bus 0000:00, and reach many other busses by
> navigating this hierarchy walking through the bridges. This is not
> enough in systems with more than one root complex/host bridge, since
> the recursion won't "traverse" to other root complexes by starting
> statically in 0000:00 (for more details, see [0]).
>
> This patch hence implements the full bus/device/function scan in
> early_quirks(), by checking all possible busses instead of using a
> recursion based on the first root bus or limiting the search scope to
> the first 32 busses (like it was done in the beginning [1]).
I don't want to expand the early quirk infrastructure unless there is
absolutely no other way to solve this. The early quirk stuff is
x86-specific, and it's not obvious that this problem is x86-only.
This patch scans buses 0-255, but still only in domain 0, so it won't
help with even more complicated systems that use other domains.
I'm not an IRQ expert, but it seems wrong to me that we are enabling
this interrupt before we're ready for it. The MSI should target an
IOAPIC. Can't that IOAPIC entry be masked until later? I guess the
kdump kernel doesn't know what MSI address the device might be using.
Could the IRQ core be more tolerant of this somehow, e.g., if it
notices incoming interrupts with no handler, could it disable the
IOAPIC entry and fall back to polling periodically until a handler is
added?
> [0] https://bugs.launchpad.net/bugs/1797990
>
> [1] From historical perspective, early PCI scan dates back
> to BitKeeper, added by Andi Kleen's "[PATCH] APIC fixes for x86-64",
> on October/2003. It initially restricted the search to the first
> 32 busses and slots.
>
> Due to a potential bug found in Nvidia chipsets, the scan
> was changed to run only in the first root bus: see
> commit 8659c406ade3 ("x86: only scan the root bus in early PCI quirks")
>
> Finally, secondary busses reachable from the 1st bus were re-added back by:
> commit 850c321027c2 ("x86/quirks: Reintroduce scanning of secondary buses")
>
> Reported-by: Dan Streetman <[email protected]>
> Signed-off-by: Guilherme G. Piccoli <[email protected]>
> ---
> arch/x86/kernel/early-quirks.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> index 50d5848bf22e..fd50f9e21623 100644
> --- a/arch/x86/kernel/early-quirks.c
> +++ b/arch/x86/kernel/early-quirks.c
> @@ -731,7 +731,6 @@ static int __init check_dev_quirk(int num, int slot, int func)
> u16 vendor;
> u16 device;
> u8 type;
> - u8 sec;
> int i;
>
> class = read_pci_config_16(num, slot, func, PCI_CLASS_DEVICE);
> @@ -760,11 +759,8 @@ static int __init check_dev_quirk(int num, int slot, int func)
> type = read_pci_config_byte(num, slot, func,
> PCI_HEADER_TYPE);
>
> - if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE) {
> - sec = read_pci_config_byte(num, slot, func, PCI_SECONDARY_BUS);
> - if (sec > num)
> - early_pci_scan_bus(sec);
> - }
> + if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE)
> + return -1;
>
> if (!(type & 0x80))
> return -1;
> @@ -787,8 +783,11 @@ static void __init early_pci_scan_bus(int bus)
>
> void __init early_quirks(void)
> {
> + int bus;
> +
> if (!early_pci_allowed())
> return;
>
> - early_pci_scan_bus(0);
> + for (bus = 0; bus < 256; bus++)
> + early_pci_scan_bus(bus);
> }
> --
> 2.19.0
>
On 18/10/2018 19:15, Bjorn Helgaas wrote:
> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
> [...]
> I don't want to expand the early quirk infrastructure unless there is
> absolutely no other way to solve this. The early quirk stuff is
> x86-specific, and it's not obvious that this problem is x86-only.
>
> This patch scans buses 0-255, but still only in domain 0, so it won't
> help with even more complicated systems that use other domains.
>
> I'm not an IRQ expert, but it seems wrong to me that we are enabling
> this interrupt before we're ready for it. The MSI should target an
> IOAPIC. Can't that IOAPIC entry be masked until later? I guess the
> kdump kernel doesn't know what MSI address the device might be using.
>
> Could the IRQ core be more tolerant of this somehow, e.g., if it
> notices incoming interrupts with no handler, could it disable the
> IOAPIC entry and fall back to polling periodically until a handler is
> added?
Hi Bjorn, thanks for your quick reply.
I understand your point, but I think this is inherently an architecture
problem. No matter what solution we decide for, it'll need to be applied
in early boot time, like before the PCI layer gets initialized.
So, I think a first step would be to split the solution "timing" in 2
possibilities:
a) We could try to disable MSIs or whatever approach we take in the
quiesce path of crash_kexec(), before the bootstrap of the kdump kernel.
The pro is we could use PCI handlers to do it generically. The con is
it'd touch that delicate shutdown path, from a broken kernel, and this
is unreliable. Also, I've noticed changes in those crash paths
usually gain huge amount of criticism by community, seems nobody wants
to change a bit of this code, if not utterly necessary.
b) Continue using an early boot approach. IMO, this would be per-arch by
nature.
Currently, powerpc for example does not suffer this issue due to their
arch code performing a FW-aided PCI fundamental reset in the devices[0].
On the other hand, x86 has no generic fundamental reset infrastructure
to my knowledge (we tried some alternatives, like a Bridge reset[1] that
didn't work, or zeroing the the command register, which worked), but if
we go with the IOAPIC way of handling this (which we tried a bit and
failed too), it'll be even more arch-dependent, since IOAPIC is x86 concept.
After discussing here internally, an alternative way for this MSI
approach work without requiring the change in the early PCI
infrastructure is to check if we're in kdump kernel and perform manually
the full scan in that case, instead of changing the generic case as
proposed here. This would still be x86-only, but again, it's difficult
if not impossible to fix all archs using the same code here.
Finally, about multi-domain PCI topologies, I've never saw it on x86, I
wasn't aware that such things existed in x86 - but if required we can
quickly extend the logic to contemplate it too.
Thanks again, looking forward for you suggestions.
Cheers,
Guilherme
[0]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/platforms/powernv/pci-ioda.c#n3992
[1] Based in https://patchwork.kernel.org/patch/2562841, adapted to work
in early boot time.
On 18/10/2018 17:30, Sinan Kaya wrote:
>
> AFAIK, all shutdown (not remove) routines are called before launching
> the next
> kernel even in crash scenario. It is not safe to start the new kernel while
> hardware is doing a DMA to the system memory and triggering interrupts.
Hi Sinan,
I agree with you, it's definitely not safe to start a new kernel with
in-flight DMA transactions, but in the crash scenario I think the
rationale was that running kernel is broken so it's even more unreliable
to try gracefully shutdown the devices than hope-for-the-best and start
the kdump kernel right away heheh
Fact is that the shutdown handlers are not called in the crash scenario.
They come from device_shutdown(), the code paths are as follow:
Regular kexec flow:
syscall_reboot()
kernel_kexec()
kernel_restart_prepare()
device_shutdown()
machine_kexec()
Although if CONFIG_KEXEC_JUMP is set, it doesn't call device_shutdown()
either.
Crash kexec flow:
__crash_kexec()
machine_kexec()
There are some entry points to __crash_kexec(), like panic() or die() in
x86, for example.
To validate this, one can load a kernel with "initcall_debug" parameter,
and performs a kexec - if the shutdown handlers are called, there's a
dev_info() call that shows a message per device.
> Shutdown routine in PCI core used to disable MSI/MSI-x on behalf of all
> endpoints but it was later decided that this is the responsibility of the
> endpoint driver.
>
This may be a good idea, using the pci layer to disable MSIs in the
quiesce path of the broken kernel. I'll follow-up this discussion in
Bjorn's reply.
Thanks,
Guilherme
On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote:
> On 18/10/2018 19:15, Bjorn Helgaas wrote:
> > On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
> > [...]
> I understand your point, but I think this is inherently an architecture
> problem. No matter what solution we decide for, it'll need to be applied
> in early boot time, like before the PCI layer gets initialized.
This is the part I want to know more about. Apparently there's some
event X between early_quirks() and the PCI device enumeration, and we
must disable MSIs before X:
setup_arch()
early_quirks() # arch/x86/kernel/early-quirks.c
early_pci_clear_msi()
...
X
...
pci_scan_root_bus_bridge()
...
DECLARE_PCI_FIXUP_EARLY # drivers/pci/quirks.c
I want to know specifically what X is. If we don't know what X is and
all we know is "we have to disable MSIs earlier than PCI init", then
we're likely to break things again in the future by changing the order
of disabling MSIs and whatever X is.
Bjorn
On 23/10/2018 14:03, Bjorn Helgaas wrote:
> On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote:
>> On 18/10/2018 19:15, Bjorn Helgaas wrote:
>>> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
>>> [...]
>> I understand your point, but I think this is inherently an architecture
>> problem. No matter what solution we decide for, it'll need to be applied
>> in early boot time, like before the PCI layer gets initialized.
>
> This is the part I want to know more about. Apparently there's some
> event X between early_quirks() and the PCI device enumeration, and we
> must disable MSIs before X:
>
> setup_arch()
> early_quirks() # arch/x86/kernel/early-quirks.c
> early_pci_clear_msi()
> ...
> X
> ...
> pci_scan_root_bus_bridge()
> ...
> DECLARE_PCI_FIXUP_EARLY # drivers/pci/quirks.c
>
> I want to know specifically what X is. If we don't know what X is and
> all we know is "we have to disable MSIs earlier than PCI init", then
> we're likely to break things again in the future by changing the order
> of disabling MSIs and whatever X is.
>
> Bjorn
>
Hi Bjorn (and all CCed), I'm sorry to necro-bump a thread >2 years
later, but recent discussions led to a better understanding of this 'X'
point, thanks to Thomas!
For those that deleted this thread from their email clients, it's
available in [0] - the summary is that we faced an IRQ storm really
early in boot, due to a bogus PCIe device MSI behavior, when booting a
kdump kernel. This led the machine to get stuck in the boot and we
couldn't kdump. The solution hereby proposed is to clear MSI interrupts
early in x86, if a parameter is provided. I don't have the reproducer
anymore and it was pretty hard to reproduce in virtual environments.
So, about the 'X' Bjorn, in another recent thread about IRQ storms [1],
Thomas clarified that and after a brief discussion, it seems there's no
better way to prevent the MSI storm other than clearing the MSI
capability early in boot. As discussed both here and in thread [1], this
is indeed a per-architecture issue (powerpc is not subject to that, due
to a better FW reset mechanism), so I think we still could benefit in
having this idea implemented upstream, at least in x86 (we could expand
to other architectures if desired, in the future).
As a "test" data point, this was implemented in Ubuntu (same 3 patches
present in this series) for ~2 years and we haven't received bug reports
- I'm saying that because I understand your concerns about expanding the
early PCI quirks scope.
Let me know your thoughts. I'd suggest all to read thread [1], which
addresses a similar issue but in a different "moment" of the system boot
and provides some more insight on why the early MSI clearing seems to
make sense.
Thanks,
Guilherme
[0]
https://lore.kernel.org/linux-pci/[email protected]
[1] https://lore.kernel.org/lkml/[email protected]
On Fri, Nov 06, 2020 at 10:14:14AM -0300, Guilherme G. Piccoli wrote:
> On 23/10/2018 14:03, Bjorn Helgaas wrote:
> > On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote:
> >> On 18/10/2018 19:15, Bjorn Helgaas wrote:
> >>> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
> >>> [...]
> >> I understand your point, but I think this is inherently an architecture
> >> problem. No matter what solution we decide for, it'll need to be applied
> >> in early boot time, like before the PCI layer gets initialized.
> >
> > This is the part I want to know more about. Apparently there's some
> > event X between early_quirks() and the PCI device enumeration, and we
> > must disable MSIs before X:
> >
> > setup_arch()
> > early_quirks() # arch/x86/kernel/early-quirks.c
> > early_pci_clear_msi()
> > ...
> > X
> > ...
> > pci_scan_root_bus_bridge()
> > ...
> > DECLARE_PCI_FIXUP_EARLY # drivers/pci/quirks.c
> >
> > I want to know specifically what X is. If we don't know what X is and
> > all we know is "we have to disable MSIs earlier than PCI init", then
> > we're likely to break things again in the future by changing the order
> > of disabling MSIs and whatever X is.
>
> Hi Bjorn (and all CCed), I'm sorry to necro-bump a thread >2 years
> later, but recent discussions led to a better understanding of this 'X'
> point, thanks to Thomas!
>
> For those that deleted this thread from their email clients, it's
> available in [0] - the summary is that we faced an IRQ storm really
> early in boot, due to a bogus PCIe device MSI behavior, when booting a
> kdump kernel. This led the machine to get stuck in the boot and we
> couldn't kdump. The solution hereby proposed is to clear MSI interrupts
> early in x86, if a parameter is provided. I don't have the reproducer
> anymore and it was pretty hard to reproduce in virtual environments.
>
> So, about the 'X' Bjorn, in another recent thread about IRQ storms [1],
> Thomas clarified that and after a brief discussion, it seems there's no
> better way to prevent the MSI storm other than clearing the MSI
> capability early in boot. As discussed both here and in thread [1], this
> is indeed a per-architecture issue (powerpc is not subject to that, due
> to a better FW reset mechanism), so I think we still could benefit in
> having this idea implemented upstream, at least in x86 (we could expand
> to other architectures if desired, in the future).
>
> As a "test" data point, this was implemented in Ubuntu (same 3 patches
> present in this series) for ~2 years and we haven't received bug reports
> - I'm saying that because I understand your concerns about expanding the
> early PCI quirks scope.
>
> Let me know your thoughts. I'd suggest all to read thread [1], which
> addresses a similar issue but in a different "moment" of the system boot
> and provides some more insight on why the early MSI clearing seems to
> make sense.
I guess Thomas' patch [2] (from thread [1]) doesn't solve this
problem?
I think [0] proposes using early_quirks() to disable MSIs at
boot-time. That doesn't seem like a robust solution because (a) the
problem affects all arches but early_quirks() is x86-specific and (b)
even on x86 early_quirks() only works for PCI segment 0 because it
relies on the 0xCF8/0xCFC I/O ports.
If I understand Thomas' email correctly, the IRQ storm occurs here:
start_kernel
setup_arch
early_quirks # x86-only
...
read_pci_config_16(num, slot, func, PCI_VENDOR_ID)
outl(..., 0xcf8) # PCI segment 0 only
inw(0xcfc)
local_irq_enable
...
native_irq_enable
asm("sti") # <-- enable IRQ, storm occurs
native_irq_enable() happens long before we discover PCI host bridges
and run the normal PCI quirks, so those would be too late to disable
MSIs.
It doesn't seem practical to disable MSIs in the kdump kernel at the
PCI level. I was hoping we could disable them somewhere in the IRQ
code, e.g., at IOAPICs, but I think Thomas is saying that's not
feasible.
It seems like the only option left is to disable MSIs before the
kexec. We used to clear the MSI/MSI-X Enable bits in
pci_device_shutdown(), but that broke console devices that relied on
MSI and caused "nobody cared" warnings when the devices fell back to
using INTx, so fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in
pci_device_shutdown()") left them unchanged.
pci_device_shutdown() still clears the Bus Master Enable bit if we're
doing a kexec and the device is in D0-D3hot, which should also disable
MSI/MSI-X. Why doesn't this solve the problem? Is this because the
device causing the storm was in PCI_UNKNOWN state?
> [0] https://lore.kernel.org/linux-pci/[email protected]
>
> [1] https://lore.kernel.org/lkml/[email protected]
[2] https://lore.kernel.org/lkml/[email protected]/
Notes to my future self about related changes:
2008-04-23 d52877c7b1af ("pci/irq: let pci_device_shutdown to call pci_msi_shutdown v2")
Disable MSI before kexec because new kernel isn't prepared for MSI
2011-10-17 d5dea7d95c48 ("PCI: msi: Disable msi interrupts when we initialize a pci device")
Disable MSI/MSI-X at boot; only works for new kernels with
CONFIG_PCI_MSI=y
2012-04-27 b566a22c2332 ("PCI: disable Bus Master on PCI device shutdown")
Disable bus mastering on shutdown (if enable/disable nested correctly)
2013-02-04 7897e6022761 ("PCI: Disable Bus Master unconditionally in pci_device_shutdown()")
Disable bus mastering unconditionally (ignore nested enable/disable)
2013-03-14 6e0eda3c3898 ("PCI: Don't try to disable Bus Master on disconnected PCI devices")
Don't touch bus mastering for D3cold or unknown state
2015-05-07 1851617cd2da ("PCI/MSI: Disable MSI at enumeration even if kernel doesn't support MSI")
Disable MSI/MSI-X at boot even without CONFIG_PCI_MSI=y; broke
Open Firmware arches
2015-10-21 e80e7edc55ba ("PCI/MSI: Initialize MSI capability for all architectures")
Disable MSI/MSI-X at boot for all arches, including Open Firmware
2017-01-26 fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()")
Leave MSI enabled before kexec; disabling causes device to use INTx,
which drivers aren't prepared for, causing "nobody cared" warnings
Bjorn,
On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
> On Fri, Nov 06, 2020 at 10:14:14AM -0300, Guilherme G. Piccoli wrote:
>> On 23/10/2018 14:03, Bjorn Helgaas wrote:
> I guess Thomas' patch [2] (from thread [1]) doesn't solve this
> problem?
No. As I explained in [1] patch from [2] cannot solve it because the
patch from [2] which is what Liu was trying to solve requires that there
is a registered interrupt handler which knows how to shut up the
interrupt.
> I think [0] proposes using early_quirks() to disable MSIs at
> boot-time. That doesn't seem like a robust solution because (a) the
> problem affects all arches but early_quirks() is x86-specific and (b)
> even on x86 early_quirks() only works for PCI segment 0 because it
> relies on the 0xCF8/0xCFC I/O ports.
>
> If I understand Thomas' email correctly, the IRQ storm occurs here:
>
> start_kernel
> setup_arch
> early_quirks # x86-only
> ...
> read_pci_config_16(num, slot, func, PCI_VENDOR_ID)
> outl(..., 0xcf8) # PCI segment 0 only
> inw(0xcfc)
> local_irq_enable
> ...
> native_irq_enable
> asm("sti") # <-- enable IRQ, storm occurs
>
> native_irq_enable() happens long before we discover PCI host bridges
> and run the normal PCI quirks, so those would be too late to disable
> MSIs.
Correct.
> It doesn't seem practical to disable MSIs in the kdump kernel at the
> PCI level. I was hoping we could disable them somewhere in the IRQ
> code, e.g., at IOAPICs, but I think Thomas is saying that's not
> feasible.
MSIs are not even going near the IOAPIC and as long as the interrupt
core does not have an interrupt set up for the device is has no idea
where to look at to shut it down. Actually it does not even reach the
interrupt core. The raised vector arrives at the CPU and the low level
code sees: No handler associated, ignore it. We cannot do anything from
the low level code because all we know is that the vector was raised,
but we have absolutely zero clue where that came from. At that point the
IO-APIC interrupts are definitely not the problem because they are all
disabled.
> It seems like the only option left is to disable MSIs before the
> kexec. We used to clear the MSI/MSI-X Enable bits in
> pci_device_shutdown(), but that broke console devices that relied on
> MSI and caused "nobody cared" warnings when the devices fell back to
> using INTx, so fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in
> pci_device_shutdown()") left them unchanged.
That might be solvable because INTx arrives at the IO-APIC and we could
mask all the INTx related IO-APIC lines, but that's icky because of
this:
> pci_device_shutdown() still clears the Bus Master Enable bit if we're
> doing a kexec and the device is in D0-D3hot, which should also disable
> MSI/MSI-X. Why doesn't this solve the problem? Is this because the
> device causing the storm was in PCI_UNKNOWN state?
That's indeed a really good question.
Thanks,
tglx
Bjorn,
On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
> On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>> doing a kexec and the device is in D0-D3hot, which should also disable
>> MSI/MSI-X. Why doesn't this solve the problem? Is this because the
>> device causing the storm was in PCI_UNKNOWN state?
>
> That's indeed a really good question.
So we do that on kexec, but is that true when starting a kdump kernel
from a kernel crash? I doubt it.
Thanks,
tglx
On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
> Bjorn,
>
> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
> >> doing a kexec and the device is in D0-D3hot, which should also disable
> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the
> >> device causing the storm was in PCI_UNKNOWN state?
> >
> > That's indeed a really good question.
>
> So we do that on kexec, but is that true when starting a kdump kernel
> from a kernel crash? I doubt it.
Ah, right, I bet that's it, thanks. The kdump path is basically this:
crash_kexec
machine_kexec
while the usual kexec path is:
kernel_kexec
kernel_restart_prepare
device_shutdown
while (!list_empty(&devices_kset->list))
dev->bus->shutdown
pci_device_shutdown # pci_bus_type.shutdown
machine_kexec
So maybe we need to explore doing some or all of device_shutdown() in
the crash_kexec() path as well as in the kernel_kexec() path.
Bjorn,
On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
> On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
>> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
>> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>> >> doing a kexec and the device is in D0-D3hot, which should also disable
>> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the
>> >> device causing the storm was in PCI_UNKNOWN state?
>> >
>> > That's indeed a really good question.
>>
>> So we do that on kexec, but is that true when starting a kdump kernel
>> from a kernel crash? I doubt it.
>
> Ah, right, I bet that's it, thanks. The kdump path is basically this:
>
> crash_kexec
> machine_kexec
>
> while the usual kexec path is:
>
> kernel_kexec
> kernel_restart_prepare
> device_shutdown
> while (!list_empty(&devices_kset->list))
> dev->bus->shutdown
> pci_device_shutdown # pci_bus_type.shutdown
> machine_kexec
>
> So maybe we need to explore doing some or all of device_shutdown() in
> the crash_kexec() path as well as in the kernel_kexec() path.
The problem is that if the machine crashed anything you try to attempt
before starting the crash kernel is reducing the chance that the crash
kernel actually starts.
Is there something at the root bridge level which allows to tell the
underlying busses to shut up, reset or go into a defined state? That
might avoid chasing lists which might be already unreliable.
Thanks,
tglx
[+cc Rafael for question about ACPI method for PCI host bridge reset]
On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote:
> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
> >> >> doing a kexec and the device is in D0-D3hot, which should also disable
> >> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the
> >> >> device causing the storm was in PCI_UNKNOWN state?
> >> >
> >> > That's indeed a really good question.
> >>
> >> So we do that on kexec, but is that true when starting a kdump kernel
> >> from a kernel crash? I doubt it.
> >
> > Ah, right, I bet that's it, thanks. The kdump path is basically this:
> >
> > crash_kexec
> > machine_kexec
> >
> > while the usual kexec path is:
> >
> > kernel_kexec
> > kernel_restart_prepare
> > device_shutdown
> > while (!list_empty(&devices_kset->list))
> > dev->bus->shutdown
> > pci_device_shutdown # pci_bus_type.shutdown
> > machine_kexec
> >
> > So maybe we need to explore doing some or all of device_shutdown() in
> > the crash_kexec() path as well as in the kernel_kexec() path.
>
> The problem is that if the machine crashed anything you try to attempt
> before starting the crash kernel is reducing the chance that the crash
> kernel actually starts.
Right.
> Is there something at the root bridge level which allows to tell the
> underlying busses to shut up, reset or go into a defined state? That
> might avoid chasing lists which might be already unreliable.
Maybe we need some kind of crash_device_shutdown() that does the
minimal thing to protect the kdump kernel from devices.
The programming model for conventional PCI host bridges and PCIe Root
Complexes is device-specific since they're outside the PCI domain.
There probably *are* ways to do those things, but you would need a
native host bridge driver or something like an ACPI method. I'm not
aware of an ACPI way to do this, but I added Rafael in case he is.
A crash_device_shutdown() could do something at the host bridge level
if that's possible, or reset/disable bus mastering/disable MSI/etc on
individual PCI devices if necessary.
Bjorn
Bjorn Helgaas <[email protected]> writes:
> [+cc Rafael for question about ACPI method for PCI host bridge reset]
>
> On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote:
>> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
>> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
>> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
>> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>> >> >> doing a kexec and the device is in D0-D3hot, which should also disable
>> >> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the
>> >> >> device causing the storm was in PCI_UNKNOWN state?
>> >> >
>> >> > That's indeed a really good question.
>> >>
>> >> So we do that on kexec, but is that true when starting a kdump kernel
>> >> from a kernel crash? I doubt it.
>> >
>> > Ah, right, I bet that's it, thanks. The kdump path is basically this:
>> >
>> > crash_kexec
>> > machine_kexec
>> >
>> > while the usual kexec path is:
>> >
>> > kernel_kexec
>> > kernel_restart_prepare
>> > device_shutdown
>> > while (!list_empty(&devices_kset->list))
>> > dev->bus->shutdown
>> > pci_device_shutdown # pci_bus_type.shutdown
>> > machine_kexec
>> >
>> > So maybe we need to explore doing some or all of device_shutdown() in
>> > the crash_kexec() path as well as in the kernel_kexec() path.
>>
>> The problem is that if the machine crashed anything you try to attempt
>> before starting the crash kernel is reducing the chance that the crash
>> kernel actually starts.
>
> Right.
>
>> Is there something at the root bridge level which allows to tell the
>> underlying busses to shut up, reset or go into a defined state? That
>> might avoid chasing lists which might be already unreliable.
>
> Maybe we need some kind of crash_device_shutdown() that does the
> minimal thing to protect the kdump kernel from devices.
The kdump kernel does not use any memory the original kernel uses.
Which should be a minimal and fairly robust level of protection
until the device drivers can be loaded and get ahold of things.
> The programming model for conventional PCI host bridges and PCIe Root
> Complexes is device-specific since they're outside the PCI domain.
> There probably *are* ways to do those things, but you would need a
> native host bridge driver or something like an ACPI method. I'm not
> aware of an ACPI way to do this, but I added Rafael in case he is.
>
> A crash_device_shutdown() could do something at the host bridge level
> if that's possible, or reset/disable bus mastering/disable MSI/etc on
> individual PCI devices if necessary.
Unless I am confused DMA'ing to memory that is not already in use
is completely broken wether or not you are using the kdump kernel.
Eric
[email protected] (Eric W. Biederman) writes:
> Bjorn Helgaas <[email protected]> writes:
>
>> [+cc Rafael for question about ACPI method for PCI host bridge reset]
>>
>> On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote:
>>> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
>>> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
>>> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
>>> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>>> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>>> >> >> doing a kexec and the device is in D0-D3hot, which should also disable
>>> >> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the
>>> >> >> device causing the storm was in PCI_UNKNOWN state?
>>> >> >
>>> >> > That's indeed a really good question.
>>> >>
>>> >> So we do that on kexec, but is that true when starting a kdump kernel
>>> >> from a kernel crash? I doubt it.
>>> >
>>> > Ah, right, I bet that's it, thanks. The kdump path is basically this:
>>> >
>>> > crash_kexec
>>> > machine_kexec
>>> >
>>> > while the usual kexec path is:
>>> >
>>> > kernel_kexec
>>> > kernel_restart_prepare
>>> > device_shutdown
>>> > while (!list_empty(&devices_kset->list))
>>> > dev->bus->shutdown
>>> > pci_device_shutdown # pci_bus_type.shutdown
>>> > machine_kexec
>>> >
>>> > So maybe we need to explore doing some or all of device_shutdown() in
>>> > the crash_kexec() path as well as in the kernel_kexec() path.
>>>
>>> The problem is that if the machine crashed anything you try to attempt
>>> before starting the crash kernel is reducing the chance that the crash
>>> kernel actually starts.
>>
>> Right.
>>
>>> Is there something at the root bridge level which allows to tell the
>>> underlying busses to shut up, reset or go into a defined state? That
>>> might avoid chasing lists which might be already unreliable.
>>
>> Maybe we need some kind of crash_device_shutdown() that does the
>> minimal thing to protect the kdump kernel from devices.
>
> The kdump kernel does not use any memory the original kernel uses.
> Which should be a minimal and fairly robust level of protection
> until the device drivers can be loaded and get ahold of things.
>
>> The programming model for conventional PCI host bridges and PCIe Root
>> Complexes is device-specific since they're outside the PCI domain.
>> There probably *are* ways to do those things, but you would need a
>> native host bridge driver or something like an ACPI method. I'm not
>> aware of an ACPI way to do this, but I added Rafael in case he is.
>>
>> A crash_device_shutdown() could do something at the host bridge level
>> if that's possible, or reset/disable bus mastering/disable MSI/etc on
>> individual PCI devices if necessary.
>
> Unless I am confused DMA'ing to memory that is not already in use
> is completely broken wether or not you are using the kdump kernel.
Bah. I was confused because I had not read up-thread.
MSI mixes DMA and irqs so confusion is easy.
So the problem is screaming irqs when the kernel is booting up.
This is a fundamentally tricky problem.
For ordinary irqs you can have this with level triggered irqs
and the kernel has code that will shutdown the irq at the ioapic
level. Then the kernel continues by polling the irq source.
I am still missing details but my first question is can our general
solution to screaming level triggered irqs apply?
How can edge triggered MSI irqs be screaming?
Is there something we can do in enabling the APICs or IOAPICs that
would allow this to be handled better. My memory when we enable
the APICs and IOAPICs we completely clear the APIC entries and so
should be disabling sources.
Is the problem perhaps that we wind up using an APIC entry that was
previously used for the MSI interrupt as something else when we
reprogram them? Even with this why doesn't the generic code
to stop screaming irqs apply here?
Eric
On Sun, Nov 15, 2020 at 04:11:43PM +0100, Thomas Gleixner wrote:
> Unfortunately there is no way to tell the APIC "Mask vector X" and the
> dump kernel does neither know which device it comes from nor does it
> have enumerated PCI completely which would reset the device and shutup
> the spew. Due to the interrupt storm it does not get that far.
Can't we just set DisINTx, clear MSI Enable and clear MSI-X Enable
on all active PCI devices in the crashing kernel before starting the
crash kernel? So that the crash kernel starts with a clean slate?
Guilherme's original patches from 2018 iterate over all 256 PCI buses.
That might impact boot time negatively. The reason he has to do that
is because the crashing kernel doesn't know which devices exist and
which have interrupts enabled. However the crashing kernel has that
information. It should either disable interrupts itself or pass the
necessary information to the crashing kernel as setup_data or whatever.
Guilherme's patches add a "clearmsi" command line parameter. I guess
the idea is that the parameter is always passed to the crash kernel
but the patches don't do that, which seems odd.
Thanks,
Lukas
On Sun, Nov 15 2020 at 08:29, Eric W. Biederman wrote:
> [email protected] (Eric W. Biederman) writes:
> For ordinary irqs you can have this with level triggered irqs
> and the kernel has code that will shutdown the irq at the ioapic
> level. Then the kernel continues by polling the irq source.
>
> I am still missing details but my first question is can our general
> solution to screaming level triggered irqs apply?
No.
> How can edge triggered MSI irqs be screaming?
>
> Is there something we can do in enabling the APICs or IOAPICs that
> would allow this to be handled better. My memory when we enable
> the APICs and IOAPICs we completely clear the APIC entries and so
> should be disabling sources.
Yes, but MSI has nothing to do with APIC/IOAPIC
> Is the problem perhaps that we wind up using an APIC entry that was
> previously used for the MSI interrupt as something else when we
> reprogram them? Even with this why doesn't the generic code
> to stop screaming irqs apply here?
Again. No. The problem cannot be solved at the APIC level. The APIC is
the receiving end of MSI and has absolutely no control over it.
An MSI interrupt is a (DMA) write to the local APIC base address
0xfeexxxxx which has the target CPU and control bits encoded in the
lower bits. The written data is the vector and more control bits.
The only way to stop that is to shut it up at the PCI device level.
Assume the following situation:
- PCI device has MSI enabled and a valid target vector assigned
- Kernel crashes
- Kdump kernel starts
- PCI device raises interrupts which result in the MSI write
- Kdump kernel enables interrupts and the pending vector is raised in
the CPU.
- The CPU has no interrupt descriptor assigned to the vector
and does not even know where the interrupt originated from. So it
treats it like any other spurious interrupt to an unassigned vector,
emits a ratelimited message and ACKs the interrupt at the APIC.
- PCI device behaves stupid and reraises the interrupt for whatever
reason.
- Lather, rinse and repeat.
Unfortunately there is no way to tell the APIC "Mask vector X" and the
dump kernel does neither know which device it comes from nor does it
have enumerated PCI completely which would reset the device and shutup
the spew. Due to the interrupt storm it does not get that far.
Thanks,
tglx
On Sun, Nov 15 2020 at 18:01, Lukas Wunner wrote:
> On Sun, Nov 15, 2020 at 04:11:43PM +0100, Thomas Gleixner wrote:
>> Unfortunately there is no way to tell the APIC "Mask vector X" and the
>> dump kernel does neither know which device it comes from nor does it
>> have enumerated PCI completely which would reset the device and shutup
>> the spew. Due to the interrupt storm it does not get that far.
>
> Can't we just set DisINTx, clear MSI Enable and clear MSI-X Enable
> on all active PCI devices in the crashing kernel before starting the
> crash kernel? So that the crash kernel starts with a clean slate?
>
> Guilherme's original patches from 2018 iterate over all 256 PCI buses.
> That might impact boot time negatively. The reason he has to do that
> is because the crashing kernel doesn't know which devices exist and
> which have interrupts enabled. However the crashing kernel has that
> information. It should either disable interrupts itself or pass the
> necessary information to the crashing kernel as setup_data or whatever.
As I explained before: The problem with doing anything between crashing
and starting the crash kernel is reducing the chance of actually
starting it. The kernel crashed for whatever reason, so it's in a
completely undefined state.
Ergo there is no 'just do something'. You really have to think hard
about what can be done safely with the least probability of running into
another problem.
Thanks,
tglx
Thomas Gleixner <[email protected]> writes:
> On Sun, Nov 15 2020 at 08:29, Eric W. Biederman wrote:
>> [email protected] (Eric W. Biederman) writes:
>> For ordinary irqs you can have this with level triggered irqs
>> and the kernel has code that will shutdown the irq at the ioapic
>> level. Then the kernel continues by polling the irq source.
>>
>> I am still missing details but my first question is can our general
>> solution to screaming level triggered irqs apply?
>
> No.
>
>> How can edge triggered MSI irqs be screaming?
>>
>> Is there something we can do in enabling the APICs or IOAPICs that
>> would allow this to be handled better. My memory when we enable
>> the APICs and IOAPICs we completely clear the APIC entries and so
>> should be disabling sources.
>
> Yes, but MSI has nothing to do with APIC/IOAPIC
Yes, sorry. It has been long enough that the details were paged out
of my memory.
>> Is the problem perhaps that we wind up using an APIC entry that was
>> previously used for the MSI interrupt as something else when we
>> reprogram them? Even with this why doesn't the generic code
>> to stop screaming irqs apply here?
>
> Again. No. The problem cannot be solved at the APIC level. The APIC is
> the receiving end of MSI and has absolutely no control over it.
>
> An MSI interrupt is a (DMA) write to the local APIC base address
> 0xfeexxxxx which has the target CPU and control bits encoded in the
> lower bits. The written data is the vector and more control bits.
>
> The only way to stop that is to shut it up at the PCI device level.
Or to write to magic chipset registers that will stop transforming DMA
writes to 0xfeexxxxx into x86 interrupts. With an IOMMU I know x86 has
such registers (because the point of the IOMMU is to limit the problems
rogue devices can cause). Without an IOMMU I don't know if x86 has any
such registers. I remember that other platforms have an interrupt
controller that does provide some level of control. That x86 does not
is what makes this an x86 specific problem.
The generic solution is to have the PCI code set bus master disables
when it is enumerationg and initializing devices. Last time I was
paying attention that was actually the policy of the pci layer and
drivers that did not enable bus mastering were considered buggy.
Looking at patch 3/3 what this patchset does is an early disable of
of the msi registers. Which is mostly reasonable. Especially as has
been pointed out the only location the x86 vector and x86 cpu can
be found is in the msi configuration registers.
That also seems reasonable. But Bjorn's concern about not finding all
devices in all domains does seem real.
There are a handful of devices where the Bus master disable bit doesn't
disable bus mastering. I wonder if there are devices where MSI and MSIX
disables don't fully work. It seems completely possible to have MSI or
MSIX equivalent registers at a non-standard location as drivers must be
loaded to handle them.
So if we can safely and reliably disable DMA and MSI at the generic PCI
device level during boot up I am all for it.
How difficult would it be to tell the IOMMUs to stop passing traffic
through in an early pci quirk? The problem setup was apparently someone
using the device directly from a VM. So I presume there is an IOMMU
in that configuration.
> Unfortunately there is no way to tell the APIC "Mask vector X" and the
> dump kernel does neither know which device it comes from nor does it
> have enumerated PCI completely which would reset the device and shutup
> the spew. Due to the interrupt storm it does not get that far.
So the question is how do we make this robust?
Can we perhaps disable all interrupts in this case and limp along
in polling mode until the pci bus has been enumerated?
It is nice and desirable to be able to use the hardware in high
performance mode in a kexec-on-panic situation but if we can detect a
problem and figure out how to limp along sometimes that is acceptable.
The failure mode in the kexec-on-panic kernel is definitely the corect
one. We could not figure out how to wrestle the hardware into usability
so we fail to take a crash dump or do anything else that might corrupt
the system.
Eric
"Guilherme G. Piccoli" <[email protected]> writes:
> First of all, thanks everybody for the great insights/discussion! This
> thread ended-up being a great learning for (at least) me.
>
> Given the big amount of replies and intermixed comments, I wouldn't be
> able to respond inline to all, so I'll try another approach below.
>
>
> From Bjorn:
> "I think [0] proposes using early_quirks() to disable MSIs at boot-time.
> That doesn't seem like a robust solution because (a) the problem affects
> all arches but early_quirks() is x86-specific and (b) even on x86
> early_quirks() only works for PCI segment 0 because it relies on the
> 0xCF8/0xCFC I/O ports."
>
> Ah. I wasn't aware of that limitation, I thought enhancing the
> early_quirks() search to go through all buses would fix that, thanks for
> the clarification! And again, worth to clarify that this is not a
> problem affecting all arches _in practice_ - PowerPC for example has the
> FW primitives allowing a powerful PCI controller (out-of-band) reset,
> preventing this kind of issue usually.
>
> [0]
> https://lore.kernel.org/linux-pci/[email protected]
>
>
> From Bjorn:
> "A crash_device_shutdown() could do something at the host bridge level
> if that's possible, or reset/disable bus mastering/disable MSI/etc on
> individual PCI devices if necessary."
>
> From Lukas:
> "Guilherme's original patches from 2018 iterate over all 256 PCI buses.
> That might impact boot time negatively. The reason he has to do that is
> because the crashing kernel doesn't know which devices exist and which
> have interrupts enabled. However the crashing kernel has that
> information. It should either disable interrupts itself or pass the
> necessary information to the crashing kernel as setup_data or whatever.
>
> Guilherme's patches add a "clearmsi" command line parameter. I guess
> the idea is that the parameter is always passed to the crash kernel but
> the patches don't do that, which seems odd."
>
> Very good points Lukas, thanks! The reason of not adding the clearmsi
> thing as a default kdump procedure is kinda related to your first point:
> it impacts a bit boot-time, also it's an additional logic in the kdump
> kernel, which we know is (sometimes) the last resort in gathering
> additional data to debug a potentially complex issue. That said, I'd not
> like to enforce this kind of "policy" to everybody, so my implementation
> relies on having it as a parameter, and the kdump userspace counter-part
> could then have a choice in adding or not such mechanism to the kdump
> kernel parameter list.
>
> About passing the data to next kernel, this is very interesting, we
> could do something like that either through setup_data (as you said) or
> using a new proposal, the PKRAM thing [a].
> This way we wouldn't need a crash_device_shutdown(), but instead when
> kernel is loading a crash kernel (through kexec -p) we can collect all
> the necessary information that'll be passed to the crash kernel
> (obviously that if we are collecting PCI topology information, we need
> callbacks in the PCI hotplug add/del path to update this information).
>
> [a]
> https://lore.kernel.org/lkml/[email protected]/
>
> Below, inline reply to Eric's last message.
>
>
> On 15/11/2020 17:46, Eric W. Biederman wrote:
>>> [...]
>>> An MSI interrupt is a (DMA) write to the local APIC base address
>>> 0xfeexxxxx which has the target CPU and control bits encoded in the
>>> lower bits. The written data is the vector and more control bits.
>>>
>>> The only way to stop that is to shut it up at the PCI device level.
>>
>> Or to write to magic chipset registers that will stop transforming DMA
>> writes to 0xfeexxxxx into x86 interrupts. With an IOMMU I know x86 has
>> such registers (because the point of the IOMMU is to limit the problems
>> rogue devices can cause). Without an IOMMU I don't know if x86 has any
>> such registers. I remember that other platforms have an interrupt
>> controller that does provide some level of control. That x86 does not
>> is what makes this an x86 specific problem.
>> [...]
>> Looking at patch 3/3 what this patchset does is an early disable of
>> of the msi registers. Which is mostly reasonable. Especially as has
>> been pointed out the only location the x86 vector and x86 cpu can
>> be found is in the msi configuration registers.
>>
>> That also seems reasonable. But Bjorn's concern about not finding all
>> devices in all domains does seem real.
>> [...]
>> So if we can safely and reliably disable DMA and MSI at the generic PCI
>> device level during boot up I am all for it.
>>
>>
>> How difficult would it be to tell the IOMMUs to stop passing traffic
>> through in an early pci quirk? The problem setup was apparently someone
>> using the device directly from a VM. So I presume there is an IOMMU
>> in that configuration.
>
> This is a good idea I think - we considered something like that in
> theory while working the problem back in 2018, but given I'm even less
> expert in IOMMU that I already am in PCI, the option was to go with the
> PCI approach. And you are right, the original problem is a device in
> PCI-PT to a VM, and a tool messing with the PF device in the SR-IOV (to
> collect some stats) from the host side, through VFIO IIRC.
>
> Anyway, we could split the problem in two after all the considerations
> in the thread, I believe:
>
> (1) If possible to set-up the IOMMU to prevent MSIs, by "blocking" the
> DMA writes for PCI devices *until* PCI core code properly initialize the
> devices, that'd handle the majority of the cases I guess (IOMMU usage is
> quite common nowadays).
>
> (2) Collecting PCI topology information in a running kernel and passing
> that to the kdump kernel would allow us to disable the PCI devices MSI
> capabilities, the only problem here is that I couldn't see how to do
> that early enough, before the 'sti' executes, without relying in
> early_quirks(). ACPI maybe? As per Bjorn comment when explaining the
> limitations of early_quirks(), this problem seems not to be trivial.
>
> I'm a bit against doing that in crash_kexec() for the reasons mentioned
> in the thread, specially by Thomas and Eric - but if there's no other
> way...maybe this is the way to go.
The way to do that would be to collect the set of pci devices when the
kexec on panic kernel is loaded, not during crash_kexec. If someone
performs a device hotplug they would need to reload the kexec on panic
kernel.
I am not necessarily endorsing that just pointing out how it can be
done.
Eric
On Mon, Nov 16, 2020 at 6:45 PM Eric W. Biederman <[email protected]> wrote:
> The way to do that would be to collect the set of pci devices when the
> kexec on panic kernel is loaded, not during crash_kexec. If someone
> performs a device hotplug they would need to reload the kexec on panic
> kernel.
>
> I am not necessarily endorsing that just pointing out how it can be
> done.
>
> Eric
Thanks Eric, I agree! I think if we use something like PKRAM (a more
dynamic approach) we could have the PCI hotplug path updating the data
to-be-passed to the crash kernel, so the crash kernel doesn't even
need to be loaded again.
On Mon, Nov 16, 2020 at 05:31:36PM -0300, Guilherme G. Piccoli wrote:
> First of all, thanks everybody for the great insights/discussion! This
> thread ended-up being a great learning for (at least) me.
>
> Given the big amount of replies and intermixed comments, I wouldn't be
> able to respond inline to all, so I'll try another approach below.
>
>
> From Bjorn:
> "I think [0] proposes using early_quirks() to disable MSIs at boot-time.
> That doesn't seem like a robust solution because (a) the problem affects
> all arches but early_quirks() is x86-specific and (b) even on x86
> early_quirks() only works for PCI segment 0 because it relies on the
> 0xCF8/0xCFC I/O ports."
>
> Ah. I wasn't aware of that limitation, I thought enhancing the
> early_quirks() search to go through all buses would fix that, thanks for
> the clarification! And again, worth to clarify that this is not a
> problem affecting all arches _in practice_ - PowerPC for example has the
> FW primitives allowing a powerful PCI controller (out-of-band) reset,
> preventing this kind of issue usually.
>
> [0]
> https://lore.kernel.org/linux-pci/[email protected]
>
>
> From Bjorn:
> "A crash_device_shutdown() could do something at the host bridge level
> if that's possible, or reset/disable bus mastering/disable MSI/etc on
> individual PCI devices if necessary."
>
> From Lukas:
> "Guilherme's original patches from 2018 iterate over all 256 PCI buses.
> That might impact boot time negatively. The reason he has to do that is
> because the crashing kernel doesn't know which devices exist and which
> have interrupts enabled. However the crashing kernel has that
> information. It should either disable interrupts itself or pass the
> necessary information to the crashing kernel as setup_data or whatever.
I don't think passing the device information to the kdump kernel is
really practical. The kdump kernel would use it to do PCI config
writes to disable MSIs before enabling IRQs, and it doesn't know how
to access config space that early.
We could invent special "early config access" things, but that gets
really complicated really fast. Config access depends on ACPI MCFG
tables, firmware interfaces, and in many cases, on the native host
bridge drivers in drivers/pci/controllers/.
I think we need to disable MSIs in the crashing kernel before the
kexec. It adds a little more code in the crash_kexec() path, but it
seems like a worthwhile tradeoff.
Bjorn
Bjorn Helgaas <[email protected]> writes:
> I don't think passing the device information to the kdump kernel is
> really practical. The kdump kernel would use it to do PCI config
> writes to disable MSIs before enabling IRQs, and it doesn't know how
> to access config space that early.
I don't think it is particularly practical either. But in practice
on x86 it is either mmio writes or 0xcf8 style writes and we could
pass a magic table that would have all of that information.
> We could invent special "early config access" things, but that gets
> really complicated really fast. Config access depends on ACPI MCFG
> tables, firmware interfaces, and in many cases, on the native host
> bridge drivers in drivers/pci/controllers/.
I do agree that the practical problem with passing information early
is that gets us into the weeds and creates code that we only care
about in the case of kexec-on-panic. It is much better to make the
existing code more robust, so that we reduce our dependency on firmware
doing the right thing.
> I think we need to disable MSIs in the crashing kernel before the
> kexec. It adds a little more code in the crash_kexec() path, but it
> seems like a worthwhile tradeoff.
Disabling MSIs in the b0rken kernel is not possible.
Walking the device tree or even a significant subset of it hugely
decreases the chances that we will run into something that is incorrect
in the known broken kernel. I expect the code to do that would double
or triple the amount of code that must be executed in the known broken
kernel. The last time something like that happened (switching from xchg
to ordinary locks) we had cases that stopped working. Walking all of
the pci devices in the system is much more invasive.
That is not to downplay the problems of figuring out how to disable
things in early boot.
My two top candidates are poking the IOMMUs early to shut things off,
and figuring out if we can delay enabling interrupts until we have
initialized pci.
Poking at IOMMUs early should work for most systems with ``enterprise''
hardware. Systems where people care about kdump the most.
Eric
First of all, thanks everybody for the great insights/discussion! This
thread ended-up being a great learning for (at least) me.
Given the big amount of replies and intermixed comments, I wouldn't be
able to respond inline to all, so I'll try another approach below.
From Bjorn:
"I think [0] proposes using early_quirks() to disable MSIs at boot-time.
That doesn't seem like a robust solution because (a) the problem affects
all arches but early_quirks() is x86-specific and (b) even on x86
early_quirks() only works for PCI segment 0 because it relies on the
0xCF8/0xCFC I/O ports."
Ah. I wasn't aware of that limitation, I thought enhancing the
early_quirks() search to go through all buses would fix that, thanks for
the clarification! And again, worth to clarify that this is not a
problem affecting all arches _in practice_ - PowerPC for example has the
FW primitives allowing a powerful PCI controller (out-of-band) reset,
preventing this kind of issue usually.
[0]
https://lore.kernel.org/linux-pci/[email protected]
From Bjorn:
"A crash_device_shutdown() could do something at the host bridge level
if that's possible, or reset/disable bus mastering/disable MSI/etc on
individual PCI devices if necessary."
From Lukas:
"Guilherme's original patches from 2018 iterate over all 256 PCI buses.
That might impact boot time negatively. The reason he has to do that is
because the crashing kernel doesn't know which devices exist and which
have interrupts enabled. However the crashing kernel has that
information. It should either disable interrupts itself or pass the
necessary information to the crashing kernel as setup_data or whatever.
Guilherme's patches add a "clearmsi" command line parameter. I guess
the idea is that the parameter is always passed to the crash kernel but
the patches don't do that, which seems odd."
Very good points Lukas, thanks! The reason of not adding the clearmsi
thing as a default kdump procedure is kinda related to your first point:
it impacts a bit boot-time, also it's an additional logic in the kdump
kernel, which we know is (sometimes) the last resort in gathering
additional data to debug a potentially complex issue. That said, I'd not
like to enforce this kind of "policy" to everybody, so my implementation
relies on having it as a parameter, and the kdump userspace counter-part
could then have a choice in adding or not such mechanism to the kdump
kernel parameter list.
About passing the data to next kernel, this is very interesting, we
could do something like that either through setup_data (as you said) or
using a new proposal, the PKRAM thing [a].
This way we wouldn't need a crash_device_shutdown(), but instead when
kernel is loading a crash kernel (through kexec -p) we can collect all
the necessary information that'll be passed to the crash kernel
(obviously that if we are collecting PCI topology information, we need
callbacks in the PCI hotplug add/del path to update this information).
[a]
https://lore.kernel.org/lkml/[email protected]/
Below, inline reply to Eric's last message.
On 15/11/2020 17:46, Eric W. Biederman wrote:
>> [...]
>> An MSI interrupt is a (DMA) write to the local APIC base address
>> 0xfeexxxxx which has the target CPU and control bits encoded in the
>> lower bits. The written data is the vector and more control bits.
>>
>> The only way to stop that is to shut it up at the PCI device level.
>
> Or to write to magic chipset registers that will stop transforming DMA
> writes to 0xfeexxxxx into x86 interrupts. With an IOMMU I know x86 has
> such registers (because the point of the IOMMU is to limit the problems
> rogue devices can cause). Without an IOMMU I don't know if x86 has any
> such registers. I remember that other platforms have an interrupt
> controller that does provide some level of control. That x86 does not
> is what makes this an x86 specific problem.
> [...]
> Looking at patch 3/3 what this patchset does is an early disable of
> of the msi registers. Which is mostly reasonable. Especially as has
> been pointed out the only location the x86 vector and x86 cpu can
> be found is in the msi configuration registers.
>
> That also seems reasonable. But Bjorn's concern about not finding all
> devices in all domains does seem real.
> [...]
> So if we can safely and reliably disable DMA and MSI at the generic PCI
> device level during boot up I am all for it.
>
>
> How difficult would it be to tell the IOMMUs to stop passing traffic
> through in an early pci quirk? The problem setup was apparently someone
> using the device directly from a VM. So I presume there is an IOMMU
> in that configuration.
This is a good idea I think - we considered something like that in
theory while working the problem back in 2018, but given I'm even less
expert in IOMMU that I already am in PCI, the option was to go with the
PCI approach. And you are right, the original problem is a device in
PCI-PT to a VM, and a tool messing with the PF device in the SR-IOV (to
collect some stats) from the host side, through VFIO IIRC.
Anyway, we could split the problem in two after all the considerations
in the thread, I believe:
(1) If possible to set-up the IOMMU to prevent MSIs, by "blocking" the
DMA writes for PCI devices *until* PCI core code properly initialize the
devices, that'd handle the majority of the cases I guess (IOMMU usage is
quite common nowadays).
(2) Collecting PCI topology information in a running kernel and passing
that to the kdump kernel would allow us to disable the PCI devices MSI
capabilities, the only problem here is that I couldn't see how to do
that early enough, before the 'sti' executes, without relying in
early_quirks(). ACPI maybe? As per Bjorn comment when explaining the
limitations of early_quirks(), this problem seems not to be trivial.
I'm a bit against doing that in crash_kexec() for the reasons mentioned
in the thread, specially by Thomas and Eric - but if there's no other
way...maybe this is the way to go.
Cheers,
Guilherme
P.S. Fixed Gavin's bouncing address, sorry for that.
On Mon, Nov 16 2020 at 19:06, Eric W. Biederman wrote:
> Bjorn Helgaas <[email protected]> writes:
> My two top candidates are poking the IOMMUs early to shut things off,
> and figuring out if we can delay enabling interrupts until we have
> initialized pci.
Keeping interrupts disabled post PCI initialization would be nice, but
that requires feeding the whole init machinery through a shredder and
collecting the bits and pieces.
> Poking at IOMMUs early should work for most systems with ``enterprise''
> hardware. Systems where people care about kdump the most.
The IOMMU IRQ remapping part _is_ initialized early and _before_
interrupts are enabled.
But that does not solve the problem either simply because then the IOMMU
will catch the rogue MSIs and you get an interrupt storm on the IOMMU
error interrupt.
This one is not going to be turned off because the IOMMU error interrupt
handler will handle each of them and tell the core code that everything
is under control.
As I explained several times now, the only way to shut this up reliably
is at the source. Curing the symptom is almost never a good solution.
Thanks,
tglx
On Mon, Nov 16, 2020 at 10:07 PM Eric W. Biederman
<[email protected]> wrote:
> [...]
> > I think we need to disable MSIs in the crashing kernel before the
> > kexec. It adds a little more code in the crash_kexec() path, but it
> > seems like a worthwhile tradeoff.
>
> Disabling MSIs in the b0rken kernel is not possible.
>
> Walking the device tree or even a significant subset of it hugely
> decreases the chances that we will run into something that is incorrect
> in the known broken kernel. I expect the code to do that would double
> or triple the amount of code that must be executed in the known broken
> kernel. The last time something like that happened (switching from xchg
> to ordinary locks) we had cases that stopped working. Walking all of
> the pci devices in the system is much more invasive.
>
I think we could try to decouple this problem in 2, if that makes
sense. Bjorn/others can jump in and correct me if I'm wrong. So, the
problem is to walk a PCI topology tree, identify every device and
disable MSI(/INTx maybe) in them, and the big deal with doing that in
the broken kernel before the kexec is that this task is complex due to
the tree walk, mainly. But what if we keep a table (a simple 2-D
array) with the address and data to be written to disable the MSIs,
and before the kexec we could have a parameter enabling a function
that just goes through this array and performs the writes?
The table itself would be constructed by the PCI core (and updated in
the hotplug path), when device discovery is happening. This table
would live in a protected area in memory, with no write/execute
access, this way if the kernel itself tries to corrupt that, we get a
fault (yeah, I know DMAs can corrupt anywhere, but IOMMU could protect
against that). If the parameter "kdump_clear_msi" is passed in the
cmdline of the regular kernel, for example, then the function walks
this simple table and performs the devices' writes before the kexec...
If that's not possible or a bad idea for any reason, I still think the
early_quirks() idea hereby proposed is not something we should
discard, because it'll help a lot of users even with its limitations
(in our case it worked very well).
Also, taking here the opportunity to clarify my understanding about
the limitations of that approach: Bjorn, in our reproducer machine we
had 3 parents in the PCI tree (as per lspci -t), 0000:00, 0000:ff and
0000:80 - are those all under "segment 0" as per your verbiage? In our
case the troublemaker was under 0000:80, and the early_quirks() shut
the device up successfully.
Thanks,
Guilherme
On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote:
> But that does not solve the problem either simply because then the IOMMU
> will catch the rogue MSIs and you get an interrupt storm on the IOMMU
> error interrupt.
Not if you can tell the IOMMU to stop reporting those errors.
We can easily do it per-device for DMA errors; not quite sure what
granularity we have for interrupts. Perhaps the Intel IOMMU only lets
you set the Fault Processing Disable bit per IRTE entry, and you still
get faults for Compatibility Format interrupts? Not sure about AMD...
On Tue, Nov 17 2020 at 12:19, David Woodhouse wrote:
> On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote:
>> But that does not solve the problem either simply because then the IOMMU
>> will catch the rogue MSIs and you get an interrupt storm on the IOMMU
>> error interrupt.
>
> Not if you can tell the IOMMU to stop reporting those errors.
>
> We can easily do it per-device for DMA errors; not quite sure what
> granularity we have for interrupts. Perhaps the Intel IOMMU only lets
> you set the Fault Processing Disable bit per IRTE entry, and you still
> get faults for Compatibility Format interrupts? Not sure about AMD...
It looks like the fault (DMAR) and event (AMD) interrupts can be
disabled in the IOMMU. That might help to bridge the gap until the PCI
bus is scanned in full glory and the devices can be shut up for real.
If we make this conditional for a crash dump kernel that might do the
trick.
Lot's of _might_ there :)
Thanks
tglx
Thomas Gleixner <[email protected]> writes:
> On Tue, Nov 17 2020 at 12:19, David Woodhouse wrote:
>> On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote:
>>> But that does not solve the problem either simply because then the IOMMU
>>> will catch the rogue MSIs and you get an interrupt storm on the IOMMU
>>> error interrupt.
>>
>> Not if you can tell the IOMMU to stop reporting those errors.
>>
>> We can easily do it per-device for DMA errors; not quite sure what
>> granularity we have for interrupts. Perhaps the Intel IOMMU only lets
>> you set the Fault Processing Disable bit per IRTE entry, and you still
>> get faults for Compatibility Format interrupts? Not sure about AMD...
>
> It looks like the fault (DMAR) and event (AMD) interrupts can be
> disabled in the IOMMU. That might help to bridge the gap until the PCI
> bus is scanned in full glory and the devices can be shut up for real.
>
> If we make this conditional for a crash dump kernel that might do the
> trick.
>
> Lot's of _might_ there :)
Worth testing.
Folks tracking this down is this enough of a hint for you to write a
patch and test?
Also worth checking how close irqpoll is to handling a case like this.
At least historically it did a pretty good job at shutting down problem
interrupts.
I really find it weird that an edge triggered irq was firing fast enough
to stop a system from booting. Level triggered irqs do that if they are
acknolwedged without actually being handled. I think edge triggered
irqs only fire when another event comes in, and it seems odd to get so
many actual events causing interrupts that the system soft locks
up. Is my memory of that situation confused?
I recommend making these facilities general debug facilities so that
they can be used for cases other than crash dump. The crash dump kernel
would always enable them because it can assume that the hardware is
very likely in a wonky state.
Eric
On Tue, Nov 17, 2020 at 09:04:07AM -0300, Guilherme Piccoli wrote:
> Also, taking here the opportunity to clarify my understanding about
> the limitations of that approach: Bjorn, in our reproducer machine we
> had 3 parents in the PCI tree (as per lspci -t), 0000:00, 0000:ff and
> 0000:80 - are those all under "segment 0" as per your verbiage?
Yes. The "0000" is the PCI segment (or "domain" in the Linux PCI core).
It's common on x86 to have multiple host bridges in segment 0000.
Thanks a lot Bjorn! I confess except for PPC64 Server machines, I
never saw other "domains" or segments. Is it common in x86 to have
that? The early_quirks() are restricted to the first segment, no
matter how many host bridges we have in segment 0000?
Thanks again!
On Wed, Nov 18, 2020 at 07:36:08PM -0300, Guilherme Piccoli wrote:
> Thanks a lot Bjorn! I confess except for PPC64 Server machines, I
> never saw other "domains" or segments. Is it common in x86 to have
> that? The early_quirks() are restricted to the first segment, no
> matter how many host bridges we have in segment 0000?
I don't know whether it's *common* to have multiple domains on x86,
but they're definitely used on large systems. This includes some
lspci info from an HPE Superdome Flex system:
https://lore.kernel.org/lkml/[email protected]/
The early quirks in arch/x86/kernel/early-quirks.c (not the
DECLARE_PCI_FIXUP_EARLY quirks in drivers/pci/quirks.c) are restricted
to segment 0, no matter how many host bridges there are. This is
because they use read_pci_config_16(), which uses a config access
mechanism that has no provision for a segment number.
Bjorn