2020-03-18 19:34:04

by Mathias Nyman

[permalink] [raw]
Subject: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Hi

I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes
the "Plug non-maskable MSI affinity race" patch.

I can see this on a couple platforms, I'm running a script that first generates
a lot of usb traffic, and then in a busyloop sets irq affinity and turns off
and on cpus:

for i in 1 3 5 7; do
echo "1" > /sys/devices/system/cpu/cpu$i/online
done
echo "A" > "/proc/irq/*/smp_affinity"
echo "A" > "/proc/irq/*/smp_affinity"
echo "F" > "/proc/irq/*/smp_affinity"
for i in 1 3 5 7; do
echo "0" > /sys/devices/system/cpu/cpu$i/online
done

I added some very simple debugging but I don't really know what to look for.
xhci interrupts (122) just stop after a setting msi affinity, it survived many
similar msi_set_affinity() calls before this.

I'm not that familiar with the inner workings of this, but I'll be happy to
help out with adding debugging and testing patches.

Details:

cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 26 0 0 0 0 0 0 0 IO-APIC 2-edge timer
1: 0 0 0 0 0 7 0 0 IO-APIC 1-edge i8042
4: 0 4 59941 0 0 0 0 0 IO-APIC 4-edge ttyS0
8: 0 0 0 0 0 0 1 0 IO-APIC 8-edge rtc0
9: 0 40 8 0 0 0 0 0 IO-APIC 9-fasteoi acpi
16: 0 0 0 0 0 0 0 0 IO-APIC 16-fasteoi i801_smbus
120: 0 0 293 0 0 0 0 0 PCI-MSI 32768-edge i915
121: 728 0 0 58 0 0 0 0 PCI-MSI 520192-edge enp0s31f6
122: 63575 2271 0 1957 7262 0 0 0 PCI-MSI 327680-edge xhci_hcd
123: 0 0 0 0 0 0 0 0 PCI-MSI 514048-edge snd_hda_intel:card0
NMI: 0 0 0 0 0 0 0 0 Non-maskable interrupts

trace snippet:
<idle>-0 [001] d.h. 129.676900: xhci_irq: xhci irq
<idle>-0 [001] d.h. 129.677507: xhci_irq: xhci irq
<idle>-0 [001] d.h. 129.677556: xhci_irq: xhci irq
<idle>-0 [001] d.h. 129.677647: xhci_irq: xhci irq
<...>-14 [001] d..1 129.679802: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 2 -> 6
<idle>-0 [003] d.h. 129.682639: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.682769: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.682908: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.683552: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.683677: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.683819: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.689017: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.689140: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.689307: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.689984: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.690107: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.690278: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.695541: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.695674: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.695839: xhci_irq: xhci irq
<idle>-0 [003] d.H. 129.696667: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.696797: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.696973: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.702288: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.702380: xhci_irq: xhci irq
<idle>-0 [003] d.h. 129.702493: xhci_irq: xhci irq
migration/3-24 [003] d..1 129.703150: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 6 -> 0
kworker/0:0-5 [000] d.h. 131.328790: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
kworker/0:0-5 [000] d.h. 133.312704: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
kworker/0:0-5 [000] d.h. 135.360786: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
<idle>-0 [000] d.h. 137.344694: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
kworker/0:0-5 [000] d.h. 139.128679: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
kworker/0:0-5 [000] d.h. 141.312686: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
kworker/0:0-5 [000] d.h. 143.360703: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
kworker/0:0-5 [000] d.h. 145.344791: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0


--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -92,6 +92,10 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
cfg->vector == old_cfg.vector ||
old_cfg.vector == MANAGED_IRQ_SHUTDOWN_VECTOR ||
cfg->dest_apicid == old_cfg.dest_apicid) {
+ trace_printk("direct update msi %u, vector %u -> %u, apicid: %u -> %u\n",
+ irqd->irq,
+ old_cfg.vector, cfg->vector,
+ old_cfg.dest_apicid, cfg->dest_apicid);
irq_msi_update_msg(irqd, cfg);
return ret;
}
@@ -134,7 +138,10 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
*/
if (IS_ERR_OR_NULL(this_cpu_read(vector_irq[cfg->vector])))
this_cpu_write(vector_irq[cfg->vector], VECTOR_RETRIGGERED);
-
+ trace_printk("twostep update msi, irq %u, vector %u -> %u, apicid: %u -> %u\n",
+ irqd->irq,
+ old_cfg.vector, cfg->vector,
+ old_cfg.dest_apicid, cfg->dest_apicid);
/* Redirect it to the new vector on the local CPU temporarily */
old_cfg.vector = cfg->vector;
irq_msi_update_msg(irqd, &old_cfg);

Thanks
-Mathias


2020-03-19 20:25:31

by Evan Green

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

On Wed, Mar 18, 2020 at 12:23 PM Mathias Nyman
<[email protected]> wrote:
>
> Hi
>
> I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes
> the "Plug non-maskable MSI affinity race" patch.
>
> I can see this on a couple platforms, I'm running a script that first generates
> a lot of usb traffic, and then in a busyloop sets irq affinity and turns off
> and on cpus:
>
> for i in 1 3 5 7; do
> echo "1" > /sys/devices/system/cpu/cpu$i/online
> done
> echo "A" > "/proc/irq/*/smp_affinity"
> echo "A" > "/proc/irq/*/smp_affinity"
> echo "F" > "/proc/irq/*/smp_affinity"
> for i in 1 3 5 7; do
> echo "0" > /sys/devices/system/cpu/cpu$i/online
> done
>
> I added some very simple debugging but I don't really know what to look for.
> xhci interrupts (122) just stop after a setting msi affinity, it survived many
> similar msi_set_affinity() calls before this.
>
> I'm not that familiar with the inner workings of this, but I'll be happy to
> help out with adding debugging and testing patches.

How quickly are you able to reproduce this when you run your script?
Does reverting Thomas' patch make it repro faster? Can you send the
output of lspci -vvv for the xhci device?

-Evan

2020-03-20 08:05:41

by Mathias Nyman

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

On 19.3.2020 22.24, Evan Green wrote:
> On Wed, Mar 18, 2020 at 12:23 PM Mathias Nyman
> <[email protected]> wrote:
>>
>> Hi
>>
>> I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes
>> the "Plug non-maskable MSI affinity race" patch.
>>
>> I can see this on a couple platforms, I'm running a script that first generates
>> a lot of usb traffic, and then in a busyloop sets irq affinity and turns off
>> and on cpus:
>>
>> for i in 1 3 5 7; do
>> echo "1" > /sys/devices/system/cpu/cpu$i/online
>> done
>> echo "A" > "/proc/irq/*/smp_affinity"
>> echo "A" > "/proc/irq/*/smp_affinity"
>> echo "F" > "/proc/irq/*/smp_affinity"
>> for i in 1 3 5 7; do
>> echo "0" > /sys/devices/system/cpu/cpu$i/online
>> done
>>
>> I added some very simple debugging but I don't really know what to look for.
>> xhci interrupts (122) just stop after a setting msi affinity, it survived many
>> similar msi_set_affinity() calls before this.
>>
>> I'm not that familiar with the inner workings of this, but I'll be happy to
>> help out with adding debugging and testing patches.
>
> How quickly are you able to reproduce this when you run your script?

Less than a minute

> Does reverting Thomas' patch make it repro faster? Can you send the
> output of lspci -vvv for the xhci device?

It's roughly the same after reverting
"6f1a4891a592 x86/apic/msi: Plug non-maskable MSI affinity race"

# lspci -vvv -d :a3af
00:14.0 USB controller: Intel Corporation Device a3af (prog-if 30 [XHCI])
Subsystem: Intel Corporation Device 7270
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin ? routed to IRQ 122
Region 0: Memory at b1020000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [70] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+
Address: 00000000fee08000 Data: 4021
Kernel driver in use: xhci_hcd

-Mathias

2020-03-20 09:54:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Mathias,

Mathias Nyman <[email protected]> writes:
> I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes
> the "Plug non-maskable MSI affinity race" patch.
>
> I can see this on a couple platforms, I'm running a script that first generates
> a lot of usb traffic, and then in a busyloop sets irq affinity and turns off
> and on cpus:
>
> for i in 1 3 5 7; do
> echo "1" > /sys/devices/system/cpu/cpu$i/online
> done
> echo "A" > "/proc/irq/*/smp_affinity"
> echo "A" > "/proc/irq/*/smp_affinity"
> echo "F" > "/proc/irq/*/smp_affinity"
> for i in 1 3 5 7; do
> echo "0" > /sys/devices/system/cpu/cpu$i/online
> done
> trace snippet:
> <idle>-0 [001] d.h. 129.676900: xhci_irq: xhci irq
> <idle>-0 [001] d.h. 129.677507: xhci_irq: xhci irq
> <idle>-0 [001] d.h. 129.677556: xhci_irq: xhci irq
> <idle>-0 [001] d.h. 129.677647: xhci_irq: xhci irq
> <...>-14 [001] d..1 129.679802: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 2 -> 6

Looks like a regular affinity setting in interrupt context, but I can't
make sense of the time stamps

> <idle>-0 [003] d.h. 129.682639: xhci_irq: xhci irq
> <idle>-0 [003] d.h. 129.702380: xhci_irq: xhci irq
> <idle>-0 [003] d.h. 129.702493: xhci_irq: xhci irq
> migration/3-24 [003] d..1 129.703150: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 6 -> 0

So this is a CPU offline operation and after that irq 122 is silent, right?

> kworker/0:0-5 [000] d.h. 131.328790: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
> kworker/0:0-5 [000] d.h. 133.312704: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
> kworker/0:0-5 [000] d.h. 135.360786: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
> <idle>-0 [000] d.h. 137.344694: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
> kworker/0:0-5 [000] d.h. 139.128679: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
> kworker/0:0-5 [000] d.h. 141.312686: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
> kworker/0:0-5 [000] d.h. 143.360703: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
> kworker/0:0-5 [000] d.h. 145.344791: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0

That kworker context looks fishy. Can you please enable stacktraces in
the tracer so I can see the call chains leading to this? OTOH that's irq
121 not 122. Anyway moar information is always useful.

And please add the patch below.

Thanks,

tglx

8<---------------
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -243,6 +243,7 @@ u64 arch_irq_stat(void)
RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");

desc = __this_cpu_read(vector_irq[vector]);
+ trace_printk("vector: %u desc %lx\n", vector, (unsigned long) desc);
if (likely(!IS_ERR_OR_NULL(desc))) {
if (IS_ENABLED(CONFIG_X86_32))
handle_irq(desc, regs);

2020-03-23 09:42:00

by Mathias Nyman

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

On 20.3.2020 11.52, Thomas Gleixner wrote:
> Mathias,
>
> Mathias Nyman <[email protected]> writes:
>> I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes
>> the "Plug non-maskable MSI affinity race" patch.
>>
>> I can see this on a couple platforms, I'm running a script that first generates
>> a lot of usb traffic, and then in a busyloop sets irq affinity and turns off
>> and on cpus:
>>
>> for i in 1 3 5 7; do
>> echo "1" > /sys/devices/system/cpu/cpu$i/online
>> done
>> echo "A" > "/proc/irq/*/smp_affinity"
>> echo "A" > "/proc/irq/*/smp_affinity"
>> echo "F" > "/proc/irq/*/smp_affinity"
>> for i in 1 3 5 7; do
>> echo "0" > /sys/devices/system/cpu/cpu$i/online
>> done
>> trace snippet:
>> <idle>-0 [001] d.h. 129.676900: xhci_irq: xhci irq
>> <idle>-0 [001] d.h. 129.677507: xhci_irq: xhci irq
>> <idle>-0 [001] d.h. 129.677556: xhci_irq: xhci irq
>> <idle>-0 [001] d.h. 129.677647: xhci_irq: xhci irq
>> <...>-14 [001] d..1 129.679802: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 2 -> 6
>
> Looks like a regular affinity setting in interrupt context, but I can't
> make sense of the time stamps

I think so, everything worked normally after this one still.

>
>> <idle>-0 [003] d.h. 129.682639: xhci_irq: xhci irq
>> <idle>-0 [003] d.h. 129.702380: xhci_irq: xhci irq
>> <idle>-0 [003] d.h. 129.702493: xhci_irq: xhci irq
>> migration/3-24 [003] d..1 129.703150: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 6 -> 0
>
> So this is a CPU offline operation and after that irq 122 is silent, right?

Yes, after this irq 122 was silent.

>
>> kworker/0:0-5 [000] d.h. 131.328790: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>> kworker/0:0-5 [000] d.h. 133.312704: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>> kworker/0:0-5 [000] d.h. 135.360786: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>> <idle>-0 [000] d.h. 137.344694: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>> kworker/0:0-5 [000] d.h. 139.128679: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>> kworker/0:0-5 [000] d.h. 141.312686: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>> kworker/0:0-5 [000] d.h. 143.360703: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>> kworker/0:0-5 [000] d.h. 145.344791: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>
> That kworker context looks fishy. Can you please enable stacktraces in
> the tracer so I can see the call chains leading to this? OTOH that's irq
> 121 not 122. Anyway moar information is always useful.
>
> And please add the patch below.
>

Full function trace with patch is huge, can be found compressed at

https://drive.google.com/drive/folders/19AFZe32DYk4Kzxi8VYv-OWmNOCyIY6M5?usp=sharing

xhci_traces.tgz contains:
trace_full: full function trace.
trace: timestamp ~48.29 to ~48.93 of trace above, section with last xhci irq
trace_prink_only: only trace_printk() of "trace" above

This time xhci interrupts stopped after
migration/3-24 [003] d..1 48.530271: msi_set_affinity: twostep update msi, irq 122, vector 33 -> 34, apicid: 6 -> 4

Thanks
-Mathias

2020-03-23 14:13:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Mathias,

Mathias Nyman <[email protected]> writes:
> Full function trace with patch is huge, can be found compressed at
>
> This time xhci interrupts stopped after
> migration/3-24 [003] d..1 48.530271: msi_set_affinity: twostep update msi, irq 122, vector 33 -> 34, apicid: 6 -> 4

thanks for providing the data. I think I decoded the issue. Can you
please test the patch below?

Thanks,

tglx
----
diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 159bd0cb8548..49aec92fa35b 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -63,12 +63,13 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
{
struct irq_cfg old_cfg, *cfg = irqd_cfg(irqd);
struct irq_data *parent = irqd->parent_data;
- unsigned int cpu;
+ unsigned int cpu, old_vector;
int ret;

/* Save the current configuration */
cpu = cpumask_first(irq_data_get_effective_affinity_mask(irqd));
old_cfg = *cfg;
+ old_vector = old_cfg.vector;

/* Allocate a new target vector */
ret = parent->chip->irq_set_affinity(parent, mask, force);
@@ -90,10 +91,10 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
*/
if (!irqd_msi_nomask_quirk(irqd) ||
cfg->vector == old_cfg.vector ||
- old_cfg.vector == MANAGED_IRQ_SHUTDOWN_VECTOR ||
+ old_vector == MANAGED_IRQ_SHUTDOWN_VECTOR ||
cfg->dest_apicid == old_cfg.dest_apicid) {
irq_msi_update_msg(irqd, cfg);
- return ret;
+ goto check_old;
}

/*
@@ -102,7 +103,7 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
*/
if (WARN_ON_ONCE(cpu != smp_processor_id())) {
irq_msi_update_msg(irqd, cfg);
- return ret;
+ goto check_old;
}

/*
@@ -163,6 +164,11 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
if (lapic_vector_set_in_irr(cfg->vector))
irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);

+check_old:
+ if (!in_irq() && old_vector != MANAGED_IRQ_SHUTDOWN_VECTOR &&
+ irqd_msi_nomask_quirk(irqd) && lapic_vector_set_in_irr(old_vector))
+ irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
+
return ret;
}

2020-03-23 20:31:08

by Mathias Nyman

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

On 23.3.2020 16.10, Thomas Gleixner wrote:
>
> thanks for providing the data. I think I decoded the issue. Can you
> please test the patch below?
>

Unfortunately it didn't help.
Traces added to xhci_trace2.tgz in the same place

https://drive.google.com/drive/folders/19AFZe32DYk4Kzxi8VYv-OWmNOCyIY6M5?usp=sharing

Thanks
-Mathias

2020-03-24 00:25:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Mathias Nyman <[email protected]> writes:
> On 23.3.2020 16.10, Thomas Gleixner wrote:
>>
>> thanks for providing the data. I think I decoded the issue. Can you
>> please test the patch below?
>
> Unfortunately it didn't help.

I did not expect that to help, simply because the same issue is caught
by the loop in fixup_irqs(). What I wanted to make sure is that there is
not something in between which causes the latter to fail.

So I stared at the trace data earlier today and looked at the xhci irq
events. They are following a more or less periodic schedule and the
forced migration on CPU hotplug hits definitely in the time frame where
the next interrupt should be raised by the device.

1) First off all I do not have to understand why new systems released
in 2020 still use non-maskable MSI which is the root cause of all of
this trouble especially in Intel systems which are known to have
this disastrouos interrupt migration troubles.

Please tell your hardware people to stop this.

2) I have no idea why the two step mechanism fails exactly on this
system. I tried the same test case on a skylake client and I can
clearly see from the traces that the interrupt raised in the device
falls exactly into the two step update and causes the IRR to be set
which resolves the situation by IPI'ing the new target CPU.

I have not found a single instance of IPI recovery in your
traces. Instead of that your system stops working in exactly this
situation.

The two step mechanism tries to work around the fact that PCI does
not support a 64bit atomic config space update. So we carefully avoid
changing more than one 32bit value at a time, i.e. we change first
the vector and then the destination ID (part of address_lo). This
ensures that the message is consistent all the time.

But obviously on your system this does not work as expected. Why? I
really can't tell.

Please talk to your hardware folks.

And of course all of this is so well documented that all of us can
clearly figure out what's going on...

Thanks,

tglx






2020-03-24 16:20:23

by Evan Green

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

On Mon, Mar 23, 2020 at 5:24 PM Thomas Gleixner <[email protected]> wrote:
>
> Mathias Nyman <[email protected]> writes:
> > On 23.3.2020 16.10, Thomas Gleixner wrote:
> >>
> >> thanks for providing the data. I think I decoded the issue. Can you
> >> please test the patch below?
> >
> > Unfortunately it didn't help.
>
> I did not expect that to help, simply because the same issue is caught
> by the loop in fixup_irqs(). What I wanted to make sure is that there is
> not something in between which causes the latter to fail.
>
> So I stared at the trace data earlier today and looked at the xhci irq
> events. They are following a more or less periodic schedule and the
> forced migration on CPU hotplug hits definitely in the time frame where
> the next interrupt should be raised by the device.
>
> 1) First off all I do not have to understand why new systems released
> in 2020 still use non-maskable MSI which is the root cause of all of
> this trouble especially in Intel systems which are known to have
> this disastrouos interrupt migration troubles.
>
> Please tell your hardware people to stop this.
>
> 2) I have no idea why the two step mechanism fails exactly on this
> system. I tried the same test case on a skylake client and I can
> clearly see from the traces that the interrupt raised in the device
> falls exactly into the two step update and causes the IRR to be set
> which resolves the situation by IPI'ing the new target CPU.
>
> I have not found a single instance of IPI recovery in your
> traces. Instead of that your system stops working in exactly this
> situation.
>
> The two step mechanism tries to work around the fact that PCI does
> not support a 64bit atomic config space update. So we carefully avoid
> changing more than one 32bit value at a time, i.e. we change first
> the vector and then the destination ID (part of address_lo). This
> ensures that the message is consistent all the time.
>
> But obviously on your system this does not work as expected. Why? I
> really can't tell.
>
> Please talk to your hardware folks.
>
> And of course all of this is so well documented that all of us can
> clearly figure out what's going on...

I won't pretend to know what's going on, so I'll preface this by
labeling it all as "flailing", but:

I wonder if there's some way the interrupt can get delayed between
XHCI snapping the torn value and it finding its way into the IRR. For
instance, if xhci read this value at the start of their interrupt
moderation timer period, that would be awful (I hope they don't do
this). One test patch would be to carve out 8 vectors reserved for
xhci on all cpus. Whenever you change the affinity, the assigned
vector is always reserved_base + cpu_number. That lets you exercise
the affinity switching code, but in a controlled manner where torn
interrupts could be easily seen (ie hey I got an interrupt on cpu 4's
vector but I'm cpu 2). I might struggle to write such a change, but in
theory it's doable.

Actually the slightly easier experiment might be to reserve a single
vector for xhci on all cores. We'd strongly expect the problem to go
away, since now there's no more torn writes since the vector is always
the same. If it somehow didn't go away, you'd know there's more to the
story.

I was alternately trying to build a theory in my head about the write
somehow being posted and getting out of order, but I don't think that
can happen.

Another experiment would be to try my old patch in [1]. I'm not
advocating for this patch as a solution, Thomas and Bjorn have
convinced me that it will break the rest of the world. But your PCI
device 0xa3af seems to be Comet Lake. I was also on Comet Lake. So I'd
expect to at least see it mask your problem. Again, if it didn't, that
might be an interesting datapoint.

[1] https://lore.kernel.org/lkml/20200117162444.v2.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid/

-Evan

2020-03-24 19:06:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Evan Green <[email protected]> writes:
> On Mon, Mar 23, 2020 at 5:24 PM Thomas Gleixner <[email protected]> wrote:
>> And of course all of this is so well documented that all of us can
>> clearly figure out what's going on...
>
> I won't pretend to know what's going on, so I'll preface this by
> labeling it all as "flailing", but:
>
> I wonder if there's some way the interrupt can get delayed between
> XHCI snapping the torn value and it finding its way into the IRR. For
> instance, if xhci read this value at the start of their interrupt
> moderation timer period, that would be awful (I hope they don't do
> this). One test patch would be to carve out 8 vectors reserved for
> xhci on all cpus. Whenever you change the affinity, the assigned
> vector is always reserved_base + cpu_number. That lets you exercise
> the affinity switching code, but in a controlled manner where torn
> interrupts could be easily seen (ie hey I got an interrupt on cpu 4's
> vector but I'm cpu 2). I might struggle to write such a change, but in
> theory it's doable.

Well, the point is that we don't see a spurious interrupt on any
CPU. We added a traceprintk into do_IRQ() and that would immediately
tell us where the thing goes off into lala land. Which it didn't.

> I was alternately trying to build a theory in my head about the write
> somehow being posted and getting out of order, but I don't think that
> can happen.

If that happens then the lost XHCI interrupt is the least of your
worries.

Thanks,

tglx


2020-03-25 17:10:32

by Mathias Nyman

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

On 24.3.2020 18.17, Evan Green wrote:
>
> Another experiment would be to try my old patch in [1]. I'm not
> advocating for this patch as a solution, Thomas and Bjorn have
> convinced me that it will break the rest of the world. But your PCI
> device 0xa3af seems to be Comet Lake. I was also on Comet Lake. So I'd
> expect to at least see it mask your problem. Again, if it didn't, that
> might be an interesting datapoint.
>
> [1] https://lore.kernel.org/lkml/20200117162444.v2.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid/
>

Yes I'm testing on a Comet Lake, oddly your patch doesn't mask the problem.
Did a few rounds, seems that xhci interrupts don't always stop when both vector
and apicid are updated. Sometimes just apicid update triggers it.

(tail of) round 1:
<idle>-0 [001] d.h. 249.040051: xhci_irq: xhci irq
<idle>-0 [001] d.h. 249.040122: do_IRQ: vector: 33 desc ffff9e84075b1800
<idle>-0 [001] d.h. 249.040125: xhci_irq: xhci irq
migration/1-14 [001] d..1 249.045231: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 2 -> 6
migration/3-24 [003] d..1 249.069563: msi_set_affinity: twostep update msi, irq 123, vector 33 -> 35, apicid: 6 -> 0
<idle>-0 [000] d.h. 250.294934: do_IRQ: vector: 33 desc ffff9e84066df200
<idle>-0 [000] d.h. 250.294940: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 0 -> 0
<idle>-0 [000] d.h. 252.276099: do_IRQ: vector: 33 desc ffff9e84066df200
round 2:
<idle>-0 [001] d.h. 75.580577: do_IRQ: vector: 33 desc ffffa0449d084a00
<idle>-0 [001] d.h. 75.580578: xhci_irq: xhci irq
<idle>-0 [001] d.h. 75.580639: do_IRQ: vector: 33 desc ffffa0449d084a00
<idle>-0 [001] d.h. 75.580640: xhci_irq: xhci irq
<...>-14 [001] d..1 75.585814: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 2 -> 6
migration/3-24 [003] d..1 75.606381: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 6 -> 0
<idle>-0 [000] d.h. 76.275792: do_IRQ: vector: 35 desc ffffa04486009a00

round 3:
<idle>-0 [001] d.h. 117.383266: xhci_irq: xhci irq
<idle>-0 [001] d.h. 117.383407: do_IRQ: vector: 33 desc ffff92515d0d5800
<idle>-0 [001] d.h. 117.383410: xhci_irq: xhci irq
migration/1-14 [001] d..1 117.388527: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 2 -> 6
migration/3-24 [003] d..1 117.413877: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 6 -> 0
<idle>-0 [000] d.h. 119.290474: do_IRQ: vector: 35 desc ffff92514757b600

round 4:
<idle>-0 [003] d.h. 147.950569: do_IRQ: vector: 33 desc ffff976fdd04ca00
<idle>-0 [003] d.h. 147.950569: xhci_irq: xhci irq
<idle>-0 [003] d.h. 147.950689: do_IRQ: vector: 33 desc ffff976fdd04ca00
<idle>-0 [003] d.h. 147.950690: xhci_irq: xhci irq
migration/3-24 [003] d..1 147.951368: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 6 -> 0
<idle>-0 [000] d.h. 148.491193: do_IRQ: vector: 35 desc ffff976fdcc9da00
<idle>-0 [000] d.h. 148.491198: msi_set_affinity: direct update msi 122, vector 35 -> 35, apicid: 0 -> 0

-Mathias

2020-05-01 18:47:16

by Raj, Ashok

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Hi Thomas

Just started looking into it to get some idea about what could be
going on. I had some questions, that would be helpful to clarify.

On Tue, Mar 24, 2020 at 08:03:44PM +0100, Thomas Gleixner wrote:
> Evan Green <[email protected]> writes:
> > On Mon, Mar 23, 2020 at 5:24 PM Thomas Gleixner <[email protected]> wrote:
> >> And of course all of this is so well documented that all of us can
> >> clearly figure out what's going on...
> >
> > I won't pretend to know what's going on, so I'll preface this by
> > labeling it all as "flailing", but:
> >
> > I wonder if there's some way the interrupt can get delayed between
> > XHCI snapping the torn value and it finding its way into the IRR. For
> > instance, if xhci read this value at the start of their interrupt
> > moderation timer period, that would be awful (I hope they don't do
> > this). One test patch would be to carve out 8 vectors reserved for
> > xhci on all cpus. Whenever you change the affinity, the assigned
> > vector is always reserved_base + cpu_number. That lets you exercise
> > the affinity switching code, but in a controlled manner where torn
> > interrupts could be easily seen (ie hey I got an interrupt on cpu 4's
> > vector but I'm cpu 2). I might struggle to write such a change, but in
> > theory it's doable.
>
> Well, the point is that we don't see a spurious interrupt on any
> CPU. We added a traceprintk into do_IRQ() and that would immediately
> tell us where the thing goes off into lala land. Which it didn't.

Now that we don't have the torn write issue. We did an experiment
with legacy MSI, and no interrupt remap support. One of the thought
process was, since we don't have a way to ensure that previous MSI writes
are globally observed, a read from the device should flush any
outstanidng writes correct? (according to PCI, not sure if we can
depend on this.. or chipsets would take their own sweet time to push to CPU)

I'm not certain if such a read happens today? So to make it simple tried
to force a retrigger. In the following case of direct update,
even though the vector isn't changing a MSI write to the previous
destination could have been sent to the previous CPU right?

arch/x86/kernel/apic/msi.c: msi_set_affinity()

/*
* Direct update is possible when:
* - The MSI is maskable (remapped MSI does not use this code path)).
* The quirk bit is not set in this case.
* - The new vector is the same as the old vector
* - The old vector is MANAGED_IRQ_SHUTDOWN_VECTOR (interrupt starts up)
* - The new destination CPU is the same as the old destination CPU
*/
if (!irqd_msi_nomask_quirk(irqd) ||
cfg->vector == old_cfg.vector ||
old_cfg.vector == MANAGED_IRQ_SHUTDOWN_VECTOR ||
cfg->dest_apicid == old_cfg.dest_apicid) {
irq_msi_update_msg(irqd, cfg);
-->>> force a retrigger

It appears that without a gaurantee of flusing MSI writes from the device
the check for lapic_vector_set_in_irr(vector) is still racy.

With adding the forced retrigger in both places, the test didn't reveal any
lost interrupt cases.

Now the second question with Interrupt Remapping Support:


intel_ir_set_affinity->intel_ir_reconfigure_irte()-> modify_irte()

The flush of Interrupt Entry Cache (IEC) should ensure, if any interrupts
were in flight, they made it to the previous CPU, and any new interrupts
must be delivered to the new CPU.

Question is do we need a check similar to the legacy MSI handling

if (lapic_vector_set_in_irr())
handle interrupt?

Is there a reason we don't check if the interrupt delivered to previous
CPU in intel_ir_set_affinity()? Or is the send_cleanup_vector() sends
an IPI to perform the cleanup?

It appears that maybe send_cleanup_vector() sends IPI to the old cpu
and that somehow ensures the device interrupt handler actually getting
called? I lost my track somewhere down there :)


Cheers,
Ashok

2020-05-05 19:40:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Ashok,

"Raj, Ashok" <[email protected]> writes:
> On Tue, Mar 24, 2020 at 08:03:44PM +0100, Thomas Gleixner wrote:
>> Evan Green <[email protected]> writes:
>> Well, the point is that we don't see a spurious interrupt on any
>> CPU. We added a traceprintk into do_IRQ() and that would immediately
>> tell us where the thing goes off into lala land. Which it didn't.
>
> Now that we don't have the torn write issue. We did an experiment
> with legacy MSI, and no interrupt remap support. One of the thought
> process was, since we don't have a way to ensure that previous MSI writes
> are globally observed, a read from the device should flush any
> outstanidng writes correct? (according to PCI, not sure if we can
> depend on this.. or chipsets would take their own sweet time to push
> to CPU)

Aargh. Indeed.

> I'm not certain if such a read happens today? So to make it simple tried
> to force a retrigger. In the following case of direct update,
> even though the vector isn't changing a MSI write to the previous
> destination could have been sent to the previous CPU right?

Just checked the pci msi write code and there is no read at the end
which would flush the darn posted write. Duh, I never noticed.

> With adding the forced retrigger in both places, the test didn't reveal any
> lost interrupt cases.

Not a surprise, but not what we really want.

> Now the second question with Interrupt Remapping Support:
>
> intel_ir_set_affinity->intel_ir_reconfigure_irte()-> modify_irte()
>
> The flush of Interrupt Entry Cache (IEC) should ensure, if any interrupts
> were in flight, they made it to the previous CPU, and any new interrupts
> must be delivered to the new CPU.
>
> Question is do we need a check similar to the legacy MSI handling
>
> if (lapic_vector_set_in_irr())
> handle interrupt?
>
> Is there a reason we don't check if the interrupt delivered to previous
> CPU in intel_ir_set_affinity()? Or is the send_cleanup_vector() sends
> an IPI to perform the cleanup?
>
> It appears that maybe send_cleanup_vector() sends IPI to the old cpu
> and that somehow ensures the device interrupt handler actually getting
> called? I lost my track somewhere down there :)

The way it works is:

1) New vector on different CPU is allocated

2) New vector is installed

3) When the first interrupt on the new CPU arrives then the cleanup
IPI is sent to the previous CPU which tears down the old vector

So if the interrupt is sent to the original CPU just before #2 then this
will be correctly handled on that CPU after the set affinity code
reenables interrupts.

Can you try the patch below?

Thanks,

tglx

8<--------------
drivers/pci/msi.c | 2 ++
1 file changed, 2 insertions(+)

--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -323,6 +323,7 @@ void __pci_write_msi_msg(struct msi_desc
writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
+ readl(base + PCI_MSIX_ENTRY_DATA);
} else {
int pos = dev->msi_cap;
u16 msgctl;
@@ -343,6 +344,7 @@ void __pci_write_msi_msg(struct msi_desc
pci_write_config_word(dev, pos + PCI_MSI_DATA_32,
msg->data);
}
+ pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
}

skip:

2020-05-05 20:17:58

by Ashok Raj

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

On Tue, May 05, 2020 at 09:36:04PM +0200, Thomas Gleixner wrote:
> Ashok,
>
>
> > Now the second question with Interrupt Remapping Support:
> >
> > intel_ir_set_affinity->intel_ir_reconfigure_irte()-> modify_irte()
> >
> > The flush of Interrupt Entry Cache (IEC) should ensure, if any interrupts
> > were in flight, they made it to the previous CPU, and any new interrupts
> > must be delivered to the new CPU.
> >
> > Question is do we need a check similar to the legacy MSI handling
> >
> > if (lapic_vector_set_in_irr())
> > handle interrupt?
> >
> > Is there a reason we don't check if the interrupt delivered to previous
> > CPU in intel_ir_set_affinity()? Or is the send_cleanup_vector() sends
> > an IPI to perform the cleanup?
> >
> > It appears that maybe send_cleanup_vector() sends IPI to the old cpu
> > and that somehow ensures the device interrupt handler actually getting
> > called? I lost my track somewhere down there :)
>
> The way it works is:
>
> 1) New vector on different CPU is allocated
>
> 2) New vector is installed
>
> 3) When the first interrupt on the new CPU arrives then the cleanup
> IPI is sent to the previous CPU which tears down the old vector
>
> So if the interrupt is sent to the original CPU just before #2 then this
> will be correctly handled on that CPU after the set affinity code
> reenables interrupts.

I'll have this test tried out.. but in msi_set_affinity() the check below
for lapic_vector_set_in_irr(cfg->vector), this is the new vector correct?
don't we want to check for old_cfg.vector instead?

msi_set_affinit ()
{
....


unlock_vector_lock();

/*
* Check whether the transition raced with a device interrupt and
* is pending in the local APICs IRR. It is safe to do this outside
* of vector lock as the irq_desc::lock of this interrupt is still
* held and interrupts are disabled: The check is not accessing the
* underlying vector store. It's just checking the local APIC's
* IRR.
*/
if (lapic_vector_set_in_irr(cfg->vector))
irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);


>
> Can you try the patch below?
>
> Thanks,
>
> tglx
>
> 8<--------------
> drivers/pci/msi.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -323,6 +323,7 @@ void __pci_write_msi_msg(struct msi_desc
> writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
> writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
> writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
> + readl(base + PCI_MSIX_ENTRY_DATA);
> } else {
> int pos = dev->msi_cap;
> u16 msgctl;
> @@ -343,6 +344,7 @@ void __pci_write_msi_msg(struct msi_desc
> pci_write_config_word(dev, pos + PCI_MSI_DATA_32,
> msg->data);
> }
> + pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
> }
>
> skip:

2020-05-05 21:55:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Ashok,

"Raj, Ashok" <[email protected]> writes:
> On Tue, May 05, 2020 at 09:36:04PM +0200, Thomas Gleixner wrote:
>> The way it works is:
>>
>> 1) New vector on different CPU is allocated
>>
>> 2) New vector is installed
>>
>> 3) When the first interrupt on the new CPU arrives then the cleanup
>> IPI is sent to the previous CPU which tears down the old vector
>>
>> So if the interrupt is sent to the original CPU just before #2 then this
>> will be correctly handled on that CPU after the set affinity code
>> reenables interrupts.
>
> I'll have this test tried out.. but in msi_set_affinity() the check below
> for lapic_vector_set_in_irr(cfg->vector), this is the new vector correct?
> don't we want to check for old_cfg.vector instead?
>
> msi_set_affinit ()
> {
> ....
> unlock_vector_lock();
>
> /*
> * Check whether the transition raced with a device interrupt and
> * is pending in the local APICs IRR. It is safe to do this outside
> * of vector lock as the irq_desc::lock of this interrupt is still
> * held and interrupts are disabled: The check is not accessing the
> * underlying vector store. It's just checking the local APIC's
> * IRR.
> */
> if (lapic_vector_set_in_irr(cfg->vector))
> irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);

No. This catches the transitional interrupt to the new vector on the
original CPU, i.e. the one which is running that code.

Again the steps are:

1) Allocate new vector on new CPU

2) Set new vector on original CPU

3) Set new vector on new CPU

So we have 3 points where an interrupt can fire:

A) Before #2

B) After #2 and before #3

C) After #3

#A is hitting the old vector which is still valid on the old CPU and
will be handled once interrupts are enabled with the correct irq
descriptor - Normal operation (same as with maskable MSI)

#B This must be checked in the IRR because the there is no valid vector
on the old CPU.

#C is handled on the new vector on the new CPU

Thanks,

tglx



2020-05-07 12:21:09

by Ashok Raj

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Hi Thomas

We did a bit more tracing and it looks like the IRR check is actually
not happening on the right cpu. See below.

On Tue, May 05, 2020 at 11:47:26PM +0200, Thomas Gleixner wrote:
> >
> > msi_set_affinit ()
> > {
> > ....
> > unlock_vector_lock();
> >
> > /*
> > * Check whether the transition raced with a device interrupt and
> > * is pending in the local APICs IRR. It is safe to do this outside
> > * of vector lock as the irq_desc::lock of this interrupt is still
> > * held and interrupts are disabled: The check is not accessing the
> > * underlying vector store. It's just checking the local APIC's
> > * IRR.
> > */
> > if (lapic_vector_set_in_irr(cfg->vector))
> > irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
>
> No. This catches the transitional interrupt to the new vector on the
> original CPU, i.e. the one which is running that code.

Mathias added some trace to his xhci driver when the isr is called.

Below is the tail of my trace with last two times xhci_irq isr is called:

<idle>-0 [003] d.h. 200.277971: xhci_irq: xhci irq
<idle>-0 [003] d.h. 200.278052: xhci_irq: xhci irq

Just trying to follow your steps below with traces. The traces follow
the same comments in the source.

>
> Again the steps are:
>
> 1) Allocate new vector on new CPU

/* Allocate a new target vector */
ret = parent->chip->irq_set_affinity(parent, mask, force);

migration/3-24 [003] d..1 200.283012: msi_set_affinity: msi_set_affinity: quirk: 1: new vector allocated, new cpu = 0

>
> 2) Set new vector on original CPU

/* Redirect it to the new vector on the local CPU temporarily */
old_cfg.vector = cfg->vector;
irq_msi_update_msg(irqd, &old_cfg);

migration/3-24 [003] d..1 200.283033: msi_set_affinity: msi_set_affinity: Redirect to new vector 33 on old cpu 6
>
> 3) Set new vector on new CPU

/* Now transition it to the target CPU */
irq_msi_update_msg(irqd, cfg);

migration/3-24 [003] d..1 200.283044: msi_set_affinity: msi_set_affinity: Transition to new target cpu 0 vector 33



if (lapic_vector_set_in_irr(cfg->vector))
irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);


migration/3-24 [003] d..1 200.283046: msi_set_affinity: msi_set_affinity: Update Done [IRR 0]: irq 123 localsw: Nvec 33 Napic 0

>
> So we have 3 points where an interrupt can fire:
>
> A) Before #2
>
> B) After #2 and before #3
>
> C) After #3
>
> #A is hitting the old vector which is still valid on the old CPU and
> will be handled once interrupts are enabled with the correct irq
> descriptor - Normal operation (same as with maskable MSI)
>
> #B This must be checked in the IRR because the there is no valid vector
> on the old CPU.

The check for IRR seems like on a random cpu3 vs checking for the new vector 33
on old cpu 6?

This is the place when we force the retrigger without the IRR check things seem to fix itself.

>
> #C is handled on the new vector on the new CPU
>


Did we miss something?

Cheers,
Ashok

2020-05-07 12:57:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Ashok,

"Raj, Ashok" <[email protected]> writes:

> We did a bit more tracing and it looks like the IRR check is actually
> not happening on the right cpu. See below.

What?

> On Tue, May 05, 2020 at 11:47:26PM +0200, Thomas Gleixner wrote:
>> >
>> > msi_set_affinit ()
>> > {
>> > ....
>> > unlock_vector_lock();
>> >
>> > /*
>> > * Check whether the transition raced with a device interrupt and
>> > * is pending in the local APICs IRR. It is safe to do this outside
>> > * of vector lock as the irq_desc::lock of this interrupt is still
>> > * held and interrupts are disabled: The check is not accessing the
>> > * underlying vector store. It's just checking the local APIC's
>> > * IRR.
>> > */
>> > if (lapic_vector_set_in_irr(cfg->vector))
>> > irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
>>
>> No. This catches the transitional interrupt to the new vector on the
>> original CPU, i.e. the one which is running that code.
>
> Mathias added some trace to his xhci driver when the isr is called.
>
> Below is the tail of my trace with last two times xhci_irq isr is called:
>
> <idle>-0 [003] d.h. 200.277971: xhci_irq: xhci irq
> <idle>-0 [003] d.h. 200.278052: xhci_irq: xhci irq
>
> Just trying to follow your steps below with traces. The traces follow
> the same comments in the source.
>
>>
>> Again the steps are:
>>
>> 1) Allocate new vector on new CPU
>
> /* Allocate a new target vector */
> ret = parent->chip->irq_set_affinity(parent, mask, force);
>
> migration/3-24 [003] d..1 200.283012: msi_set_affinity: msi_set_affinity: quirk: 1: new vector allocated, new cpu = 0
>
>>
>> 2) Set new vector on original CPU
>
> /* Redirect it to the new vector on the local CPU temporarily */
> old_cfg.vector = cfg->vector;
> irq_msi_update_msg(irqd, &old_cfg);
>
> migration/3-24 [003] d..1 200.283033: msi_set_affinity: msi_set_affinity: Redirect to new vector 33 on old cpu 6

On old CPU 6? This runs on CPU 3 which is wrong to begin with.

>> 3) Set new vector on new CPU
>
> /* Now transition it to the target CPU */
> irq_msi_update_msg(irqd, cfg);
>
> migration/3-24 [003] d..1 200.283044: msi_set_affinity: msi_set_affinity: Transition to new target cpu 0 vector 33
>
>
>
> if (lapic_vector_set_in_irr(cfg->vector))
> irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
>
>
> migration/3-24 [003] d..1 200.283046: msi_set_affinity: msi_set_affinity: Update Done [IRR 0]: irq 123 localsw: Nvec 33 Napic 0
>
>>
>> So we have 3 points where an interrupt can fire:
>>
>> A) Before #2
>>
>> B) After #2 and before #3
>>
>> C) After #3
>>
>> #A is hitting the old vector which is still valid on the old CPU and
>> will be handled once interrupts are enabled with the correct irq
>> descriptor - Normal operation (same as with maskable MSI)
>>
>> #B This must be checked in the IRR because the there is no valid vector
>> on the old CPU.
>
> The check for IRR seems like on a random cpu3 vs checking for the new vector 33
> on old cpu 6?

The whole sequence runs on CPU 3. If old CPU was 6 then this should
never run on CPU 3.

> This is the place when we force the retrigger without the IRR check things seem to fix itself.

It's not fixing it. It's papering over the root cause.

> Did we miss something?

Yes, you missed to analyze why this runs on CPU3 when old CPU is 6. But
the last interrupt actually was on CPU3.

> <idle>-0 [003] d.h. 200.278052: xhci_irq: xhci irq

Can you please provide the full trace and the patch you used to generate
it?

Thanks,

tglx

2020-05-07 18:02:12

by Ashok Raj

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Hi Thomas,

I think i got mixed up with logical apic id and logical cpu :-(

Patch and trace attached.




On Thu, May 07, 2020 at 02:53:41PM +0200, Thomas Gleixner wrote:
> Ashok,
>
> "Raj, Ashok" <[email protected]> writes:
>
> > We did a bit more tracing and it looks like the IRR check is actually
> > not happening on the right cpu. See below.
>
> What?
>
> >> So we have 3 points where an interrupt can fire:
> >>
> >> A) Before #2
> >>
> >> B) After #2 and before #3
> >>
> >> C) After #3
> >>
> >> #A is hitting the old vector which is still valid on the old CPU and
> >> will be handled once interrupts are enabled with the correct irq
> >> descriptor - Normal operation (same as with maskable MSI)
> >>
> >> #B This must be checked in the IRR because the there is no valid vector
> >> on the old CPU.
> >
> > The check for IRR seems like on a random cpu3 vs checking for the new vector 33
> > on old cpu 6?
>
> The whole sequence runs on CPU 3. If old CPU was 6 then this should
> never run on CPU 3.
>
> > This is the place when we force the retrigger without the IRR check things seem to fix itself.
>
> It's not fixing it. It's papering over the root cause.
>
> > Did we miss something?
>
> Yes, you missed to analyze why this runs on CPU3 when old CPU is 6. But
> the last interrupt actually was on CPU3.
>
> > <idle>-0 [003] d.h. 200.278052: xhci_irq: xhci irq
>
> Can you please provide the full trace and the patch you used to generate
> it?
>

Cheers,
Ashok


Attachments:
(No filename) (1.54 kB)
msi-xhci-hang-trace.patch (4.11 kB)
trace_tglx (278.76 kB)
Download all attachments

2020-05-07 19:46:03

by Thomas Gleixner

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Ashok,

"Raj, Ashok" <[email protected]> writes:
>
> I think i got mixed up with logical apic id and logical cpu :-(

Stuff happens.

> <idle>-0 [000] d.h. 44.376659: msi_set_affinity: quirk[1] new vector allocated, new apic = 2 vector = 33 this apic = 0
> <idle>-0 [000] d.h. 44.376684: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 0 Nvec 33 Napic 2

> <idle>-0 [000] d.h. 44.376685: xhci_irq: xhci irq

> <idle>-0 [001] d.h. 44.376750: msi_set_affinity: quirk[1] new vector allocated, new apic = 2 vector = 33 this apic = 2
> <idle>-0 [001] d.h. 44.376774: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 2 Nvec 33 Napic 2

> <idle>-0 [001] d.h. 44.376776: xhci_irq: xhci irq
> <idle>-0 [001] d.h. 44.395824: xhci_irq: xhci irq

> <...>-14 [001] d..1 44.400666: msi_set_affinity: quirk[1] new vector allocated, new apic = 6 vector = 33 this apic = 2
> <...>-14 [001] d..1 44.400691: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 2 Nvec 33 Napic 6

> <idle>-0 [003] d.h. 44.421021: xhci_irq: xhci irq
> <idle>-0 [003] d.h. 44.421135: xhci_irq: xhci irq

> migration/3-24 [003] d..1 44.421784: msi_set_affinity: quirk[1] new vector allocated, new apic = 0 vector = 33 this apic = 6
> migration/3-24 [003] d..1 44.421803: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 6 Nvec 33 Napic 0

So this last one is a direct update. Straight forward moving it from one
to the other CPU on the same vector number.

And that's the case where we either expect the interrupt to come in on
CPU3 or on CPU0.

There is actually an example in the trace:

<idle>-0 [000] d.h. 40.616467: msi_set_affinity: quirk[1] new vector allocated, new apic = 2 vector = 33 this apic = 0
<idle>-0 [000] d.h. 40.616488: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 0 Nvec 33 Napic 2
<idle>-0 [000] d.h. 40.616488: xhci_irq: xhci irq
<idle>-0 [001] d.h. 40.616504: xhci_irq: xhci irq

> migration/3-24 [003] d..1 44.421784: msi_set_affinity: quirk[1] new vector allocated, new apic = 0 vector = 33 this apic = 6
> migration/3-24 [003] d..1 44.421803: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 6 Nvec 33 Napic 0

But as this last one is the migration thread, aka stomp machine, I
assume this is a hotplug operation. Which means the CPU cannot handle
interrupts anymore. In that case we check the old vector on the
unplugged CPU in fixup_irqs() and do the retrigger from there.
Can you please add tracing to that one as well?

Thanks,

tglx




2020-05-08 11:10:14

by Thomas Gleixner

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Ashok,

"Raj, Ashok" <[email protected]> writes:
>> But as this last one is the migration thread, aka stomp machine, I
>> assume this is a hotplug operation. Which means the CPU cannot handle
>> interrupts anymore. In that case we check the old vector on the
>> unplugged CPU in fixup_irqs() and do the retrigger from there.
>> Can you please add tracing to that one as well?
>
> New tracelog attached. It just happened once.

Correct and it worked as expected.

migration/3-24 [003] d..1 275.665751: msi_set_affinity: quirk[1] new vector allocated, new apic = 4 vector = 36 this apic = 6
migration/3-24 [003] d..1 275.665776: msi_set_affinity: Redirect to new vector 36 on old apic 6
migration/3-24 [003] d..1 275.665789: msi_set_affinity: Transition to new target apic 4 vector 36
migration/3-24 [003] d..1 275.665790: msi_set_affinity: Update Done [IRR 0]: irq 123 Nvec 36 Napic 4
migration/3-24 [003] d..1 275.666792: fixup_irqs: retrigger vector 33 irq 123

So looking at your trace further down, the problem is not the last
one. It dies already before that:

<...>-14 [001] d..1 284.901587: msi_set_affinity: quirk[1] new vector allocated, new apic = 6 vector = 33 this apic = 2
<...>-14 [001] d..1 284.901604: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 2 Nvec 33 Napic 6

Here, the interrupts stop coming in and that's just a regular direct
update, i.e. same vector, different CPU. The update below is updating a
dead device already.

migration/3-24 [003] d..1 284.924960: msi_set_affinity: quirk[1] new vector allocated, new apic = 4 vector = 36 this apic = 6
migration/3-24 [003] d..1 284.924987: msi_set_affinity: Redirect to new vector 36 on old apic 6
migration/3-24 [003] d..1 284.924999: msi_set_affinity: Transition to new target apic 4 vector 36
migration/3-24 [003] d..1 284.925000: msi_set_affinity: Update Done [IRR 0]: irq 123 Nvec 36 Napic 4

TBH, I can't see anything what's wrong here from the kernel side and as
this is new silicon and you're the only ones reporting this it seems
that this is something which is specific to that particular
hardware. Have you talked to the hardware people about this?

Thanks,

tglx

2020-05-08 16:14:10

by Ashok Raj

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Hi Thomas,

On Fri, May 08, 2020 at 01:04:29PM +0200, Thomas Gleixner wrote:
> Ashok,
>
> "Raj, Ashok" <[email protected]> writes:
> >> But as this last one is the migration thread, aka stomp machine, I
> >> assume this is a hotplug operation. Which means the CPU cannot handle
> >> interrupts anymore. In that case we check the old vector on the
> >> unplugged CPU in fixup_irqs() and do the retrigger from there.
> >> Can you please add tracing to that one as well?
> >
> > New tracelog attached. It just happened once.
>
> Correct and it worked as expected.
>
> migration/3-24 [003] d..1 275.665751: msi_set_affinity: quirk[1] new vector allocated, new apic = 4 vector = 36 this apic = 6
> migration/3-24 [003] d..1 275.665776: msi_set_affinity: Redirect to new vector 36 on old apic 6
> migration/3-24 [003] d..1 275.665789: msi_set_affinity: Transition to new target apic 4 vector 36
> migration/3-24 [003] d..1 275.665790: msi_set_affinity: Update Done [IRR 0]: irq 123 Nvec 36 Napic 4
> migration/3-24 [003] d..1 275.666792: fixup_irqs: retrigger vector 33 irq 123
>
> So looking at your trace further down, the problem is not the last
> one. It dies already before that:
>
> <...>-14 [001] d..1 284.901587: msi_set_affinity: quirk[1] new vector allocated, new apic = 6 vector = 33 this apic = 2
> <...>-14 [001] d..1 284.901604: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 2 Nvec 33 Napic 6
>
> Here, the interrupts stop coming in and that's just a regular direct
> update, i.e. same vector, different CPU. The update below is updating a
> dead device already.

So we lost the interrupt either before or after we migrated to apic2?

In the direct update when we migrate same vector but different CPU
if there was a pending IRR on the current CPU, where is that getting cleaned up?

>
> migration/3-24 [003] d..1 284.924960: msi_set_affinity: quirk[1] new vector allocated, new apic = 4 vector = 36 this apic = 6
> migration/3-24 [003] d..1 284.924987: msi_set_affinity: Redirect to new vector 36 on old apic 6
> migration/3-24 [003] d..1 284.924999: msi_set_affinity: Transition to new target apic 4 vector 36
> migration/3-24 [003] d..1 284.925000: msi_set_affinity: Update Done [IRR 0]: irq 123 Nvec 36 Napic 4
>
> TBH, I can't see anything what's wrong here from the kernel side and as
> this is new silicon and you're the only ones reporting this it seems
> that this is something which is specific to that particular
> hardware. Have you talked to the hardware people about this?
>

After chasing it, I'm also trending to think that way. We had a question
about the moderation timer and how its affecting this behavior.
Mathias tried to turn off the moderation timer, and we are still able to
see this hang. We are reaching out to the HW folks to get a handle on this.

With legacy MSI we can have these races and kernel is trying to do the
song and dance, but we see this happening even when IR is turned on.
Which is perplexing. I think when we have IR, once we do the change vector
and flush the interrupt entry cache, if there was an outstandng one in
flight it should be in IRR. Possibly should be clearned up by the
send_cleanup_vector() i suppose.

Cheers,
Ashok

2020-05-08 16:51:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Ashok,

"Raj, Ashok" <[email protected]> writes:
> On Fri, May 08, 2020 at 01:04:29PM +0200, Thomas Gleixner wrote:
>> TBH, I can't see anything what's wrong here from the kernel side and as
>> this is new silicon and you're the only ones reporting this it seems
>> that this is something which is specific to that particular
>> hardware. Have you talked to the hardware people about this?
>>
>
> After chasing it, I'm also trending to think that way. We had a question
> about the moderation timer and how its affecting this behavior.
> Mathias tried to turn off the moderation timer, and we are still able to
> see this hang. We are reaching out to the HW folks to get a handle on this.
>
> With legacy MSI we can have these races and kernel is trying to do the
> song and dance, but we see this happening even when IR is turned on.
> Which is perplexing. I think when we have IR, once we do the change vector
> and flush the interrupt entry cache, if there was an outstandng one in
> flight it should be in IRR. Possibly should be clearned up by the
> send_cleanup_vector() i suppose.

Ouch. With IR this really should never happen and yes the old vector
will catch one which was raised just before the migration disabled the
IR entry. During the change nothing can go wrong because the entry is
disabled and only reenabled after it's flushed which will send a pending
one to the new vector.

But if you see the issue with IR as well, then there is something wrong
with the local APICs in that CPU.

Thanks,

tglx

2020-05-11 19:06:01

by Ashok Raj

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Hi Thomas,

On Fri, May 08, 2020 at 06:49:15PM +0200, Thomas Gleixner wrote:
> Ashok,
>
> "Raj, Ashok" <[email protected]> writes:
> > With legacy MSI we can have these races and kernel is trying to do the
> > song and dance, but we see this happening even when IR is turned on.
> > Which is perplexing. I think when we have IR, once we do the change vector
> > and flush the interrupt entry cache, if there was an outstandng one in
> > flight it should be in IRR. Possibly should be clearned up by the
> > send_cleanup_vector() i suppose.
>
> Ouch. With IR this really should never happen and yes the old vector
> will catch one which was raised just before the migration disabled the
> IR entry. During the change nothing can go wrong because the entry is
> disabled and only reenabled after it's flushed which will send a pending
> one to the new vector.

with IR, I'm not sure if we actually mask the interrupt except when
its a Posted Interrupt.

We do an atomic update to IRTE, with cmpxchg_double

ret = cmpxchg_double(&irte->low, &irte->high,
irte->low, irte->high,
irte_modified->low, irte_modified->high);

followed by flushing the interrupt entry cache. After which any
old ones in flight before the flush should be sittig in IRR
on the outgoing cpu.

The send_cleanup_vector() sends IPI to the apic_id->old_cpu which
would be the cpu we are running on correct? and this is a self_ipi
to IRQ_MOVE_CLEANUP_VECTOR.

smp_irq_move_cleanup_interrupt() seems to check IRR with
apicid_prev_vector()

irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
if (irr & (1U << (vector % 32))) {
apic->send_IPI_self(IRQ_MOVE_CLEANUP_VECTOR);
continue;
}

And this would allow any pending IRR bits in the outgoing CPU to
call the relevant ISR's before draining all vectors on the outgoing
CPU.

Does it sound right?

I couldn't quite pin down how the device ISR's are hooked up through
this send_cleanup_vector() and what follows.

2020-05-11 20:18:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Ashok,

"Raj, Ashok" <[email protected]> writes:
> On Fri, May 08, 2020 at 06:49:15PM +0200, Thomas Gleixner wrote:
>> "Raj, Ashok" <[email protected]> writes:
>> > With legacy MSI we can have these races and kernel is trying to do the
>> > song and dance, but we see this happening even when IR is turned on.
>> > Which is perplexing. I think when we have IR, once we do the change vector
>> > and flush the interrupt entry cache, if there was an outstandng one in
>> > flight it should be in IRR. Possibly should be clearned up by the
>> > send_cleanup_vector() i suppose.
>>
>> Ouch. With IR this really should never happen and yes the old vector
>> will catch one which was raised just before the migration disabled the
>> IR entry. During the change nothing can go wrong because the entry is
>> disabled and only reenabled after it's flushed which will send a pending
>> one to the new vector.
>
> with IR, I'm not sure if we actually mask the interrupt except when
> its a Posted Interrupt.

Indeed. Memory tricked me.

> We do an atomic update to IRTE, with cmpxchg_double
>
> ret = cmpxchg_double(&irte->low, &irte->high,
> irte->low, irte->high,
> irte_modified->low, irte_modified->high);

We only do this if:

if ((irte->pst == 1) || (irte_modified->pst == 1))

i.e. the irte entry was or is going to be used for posted interrupts.

In the non-posted remapping case we do:

set_64bit(&irte->low, irte_modified->low);
set_64bit(&irte->high, irte_modified->high);

> followed by flushing the interrupt entry cache. After which any
> old ones in flight before the flush should be sittig in IRR
> on the outgoing cpu.

Correct.

> The send_cleanup_vector() sends IPI to the apic_id->old_cpu which
> would be the cpu we are running on correct? and this is a self_ipi
> to IRQ_MOVE_CLEANUP_VECTOR.

Yes, for the IR case the cleanup vector IPI is sent right away because
the IR logic allegedly guarantees that after flushing the IRTE cache no
interrupt can be sent to the old vector. IOW, after the flush a vector
sent to the previous CPU must be already in the IRR of that CPU.

> smp_irq_move_cleanup_interrupt() seems to check IRR with
> apicid_prev_vector()
>
> irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
> if (irr & (1U << (vector % 32))) {
> apic->send_IPI_self(IRQ_MOVE_CLEANUP_VECTOR);
> continue;
> }
>
> And this would allow any pending IRR bits in the outgoing CPU to
> call the relevant ISR's before draining all vectors on the outgoing
> CPU.

No. When the CPU goes offline it does not handle anything after moving
the interrupts away. The pending bits are handled in fixup_irq() after
the forced migration. If the vector is set in the IRR it sends an IPI to
the new target CPU.

That IRR check is paranoia for the following case where interrupt
migration happens between live CPUs:

Old CPU

set affinity <- device sends to old vector/cpu
...
irte_...()
send_cleanup_vector();

handle_cleanup_vector() <- Would remove the vector
without that IRR check and
then result in a spurious interrupt

This should actually never happen because the cleanup vector is the
lowest priority vector.

> I couldn't quite pin down how the device ISR's are hooked up through
> this send_cleanup_vector() and what follows.

Device ISRs don't know anything about the underlying irq machinery.

The non IR migration does:

Old CPU New CPU

set affinity
...
<- device sends to new vector/cpu
handle_irq
ack()
apic_ack_irq()
irq_complete_move()
send_cleanup_vector()

handle_cleanup_vector()

In case that an interrupt still hits the old CPU:

set affinity <- device sends to old vector/cpu
...
handle_irq
ack()
apic_ack_irq()
irq_complete_move() (Does not send cleanup vector because wrong CPU)

<- device sends to new vector/cpu
handle_irq
ack()
apic_ack_irq()
irq_complete_move()
send_cleanup_vector()

handle_cleanup_vector()

Thanks,

tglx