2015-08-09 08:12:55

by Jiang Liu

[permalink] [raw]
Subject: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

Alex Deucher, Mark Rustad and Alexander Holler reported a regression
with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
With multi-MSI capable SATA controllers, only the first port works,
all other ports times out when executing SATA commands. This regression
bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
MSI interrupts"), but it's not the root cause, it just triggers a bug
caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
CPU interrupt vectors").

With this patch applied, the affected SATA controllers work as expected.

Signed-off-by: Jiang Liu <[email protected]>
Reported-by: Alex Deucher <[email protected]>
Reported-by: Mark Rustad <[email protected]>
Reported-by: Alexander Holler <[email protected]>
---
Hi Alex, Mark and Alexandler,
Sorry for the long delay to root cause this regression, it's
really annoying. Could you please help test this patch against the
latest v4.2-rcx?
Thanks!
Gerry
---
arch/x86/kernel/apic/vector.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index f813261d9740..2683f36e4e0a 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -322,7 +322,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
irq_data->chip = &lapic_controller;
irq_data->chip_data = data;
irq_data->hwirq = virq + i;
- err = assign_irq_vector_policy(virq, irq_data->node, data,
+ err = assign_irq_vector_policy(virq + i, irq_data->node, data,
info);
if (err)
goto error;
--
1.7.10.4


2015-08-09 10:15:06

by Alexander Holler

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

Am 09.08.2015 um 10:15 schrieb Jiang Liu:
> Alex Deucher, Mark Rustad and Alexander Holler reported a regression
> with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
> With multi-MSI capable SATA controllers, only the first port works,
> all other ports times out when executing SATA commands. This regression
> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
> MSI interrupts"), but it's not the root cause, it just triggers a bug
> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
> CPU interrupt vectors").
>
> With this patch applied, the affected SATA controllers work as expected.
>
> Signed-off-by: Jiang Liu <[email protected]>
> Reported-by: Alex Deucher <[email protected]>
> Reported-by: Mark Rustad <[email protected]>
> Reported-by: Alexander Holler <[email protected]>
> ---
> Hi Alex, Mark and Alexandler,
> Sorry for the long delay to root cause this regression, it's
> really annoying. Could you please help test this patch against the
> latest v4.2-rcx?

Works. Thanks.

Tested-by: Alexander Holler <[email protected]>

2015-08-09 14:12:06

by Jiang Liu

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

On 2015/8/9 18:14, Alexander Holler wrote:
> Am 09.08.2015 um 10:15 schrieb Jiang Liu:
>> Alex Deucher, Mark Rustad and Alexander Holler reported a regression
>> with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
>> With multi-MSI capable SATA controllers, only the first port works,
>> all other ports times out when executing SATA commands. This regression
>> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
>> MSI interrupts"), but it's not the root cause, it just triggers a bug
>> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
>> CPU interrupt vectors").
>>
>> With this patch applied, the affected SATA controllers work as expected.
>>
>> Signed-off-by: Jiang Liu <[email protected]>
>> Reported-by: Alex Deucher <[email protected]>
>> Reported-by: Mark Rustad <[email protected]>
>> Reported-by: Alexander Holler <[email protected]>
>> ---
>> Hi Alex, Mark and Alexandler,
>> Sorry for the long delay to root cause this regression, it's
>> really annoying. Could you please help test this patch against the
>> latest v4.2-rcx?
>
> Works. Thanks.
>
> Tested-by: Alexander Holler <[email protected]>
Thanks, Alexander!

2015-08-10 15:00:43

by Alex Deucher

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

On Sun, Aug 9, 2015 at 4:15 AM, Jiang Liu <[email protected]> wrote:
> Alex Deucher, Mark Rustad and Alexander Holler reported a regression
> with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
> With multi-MSI capable SATA controllers, only the first port works,
> all other ports times out when executing SATA commands. This regression
> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
> MSI interrupts"), but it's not the root cause, it just triggers a bug
> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
> CPU interrupt vectors").
>
> With this patch applied, the affected SATA controllers work as expected.

Yes, this fixes the SATA regression:
Tested-by: Alex Deucher <[email protected]>

I'm not sure if it's related to this patch or not (I haven't bisected
it independently yet), but MSIs don't seem to work on GPUs. See the
line for amdgpu. This is just after loading the driver.

$ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 138 0 0 0 IR-IO-APIC
2-edge timer
1: 2 2 1 4 IR-IO-APIC
1-edge i8042
7: 1 0 0 0 IR-IO-APIC 7-edge
8: 0 0 1 0 IR-IO-APIC
8-edge rtc0
9: 0 0 0 0 IR-IO-APIC
9-fasteoi acpi
14: 0 0 0 0 IR-IO-APIC
14-edge pata_atiixp
15: 0 0 0 0 IR-IO-APIC
15-edge pata_atiixp
16: 302 303 301 314 IR-IO-APIC
16-fasteoi snd_hda_intel
17: 0 0 0 0 IR-IO-APIC
17-fasteoi ehci_hcd:usb7, ehci_hcd:usb8
18: 0 0 0 0 IR-IO-APIC
18-fasteoi ohci_hcd:usb9, ohci_hcd:usb10, ohci_hcd:usb11
24: 0 0 0 1 PCI-MSI 4096-edge
AMD-Vi
26: 0 0 0 0 IR-PCI-MSI
34816-edge PCIe PME
27: 0 0 0 0 IR-PCI-MSI
344064-edge PCIe PME
28: 0 0 0 0 IR-PCI-MSI
348160-edge PCIe PME
29: 0 0 0 0 IR-PCI-MSI
350208-edge PCIe PME
30: 247 255 1381 4617 IR-PCI-MSI
278528-edge ahci0
31: 162 163 164 181 IR-PCI-MSI
278529-edge ahci1
34: 2 1 2 17 IR-PCI-MSI
262144-edge xhci_hcd
35: 0 0 0 0 IR-PCI-MSI
262145-edge xhci_hcd
36: 0 0 0 0 IR-PCI-MSI
262146-edge xhci_hcd
37: 0 0 0 0 IR-PCI-MSI
262147-edge xhci_hcd
38: 0 0 0 0 IR-PCI-MSI
262148-edge xhci_hcd
39: 0 0 0 0 IR-PCI-MSI
264192-edge xhci_hcd
40: 0 0 0 0 IR-PCI-MSI
264193-edge xhci_hcd
41: 0 0 0 0 IR-PCI-MSI
264194-edge xhci_hcd
42: 0 0 0 0 IR-PCI-MSI
264195-edge xhci_hcd
43: 0 0 0 0 IR-PCI-MSI
264196-edge xhci_hcd
44: 0 0 0 0 IR-PCI-MSI
2097152-edge xhci_hcd
45: 0 0 0 0 IR-PCI-MSI
2097153-edge xhci_hcd
46: 0 0 0 0 IR-PCI-MSI
2097154-edge xhci_hcd
47: 0 0 0 0 IR-PCI-MSI
2097155-edge xhci_hcd
48: 0 0 0 0 IR-PCI-MSI
2097156-edge xhci_hcd
50: 40 41 41 40 IR-PCI-MSI
526336-edge snd_hda_intel
51: 14 15 21 1105 IR-PCI-MSI
2621440-edge em1
52: 16579895 16579562 16580988 16583443 IR-PCI-MSI
524288-edge amdgpu
NMI: 4 3 4 3 Non-maskable interrupts
LOC: 15020 10425 8933 8584 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 4 3 4 3 Performance
monitoring interrupts
IWI: 1 1 1 1 IRQ work interrupts
RTR: 0 0 0 0 APIC ICR read retries
RES: 7203 5501 10621 5077 Rescheduling interrupts
CAL: 498 559 614 591 Function call interrupts
TLB: 58 149 104 95 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 Deferred Error
APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 1 1 1 1 Machine check polls
HYP: 0 0 0 0 Hypervisor
callback interrupts
ERR: 1
MIS: 0
PIN: 0 0 0 0 Posted-interrupt
notification event
PIW: 0 0 0 0 Posted-interrupt
wakeup event

This worked fine on 4.1. Any ideas?

Thanks,

Alex


>
> Signed-off-by: Jiang Liu <[email protected]>
> Reported-by: Alex Deucher <[email protected]>
> Reported-by: Mark Rustad <[email protected]>
> Reported-by: Alexander Holler <[email protected]>
> ---
> Hi Alex, Mark and Alexandler,
> Sorry for the long delay to root cause this regression, it's
> really annoying. Could you please help test this patch against the
> latest v4.2-rcx?
> Thanks!
> Gerry
> ---
> arch/x86/kernel/apic/vector.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index f813261d9740..2683f36e4e0a 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -322,7 +322,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
> irq_data->chip = &lapic_controller;
> irq_data->chip_data = data;
> irq_data->hwirq = virq + i;
> - err = assign_irq_vector_policy(virq, irq_data->node, data,
> + err = assign_irq_vector_policy(virq + i, irq_data->node, data,
> info);
> if (err)
> goto error;
> --
> 1.7.10.4
>

2015-08-10 16:48:14

by Rustad, Mark D

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

Gerry,

> On Aug 9, 2015, at 1:15 AM, Jiang Liu <[email protected]> wrote:
>
> Alex Deucher, Mark Rustad and Alexander Holler reported a regression
> with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
> With multi-MSI capable SATA controllers, only the first port works,
> all other ports times out when executing SATA commands. This regression
> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
> MSI interrupts"), but it's not the root cause, it just triggers a bug
> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
> CPU interrupt vectors").
>
> With this patch applied, the affected SATA controllers work as expected.

I see the same thing here as well.

> Signed-off-by: Jiang Liu <[email protected]>
> Reported-by: Alex Deucher <[email protected]>
> Reported-by: Mark Rustad <[email protected]>
> Reported-by: Alexander Holler <[email protected]>
> ---
> Hi Alex, Mark and Alexandler,
> Sorry for the long delay to root cause this regression, it's
> really annoying. Could you please help test this patch against the
> latest v4.2-rcx?

It works for me. Thanks.

--
Mark Rustad, Networking Division, Intel Corporation


Attachments:
signature.asc (841.00 B)
Message signed with OpenPGP using GPGMail

2015-08-11 01:06:50

by Jiang Liu

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

On 2015/8/10 23:00, Alex Deucher wrote:
> On Sun, Aug 9, 2015 at 4:15 AM, Jiang Liu <[email protected]> wrote:
>> Alex Deucher, Mark Rustad and Alexander Holler reported a regression
>> with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
>> With multi-MSI capable SATA controllers, only the first port works,
>> all other ports times out when executing SATA commands. This regression
>> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
>> MSI interrupts"), but it's not the root cause, it just triggers a bug
>> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
>> CPU interrupt vectors").
>>
>> With this patch applied, the affected SATA controllers work as expected.
>
> Yes, this fixes the SATA regression:
> Tested-by: Alex Deucher <[email protected]>
>
> I'm not sure if it's related to this patch or not (I haven't bisected
> it independently yet), but MSIs don't seem to work on GPUs. See the
> line for amdgpu. This is just after loading the driver.
Hi Alex,
This patch only affects multiple-MSI, and it seems that your
gpu only uses one MSI interrupt, so it may not be related to this patch.
And this seems like a sort of interrupt storm.
> 52: 16579895 16579562 16580988 16583443 IR-PCI-MSI
> 524288-edge amdgpu

Does it make any change by disable interrupt remapping?
Does it make any change by disable MSI?
Thanks!
Gerry

>
> $ cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3
> 0: 138 0 0 0 IR-IO-APIC
> 2-edge timer
> 1: 2 2 1 4 IR-IO-APIC
> 1-edge i8042
> 7: 1 0 0 0 IR-IO-APIC 7-edge
> 8: 0 0 1 0 IR-IO-APIC
> 8-edge rtc0
> 9: 0 0 0 0 IR-IO-APIC
> 9-fasteoi acpi
> 14: 0 0 0 0 IR-IO-APIC
> 14-edge pata_atiixp
> 15: 0 0 0 0 IR-IO-APIC
> 15-edge pata_atiixp
> 16: 302 303 301 314 IR-IO-APIC
> 16-fasteoi snd_hda_intel
> 17: 0 0 0 0 IR-IO-APIC
> 17-fasteoi ehci_hcd:usb7, ehci_hcd:usb8
> 18: 0 0 0 0 IR-IO-APIC
> 18-fasteoi ohci_hcd:usb9, ohci_hcd:usb10, ohci_hcd:usb11
> 24: 0 0 0 1 PCI-MSI 4096-edge
> AMD-Vi
> 26: 0 0 0 0 IR-PCI-MSI
> 34816-edge PCIe PME
> 27: 0 0 0 0 IR-PCI-MSI
> 344064-edge PCIe PME
> 28: 0 0 0 0 IR-PCI-MSI
> 348160-edge PCIe PME
> 29: 0 0 0 0 IR-PCI-MSI
> 350208-edge PCIe PME
> 30: 247 255 1381 4617 IR-PCI-MSI
> 278528-edge ahci0
> 31: 162 163 164 181 IR-PCI-MSI
> 278529-edge ahci1
> 34: 2 1 2 17 IR-PCI-MSI
> 262144-edge xhci_hcd
> 35: 0 0 0 0 IR-PCI-MSI
> 262145-edge xhci_hcd
> 36: 0 0 0 0 IR-PCI-MSI
> 262146-edge xhci_hcd
> 37: 0 0 0 0 IR-PCI-MSI
> 262147-edge xhci_hcd
> 38: 0 0 0 0 IR-PCI-MSI
> 262148-edge xhci_hcd
> 39: 0 0 0 0 IR-PCI-MSI
> 264192-edge xhci_hcd
> 40: 0 0 0 0 IR-PCI-MSI
> 264193-edge xhci_hcd
> 41: 0 0 0 0 IR-PCI-MSI
> 264194-edge xhci_hcd
> 42: 0 0 0 0 IR-PCI-MSI
> 264195-edge xhci_hcd
> 43: 0 0 0 0 IR-PCI-MSI
> 264196-edge xhci_hcd
> 44: 0 0 0 0 IR-PCI-MSI
> 2097152-edge xhci_hcd
> 45: 0 0 0 0 IR-PCI-MSI
> 2097153-edge xhci_hcd
> 46: 0 0 0 0 IR-PCI-MSI
> 2097154-edge xhci_hcd
> 47: 0 0 0 0 IR-PCI-MSI
> 2097155-edge xhci_hcd
> 48: 0 0 0 0 IR-PCI-MSI
> 2097156-edge xhci_hcd
> 50: 40 41 41 40 IR-PCI-MSI
> 526336-edge snd_hda_intel
> 51: 14 15 21 1105 IR-PCI-MSI
> 2621440-edge em1
> 52: 16579895 16579562 16580988 16583443 IR-PCI-MSI
> 524288-edge amdgpu
> NMI: 4 3 4 3 Non-maskable interrupts
> LOC: 15020 10425 8933 8584 Local timer interrupts
> SPU: 0 0 0 0 Spurious interrupts
> PMI: 4 3 4 3 Performance
> monitoring interrupts
> IWI: 1 1 1 1 IRQ work interrupts
> RTR: 0 0 0 0 APIC ICR read retries
> RES: 7203 5501 10621 5077 Rescheduling interrupts
> CAL: 498 559 614 591 Function call interrupts
> TLB: 58 149 104 95 TLB shootdowns
> TRM: 0 0 0 0 Thermal event interrupts
> THR: 0 0 0 0 Threshold APIC interrupts
> DFR: 0 0 0 0 Deferred Error
> APIC interrupts
> MCE: 0 0 0 0 Machine check exceptions
> MCP: 1 1 1 1 Machine check polls
> HYP: 0 0 0 0 Hypervisor
> callback interrupts
> ERR: 1
> MIS: 0
> PIN: 0 0 0 0 Posted-interrupt
> notification event
> PIW: 0 0 0 0 Posted-interrupt
> wakeup event
>
> This worked fine on 4.1. Any ideas?
>
> Thanks,
>
> Alex
>
>
>>
>> Signed-off-by: Jiang Liu <[email protected]>
>> Reported-by: Alex Deucher <[email protected]>
>> Reported-by: Mark Rustad <[email protected]>
>> Reported-by: Alexander Holler <[email protected]>
>> ---
>> Hi Alex, Mark and Alexandler,
>> Sorry for the long delay to root cause this regression, it's
>> really annoying. Could you please help test this patch against the
>> latest v4.2-rcx?
>> Thanks!
>> Gerry
>> ---
>> arch/x86/kernel/apic/vector.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
>> index f813261d9740..2683f36e4e0a 100644
>> --- a/arch/x86/kernel/apic/vector.c
>> +++ b/arch/x86/kernel/apic/vector.c
>> @@ -322,7 +322,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
>> irq_data->chip = &lapic_controller;
>> irq_data->chip_data = data;
>> irq_data->hwirq = virq + i;
>> - err = assign_irq_vector_policy(virq, irq_data->node, data,
>> + err = assign_irq_vector_policy(virq + i, irq_data->node, data,
>> info);
>> if (err)
>> goto error;
>> --
>> 1.7.10.4
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2015-08-13 19:46:15

by Alex Deucher

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

On Mon, Aug 10, 2015 at 9:06 PM, Jiang Liu <[email protected]> wrote:
> On 2015/8/10 23:00, Alex Deucher wrote:
>> On Sun, Aug 9, 2015 at 4:15 AM, Jiang Liu <[email protected]> wrote:
>>> Alex Deucher, Mark Rustad and Alexander Holler reported a regression
>>> with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
>>> With multi-MSI capable SATA controllers, only the first port works,
>>> all other ports times out when executing SATA commands. This regression
>>> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
>>> MSI interrupts"), but it's not the root cause, it just triggers a bug
>>> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
>>> CPU interrupt vectors").
>>>
>>> With this patch applied, the affected SATA controllers work as expected.
>>
>> Yes, this fixes the SATA regression:
>> Tested-by: Alex Deucher <[email protected]>
>>
>> I'm not sure if it's related to this patch or not (I haven't bisected
>> it independently yet), but MSIs don't seem to work on GPUs. See the
>> line for amdgpu. This is just after loading the driver.
> Hi Alex,
> This patch only affects multiple-MSI, and it seems that your
> gpu only uses one MSI interrupt, so it may not be related to this patch.
> And this seems like a sort of interrupt storm.
>> 52: 16579895 16579562 16580988 16583443 IR-PCI-MSI
>> 524288-edge amdgpu
>
> Does it make any change by disable interrupt remapping?

Nope. Still going crazy:
46: 4769660 4769130 4775899 4784657 PCI-MSI
524288-edge amdgpu


> Does it make any change by disable MSI?

If I set pci=nomsi, the sata controllers time out. If I disable MSIs
just for the gpu, I don't get any interrupts:
25: 0 0 0 0 IR-IO-APIC
0-fasteoi amdgpu

Alex

> Thanks!
> Gerry
>
>>
>> $ cat /proc/interrupts
>> CPU0 CPU1 CPU2 CPU3
>> 0: 138 0 0 0 IR-IO-APIC
>> 2-edge timer
>> 1: 2 2 1 4 IR-IO-APIC
>> 1-edge i8042
>> 7: 1 0 0 0 IR-IO-APIC 7-edge
>> 8: 0 0 1 0 IR-IO-APIC
>> 8-edge rtc0
>> 9: 0 0 0 0 IR-IO-APIC
>> 9-fasteoi acpi
>> 14: 0 0 0 0 IR-IO-APIC
>> 14-edge pata_atiixp
>> 15: 0 0 0 0 IR-IO-APIC
>> 15-edge pata_atiixp
>> 16: 302 303 301 314 IR-IO-APIC
>> 16-fasteoi snd_hda_intel
>> 17: 0 0 0 0 IR-IO-APIC
>> 17-fasteoi ehci_hcd:usb7, ehci_hcd:usb8
>> 18: 0 0 0 0 IR-IO-APIC
>> 18-fasteoi ohci_hcd:usb9, ohci_hcd:usb10, ohci_hcd:usb11
>> 24: 0 0 0 1 PCI-MSI 4096-edge
>> AMD-Vi
>> 26: 0 0 0 0 IR-PCI-MSI
>> 34816-edge PCIe PME
>> 27: 0 0 0 0 IR-PCI-MSI
>> 344064-edge PCIe PME
>> 28: 0 0 0 0 IR-PCI-MSI
>> 348160-edge PCIe PME
>> 29: 0 0 0 0 IR-PCI-MSI
>> 350208-edge PCIe PME
>> 30: 247 255 1381 4617 IR-PCI-MSI
>> 278528-edge ahci0
>> 31: 162 163 164 181 IR-PCI-MSI
>> 278529-edge ahci1
>> 34: 2 1 2 17 IR-PCI-MSI
>> 262144-edge xhci_hcd
>> 35: 0 0 0 0 IR-PCI-MSI
>> 262145-edge xhci_hcd
>> 36: 0 0 0 0 IR-PCI-MSI
>> 262146-edge xhci_hcd
>> 37: 0 0 0 0 IR-PCI-MSI
>> 262147-edge xhci_hcd
>> 38: 0 0 0 0 IR-PCI-MSI
>> 262148-edge xhci_hcd
>> 39: 0 0 0 0 IR-PCI-MSI
>> 264192-edge xhci_hcd
>> 40: 0 0 0 0 IR-PCI-MSI
>> 264193-edge xhci_hcd
>> 41: 0 0 0 0 IR-PCI-MSI
>> 264194-edge xhci_hcd
>> 42: 0 0 0 0 IR-PCI-MSI
>> 264195-edge xhci_hcd
>> 43: 0 0 0 0 IR-PCI-MSI
>> 264196-edge xhci_hcd
>> 44: 0 0 0 0 IR-PCI-MSI
>> 2097152-edge xhci_hcd
>> 45: 0 0 0 0 IR-PCI-MSI
>> 2097153-edge xhci_hcd
>> 46: 0 0 0 0 IR-PCI-MSI
>> 2097154-edge xhci_hcd
>> 47: 0 0 0 0 IR-PCI-MSI
>> 2097155-edge xhci_hcd
>> 48: 0 0 0 0 IR-PCI-MSI
>> 2097156-edge xhci_hcd
>> 50: 40 41 41 40 IR-PCI-MSI
>> 526336-edge snd_hda_intel
>> 51: 14 15 21 1105 IR-PCI-MSI
>> 2621440-edge em1
>> 52: 16579895 16579562 16580988 16583443 IR-PCI-MSI
>> 524288-edge amdgpu
>> NMI: 4 3 4 3 Non-maskable interrupts
>> LOC: 15020 10425 8933 8584 Local timer interrupts
>> SPU: 0 0 0 0 Spurious interrupts
>> PMI: 4 3 4 3 Performance
>> monitoring interrupts
>> IWI: 1 1 1 1 IRQ work interrupts
>> RTR: 0 0 0 0 APIC ICR read retries
>> RES: 7203 5501 10621 5077 Rescheduling interrupts
>> CAL: 498 559 614 591 Function call interrupts
>> TLB: 58 149 104 95 TLB shootdowns
>> TRM: 0 0 0 0 Thermal event interrupts
>> THR: 0 0 0 0 Threshold APIC interrupts
>> DFR: 0 0 0 0 Deferred Error
>> APIC interrupts
>> MCE: 0 0 0 0 Machine check exceptions
>> MCP: 1 1 1 1 Machine check polls
>> HYP: 0 0 0 0 Hypervisor
>> callback interrupts
>> ERR: 1
>> MIS: 0
>> PIN: 0 0 0 0 Posted-interrupt
>> notification event
>> PIW: 0 0 0 0 Posted-interrupt
>> wakeup event
>>
>> This worked fine on 4.1. Any ideas?
>>
>> Thanks,
>>
>> Alex
>>
>>
>>>
>>> Signed-off-by: Jiang Liu <[email protected]>
>>> Reported-by: Alex Deucher <[email protected]>
>>> Reported-by: Mark Rustad <[email protected]>
>>> Reported-by: Alexander Holler <[email protected]>
>>> ---
>>> Hi Alex, Mark and Alexandler,
>>> Sorry for the long delay to root cause this regression, it's
>>> really annoying. Could you please help test this patch against the
>>> latest v4.2-rcx?
>>> Thanks!
>>> Gerry
>>> ---
>>> arch/x86/kernel/apic/vector.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
>>> index f813261d9740..2683f36e4e0a 100644
>>> --- a/arch/x86/kernel/apic/vector.c
>>> +++ b/arch/x86/kernel/apic/vector.c
>>> @@ -322,7 +322,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
>>> irq_data->chip = &lapic_controller;
>>> irq_data->chip_data = data;
>>> irq_data->hwirq = virq + i;
>>> - err = assign_irq_vector_policy(virq, irq_data->node, data,
>>> + err = assign_irq_vector_policy(virq + i, irq_data->node, data,
>>> info);
>>> if (err)
>>> goto error;
>>> --
>>> 1.7.10.4
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>

2015-08-13 20:16:00

by Alex Deucher

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

On Thu, Aug 13, 2015 at 3:46 PM, Alex Deucher <[email protected]> wrote:
> On Mon, Aug 10, 2015 at 9:06 PM, Jiang Liu <[email protected]> wrote:
>> On 2015/8/10 23:00, Alex Deucher wrote:
>>> On Sun, Aug 9, 2015 at 4:15 AM, Jiang Liu <[email protected]> wrote:
>>>> Alex Deucher, Mark Rustad and Alexander Holler reported a regression
>>>> with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
>>>> With multi-MSI capable SATA controllers, only the first port works,
>>>> all other ports times out when executing SATA commands. This regression
>>>> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
>>>> MSI interrupts"), but it's not the root cause, it just triggers a bug
>>>> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
>>>> CPU interrupt vectors").
>>>>
>>>> With this patch applied, the affected SATA controllers work as expected.
>>>
>>> Yes, this fixes the SATA regression:
>>> Tested-by: Alex Deucher <[email protected]>
>>>
>>> I'm not sure if it's related to this patch or not (I haven't bisected
>>> it independently yet), but MSIs don't seem to work on GPUs. See the
>>> line for amdgpu. This is just after loading the driver.
>> Hi Alex,
>> This patch only affects multiple-MSI, and it seems that your
>> gpu only uses one MSI interrupt, so it may not be related to this patch.
>> And this seems like a sort of interrupt storm.
>>> 52: 16579895 16579562 16580988 16583443 IR-PCI-MSI
>>> 524288-edge amdgpu
>>
>> Does it make any change by disable interrupt remapping?
>
> Nope. Still going crazy:
> 46: 4769660 4769130 4775899 4784657 PCI-MSI
> 524288-edge amdgpu
>
>
>> Does it make any change by disable MSI?
>
> If I set pci=nomsi, the sata controllers time out. If I disable MSIs
> just for the gpu, I don't get any interrupts:
> 25: 0 0 0 0 IR-IO-APIC
> 0-fasteoi amdgpu
>

Strangely, it only seems to affect certain boards. E.g., this card works fine:
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Bonaire XT [Radeon HD 7790/8770 / R9 260 OEM] (prog-if 00
[VGA controller])
Subsystem: Diamond Multimedia Systems Device 2329
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort+ <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 52
Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
Region 2: Memory at d0000000 (64-bit, prefetchable) [size=8M]
Region 4: I/O ports at e000 [size=256]
Region 5: Memory at ff600000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at ff640000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1+,D2+,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
<4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit
Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-,
OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00000 Data: 0000
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1
Len=010 <?>
Capabilities: [150 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [270 v1] #19
Capabilities: [2b0 v1] Address Translation Service (ATS)
ATSCap: Invalidate Queue Depth: 00
ATSCtl: Enable+, Smallest Translation Unit: 00
Capabilities: [2c0 v1] #13
Capabilities: [2d0 v1] #1b
Kernel driver in use: amdgpu

This one does not:
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Device 6939 (prog-if 00 [VGA controller])
Subsystem: Gigabyte Technology Co., Ltd Device 229d
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort+ <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 52
Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
Region 2: Memory at d0000000 (64-bit, prefetchable) [size=2M]
Region 4: I/O ports at e000 [size=256]
Region 5: Memory at ff600000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at ff640000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
<4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit
Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-,
OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00000 Data: 0000
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1
Len=010 <?>
Capabilities: [150 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [200 v1] #15
Capabilities: [270 v1] #19
Capabilities: [2b0 v1] Address Translation Service (ATS)
ATSCap: Invalidate Queue Depth: 00
ATSCtl: Enable+, Smallest Translation Unit: 00
Capabilities: [2c0 v1] #13
Capabilities: [2d0 v1] #1b
Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Kernel driver in use: amdgpu

Any ideas? I'll see if I can find the time to bisect this.

Alex



> Alex
>
>> Thanks!
>> Gerry
>>
>>>
>>> $ cat /proc/interrupts
>>> CPU0 CPU1 CPU2 CPU3
>>> 0: 138 0 0 0 IR-IO-APIC
>>> 2-edge timer
>>> 1: 2 2 1 4 IR-IO-APIC
>>> 1-edge i8042
>>> 7: 1 0 0 0 IR-IO-APIC 7-edge
>>> 8: 0 0 1 0 IR-IO-APIC
>>> 8-edge rtc0
>>> 9: 0 0 0 0 IR-IO-APIC
>>> 9-fasteoi acpi
>>> 14: 0 0 0 0 IR-IO-APIC
>>> 14-edge pata_atiixp
>>> 15: 0 0 0 0 IR-IO-APIC
>>> 15-edge pata_atiixp
>>> 16: 302 303 301 314 IR-IO-APIC
>>> 16-fasteoi snd_hda_intel
>>> 17: 0 0 0 0 IR-IO-APIC
>>> 17-fasteoi ehci_hcd:usb7, ehci_hcd:usb8
>>> 18: 0 0 0 0 IR-IO-APIC
>>> 18-fasteoi ohci_hcd:usb9, ohci_hcd:usb10, ohci_hcd:usb11
>>> 24: 0 0 0 1 PCI-MSI 4096-edge
>>> AMD-Vi
>>> 26: 0 0 0 0 IR-PCI-MSI
>>> 34816-edge PCIe PME
>>> 27: 0 0 0 0 IR-PCI-MSI
>>> 344064-edge PCIe PME
>>> 28: 0 0 0 0 IR-PCI-MSI
>>> 348160-edge PCIe PME
>>> 29: 0 0 0 0 IR-PCI-MSI
>>> 350208-edge PCIe PME
>>> 30: 247 255 1381 4617 IR-PCI-MSI
>>> 278528-edge ahci0
>>> 31: 162 163 164 181 IR-PCI-MSI
>>> 278529-edge ahci1
>>> 34: 2 1 2 17 IR-PCI-MSI
>>> 262144-edge xhci_hcd
>>> 35: 0 0 0 0 IR-PCI-MSI
>>> 262145-edge xhci_hcd
>>> 36: 0 0 0 0 IR-PCI-MSI
>>> 262146-edge xhci_hcd
>>> 37: 0 0 0 0 IR-PCI-MSI
>>> 262147-edge xhci_hcd
>>> 38: 0 0 0 0 IR-PCI-MSI
>>> 262148-edge xhci_hcd
>>> 39: 0 0 0 0 IR-PCI-MSI
>>> 264192-edge xhci_hcd
>>> 40: 0 0 0 0 IR-PCI-MSI
>>> 264193-edge xhci_hcd
>>> 41: 0 0 0 0 IR-PCI-MSI
>>> 264194-edge xhci_hcd
>>> 42: 0 0 0 0 IR-PCI-MSI
>>> 264195-edge xhci_hcd
>>> 43: 0 0 0 0 IR-PCI-MSI
>>> 264196-edge xhci_hcd
>>> 44: 0 0 0 0 IR-PCI-MSI
>>> 2097152-edge xhci_hcd
>>> 45: 0 0 0 0 IR-PCI-MSI
>>> 2097153-edge xhci_hcd
>>> 46: 0 0 0 0 IR-PCI-MSI
>>> 2097154-edge xhci_hcd
>>> 47: 0 0 0 0 IR-PCI-MSI
>>> 2097155-edge xhci_hcd
>>> 48: 0 0 0 0 IR-PCI-MSI
>>> 2097156-edge xhci_hcd
>>> 50: 40 41 41 40 IR-PCI-MSI
>>> 526336-edge snd_hda_intel
>>> 51: 14 15 21 1105 IR-PCI-MSI
>>> 2621440-edge em1
>>> 52: 16579895 16579562 16580988 16583443 IR-PCI-MSI
>>> 524288-edge amdgpu
>>> NMI: 4 3 4 3 Non-maskable interrupts
>>> LOC: 15020 10425 8933 8584 Local timer interrupts
>>> SPU: 0 0 0 0 Spurious interrupts
>>> PMI: 4 3 4 3 Performance
>>> monitoring interrupts
>>> IWI: 1 1 1 1 IRQ work interrupts
>>> RTR: 0 0 0 0 APIC ICR read retries
>>> RES: 7203 5501 10621 5077 Rescheduling interrupts
>>> CAL: 498 559 614 591 Function call interrupts
>>> TLB: 58 149 104 95 TLB shootdowns
>>> TRM: 0 0 0 0 Thermal event interrupts
>>> THR: 0 0 0 0 Threshold APIC interrupts
>>> DFR: 0 0 0 0 Deferred Error
>>> APIC interrupts
>>> MCE: 0 0 0 0 Machine check exceptions
>>> MCP: 1 1 1 1 Machine check polls
>>> HYP: 0 0 0 0 Hypervisor
>>> callback interrupts
>>> ERR: 1
>>> MIS: 0
>>> PIN: 0 0 0 0 Posted-interrupt
>>> notification event
>>> PIW: 0 0 0 0 Posted-interrupt
>>> wakeup event
>>>
>>> This worked fine on 4.1. Any ideas?
>>>
>>> Thanks,
>>>
>>> Alex
>>>
>>>
>>>>
>>>> Signed-off-by: Jiang Liu <[email protected]>
>>>> Reported-by: Alex Deucher <[email protected]>
>>>> Reported-by: Mark Rustad <[email protected]>
>>>> Reported-by: Alexander Holler <[email protected]>
>>>> ---
>>>> Hi Alex, Mark and Alexandler,
>>>> Sorry for the long delay to root cause this regression, it's
>>>> really annoying. Could you please help test this patch against the
>>>> latest v4.2-rcx?
>>>> Thanks!
>>>> Gerry
>>>> ---
>>>> arch/x86/kernel/apic/vector.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
>>>> index f813261d9740..2683f36e4e0a 100644
>>>> --- a/arch/x86/kernel/apic/vector.c
>>>> +++ b/arch/x86/kernel/apic/vector.c
>>>> @@ -322,7 +322,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
>>>> irq_data->chip = &lapic_controller;
>>>> irq_data->chip_data = data;
>>>> irq_data->hwirq = virq + i;
>>>> - err = assign_irq_vector_policy(virq, irq_data->node, data,
>>>> + err = assign_irq_vector_policy(virq + i, irq_data->node, data,
>>>> info);
>>>> if (err)
>>>> goto error;
>>>> --
>>>> 1.7.10.4
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>

2015-08-13 22:13:14

by Alex Deucher

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

On Thu, Aug 13, 2015 at 4:15 PM, Alex Deucher <[email protected]> wrote:
> On Thu, Aug 13, 2015 at 3:46 PM, Alex Deucher <[email protected]> wrote:
>> On Mon, Aug 10, 2015 at 9:06 PM, Jiang Liu <[email protected]> wrote:
>>> On 2015/8/10 23:00, Alex Deucher wrote:
>>>> On Sun, Aug 9, 2015 at 4:15 AM, Jiang Liu <[email protected]> wrote:
>>>>> Alex Deucher, Mark Rustad and Alexander Holler reported a regression
>>>>> with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
>>>>> With multi-MSI capable SATA controllers, only the first port works,
>>>>> all other ports times out when executing SATA commands. This regression
>>>>> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
>>>>> MSI interrupts"), but it's not the root cause, it just triggers a bug
>>>>> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
>>>>> CPU interrupt vectors").
>>>>>
>>>>> With this patch applied, the affected SATA controllers work as expected.
>>>>
>>>> Yes, this fixes the SATA regression:
>>>> Tested-by: Alex Deucher <[email protected]>
>>>>
>>>> I'm not sure if it's related to this patch or not (I haven't bisected
>>>> it independently yet), but MSIs don't seem to work on GPUs. See the
>>>> line for amdgpu. This is just after loading the driver.
>>> Hi Alex,
>>> This patch only affects multiple-MSI, and it seems that your
>>> gpu only uses one MSI interrupt, so it may not be related to this patch.
>>> And this seems like a sort of interrupt storm.
>>>> 52: 16579895 16579562 16580988 16583443 IR-PCI-MSI
>>>> 524288-edge amdgpu
>>>
>>> Does it make any change by disable interrupt remapping?
>>
>> Nope. Still going crazy:
>> 46: 4769660 4769130 4775899 4784657 PCI-MSI
>> 524288-edge amdgpu
>>
>>
>>> Does it make any change by disable MSI?
>>
>> If I set pci=nomsi, the sata controllers time out. If I disable MSIs
>> just for the gpu, I don't get any interrupts:
>> 25: 0 0 0 0 IR-IO-APIC
>> 0-fasteoi amdgpu
>>
>
> Strangely, it only seems to affect certain boards. E.g., this card works fine:
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Bonaire XT [Radeon HD 7790/8770 / R9 260 OEM] (prog-if 00
> [VGA controller])
> Subsystem: Diamond Multimedia Systems Device 2329
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort+ <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 52
> Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
> Region 2: Memory at d0000000 (64-bit, prefetchable) [size=8M]
> Region 4: I/O ports at e000 [size=256]
> Region 5: Memory at ff600000 (32-bit, non-prefetchable) [size=256K]
> Expansion ROM at ff640000 [disabled] [size=128K]
> Capabilities: [48] Vendor Specific Information: Len=08 <?>
> Capabilities: [50] Power Management version 3
> Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
> PME(D0-,D1+,D2+,D3hot+,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
> <4us, L1 unlimited
> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 256 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit
> Latency L0s <64ns, L1 <1us
> ClockPM- Surprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+
> DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-,
> OBFF Not Supported
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
> OBFF Disabled
> LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete+, EqualizationPhase1+
> EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Address: 00000000fee00000 Data: 0000
> Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1
> Len=010 <?>
> Capabilities: [150 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> Capabilities: [270 v1] #19
> Capabilities: [2b0 v1] Address Translation Service (ATS)
> ATSCap: Invalidate Queue Depth: 00
> ATSCtl: Enable+, Smallest Translation Unit: 00
> Capabilities: [2c0 v1] #13
> Capabilities: [2d0 v1] #1b
> Kernel driver in use: amdgpu
>
> This one does not:
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Device 6939 (prog-if 00 [VGA controller])
> Subsystem: Gigabyte Technology Co., Ltd Device 229d
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort+ <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 52
> Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
> Region 2: Memory at d0000000 (64-bit, prefetchable) [size=2M]
> Region 4: I/O ports at e000 [size=256]
> Region 5: Memory at ff600000 (32-bit, non-prefetchable) [size=256K]
> Expansion ROM at ff640000 [disabled] [size=128K]
> Capabilities: [48] Vendor Specific Information: Len=08 <?>
> Capabilities: [50] Power Management version 3
> Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
> PME(D0-,D1+,D2+,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
> <4us, L1 unlimited
> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 256 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit
> Latency L0s <64ns, L1 <1us
> ClockPM- Surprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+
> DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-,
> OBFF Not Supported
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
> OBFF Disabled
> LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete+, EqualizationPhase1+
> EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Address: 00000000fee00000 Data: 0000
> Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1
> Len=010 <?>
> Capabilities: [150 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> Capabilities: [200 v1] #15
> Capabilities: [270 v1] #19
> Capabilities: [2b0 v1] Address Translation Service (ATS)
> ATSCap: Invalidate Queue Depth: 00
> ATSCtl: Enable+, Smallest Translation Unit: 00
> Capabilities: [2c0 v1] #13
> Capabilities: [2d0 v1] #1b
> Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
> ARICap: MFVC- ACS-, Next Function: 1
> ARICtl: MFVC- ACS-, Function Group: 0
> Kernel driver in use: amdgpu
>
> Any ideas? I'll see if I can find the time to bisect this.

I attempted to bisect this, however the regression happened prior to
my driver being merged upstream:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=099bfbfc7fbbe22356c02f0caf709ac32e1126ea
So I can't easily bisect it further without backporting the driver to
each commit before that. This may take a while...

Alex

2015-08-17 21:02:59

by Alex Deucher

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

On Sun, Aug 9, 2015 at 4:15 AM, Jiang Liu <[email protected]> wrote:
> Alex Deucher, Mark Rustad and Alexander Holler reported a regression
> with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
> With multi-MSI capable SATA controllers, only the first port works,
> all other ports times out when executing SATA commands. This regression
> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
> MSI interrupts"), but it's not the root cause, it just triggers a bug
> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
> CPU interrupt vectors").
>
> With this patch applied, the affected SATA controllers work as expected.

I don't see this upstream yet. Any chance this will make 4.2?

Alex

>
> Signed-off-by: Jiang Liu <[email protected]>
> Reported-by: Alex Deucher <[email protected]>
> Reported-by: Mark Rustad <[email protected]>
> Reported-by: Alexander Holler <[email protected]>
> ---
> Hi Alex, Mark and Alexandler,
> Sorry for the long delay to root cause this regression, it's
> really annoying. Could you please help test this patch against the
> latest v4.2-rcx?
> Thanks!
> Gerry
> ---
> arch/x86/kernel/apic/vector.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index f813261d9740..2683f36e4e0a 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -322,7 +322,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
> irq_data->chip = &lapic_controller;
> irq_data->chip_data = data;
> irq_data->hwirq = virq + i;
> - err = assign_irq_vector_policy(virq, irq_data->node, data,
> + err = assign_irq_vector_policy(virq + i, irq_data->node, data,
> info);
> if (err)
> goto error;
> --
> 1.7.10.4
>

2015-08-17 21:12:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

On Sun, 9 Aug 2015, Jiang Liu wrote:

> Alex Deucher, Mark Rustad and Alexander Holler reported a regression
> with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
> With multi-MSI capable SATA controllers, only the first port works,
> all other ports times out when executing SATA commands. This regression
> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
> MSI interrupts"), but it's not the root cause, it just triggers a bug
> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
> CPU interrupt vectors").
>
> With this patch applied, the affected SATA controllers work as expected.

This changelog including the subject line is horrible.

1) The subject line should describe the change in a short and precise form

x86/irq: Fix a regression caused by commit b5dc8e6c21e7

fits the short category, but completely fails to be precise. It's
not interesting for the subject line which commit caused the
problem and whether it's a regression or not. We want a to see a
proper description of the change itself.

2) The changelog should describe the bug itself.

... but it's not the root cause, it just triggers a bug caused by
b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage CPU
interrupt vectors").

does not tell what the actual bug in the code is.

3) The changelog should describe the solution.

With this patch applied, the affected SATA controllers work as
expected.

is describing the desired effect of the change, but not the change
itself.

Thanks,

tglx

2015-08-18 12:54:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

On Mon, 17 Aug 2015, Alex Deucher wrote:
> On Sun, Aug 9, 2015 at 4:15 AM, Jiang Liu <[email protected]> wrote:
> > Alex Deucher, Mark Rustad and Alexander Holler reported a regression
> > with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
> > With multi-MSI capable SATA controllers, only the first port works,
> > all other ports times out when executing SATA commands. This regression
> > bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
> > MSI interrupts"), but it's not the root cause, it just triggers a bug
> > caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
> > CPU interrupt vectors").
> >
> > With this patch applied, the affected SATA controllers work as expected.
>
> I don't see this upstream yet. Any chance this will make 4.2?

Yes, it's going to be there.

2015-08-18 15:17:13

by Jiang Liu

[permalink] [raw]
Subject: [Bugfix] x86, irq: Fix an error in building CPU vector to IRQ number mapping for MSI

Alex Deucher, Mark Rustad and Alexander Holler reported a regression
with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
With multi-MSI capable SATA controllers, only the first port works,
all other ports times out when executing SATA commands.

The regression is caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical
irqdomain to manage CPU interrupt vectors"), which builds wrong CPU
vector to IRQ number mappings for second and subsequent IRQs of multiple
MSIs and causes that all MSI IRQs are handled as the first MSI IRQ.
So fix the regression by building correct CPU vector to IRQ number mappings
for multiple MSIs.

Signed-off-by: Jiang Liu <[email protected]>
Reported-and-tested-by: Alex Deucher <[email protected]>
Reported-and-tested-by: Mark Rustad <[email protected]>
Reported-and-tested-by: Alexander Holler <[email protected]>
---
arch/x86/kernel/apic/vector.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index f813261d9740..2683f36e4e0a 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -322,7 +322,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
irq_data->chip = &lapic_controller;
irq_data->chip_data = data;
irq_data->hwirq = virq + i;
- err = assign_irq_vector_policy(virq, irq_data->node, data,
+ err = assign_irq_vector_policy(virq + i, irq_data->node, data,
info);
if (err)
goto error;
--
1.7.10.4

Subject: [tip:x86/urgent] x86/irq: Build correct vector mapping for multiple MSI interrupts

Commit-ID: 527f0a91e91cd55ec79fce80451b0ad5d5e6a21a
Gitweb: http://git.kernel.org/tip/527f0a91e91cd55ec79fce80451b0ad5d5e6a21a
Author: Jiang Liu <[email protected]>
AuthorDate: Tue, 18 Aug 2015 23:20:20 +0800
Committer: Thomas Gleixner <[email protected]>
CommitDate: Tue, 18 Aug 2015 18:18:55 +0200

x86/irq: Build correct vector mapping for multiple MSI interrupts

Alex Deucher, Mark Rustad and Alexander Holler reported a regression
with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
With multi-MSI capable SATA controllers, only the first port works,
all other ports time out when executing SATA commands.

This happens because the first argument to assign_irq_vector_policy()
is always the base linux irq number of the multi MSI interrupt block,
so all subsequent vector assignments operate on the base linux irq
number, so all MSI irqs are handled as the first irq number. Therefor
the other MSI irqs of a device are never set up correctly and never
fire.

Add the loop iterator to the base irq number so all vectors are
assigned correctly.

Fixes: b5dc8e6c21e7 "x86/irq: Use hierarchical irqdomain to manage CPU interrupt vectors"
Reported-and-tested-by: Alex Deucher <[email protected]>
Reported-and-tested-by: Mark Rustad <[email protected]>
Reported-and-tested-by: Alexander Holler <[email protected]>
Signed-off-by: Jiang Liu <[email protected]>
Cc: Tony Luck <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/apic/vector.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index f813261..2683f36 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -322,7 +322,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
irq_data->chip = &lapic_controller;
irq_data->chip_data = data;
irq_data->hwirq = virq + i;
- err = assign_irq_vector_policy(virq, irq_data->node, data,
+ err = assign_irq_vector_policy(virq + i, irq_data->node, data,
info);
if (err)
goto error;

2015-08-25 04:03:18

by Alex Deucher

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

On Thu, Aug 13, 2015 at 6:13 PM, Alex Deucher <[email protected]> wrote:
> On Thu, Aug 13, 2015 at 4:15 PM, Alex Deucher <[email protected]> wrote:
>> On Thu, Aug 13, 2015 at 3:46 PM, Alex Deucher <[email protected]> wrote:
>>> On Mon, Aug 10, 2015 at 9:06 PM, Jiang Liu <[email protected]> wrote:
>>>> On 2015/8/10 23:00, Alex Deucher wrote:
>>>>> On Sun, Aug 9, 2015 at 4:15 AM, Jiang Liu <[email protected]> wrote:
>>>>>> Alex Deucher, Mark Rustad and Alexander Holler reported a regression
>>>>>> with the latest v4.2-rc4 kernel, which breaks some SATA controllers.
>>>>>> With multi-MSI capable SATA controllers, only the first port works,
>>>>>> all other ports times out when executing SATA commands. This regression
>>>>>> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage
>>>>>> MSI interrupts"), but it's not the root cause, it just triggers a bug
>>>>>> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage
>>>>>> CPU interrupt vectors").
>>>>>>
>>>>>> With this patch applied, the affected SATA controllers work as expected.
>>>>>
>>>>> Yes, this fixes the SATA regression:
>>>>> Tested-by: Alex Deucher <[email protected]>
>>>>>
>>>>> I'm not sure if it's related to this patch or not (I haven't bisected
>>>>> it independently yet), but MSIs don't seem to work on GPUs. See the
>>>>> line for amdgpu. This is just after loading the driver.
>>>> Hi Alex,
>>>> This patch only affects multiple-MSI, and it seems that your
>>>> gpu only uses one MSI interrupt, so it may not be related to this patch.
>>>> And this seems like a sort of interrupt storm.
>>>>> 52: 16579895 16579562 16580988 16583443 IR-PCI-MSI
>>>>> 524288-edge amdgpu
>>>>
>>>> Does it make any change by disable interrupt remapping?
>>>
>>> Nope. Still going crazy:
>>> 46: 4769660 4769130 4775899 4784657 PCI-MSI
>>> 524288-edge amdgpu
>>>
>>>
>>>> Does it make any change by disable MSI?
>>>
>>> If I set pci=nomsi, the sata controllers time out. If I disable MSIs
>>> just for the gpu, I don't get any interrupts:
>>> 25: 0 0 0 0 IR-IO-APIC
>>> 0-fasteoi amdgpu
>>>
>>
>> Strangely, it only seems to affect certain boards. E.g., this card works fine:
>> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
>> [AMD/ATI] Bonaire XT [Radeon HD 7790/8770 / R9 260 OEM] (prog-if 00
>> [VGA controller])
>> Subsystem: Diamond Multimedia Systems Device 2329
>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>> ParErr- Stepping- SERR- FastB2B- DisINTx+
>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>> <TAbort+ <MAbort- >SERR- <PERR- INTx-
>> Latency: 0, Cache Line Size: 64 bytes
>> Interrupt: pin A routed to IRQ 52
>> Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
>> Region 2: Memory at d0000000 (64-bit, prefetchable) [size=8M]
>> Region 4: I/O ports at e000 [size=256]
>> Region 5: Memory at ff600000 (32-bit, non-prefetchable) [size=256K]
>> Expansion ROM at ff640000 [disabled] [size=128K]
>> Capabilities: [48] Vendor Specific Information: Len=08 <?>
>> Capabilities: [50] Power Management version 3
>> Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
>> PME(D0-,D1+,D2+,D3hot+,D3cold-)
>> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>> Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
>> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
>> <4us, L1 unlimited
>> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
>> MaxPayload 256 bytes, MaxReadReq 512 bytes
>> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
>> LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit
>> Latency L0s <64ns, L1 <1us
>> ClockPM- Surprise- LLActRep- BwNot-
>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>> LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+
>> DLActive- BWMgmt- ABWMgmt-
>> DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-,
>> OBFF Not Supported
>> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
>> OBFF Disabled
>> LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
>> Transmit Margin: Normal Operating Range,
>> EnterModifiedCompliance- ComplianceSOS-
>> Compliance De-emphasis: -6dB
>> LnkSta2: Current De-emphasis Level: -6dB,
>> EqualizationComplete+, EqualizationPhase1+
>> EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
>> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>> Address: 00000000fee00000 Data: 0000
>> Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1
>> Len=010 <?>
>> Capabilities: [150 v2] Advanced Error Reporting
>> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
>> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
>> Capabilities: [270 v1] #19
>> Capabilities: [2b0 v1] Address Translation Service (ATS)
>> ATSCap: Invalidate Queue Depth: 00
>> ATSCtl: Enable+, Smallest Translation Unit: 00
>> Capabilities: [2c0 v1] #13
>> Capabilities: [2d0 v1] #1b
>> Kernel driver in use: amdgpu
>>
>> This one does not:
>> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
>> [AMD/ATI] Device 6939 (prog-if 00 [VGA controller])
>> Subsystem: Gigabyte Technology Co., Ltd Device 229d
>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>> ParErr- Stepping- SERR- FastB2B- DisINTx+
>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>> <TAbort+ <MAbort- >SERR- <PERR- INTx-
>> Latency: 0, Cache Line Size: 64 bytes
>> Interrupt: pin A routed to IRQ 52
>> Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
>> Region 2: Memory at d0000000 (64-bit, prefetchable) [size=2M]
>> Region 4: I/O ports at e000 [size=256]
>> Region 5: Memory at ff600000 (32-bit, non-prefetchable) [size=256K]
>> Expansion ROM at ff640000 [disabled] [size=128K]
>> Capabilities: [48] Vendor Specific Information: Len=08 <?>
>> Capabilities: [50] Power Management version 3
>> Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
>> PME(D0-,D1+,D2+,D3hot+,D3cold+)
>> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>> Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
>> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
>> <4us, L1 unlimited
>> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
>> MaxPayload 256 bytes, MaxReadReq 512 bytes
>> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
>> LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit
>> Latency L0s <64ns, L1 <1us
>> ClockPM- Surprise- LLActRep- BwNot-
>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>> LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+
>> DLActive- BWMgmt- ABWMgmt-
>> DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-,
>> OBFF Not Supported
>> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
>> OBFF Disabled
>> LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
>> Transmit Margin: Normal Operating Range,
>> EnterModifiedCompliance- ComplianceSOS-
>> Compliance De-emphasis: -6dB
>> LnkSta2: Current De-emphasis Level: -6dB,
>> EqualizationComplete+, EqualizationPhase1+
>> EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
>> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>> Address: 00000000fee00000 Data: 0000
>> Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1
>> Len=010 <?>
>> Capabilities: [150 v2] Advanced Error Reporting
>> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
>> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
>> Capabilities: [200 v1] #15
>> Capabilities: [270 v1] #19
>> Capabilities: [2b0 v1] Address Translation Service (ATS)
>> ATSCap: Invalidate Queue Depth: 00
>> ATSCtl: Enable+, Smallest Translation Unit: 00
>> Capabilities: [2c0 v1] #13
>> Capabilities: [2d0 v1] #1b
>> Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
>> ARICap: MFVC- ACS-, Next Function: 1
>> ARICtl: MFVC- ACS-, Function Group: 0
>> Kernel driver in use: amdgpu
>>
>> Any ideas? I'll see if I can find the time to bisect this.
>
> I attempted to bisect this, however the regression happened prior to
> my driver being merged upstream:
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=099bfbfc7fbbe22356c02f0caf709ac32e1126ea
> So I can't easily bisect it further without backporting the driver to
> each commit before that. This may take a while...

Just a heads up, this ended up being an alignment issue in the driver
and was not a regression.

Alex

>
> Alex

2015-08-25 04:46:34

by Jiang Liu

[permalink] [raw]
Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7

On 2015/8/25 12:03, Alex Deucher wrote:
> On Thu, Aug 13, 2015 at 6:13 PM, Alex Deucher <[email protected]> wrote:
>>> Any ideas? I'll see if I can find the time to bisect this.
>>
>> I attempted to bisect this, however the regression happened prior to
>> my driver being merged upstream:
>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=099bfbfc7fbbe22356c02f0caf709ac32e1126ea
>> So I can't easily bisect it further without backporting the driver to
>> each commit before that. This may take a while...
>
> Just a heads up, this ended up being an alignment issue in the driver
> and was not a regression.

Thanks for confirmation:)