Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754077AbbHMUQA (ORCPT ); Thu, 13 Aug 2015 16:16:00 -0400 Received: from mail-yk0-f182.google.com ([209.85.160.182]:33859 "EHLO mail-yk0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753094AbbHMUP6 (ORCPT ); Thu, 13 Aug 2015 16:15:58 -0400 MIME-Version: 1.0 In-Reply-To: References: <1439108128-18441-1-git-send-email-jiang.liu@linux.intel.com> <55C94AA5.8090904@linux.intel.com> Date: Thu, 13 Aug 2015 16:15:57 -0400 Message-ID: Subject: Re: [Bugfix] x86, irq: Fix a regression caused by commit b5dc8e6c21e7 From: Alex Deucher To: Jiang Liu Cc: Thomas Gleixner , Alexander Holler , Mark Rustad , Ingo Molnar , "H. Peter Anvin" , x86@kernel.org, Tony Luck , LKML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 16237 Lines: 342 On Thu, Aug 13, 2015 at 3:46 PM, Alex Deucher wrote: > On Mon, Aug 10, 2015 at 9:06 PM, Jiang Liu wrote: >> On 2015/8/10 23:00, Alex Deucher wrote: >>> On Sun, Aug 9, 2015 at 4:15 AM, Jiang Liu wrote: >>>> Alex Deucher, Mark Rustad and Alexander Holler reported a regression >>>> with the latest v4.2-rc4 kernel, which breaks some SATA controllers. >>>> With multi-MSI capable SATA controllers, only the first port works, >>>> all other ports times out when executing SATA commands. This regression >>>> bisects to 52f518a3a7c2 ("x86/MSI: Use hierarchical irqdomains to manage >>>> MSI interrupts"), but it's not the root cause, it just triggers a bug >>>> caused by b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage >>>> CPU interrupt vectors"). >>>> >>>> With this patch applied, the affected SATA controllers work as expected. >>> >>> Yes, this fixes the SATA regression: >>> Tested-by: Alex Deucher >>> >>> I'm not sure if it's related to this patch or not (I haven't bisected >>> it independently yet), but MSIs don't seem to work on GPUs. See the >>> line for amdgpu. This is just after loading the driver. >> Hi Alex, >> This patch only affects multiple-MSI, and it seems that your >> gpu only uses one MSI interrupt, so it may not be related to this patch. >> And this seems like a sort of interrupt storm. >>> 52: 16579895 16579562 16580988 16583443 IR-PCI-MSI >>> 524288-edge amdgpu >> >> Does it make any change by disable interrupt remapping? > > Nope. Still going crazy: > 46: 4769660 4769130 4775899 4784657 PCI-MSI > 524288-edge amdgpu > > >> Does it make any change by disable MSI? > > If I set pci=nomsi, the sata controllers time out. If I disable MSIs > just for the gpu, I don't get any interrupts: > 25: 0 0 0 0 IR-IO-APIC > 0-fasteoi amdgpu > Strangely, it only seems to affect certain boards. E.g., this card works fine: 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Bonaire XT [Radeon HD 7790/8770 / R9 260 OEM] (prog-if 00 [VGA controller]) Subsystem: Diamond Multimedia Systems Device 2329 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00000 Data: 0000 Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 Capabilities: [150 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [270 v1] #19 Capabilities: [2b0 v1] Address Translation Service (ATS) ATSCap: Invalidate Queue Depth: 00 ATSCtl: Enable+, Smallest Translation Unit: 00 Capabilities: [2c0 v1] #13 Capabilities: [2d0 v1] #1b Kernel driver in use: amdgpu This one does not: 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 6939 (prog-if 00 [VGA controller]) Subsystem: Gigabyte Technology Co., Ltd Device 229d Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00000 Data: 0000 Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 Capabilities: [150 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [200 v1] #15 Capabilities: [270 v1] #19 Capabilities: [2b0 v1] Address Translation Service (ATS) ATSCap: Invalidate Queue Depth: 00 ATSCtl: Enable+, Smallest Translation Unit: 00 Capabilities: [2c0 v1] #13 Capabilities: [2d0 v1] #1b Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0 Kernel driver in use: amdgpu Any ideas? I'll see if I can find the time to bisect this. Alex > Alex > >> Thanks! >> Gerry >> >>> >>> $ cat /proc/interrupts >>> CPU0 CPU1 CPU2 CPU3 >>> 0: 138 0 0 0 IR-IO-APIC >>> 2-edge timer >>> 1: 2 2 1 4 IR-IO-APIC >>> 1-edge i8042 >>> 7: 1 0 0 0 IR-IO-APIC 7-edge >>> 8: 0 0 1 0 IR-IO-APIC >>> 8-edge rtc0 >>> 9: 0 0 0 0 IR-IO-APIC >>> 9-fasteoi acpi >>> 14: 0 0 0 0 IR-IO-APIC >>> 14-edge pata_atiixp >>> 15: 0 0 0 0 IR-IO-APIC >>> 15-edge pata_atiixp >>> 16: 302 303 301 314 IR-IO-APIC >>> 16-fasteoi snd_hda_intel >>> 17: 0 0 0 0 IR-IO-APIC >>> 17-fasteoi ehci_hcd:usb7, ehci_hcd:usb8 >>> 18: 0 0 0 0 IR-IO-APIC >>> 18-fasteoi ohci_hcd:usb9, ohci_hcd:usb10, ohci_hcd:usb11 >>> 24: 0 0 0 1 PCI-MSI 4096-edge >>> AMD-Vi >>> 26: 0 0 0 0 IR-PCI-MSI >>> 34816-edge PCIe PME >>> 27: 0 0 0 0 IR-PCI-MSI >>> 344064-edge PCIe PME >>> 28: 0 0 0 0 IR-PCI-MSI >>> 348160-edge PCIe PME >>> 29: 0 0 0 0 IR-PCI-MSI >>> 350208-edge PCIe PME >>> 30: 247 255 1381 4617 IR-PCI-MSI >>> 278528-edge ahci0 >>> 31: 162 163 164 181 IR-PCI-MSI >>> 278529-edge ahci1 >>> 34: 2 1 2 17 IR-PCI-MSI >>> 262144-edge xhci_hcd >>> 35: 0 0 0 0 IR-PCI-MSI >>> 262145-edge xhci_hcd >>> 36: 0 0 0 0 IR-PCI-MSI >>> 262146-edge xhci_hcd >>> 37: 0 0 0 0 IR-PCI-MSI >>> 262147-edge xhci_hcd >>> 38: 0 0 0 0 IR-PCI-MSI >>> 262148-edge xhci_hcd >>> 39: 0 0 0 0 IR-PCI-MSI >>> 264192-edge xhci_hcd >>> 40: 0 0 0 0 IR-PCI-MSI >>> 264193-edge xhci_hcd >>> 41: 0 0 0 0 IR-PCI-MSI >>> 264194-edge xhci_hcd >>> 42: 0 0 0 0 IR-PCI-MSI >>> 264195-edge xhci_hcd >>> 43: 0 0 0 0 IR-PCI-MSI >>> 264196-edge xhci_hcd >>> 44: 0 0 0 0 IR-PCI-MSI >>> 2097152-edge xhci_hcd >>> 45: 0 0 0 0 IR-PCI-MSI >>> 2097153-edge xhci_hcd >>> 46: 0 0 0 0 IR-PCI-MSI >>> 2097154-edge xhci_hcd >>> 47: 0 0 0 0 IR-PCI-MSI >>> 2097155-edge xhci_hcd >>> 48: 0 0 0 0 IR-PCI-MSI >>> 2097156-edge xhci_hcd >>> 50: 40 41 41 40 IR-PCI-MSI >>> 526336-edge snd_hda_intel >>> 51: 14 15 21 1105 IR-PCI-MSI >>> 2621440-edge em1 >>> 52: 16579895 16579562 16580988 16583443 IR-PCI-MSI >>> 524288-edge amdgpu >>> NMI: 4 3 4 3 Non-maskable interrupts >>> LOC: 15020 10425 8933 8584 Local timer interrupts >>> SPU: 0 0 0 0 Spurious interrupts >>> PMI: 4 3 4 3 Performance >>> monitoring interrupts >>> IWI: 1 1 1 1 IRQ work interrupts >>> RTR: 0 0 0 0 APIC ICR read retries >>> RES: 7203 5501 10621 5077 Rescheduling interrupts >>> CAL: 498 559 614 591 Function call interrupts >>> TLB: 58 149 104 95 TLB shootdowns >>> TRM: 0 0 0 0 Thermal event interrupts >>> THR: 0 0 0 0 Threshold APIC interrupts >>> DFR: 0 0 0 0 Deferred Error >>> APIC interrupts >>> MCE: 0 0 0 0 Machine check exceptions >>> MCP: 1 1 1 1 Machine check polls >>> HYP: 0 0 0 0 Hypervisor >>> callback interrupts >>> ERR: 1 >>> MIS: 0 >>> PIN: 0 0 0 0 Posted-interrupt >>> notification event >>> PIW: 0 0 0 0 Posted-interrupt >>> wakeup event >>> >>> This worked fine on 4.1. Any ideas? >>> >>> Thanks, >>> >>> Alex >>> >>> >>>> >>>> Signed-off-by: Jiang Liu >>>> Reported-by: Alex Deucher >>>> Reported-by: Mark Rustad >>>> Reported-by: Alexander Holler >>>> --- >>>> Hi Alex, Mark and Alexandler, >>>> Sorry for the long delay to root cause this regression, it's >>>> really annoying. Could you please help test this patch against the >>>> latest v4.2-rcx? >>>> Thanks! >>>> Gerry >>>> --- >>>> arch/x86/kernel/apic/vector.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c >>>> index f813261d9740..2683f36e4e0a 100644 >>>> --- a/arch/x86/kernel/apic/vector.c >>>> +++ b/arch/x86/kernel/apic/vector.c >>>> @@ -322,7 +322,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq, >>>> irq_data->chip = &lapic_controller; >>>> irq_data->chip_data = data; >>>> irq_data->hwirq = virq + i; >>>> - err = assign_irq_vector_policy(virq, irq_data->node, data, >>>> + err = assign_irq_vector_policy(virq + i, irq_data->node, data, >>>> info); >>>> if (err) >>>> goto error; >>>> -- >>>> 1.7.10.4 >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> Please read the FAQ at http://www.tux.org/lkml/ >>> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/