2021-09-13 06:35:38

by Matthew Ruffell

[permalink] [raw]
Subject: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

Dear PCI, KVM and VFIO Subsystem Maintainers,



I have a user which can reliably reproduce a host lockup when passing 2x GPUs to

a KVM guest via vfio-pci, where the two GPUs each share the same PCI switch. If

the user passes through multiple GPUs, and selects them such that no GPU shares

the same PCI switch as any other GPU, the system is stable.



System Information:

- SuperMicro X9DRG-O(T)F

- 8x Nvidia GeForce RTX 2080 Ti GPUs

- Ubuntu 20.04 LTS

- 5.14.0 mainline kernel

- libvirt 6.0.0-0ubuntu8.10

- qemu 4.2-3ubuntu6.16



Kernel command line:

Command line: BOOT_IMAGE=/vmlinuz-5.14-051400-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro intel_iommu=on hugepagesz=1G hugepages=240 kvm.report_ignored_msrs=0 kvm.ignore_msrs=1 vfio-pci.ids=10de:1e04,10de:10f7,10de:1ad6,10de:1ad7 console=ttyS1,115200n8 ignore_loglevel crashkernel=512M



lspci -vvv ran as root under kernel 5.14.0 is available in the pastebin below,

and also attached to this message.

https://paste.ubuntu.com/p/TVNvvXC7Z9/



lspci -tv ran as root available in the pastebin below:

https://paste.ubuntu.com/p/52Y69PbjZg/



The symptoms are:



When multiple GPUs are passed through to a KVM guest via pci-vfio, and if a

pair of GPUs are passed through which share the same PCI switch, if you start

the VM, panic the VM / force restart the VM, and keep looping, eventually the

host will have the following kernel oops:



irq 31: nobody cared (try booting with the "irqpoll" option)

CPU: 23 PID: 0 Comm: swapper/23 Kdump: loaded Not tainted 5.14-051400-generic #202108310811-Ubuntu

Hardware name: Supermicro X9DRG-O(T)F/X9DRG-O(T)F, BIOS 3.3 11/27/2018

Call Trace:

<IRQ>

dump_stack_lvl+0x4a/0x5f

dump_stack+0x10/0x12

__report_bad_irq+0x3a/0xaf

note_interrupt.cold+0xb/0x60

handle_irq_event_percpu+0x72/0x80

handle_irq_event+0x3b/0x60

handle_fasteoi_irq+0x9c/0x150

__common_interrupt+0x4b/0xb0

common_interrupt+0x4a/0xa0

asm_common_interrupt+0x1e/0x40

RIP: 0010:__do_softirq+0x73/0x2ae

Code: 7b 61 4c 00 01 00 00 89 75 a8 c7 45 ac 0a 00 00 00 48 89 45 c0 48 89 45 b0 65 66 c7 05 54 c7 62 4c 00 00 fb 66 0f 1f 44 00 00 <bb> ff ff ff ff 49 c7 c7 c0 60 80 b4 41 0f bc de 83 c3 01 89 5d d4

RSP: 0018:ffffba440cc04f80 EFLAGS: 00000286

RAX: ffff93c5a0929880 RBX: 0000000000000000 RCX: 00000000000006e0

RDX: 0000000000000001 RSI: 0000000004200042 RDI: ffff93c5a1104980

RBP: ffffba440cc04fd8 R08: 0000000000000000 R09: 000000f47ad6e537

R10: 000000f47a99de21 R11: 000000f47a99dc37 R12: ffffba440c68be08

R13: 0000000000000001 R14: 0000000000000200 R15: 0000000000000000

irq_exit_rcu+0x8d/0xa0

sysvec_apic_timer_interrupt+0x7c/0x90

</IRQ>

asm_sysvec_apic_timer_interrupt+0x12/0x20

RIP: 0010:tick_nohz_idle_enter+0x47/0x50

Code: 30 4b 4d 48 83 bb b0 00 00 00 00 75 20 80 4b 4c 01 e8 5d 0c ff ff 80 4b 4c 04 48 89 43 78 e8 50 e8 f8 ff fb 66 0f 1f 44 00 00 <5b> 5d c3 0f 0b eb dc 66 90 0f 1f 44 00 00 55 48 89 e5 53 48 c7 c3

RSP: 0018:ffffba440c68beb0 EFLAGS: 00000213

RAX: 000000f5424040a4 RBX: ffff93e51fadf680 RCX: 000000000000001f

RDX: 0000000000000000 RSI: 000000002f684d00 RDI: ffe8b4bb6b90380b

RBP: ffffba440c68beb8 R08: 000000f5424040a4 R09: 0000000000000001

R10: ffffffffb4875460 R11: 0000000000000017 R12: 0000000000000093

R13: ffff93c5a0929880 R14: 0000000000000000 R15: 0000000000000000

do_idle+0x47/0x260

? do_idle+0x197/0x260

cpu_startup_entry+0x20/0x30

start_secondary+0x127/0x160

secondary_startup_64_no_verify+0xc2/0xcb

handlers:

[<00000000b16da31d>] vfio_intx_handler

Disabling IRQ #31



The IRQs which this occurs on are: 25, 27, 106, 31, 29. These represent the

PEX 8747 PCIe switches present in the system:



*-pci

description: PCI bridge

product: PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch

vendor: PLX Technology, Inc.

bus info: pci@0000:02:00.0

capabilities: pci msi pciexpress

configuration: driver=pcieport

resources: irq:25



*-pci

description: PCI bridge

product: PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch

vendor: PLX Technology, Inc.

bus info: pci@0000:06:00.0

capabilities: pci msi pciexpress

configuration: driver=pcieport

resources: irq:27



*-pci

description: PCI bridge

product: PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch

vendor: PLX Technology, Inc.

bus info: pci@0000:82:00.0

capabilities: pci msi pciexpress

configuration: driver=pcieport

resources: irq:29



*-pci

description: PCI bridge

product: PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch

vendor: PLX Technology, Inc.

bus info: pci@0000:86:00.0

capabilities: pci msi pciexpress

configuration: driver=pcieport

resources: irq:31



When the system hits the kernel oops, the host crashes and the crashkernel

boots, but it gets stuck initialising the IOMMU:



DMAR: Host address width 46

DMAR: DRHD base: 0x000000fbffe000 flags: 0x0

DMAR: dmar0: reg_base_addr fbffe000 ver 1:0 cap d2078c106f0466 ecap f020de

DMAR: DRHD base: 0x000000cbffc000 flags: 0x1

DMAR: dmar1: reg_base_addr cbffc000 ver 1:0 cap d2078c106f0466 ecap f020de

DMAR: RMRR base: 0x0000005f21a000 end: 0x0000005f228fff

DMAR: ATSR flags: 0x0

DMAR: RHSA base: 0x000000fbffe000 proximity domain: 0x1

DMAR: RHSA base: 0x000000cbffc000 proximity domain: 0x0

DMAR-IR: IOAPIC id 3 under DRHD base 0xfbffe000 IOMMU 0

DMAR-IR: IOAPIC id 0 under DRHD base 0xcbffc000 IOMMU 1

DMAR-IR: IOAPIC id 2 under DRHD base 0xcbffc000 IOMMU 1

DMAR-IR: HPET id 0 under DRHD base 0xcbffc000

[ 3.271530] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.

[ 3.282572] DMAR-IR: Copied IR table for dmar0 from previous kernel

[ 13.291319] DMAR-IR: Copied IR table for dmar1 from previous kernel



The crashkernel then hard locks, and the system must be manually rebooted. Note

that it took ten seconds to copy the IR table for dmar1, which is most unusual.

If we do a sysrq-trigger, there is no ten second delay, and the very next

message is:



DMAR-IR: Enabled IRQ remapping in x2apic mode



Which leads us to believe that we are getting stuck in the crashkernel copying

the IR table and re-enabling the IRQ that was disabled from "nobody cared"

and globally enabling IRQ remapping.



Things we have tried:



We have tried adding vfio-pci.nointxmask=1 to the kernel command line, but we

cannot start a VM where the GPUs shares the same PCI switch, instead we get a

libvirt error:



Fails to start: vfio 0000:05:00.0: Failed to set up TRIGGER eventfd signaling for interrupt INTX-0: VFIO_DEVICE_SET_IRQS failure: Device or resource busy



Starting a VM with GPUs all from different PCI switches works just fine.



We tried adding "options snd-hda-intel enable_msi=1" to /etc/modprobe.d/snd-hda-intel.conf,

and while it did enable MSI for all PCI devices under each GPU, MSI is still

disabled on each of the PLX PCI switches, and the issue still reproduces when

GPUs share PCI switches.



We have ruled out ACS issues, as each PLX PCI switch and Nvidia GPU are

allocated their own isolated IOMMU group:



https://paste.ubuntu.com/p/9VRt2zrqRR/



Looking at the initial kernel oops, we seem to hit __report_bad_irq(), which

means that we have ignored 99,900 of these interrupts coming from the PCI switch,

and that the vfio_intx_handler() doesn't process them, likely because the PCI

switch itself is not passed through to the VM, only the VGA PCI devices are.



184 /*

185 * If 99,900 of the previous 100,000 interrupts have not been handled

186 * then assume that the IRQ is stuck in some manner. Drop a diagnostic

187 * and try to turn the IRQ off.

188 *

189 * (The other 100-of-100,000 interrupts may have been a correctly

190 * functioning device sharing an IRQ with the failing one)

191 */

192 static void __report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret)

193 {

194 unsigned int irq = irq_desc_get_irq(desc);

195 struct irqaction *action;

196 unsigned long flags;

197

198 if (bad_action_ret(action_ret)) {

199 printk(KERN_ERR "irq event %d: bogus return value %x\n",

200 irq, action_ret);

201 } else {

202 printk(KERN_ERR "irq %d: nobody cared (try booting with "

203 "the \"irqpoll\" option)\n", irq);

204 }

205 dump_stack();

206 printk(KERN_ERR "handlers:\n");

207

208 /*

209 * We need to take desc->lock here. note_interrupt() is called

210 * w/o desc->lock held, but IRQ_PROGRESS set. We might race

211 * with something else removing an action. It's ok to take

212 * desc->lock here. See synchronize_irq().

213 */

214 raw_spin_lock_irqsave(&desc->lock, flags);

215 for_each_action_of_desc(desc, action) {

216 printk(KERN_ERR "[<%p>] %ps", action->handler, action->handler);

217 if (action->thread_fn)

218 printk(KERN_CONT " threaded [<%p>] %ps",

219 action->thread_fn, action->thread_fn);

220 printk(KERN_CONT "\n");

221 }

222 raw_spin_unlock_irqrestore(&desc->lock, flags);

223 }



Any help with debugging this issue would be greatly appreciated. We are able

to gather any information requested, and can test patches or debug patches.



Thanks,

Matthew Ruffell


Attachments:
lspci_vvv_2021-09-10.txt (328.14 kB)

2021-09-14 16:45:02

by Alex Williamson

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

On Mon, 13 Sep 2021 18:31:02 +1200
Matthew Ruffell <[email protected]> wrote:

> Dear PCI, KVM and VFIO Subsystem Maintainers,
>
>
>
> I have a user which can reliably reproduce a host lockup when passing 2x GPUs to
>
> a KVM guest via vfio-pci, where the two GPUs each share the same PCI switch. If
>
> the user passes through multiple GPUs, and selects them such that no GPU shares
>
> the same PCI switch as any other GPU, the system is stable.

For discussion, an abridged lspci tree where all the endpoints are
functions per GPU card:

+-[0000:80]-+-02.0-[82-85]----00.0-[83-85]--+-08.0-[84]--+-00.0
| | | +-00.1
| | | +-00.2
| | | \-00.3
| | \-10.0-[85]--+-00.0
| | +-00.1
| | +-00.2
| | \-00.3
| +-03.0-[86-89]----00.0-[87-89]--+-08.0-[88]--+-00.0
| | | +-00.1
| | | +-00.2
| | | \-00.3
| | \-10.0-[89]--+-00.0
| | +-00.1
| | +-00.2
| | \-00.3
\-[0000:00]-+-02.0-[02-05]----00.0-[03-05]--+-08.0-[04]--+-00.0
| | +-00.1
| | +-00.2
| | \-00.3
| \-10.0-[05]--+-00.0
| +-00.1
| +-00.2
| \-00.3
+-03.0-[06-09]----00.0-[07-09]--+-08.0-[08]--+-00.0
| +-00.1
| +-00.2
| \-00.3
\-10.0-[09]--+-00.0
+-00.1
+-00.2
\-00.3

When you say the system is stable when no GPU shares the same PCI
switch as any other GPU, is that per VM or one GPU per switch remains
entirely unused?

FWIW, I have access to a system with an NVIDIA K1 and M60, both use
this same switch on-card and I've not experienced any issues assigning
all the GPUs to a single VM. Topo:

+-[0000:40]-+-02.0-[42-47]----00.0-[43-47]--+-08.0-[44]----00.0
| +-09.0-[45]----00.0
| +-10.0-[46]----00.0
| \-11.0-[47]----00.0
\-[0000:00]-+-03.0-[04-07]----00.0-[05-07]--+-08.0-[06]----00.0
\-10.0-[07]----00.0

>
> System Information:
>
> - SuperMicro X9DRG-O(T)F
>
> - 8x Nvidia GeForce RTX 2080 Ti GPUs
>
> - Ubuntu 20.04 LTS
>
> - 5.14.0 mainline kernel
>
> - libvirt 6.0.0-0ubuntu8.10
>
> - qemu 4.2-3ubuntu6.16
>
>
>
> Kernel command line:
>
> Command line: BOOT_IMAGE=/vmlinuz-5.14-051400-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro intel_iommu=on hugepagesz=1G hugepages=240 kvm.report_ignored_msrs=0 kvm.ignore_msrs=1 vfio-pci.ids=10de:1e04,10de:10f7,10de:1ad6,10de:1ad7 console=ttyS1,115200n8 ignore_loglevel crashkernel=512M
>
>
>
> lspci -vvv ran as root under kernel 5.14.0 is available in the pastebin below,
>
> and also attached to this message.
>
> https://paste.ubuntu.com/p/TVNvvXC7Z9/


On my system, the upstream ports of the switch have MSI interrupts
enabled, in your listing only the downstream ports enable MSI. Is
there anything in dmesg that might indicate an issue configuring
interrupts on the upstream port?


> lspci -tv ran as root available in the pastebin below:
>
> https://paste.ubuntu.com/p/52Y69PbjZg/
>
>
>
> The symptoms are:
>
>
>
> When multiple GPUs are passed through to a KVM guest via pci-vfio, and if a
>
> pair of GPUs are passed through which share the same PCI switch, if you start
>
> the VM, panic the VM / force restart the VM, and keep looping, eventually the
>
> host will have the following kernel oops:


You say "eventually" here, does that suggest the frequency is not 100%?

It seems like all the actions you list have a bus reset in common.
Does a clean reboot of the VM also trigger it, or killing the VM?


> irq 31: nobody cared (try booting with the "irqpoll" option)
>
> CPU: 23 PID: 0 Comm: swapper/23 Kdump: loaded Not tainted 5.14-051400-generic #202108310811-Ubuntu
>
> Hardware name: Supermicro X9DRG-O(T)F/X9DRG-O(T)F, BIOS 3.3 11/27/2018
>
> Call Trace:
>
> <IRQ>
>
> dump_stack_lvl+0x4a/0x5f
>
> dump_stack+0x10/0x12
>
> __report_bad_irq+0x3a/0xaf
>
> note_interrupt.cold+0xb/0x60
>
> handle_irq_event_percpu+0x72/0x80
>
> handle_irq_event+0x3b/0x60
>
> handle_fasteoi_irq+0x9c/0x150
>
> __common_interrupt+0x4b/0xb0
>
> common_interrupt+0x4a/0xa0
>
> asm_common_interrupt+0x1e/0x40
>
> RIP: 0010:__do_softirq+0x73/0x2ae
>
> Code: 7b 61 4c 00 01 00 00 89 75 a8 c7 45 ac 0a 00 00 00 48 89 45 c0 48 89 45 b0 65 66 c7 05 54 c7 62 4c 00 00 fb 66 0f 1f 44 00 00 <bb> ff ff ff ff 49 c7 c7 c0 60 80 b4 41 0f bc de 83 c3 01 89 5d d4
>
> RSP: 0018:ffffba440cc04f80 EFLAGS: 00000286
>
> RAX: ffff93c5a0929880 RBX: 0000000000000000 RCX: 00000000000006e0
>
> RDX: 0000000000000001 RSI: 0000000004200042 RDI: ffff93c5a1104980
>
> RBP: ffffba440cc04fd8 R08: 0000000000000000 R09: 000000f47ad6e537
>
> R10: 000000f47a99de21 R11: 000000f47a99dc37 R12: ffffba440c68be08
>
> R13: 0000000000000001 R14: 0000000000000200 R15: 0000000000000000
>
> irq_exit_rcu+0x8d/0xa0
>
> sysvec_apic_timer_interrupt+0x7c/0x90
>
> </IRQ>
>
> asm_sysvec_apic_timer_interrupt+0x12/0x20
>
> RIP: 0010:tick_nohz_idle_enter+0x47/0x50
>
> Code: 30 4b 4d 48 83 bb b0 00 00 00 00 75 20 80 4b 4c 01 e8 5d 0c ff ff 80 4b 4c 04 48 89 43 78 e8 50 e8 f8 ff fb 66 0f 1f 44 00 00 <5b> 5d c3 0f 0b eb dc 66 90 0f 1f 44 00 00 55 48 89 e5 53 48 c7 c3
>
> RSP: 0018:ffffba440c68beb0 EFLAGS: 00000213
>
> RAX: 000000f5424040a4 RBX: ffff93e51fadf680 RCX: 000000000000001f
>
> RDX: 0000000000000000 RSI: 000000002f684d00 RDI: ffe8b4bb6b90380b
>
> RBP: ffffba440c68beb8 R08: 000000f5424040a4 R09: 0000000000000001
>
> R10: ffffffffb4875460 R11: 0000000000000017 R12: 0000000000000093
>
> R13: ffff93c5a0929880 R14: 0000000000000000 R15: 0000000000000000
>
> do_idle+0x47/0x260
>
> ? do_idle+0x197/0x260
>
> cpu_startup_entry+0x20/0x30
>
> start_secondary+0x127/0x160
>
> secondary_startup_64_no_verify+0xc2/0xcb
>
> handlers:
>
> [<00000000b16da31d>] vfio_intx_handler
>
> Disabling IRQ #31

Hmm, we have the vfio-pci INTx handler installed, but possibly another
device is pulling the line or we're somehow out of sync with our own
device, ie. we either think it's already masked or it's not indicating
INTx status.

> The IRQs which this occurs on are: 25, 27, 106, 31, 29. These represent the
>
> PEX 8747 PCIe switches present in the system:

Some of those are the upstream port IRQ shared with a downstream
endpoint, but the lspci doesn't show that conclusively for all of
those. Perhaps a test we can run is to set DisINTx on all the upstream
root ports to eliminate this known IRQ line sharing between devices.
Something like this should set the correct bit for all the 8747
upstream ports:

# setpci -d 10b5: -s 00.0 4.w=400:400

lspci for each should then show:

Control: ... DisINTx+

vs DisINTx-

I'm still curious why it's not using MSI, but maybe this will hint
which device has the issue.

...
>
> [ 3.271530] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
>
> [ 3.282572] DMAR-IR: Copied IR table for dmar0 from previous kernel
>
> [ 13.291319] DMAR-IR: Copied IR table for dmar1 from previous kernel
>
>
>
> The crashkernel then hard locks, and the system must be manually rebooted. Note
>
> that it took ten seconds to copy the IR table for dmar1, which is most unusual.
>
> If we do a sysrq-trigger, there is no ten second delay, and the very next
>
> message is:
>
>
>
> DMAR-IR: Enabled IRQ remapping in x2apic mode
>
>
>
> Which leads us to believe that we are getting stuck in the crashkernel copying
>
> the IR table and re-enabling the IRQ that was disabled from "nobody cared"
>
> and globally enabling IRQ remapping.


Curious, yes. I don't have any insights there. Maybe something about
the IRQ being constantly asserted interferes with installing the new
remapper table.


> Things we have tried:
>
>
>
> We have tried adding vfio-pci.nointxmask=1 to the kernel command line, but we
>
> cannot start a VM where the GPUs shares the same PCI switch, instead we get a
>
> libvirt error:
>
>
>
> Fails to start: vfio 0000:05:00.0: Failed to set up TRIGGER eventfd signaling for interrupt INTX-0: VFIO_DEVICE_SET_IRQS failure: Device or resource busy


Yup, chances of being able to enable all the assignable endpoints with
shared interrupts is pretty slim. It's also curious that your oops
above doesn't list any IRQ handler corresponding to the upstream port.
Theoretically that means you'd at least avoid conflicts between
endpoints and the switch. It might be possible to get a limited config
working by unbinding drivers from conflicting devices.


> Starting a VM with GPUs all from different PCI switches works just fine.


This must be a clue, but I'm not sure how it fits yet.


> We tried adding "options snd-hda-intel enable_msi=1" to /etc/modprobe.d/snd-hda-intel.conf,
>
> and while it did enable MSI for all PCI devices under each GPU, MSI is still
>
> disabled on each of the PLX PCI switches, and the issue still reproduces when
>
> GPUs share PCI switches.
>
>
>
> We have ruled out ACS issues, as each PLX PCI switch and Nvidia GPU are
>
> allocated their own isolated IOMMU group:
>
>
>
> https://paste.ubuntu.com/p/9VRt2zrqRR/
>
>
>
> Looking at the initial kernel oops, we seem to hit __report_bad_irq(), which
>
> means that we have ignored 99,900 of these interrupts coming from the PCI switch,
>
> and that the vfio_intx_handler() doesn't process them, likely because the PCI
>
> switch itself is not passed through to the VM, only the VGA PCI devices are.


Assigning switch ports doesn't really make any sense, all the bridge
features would be emulated in the guest and the host needs to handle
various error conditions first. Seems we need to figure out which
device is actually signaling the interrupt and how it relates to
multiple devices assigned from the same switch, and maybe if it's
related to a bus reset operation. Thanks,

Alex

2021-09-15 04:49:54

by Matthew Ruffell

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

Hi Alex,

Adding Nathan Langford to CC, Nathan is the affected user, and has direct access

to the system.

On 15/09/21 4:43 am, Alex Williamson wrote:
> On Mon, 13 Sep 2021 18:31:02 +1200
> Matthew Ruffell <[email protected]> wrote:
>
>> Dear PCI, KVM and VFIO Subsystem Maintainers,
>>
>>
>>
>> I have a user which can reliably reproduce a host lockup when passing 2x GPUs to
>>
>> a KVM guest via vfio-pci, where the two GPUs each share the same PCI switch. If
>>
>> the user passes through multiple GPUs, and selects them such that no GPU shares
>>
>> the same PCI switch as any other GPU, the system is stable.
>
> For discussion, an abridged lspci tree where all the endpoints are
> functions per GPU card:
>
> +-[0000:80]-+-02.0-[82-85]----00.0-[83-85]--+-08.0-[84]--+-00.0
> | | | +-00.1
> | | | +-00.2
> | | | \-00.3
> | | \-10.0-[85]--+-00.0
> | | +-00.1
> | | +-00.2
> | | \-00.3
> | +-03.0-[86-89]----00.0-[87-89]--+-08.0-[88]--+-00.0
> | | | +-00.1
> | | | +-00.2
> | | | \-00.3
> | | \-10.0-[89]--+-00.0
> | | +-00.1
> | | +-00.2
> | | \-00.3
> \-[0000:00]-+-02.0-[02-05]----00.0-[03-05]--+-08.0-[04]--+-00.0
> | | +-00.1
> | | +-00.2
> | | \-00.3
> | \-10.0-[05]--+-00.0
> | +-00.1
> | +-00.2
> | \-00.3
> +-03.0-[06-09]----00.0-[07-09]--+-08.0-[08]--+-00.0
> | +-00.1
> | +-00.2
> | \-00.3
> \-10.0-[09]--+-00.0
> +-00.1
> +-00.2
> \-00.3
>
> When you say the system is stable when no GPU shares the same PCI
> switch as any other GPU, is that per VM or one GPU per switch remains
> entirely unused?

We have only been testing with one running VM in different configurations.



Configuration with no issue:

GPUs (by PCI device ID): 04, 08, 84, 88 (none sharing the same PCIe switch)



Configuration where the issue occurs:

GPUs 04, 08, 84, 88, 89 (88 and 89 sharing a PCIe switch)

>
> FWIW, I have access to a system with an NVIDIA K1 and M60, both use
> this same switch on-card and I've not experienced any issues assigning
> all the GPUs to a single VM. Topo:
>
> +-[0000:40]-+-02.0-[42-47]----00.0-[43-47]--+-08.0-[44]----00.0
> | +-09.0-[45]----00.0
> | +-10.0-[46]----00.0
> | \-11.0-[47]----00.0
> \-[0000:00]-+-03.0-[04-07]----00.0-[05-07]--+-08.0-[06]----00.0
> \-10.0-[07]----00.0
>
>>
>> System Information:
>>
>> - SuperMicro X9DRG-O(T)F
>>
>> - 8x Nvidia GeForce RTX 2080 Ti GPUs
>>
>> - Ubuntu 20.04 LTS
>>
>> - 5.14.0 mainline kernel
>>
>> - libvirt 6.0.0-0ubuntu8.10
>>
>> - qemu 4.2-3ubuntu6.16
>>
>>
>>
>> Kernel command line:
>>
>> Command line: BOOT_IMAGE=/vmlinuz-5.14-051400-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro intel_iommu=on hugepagesz=1G hugepages=240 kvm.report_ignored_msrs=0 kvm.ignore_msrs=1 vfio-pci.ids=10de:1e04,10de:10f7,10de:1ad6,10de:1ad7 console=ttyS1,115200n8 ignore_loglevel crashkernel=512M
>>
>>
>>
>> lspci -vvv ran as root under kernel 5.14.0 is available in the pastebin below,
>>
>> and also attached to this message.
>>
>> https://paste.ubuntu.com/p/TVNvvXC7Z9/
>
>
> On my system, the upstream ports of the switch have MSI interrupts
> enabled, in your listing only the downstream ports enable MSI. Is
> there anything in dmesg that might indicate an issue configuring
> interrupts on the upstream port?
>

Full dmesg output is available in the pastebin below:

https://paste.ubuntu.com/p/XTdRXFvvSV/

The only issues I found in dmesg are:

[ 11.711225] pci 0000:00:05.0: disabled boot interrupts on device [8086:0e28]
[ 11.863785] pci 0000:80:05.0: disabled boot interrupts on device [8086:0e28]

But these seem to point to the VTd/Memory Map/Misc peripheral devices and are unrelated.
>
>> lspci -tv ran as root available in the pastebin below:
>>
>> https://paste.ubuntu.com/p/52Y69PbjZg/
>>
>>
>>
>> The symptoms are:
>>
>>
>>
>> When multiple GPUs are passed through to a KVM guest via pci-vfio, and if a
>>
>> pair of GPUs are passed through which share the same PCI switch, if you start
>>
>> the VM, panic the VM / force restart the VM, and keep looping, eventually the
>>
>> host will have the following kernel oops:
>
>
> You say "eventually" here, does that suggest the frequency is not 100%?
>
> It seems like all the actions you list have a bus reset in common.
> Does a clean reboot of the VM also trigger it, or killing the VM?
>

The lockup is extremely intermittent in normal use (sometimes months between

occurrences) but we have been able to reproduce it much more frequently by

this loop of starting the VM, crashing the kernel in VM, resetting the VM from

the host, and repeating. The lockup still doesn’t happen every time though. Here

is some data on the number of VM crash/reset cycles between host lockups:



Config: GPUs 04, 05, 84, 88, 89

VM crash/reset cycles between host lockups: 15, 50, 60, 64, 7, 42, 8, 8, 78, 60, 32

Note: all lockups occurred after a message of “irq 31: nobody cared”

corresponding to the PCIe switch shared by GPUs 88, 89.



Config: GPUs 04, 08, 84, 88

Results: No lockups after 722 VM crash/reset cycles



We have been able to cause the host lockups by just executing a reboot from

within the VM, rather than crashing the VM, but most testing hasn’t been done

this way. As an anecdote from seeing this occur in production, we believe

rebooting the VM can cause the issue.


>> irq 31: nobody cared (try booting with the "irqpoll" option)
>>
>> CPU: 23 PID: 0 Comm: swapper/23 Kdump: loaded Not tainted 5.14-051400-generic #202108310811-Ubuntu
>>
>> Hardware name: Supermicro X9DRG-O(T)F/X9DRG-O(T)F, BIOS 3.3 11/27/2018
>>
>> Call Trace:
>>
>> <IRQ>
>>
>> dump_stack_lvl+0x4a/0x5f
>>
>> dump_stack+0x10/0x12
>>
>> __report_bad_irq+0x3a/0xaf
>>
>> note_interrupt.cold+0xb/0x60
>>
>> handle_irq_event_percpu+0x72/0x80
>>
>> handle_irq_event+0x3b/0x60
>>
>> handle_fasteoi_irq+0x9c/0x150
>>
>> __common_interrupt+0x4b/0xb0
>>
>> common_interrupt+0x4a/0xa0
>>
>> asm_common_interrupt+0x1e/0x40
>>
>> RIP: 0010:__do_softirq+0x73/0x2ae
>>
>> Code: 7b 61 4c 00 01 00 00 89 75 a8 c7 45 ac 0a 00 00 00 48 89 45 c0 48 89 45 b0 65 66 c7 05 54 c7 62 4c 00 00 fb 66 0f 1f 44 00 00 <bb> ff ff ff ff 49 c7 c7 c0 60 80 b4 41 0f bc de 83 c3 01 89 5d d4
>>
>> RSP: 0018:ffffba440cc04f80 EFLAGS: 00000286
>>
>> RAX: ffff93c5a0929880 RBX: 0000000000000000 RCX: 00000000000006e0
>>
>> RDX: 0000000000000001 RSI: 0000000004200042 RDI: ffff93c5a1104980
>>
>> RBP: ffffba440cc04fd8 R08: 0000000000000000 R09: 000000f47ad6e537
>>
>> R10: 000000f47a99de21 R11: 000000f47a99dc37 R12: ffffba440c68be08
>>
>> R13: 0000000000000001 R14: 0000000000000200 R15: 0000000000000000
>>
>> irq_exit_rcu+0x8d/0xa0
>>
>> sysvec_apic_timer_interrupt+0x7c/0x90
>>
>> </IRQ>
>>
>> asm_sysvec_apic_timer_interrupt+0x12/0x20
>>
>> RIP: 0010:tick_nohz_idle_enter+0x47/0x50
>>
>> Code: 30 4b 4d 48 83 bb b0 00 00 00 00 75 20 80 4b 4c 01 e8 5d 0c ff ff 80 4b 4c 04 48 89 43 78 e8 50 e8 f8 ff fb 66 0f 1f 44 00 00 <5b> 5d c3 0f 0b eb dc 66 90 0f 1f 44 00 00 55 48 89 e5 53 48 c7 c3
>>
>> RSP: 0018:ffffba440c68beb0 EFLAGS: 00000213
>>
>> RAX: 000000f5424040a4 RBX: ffff93e51fadf680 RCX: 000000000000001f
>>
>> RDX: 0000000000000000 RSI: 000000002f684d00 RDI: ffe8b4bb6b90380b
>>
>> RBP: ffffba440c68beb8 R08: 000000f5424040a4 R09: 0000000000000001
>>
>> R10: ffffffffb4875460 R11: 0000000000000017 R12: 0000000000000093
>>
>> R13: ffff93c5a0929880 R14: 0000000000000000 R15: 0000000000000000
>>
>> do_idle+0x47/0x260
>>
>> ? do_idle+0x197/0x260
>>
>> cpu_startup_entry+0x20/0x30
>>
>> start_secondary+0x127/0x160
>>
>> secondary_startup_64_no_verify+0xc2/0xcb
>>
>> handlers:
>>
>> [<00000000b16da31d>] vfio_intx_handler
>>
>> Disabling IRQ #31
>
> Hmm, we have the vfio-pci INTx handler installed, but possibly another
> device is pulling the line or we're somehow out of sync with our own
> device, ie. we either think it's already masked or it's not indicating
> INTx status.
>
>> The IRQs which this occurs on are: 25, 27, 106, 31, 29. These represent the
>>
>> PEX 8747 PCIe switches present in the system:
>
> Some of those are the upstream port IRQ shared with a downstream
> endpoint, but the lspci doesn't show that conclusively for all of
> those.

The lspci -vvv output previously attached was executed while a VM was running

with GPUs 04,08,84,88,89. We can see what you mean about IRQ 25 shared with PCIe

switch 02:00 and GPU 05:00.0, IRQ 27 with switch 06:00 and GPU 09:00.0, IRQ 29

with switch 82:00 and GPU 85:00.0, and IRQ 31 only on switch 86:00. We just

started a VM with all 8 GPUs and now none of the upstream switch ports and GPUs

share IRQs.

As an aside, we have seen the host lockups occur in IRQs 25, 27, 29, 31, which

are all for the upstream switch ports, but also IRQs 103, 106, 109, 112, which

correspond to the GPU audio device functions. There are only four IRQs for eight

GPUS because GPUs sharing a PCIe switch will have their audio device functions

both share an IRQ. This is different to what was shown in the previously

attached lspci -vvv output because, since by default, the audio devices on the

GPUs don’t use MSI. Recently, we set “options snd-hda-intel enable_msi=1” in

/etc/modprobe.d/ within the VM to force the audio device to use MSI, which had

the side effect of audio devices no longer sharing IRQs. We don't know if the

host lockups related to the audio devices are related to the PCIe switch lockup

problem.

> Perhaps a test we can run is to set DisINTx on all the upstream
> root ports to eliminate this known IRQ line sharing between devices.
> Something like this should set the correct bit for all the 8747
> upstream ports:
>
> # setpci -d 10b5: -s 00.0 4.w=400:400
>
> lspci for each should then show:
>
> Control: ... DisINTx+
>
> vs DisINTx-
>
> I'm still curious why it's not using MSI, but maybe this will hint
> which device has the issue.

Thanks. We have ran the setpci command, and can see DisINTx+ has been set. MSI

is still disabled, interestingly enough. We will run the reproducer script to

loop the VM to see if it reproduces.

> ...
>>
>> [ 3.271530] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
>>
>> [ 3.282572] DMAR-IR: Copied IR table for dmar0 from previous kernel
>>
>> [ 13.291319] DMAR-IR: Copied IR table for dmar1 from previous kernel
>>
>>
>>
>> The crashkernel then hard locks, and the system must be manually rebooted. Note
>>
>> that it took ten seconds to copy the IR table for dmar1, which is most unusual.
>>
>> If we do a sysrq-trigger, there is no ten second delay, and the very next
>>
>> message is:
>>
>>
>>
>> DMAR-IR: Enabled IRQ remapping in x2apic mode
>>
>>
>>
>> Which leads us to believe that we are getting stuck in the crashkernel copying
>>
>> the IR table and re-enabling the IRQ that was disabled from "nobody cared"
>>
>> and globally enabling IRQ remapping.
>
>
> Curious, yes. I don't have any insights there. Maybe something about
> the IRQ being constantly asserted interferes with installing the new
> remapper table.
>
>
>> Things we have tried:
>>
>>
>>
>> We have tried adding vfio-pci.nointxmask=1 to the kernel command line, but we
>>
>> cannot start a VM where the GPUs shares the same PCI switch, instead we get a
>>
>> libvirt error:
>>
>>
>>
>> Fails to start: vfio 0000:05:00.0: Failed to set up TRIGGER eventfd signaling for interrupt INTX-0: VFIO_DEVICE_SET_IRQS failure: Device or resource busy
>
>
> Yup, chances of being able to enable all the assignable endpoints with
> shared interrupts is pretty slim. It's also curious that your oops
> above doesn't list any IRQ handler corresponding to the upstream port.
> Theoretically that means you'd at least avoid conflicts between
> endpoints and the switch. It might be possible to get a limited config
> working by unbinding drivers from conflicting devices.
>

I also thought it was curious that the oops didn't list a handler for the

upstream port. Perhaps this is why the IRQs are being ignored, because a handler

for the upstream port isn't being set?

>> Starting a VM with GPUs all from different PCI switches works just fine.
>
>
> This must be a clue, but I'm not sure how it fits yet.
>
>
>> We tried adding "options snd-hda-intel enable_msi=1" to /etc/modprobe.d/snd-hda-intel.conf,
>>
>> and while it did enable MSI for all PCI devices under each GPU, MSI is still
>>
>> disabled on each of the PLX PCI switches, and the issue still reproduces when
>>
>> GPUs share PCI switches.
>>
>>
>>
>> We have ruled out ACS issues, as each PLX PCI switch and Nvidia GPU are
>>
>> allocated their own isolated IOMMU group:
>>
>>
>>
>> https://paste.ubuntu.com/p/9VRt2zrqRR/
>>
>>
>>
>> Looking at the initial kernel oops, we seem to hit __report_bad_irq(), which
>>
>> means that we have ignored 99,900 of these interrupts coming from the PCI switch,
>>
>> and that the vfio_intx_handler() doesn't process them, likely because the PCI
>>
>> switch itself is not passed through to the VM, only the VGA PCI devices are.
>
>
> Assigning switch ports doesn't really make any sense, all the bridge
> features would be emulated in the guest and the host needs to handle
> various error conditions first. Seems we need to figure out which
> device is actually signaling the interrupt and how it relates to
> multiple devices assigned from the same switch, and maybe if it's
> related to a bus reset operation. Thanks,

Would tracing the pci subsystem be helpful? We will enable dyndbg on

dyndbg='file drivers/pci/* +p'

and see if we can catch bus resets.

>
> Alex
>

Thanks,
Matthew

2021-09-15 16:34:37

by Alex Williamson

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

On Wed, 15 Sep 2021 16:44:38 +1200
Matthew Ruffell <[email protected]> wrote:
> On 15/09/21 4:43 am, Alex Williamson wrote:
> >
> > FWIW, I have access to a system with an NVIDIA K1 and M60, both use
> > this same switch on-card and I've not experienced any issues assigning
> > all the GPUs to a single VM. Topo:
> >
> > +-[0000:40]-+-02.0-[42-47]----00.0-[43-47]--+-08.0-[44]----00.0
> > | +-09.0-[45]----00.0
> > | +-10.0-[46]----00.0
> > | \-11.0-[47]----00.0
> > \-[0000:00]-+-03.0-[04-07]----00.0-[05-07]--+-08.0-[06]----00.0
> > \-10.0-[07]----00.0


I've actually found that the above configuration, assigning all 6 GPUs
to a VM reproduces this pretty readily by simply rebooting the VM. In
my case, I don't have the panic-on-warn/oops that must be set on your
kernel, so the result is far more benign, the IRQ gets masked until
it's re-registered.

The fact that my upstream ports are using MSI seems irrelevant.

Adding debugging to the vfio-pci interrupt handler, it's correctly
deferring the interrupt as the GPU device is not identifying itself as
the source of the interrupt via the status register. In fact, setting
the disable INTx bit in the GPU command register while the interrupt
storm occurs does not stop the interrupts.

The interrupt storm does seem to be related to the bus resets, but I
can't figure out yet how multiple devices per switch factors into the
issue. Serializing all bus resets via a mutex doesn't seem to change
the behavior.

I'm still investigating, but if anyone knows how to get access to the
Broadcom datasheet or errata for this switch, please let me know.
Thanks,

Alex

2021-09-16 05:15:14

by Matthew Ruffell

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

On 16/09/21 4:32 am, Alex Williamson wrote:
> On Wed, 15 Sep 2021 16:44:38 +1200
> Matthew Ruffell <[email protected]> wrote:
>> On 15/09/21 4:43 am, Alex Williamson wrote:
>>>
>>> FWIW, I have access to a system with an NVIDIA K1 and M60, both use
>>> this same switch on-card and I've not experienced any issues assigning
>>> all the GPUs to a single VM. Topo:
>>>
>>> +-[0000:40]-+-02.0-[42-47]----00.0-[43-47]--+-08.0-[44]----00.0
>>> | +-09.0-[45]----00.0
>>> | +-10.0-[46]----00.0
>>> | \-11.0-[47]----00.0
>>> \-[0000:00]-+-03.0-[04-07]----00.0-[05-07]--+-08.0-[06]----00.0
>>> \-10.0-[07]----00.0
>
>
> I've actually found that the above configuration, assigning all 6 GPUs
> to a VM reproduces this pretty readily by simply rebooting the VM. In
> my case, I don't have the panic-on-warn/oops that must be set on your
> kernel, so the result is far more benign, the IRQ gets masked until
> it's re-registered.
>
> The fact that my upstream ports are using MSI seems irrelevant.

Hi Alex,



It is good news that you can reproduce an interrupt storm locally. Did a single

reboot trigger the storm, or did you have to loop the VM a few times?



On our system, if we don't have panic-on-warn/oops set, the system will

eventually grind to a halt and lock up, so we try to reset earlier on the first

oops, but we still get stuck in the crashkernel copying the IR tables from dmar.

>
> Adding debugging to the vfio-pci interrupt handler, it's correctly
> deferring the interrupt as the GPU device is not identifying itself as
> the source of the interrupt via the status register. In fact, setting
> the disable INTx bit in the GPU command register while the interrupt
> storm occurs does not stop the interrupts.
>

Interesting. So the source of the interrupts could be from the PEX switch

itself?



We did a run with DisIntx+ set on the PEX switches, but it didn't make any

difference. Serial log showing DisIntx+ and full dmesg below:



https://paste.ubuntu.com/p/n3XshCxPT8/

> The interrupt storm does seem to be related to the bus resets, but I
> can't figure out yet how multiple devices per switch factors into the
> issue. Serializing all bus resets via a mutex doesn't seem to change
> the behavior.

Very interesting indeed.

> I'm still investigating, but if anyone knows how to get access to the
> Broadcom datasheet or errata for this switch, please let me know.

I have tried reaching out to Broadcom asking for the datasheet and errata, but

I am unsure if they will get back to me.



They list the errata as publicly available on their website, in the

Documentation > errata tab.

https://www.broadcom.com/products/pcie-switches-bridges/pcie-switches/pex8749#documentation



The file "PEX 8749/48/47/33/32/25/24/23/17/16/13/12 Errata" seems to be missing

though.

https://docs.broadcom.com/docs/PEX8749-48-47-33-32-25-24-23-17-16-13-12%20Errata-and-Cautions



An Intel document talks about the errata for the PEX 8749:

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/rn/rn-ias-n3000-n.pdf

It links to the following URL, also missing.

https://docs.broadcom.com/docs/pub-005018



I did however find an older errata document at:



PEX 87xx Errata Version 1.14, September 25, 2015

https://docs.broadcom.com/doc/pub-005017



I will keep trying, and I will let you know if we manage to come across any

documents.



Thank you for your efforts.

Matthew

> Thanks,
> Alex
>

2021-10-05 05:05:30

by Matthew Ruffell

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

Hi Alex,

Have you had an opportunity to have a look at this a bit deeper?

On 16/09/21 4:32 am, Alex Williamson wrote:
>
> Adding debugging to the vfio-pci interrupt handler, it's correctly
> deferring the interrupt as the GPU device is not identifying itself as
> the source of the interrupt via the status register. In fact, setting
> the disable INTx bit in the GPU command register while the interrupt
> storm occurs does not stop the interrupts.
>
> The interrupt storm does seem to be related to the bus resets, but I
> can't figure out yet how multiple devices per switch factors into the
> issue. Serializing all bus resets via a mutex doesn't seem to change
> the behavior.
>
> I'm still investigating, but if anyone knows how to get access to the
> Broadcom datasheet or errata for this switch, please let me know.

We have managed to obtain a recent errata for this switch, and it
doesn't
mention any interrupt storms with nested switches. What would
I be looking for
in the errata? I cannot share our copy, sorry.



Is there anything that we can do to help?



Thanks,

Matthew

2021-10-05 23:15:27

by Alex Williamson

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

On Tue, 5 Oct 2021 18:02:24 +1300
Matthew Ruffell <[email protected]> wrote:

> Hi Alex,
>
> Have you had an opportunity to have a look at this a bit deeper?
>
> On 16/09/21 4:32 am, Alex Williamson wrote:
> >
> > Adding debugging to the vfio-pci interrupt handler, it's correctly
> > deferring the interrupt as the GPU device is not identifying itself as
> > the source of the interrupt via the status register. In fact, setting
> > the disable INTx bit in the GPU command register while the interrupt
> > storm occurs does not stop the interrupts.
> >
> > The interrupt storm does seem to be related to the bus resets, but I
> > can't figure out yet how multiple devices per switch factors into the
> > issue. Serializing all bus resets via a mutex doesn't seem to change
> > the behavior.
> >
> > I'm still investigating, but if anyone knows how to get access to the
> > Broadcom datasheet or errata for this switch, please let me know.
>
> We have managed to obtain a recent errata for this switch, and it
> doesn't
> mention any interrupt storms with nested switches. What would
> I be looking for
> in the errata? I cannot share our copy, sorry.

I dug back into this today and I'm thinking that it doesn't have
anything to do with the PCIe switch hardware. In my case, I believe
the switch is mostly just imposing interrupt sharing between pairs of
GPUs under the switches. For example, in the case of the GRID K1, the
1st & 3rd share an interrupt, as do the 2nd & 4th, so I believe I could
get away with assigning one from each shared set together.

The interrupt sharing is a problem because occasionally one of the GPUs
will continuously stomp on the interrupt line while there's no handler
configured, the other GPU replies "not me", and the kernel eventually
squashes the line.

In one case I see this happening when vfio-pci calls
pci_free_irq_vectors() when we're tearing down the MSI interrupt. This
is the nastiest case because this function wants to clear DisINTx in
pci_intx_for_msi(), where the free-irq-vectors function doesn't even
return to vfio-pci code so that we could mask INTx before the interrupt
storm does its thing. I've got a workaround for this in the patch I'm
playing with below, but it's exceptionally hacky.

Another case I see is that DisINTx will be cleared while the device is
still screaming on the interrupt line, but userspace doesn't yet have a
handler setup. I've had a notion that we need some sort of guard
handler to protect the host from such situations, ie. a handler that
only serves to squelch the device in cases where we could have a shared
interrupt. The patch below also includes swapping in this handler
between userspace interrupt configurations.

With both of these together, I'm so far able to prevent an interrupt
storm for these cards. I'd say the patch below is still extremely
experimental, and I'm not sure how to get around the really hacky bit,
but it would be interesting to see if it resolves the original issue.
I've not yet tested this on a variety of devices, so YMMV. Thanks,

Alex

(patch vs v5.14)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 318864d52837..c8500fcda5b8 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -347,6 +347,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
vdev->pci_2_3 = pci_intx_mask_supported(pdev);
}

+ vfio_intx_stub_init(vdev);
pci_read_config_word(pdev, PCI_COMMAND, &cmd);
if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
cmd &= ~PCI_COMMAND_INTX_DISABLE;
@@ -447,6 +448,14 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
kfree(dummy_res);
}

+ /*
+ * Set known command register state, disabling MSI/X (via busmaster)
+ * and INTx directly. At this point we can teardown the INTx stub
+ * handler initialized from the SET_IRQS teardown above.
+ */
+ pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
+ vfio_intx_stub_exit(vdev);
+
vdev->needs_reset = true;

/*
@@ -464,12 +473,6 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
pci_save_state(pdev);
}

- /*
- * Disable INTx and MSI, presumably to avoid spurious interrupts
- * during reset. Stolen from pci_reset_function()
- */
- pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
-
/*
* Try to get the locks ourselves to prevent a deadlock. The
* success of this is dependent on being able to lock the device,
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 869dce5f134d..31978c1b0103 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -139,6 +139,44 @@ static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
return ret;
}

+static irqreturn_t vfio_intx_stub(int irq, void *dev_id)
+{
+ struct vfio_pci_device *vdev = dev_id;
+
+ if (pci_check_and_mask_intx(vdev->pdev))
+ return IRQ_HANDLED;
+
+ return IRQ_NONE;
+}
+
+void vfio_intx_stub_init(struct vfio_pci_device *vdev)
+{
+ char *name;
+
+ if (vdev->nointx || !vdev->pci_2_3 || !vdev->pdev->irq)
+ return;
+
+ name = kasprintf(GFP_KERNEL, "vfio-intx-stub(%s)",
+ pci_name(vdev->pdev));
+ if (!name)
+ return;
+
+ if (request_irq(vdev->pdev->irq, vfio_intx_stub,
+ IRQF_SHARED, name, vdev))
+ kfree(name);
+
+ vdev->intx_stub = true;
+}
+
+void vfio_intx_stub_exit(struct vfio_pci_device *vdev)
+{
+ if (!vdev->intx_stub)
+ return;
+
+ kfree(free_irq(vdev->pdev->irq, vdev));
+ vdev->intx_stub = false;
+}
+
static int vfio_intx_enable(struct vfio_pci_device *vdev)
{
if (!is_irq_none(vdev))
@@ -153,6 +191,8 @@ static int vfio_intx_enable(struct vfio_pci_device *vdev)

vdev->num_ctx = 1;

+ vfio_intx_stub_exit(vdev);
+
/*
* If the virtual interrupt is masked, restore it. Devices
* supporting DisINTx can be masked at the hardware level
@@ -231,6 +271,7 @@ static void vfio_intx_disable(struct vfio_pci_device *vdev)
vdev->irq_type = VFIO_PCI_NUM_IRQS;
vdev->num_ctx = 0;
kfree(vdev->ctx);
+ vfio_intx_stub_init(vdev);
}

/*
@@ -258,6 +299,8 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
if (!vdev->ctx)
return -ENOMEM;

+ vfio_intx_stub_exit(vdev);
+
/* return the number of supported vectors if we can't get all: */
cmd = vfio_pci_memory_lock_and_enable(vdev);
ret = pci_alloc_irq_vectors(pdev, 1, nvec, flag);
@@ -266,6 +309,7 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
pci_free_irq_vectors(pdev);
vfio_pci_memory_unlock_and_restore(vdev, cmd);
kfree(vdev->ctx);
+ vfio_intx_stub_init(vdev);
return ret;
}
vfio_pci_memory_unlock_and_restore(vdev, cmd);
@@ -388,6 +432,7 @@ static int vfio_msi_set_block(struct vfio_pci_device *vdev, unsigned start,
static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix)
{
struct pci_dev *pdev = vdev->pdev;
+ pci_dev_flags_t dev_flags = pdev->dev_flags;
int i;
u16 cmd;

@@ -399,19 +444,22 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix)
vfio_msi_set_block(vdev, 0, vdev->num_ctx, NULL, msix);

cmd = vfio_pci_memory_lock_and_enable(vdev);
- pci_free_irq_vectors(pdev);
- vfio_pci_memory_unlock_and_restore(vdev, cmd);
-
/*
- * Both disable paths above use pci_intx_for_msi() to clear DisINTx
- * via their shutdown paths. Restore for NoINTx devices.
+ * XXX pci_intx_for_msi() will clear DisINTx, which can trigger an
+ * INTx storm even before we return from pci_free_irq_vectors(), even
+ * as we'll restore the previous command register immediately after.
+ * Hack around it by masking in a dev_flag to prevent such behavior.
*/
- if (vdev->nointx)
- pci_intx(pdev, 0);
+ pdev->dev_flags |= PCI_DEV_FLAGS_MSI_INTX_DISABLE_BUG;
+ pci_free_irq_vectors(pdev);
+ pdev->dev_flags = dev_flags;
+
+ vfio_pci_memory_unlock_and_restore(vdev, cmd);

vdev->irq_type = VFIO_PCI_NUM_IRQS;
vdev->num_ctx = 0;
kfree(vdev->ctx);
+ vfio_intx_stub_init(vdev);
}

/*
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 5a36272cecbf..709d497b528c 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -128,6 +128,7 @@ struct vfio_pci_device {
bool needs_reset;
bool nointx;
bool needs_pm_restore;
+ bool intx_stub;
struct pci_saved_state *pci_saved_state;
struct pci_saved_state *pm_save;
struct vfio_pci_reflck *reflck;
@@ -151,6 +152,9 @@ struct vfio_pci_device {
#define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
#define irq_is(vdev, type) (vdev->irq_type == type)

+extern void vfio_intx_stub_init(struct vfio_pci_device *vdev);
+extern void vfio_intx_stub_exit(struct vfio_pci_device *vdev);
+
extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);


2021-10-12 05:01:15

by Matthew Ruffell

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

Hi Alex,

On Wed, Oct 6, 2021 at 12:13 PM Alex Williamson
<[email protected]> wrote:
> With both of these together, I'm so far able to prevent an interrupt
> storm for these cards. I'd say the patch below is still extremely
> experimental, and I'm not sure how to get around the really hacky bit,
> but it would be interesting to see if it resolves the original issue.
> I've not yet tested this on a variety of devices, so YMMV. Thanks,

Thank you very much for your analysis and for the experimental patch, and we
have excellent news to report.

I sent Nathan a test kernel built on 5.14.0, and he has been running the
reproducer for a few days now.

Nathan writes:

> I've been testing heavily with the reproducer for a few days using all 8 GPUs
> and with the MSI fix for the audio devices in the guest disabled, i.e. a pretty
> much worst case scenario. As a control with kernel 5.14 (unpatched), the system
> locked up in 2,2,6,1, and 4 VM reset iterations, all in less than 10 minutes
> each time. With the patched kernel I'm currently at 1226 iterations running for
> 2 days 10 hours with no failures. This is excellent. FYI, I have disabled the
> dyndbg setting.

The system is stable, and your patch sounds very promising.

Nathan does have a small side effect to report:

> The only thing close to an issue that I have is that I still get frequent
> "irq 112: nobody cared" and "Disabling IRQ #112" errors. They just no longer
> lockup the system. If I watch the reproducer time between VM resets, I've
> noticed that it takes longer for the VM to startup after one of these
> "nobody cared" errors, and thus it takes longer until I can reset the VM again.
> I believe slow guest behavior in this disabled IRQ scenario is expected though?

Full dmesg:
https://paste.ubuntu.com/p/hz8WdPZmNZ/

I had a look at all the lspci Nathan has provided me in the past, but 112 isn't
listed. I will ask Nathan for a fresh lspci so we can see what device it is.
The interesting thing is that we still hit __report_bad_irq() for 112 when we
have previously disabled it, typically after 1000+ seconds has gone by.

We think your patch fixes the interrupt storm issues. We are happy to continue
testing for as much as you need, and we are happy to test any followup patch
revisions.

Is there anything you can do to feel more comfortable about the
PCI_DEV_FLAGS_MSI_INTX_DISABLE_BUG dev flag hack? While it works, I can see why
you might not want to land it in mainline.

Thanks,
Matthew

2021-10-12 20:09:21

by Alex Williamson

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

On Tue, 12 Oct 2021 17:58:07 +1300
Matthew Ruffell <[email protected]> wrote:

> Hi Alex,
>
> On Wed, Oct 6, 2021 at 12:13 PM Alex Williamson
> <[email protected]> wrote:
> > With both of these together, I'm so far able to prevent an interrupt
> > storm for these cards. I'd say the patch below is still extremely
> > experimental, and I'm not sure how to get around the really hacky bit,
> > but it would be interesting to see if it resolves the original issue.
> > I've not yet tested this on a variety of devices, so YMMV. Thanks,
>
> Thank you very much for your analysis and for the experimental patch, and we
> have excellent news to report.
>
> I sent Nathan a test kernel built on 5.14.0, and he has been running the
> reproducer for a few days now.
>
> Nathan writes:
>
> > I've been testing heavily with the reproducer for a few days using all 8 GPUs
> > and with the MSI fix for the audio devices in the guest disabled, i.e. a pretty
> > much worst case scenario. As a control with kernel 5.14 (unpatched), the system
> > locked up in 2,2,6,1, and 4 VM reset iterations, all in less than 10 minutes
> > each time. With the patched kernel I'm currently at 1226 iterations running for
> > 2 days 10 hours with no failures. This is excellent. FYI, I have disabled the
> > dyndbg setting.
>
> The system is stable, and your patch sounds very promising.

Great, I also ran a VM reboot loop for several days with all 6 GPUs
assigned, no interrupt issues.

> Nathan does have a small side effect to report:
>
> > The only thing close to an issue that I have is that I still get frequent
> > "irq 112: nobody cared" and "Disabling IRQ #112" errors. They just no longer
> > lockup the system. If I watch the reproducer time between VM resets, I've
> > noticed that it takes longer for the VM to startup after one of these
> > "nobody cared" errors, and thus it takes longer until I can reset the VM again.
> > I believe slow guest behavior in this disabled IRQ scenario is expected though?
>
> Full dmesg:
> https://paste.ubuntu.com/p/hz8WdPZmNZ/
>
> I had a look at all the lspci Nathan has provided me in the past, but 112 isn't
> listed. I will ask Nathan for a fresh lspci so we can see what device it is.
> The interesting thing is that we still hit __report_bad_irq() for 112 when we
> have previously disabled it, typically after 1000+ seconds has gone by.

The device might need to be operating in INTx mode, or at least had
been at some point, to get the register filled. It's essentially just
a scratch register on the card that gets filled when the interrupt is
configured.

Each time we register a new handler for the irq the masking due to
spurious interrupt will be removed, but if it's actually causing the VM
boot to take longer that suggests to me that the guest driver is
stalled, perhaps because it's expecting an interrupt that's now masked
in the host. This could also be caused by a device that gets
incorrectly probed for PCI-2.3 compliant interrupt masking. For
probing we can really only test that we have the ability to set the
DisINTx bit, we can only hope that the hardware folks also properly
implemented the INTx status bit to indicate the device is signaling
INTx. We should really figure out which device this is so that we can
focus on whether it's another shared interrupt issue or something
specific to the device.

I'm also confused why this doesn't trigger the same panic/kexec as we
were seeing with the other interrupt lines. Are there some downstream
patches or configs missing here that would promote these to more fatal
errors?

> We think your patch fixes the interrupt storm issues. We are happy to continue
> testing for as much as you need, and we are happy to test any followup patch
> revisions.
>
> Is there anything you can do to feel more comfortable about the
> PCI_DEV_FLAGS_MSI_INTX_DISABLE_BUG dev flag hack? While it works, I can see why
> you might not want to land it in mainline.

Yeah, it's a huge hack. I wonder if we could look at the interrupt
status and conditional'ize clearing DisINTx based on lack of a pending
interrupt. It seems somewhat reasonable not to clear the bit masking
the interrupt if we know it's pending and know there's no handler for
it. I'll try to check if that's possible. Thanks,

Alex

2021-10-12 22:40:05

by Matthew Ruffell

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

Hi Alex,

On Wed, Oct 13, 2021 at 9:05 AM Alex Williamson
<[email protected]> wrote:
> On Tue, 12 Oct 2021 17:58:07 +1300
> Matthew Ruffell <[email protected]> wrote:
> > Nathan does have a small side effect to report:
> >
> > > The only thing close to an issue that I have is that I still get frequent
> > > "irq 112: nobody cared" and "Disabling IRQ #112" errors. They just no longer
> > > lockup the system. If I watch the reproducer time between VM resets, I've
> > > noticed that it takes longer for the VM to startup after one of these
> > > "nobody cared" errors, and thus it takes longer until I can reset the VM again.
> > > I believe slow guest behavior in this disabled IRQ scenario is expected though?
>
> The device might need to be operating in INTx mode, or at least had
> been at some point, to get the register filled. It's essentially just
> a scratch register on the card that gets filled when the interrupt is
> configured.
>
> Each time we register a new handler for the irq the masking due to
> spurious interrupt will be removed, but if it's actually causing the VM
> boot to take longer that suggests to me that the guest driver is
> stalled, perhaps because it's expecting an interrupt that's now masked
> in the host. This could also be caused by a device that gets
> incorrectly probed for PCI-2.3 compliant interrupt masking. For
> probing we can really only test that we have the ability to set the
> DisINTx bit, we can only hope that the hardware folks also properly
> implemented the INTx status bit to indicate the device is signaling
> INTx. We should really figure out which device this is so that we can
> focus on whether it's another shared interrupt issue or something
> specific to the device.

Nathan got back to me and the device is that same GPU audio controller pair from
before, so it might be another shared interrupt issue, since they both
share 112.

$ sudo lspci -vvv | grep "IRQ 112" -B 5 -A 10
88:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio
Controller (rev a1)
Subsystem: eVga.com. Corp. TU102 High Definition Audio Controller
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 112
NUMA node: 1
Region 0: Memory at f7080000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+
FLReset- SlotPowerLimit 25.000W
--
89:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio
Controller (rev a1)
Subsystem: eVga.com. Corp. TU102 High Definition Audio Controller
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 112
NUMA node: 1
Region 0: Memory at f5080000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+
FLReset- SlotPowerLimit 25.000W

> I'm also confused why this doesn't trigger the same panic/kexec as we
> were seeing with the other interrupt lines. Are there some downstream
> patches or configs missing here that would promote these to more fatal
> errors?
>
There aren't any downstream patches, since the machine lockup happens with
regular mainline kernels too. Even without panic on oops set, the system will
grind to a halt and hang. The panic on oops sysctl was an attempt to get the
machine to reboot to the crashkernel and restart again, but it didn't work very
well since we get stuck copying the IR tables from DMAR.

But your patches seem to fix the hang, which is very promising.

Thanks,
Matthew

2021-11-01 04:36:56

by Matthew Ruffell

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

Hi Alex,

Nathan has been running a workload on the 5.14 kernel + the test patch, and has
ran into some interesting softlockups and hardlockups.

The first, happened on a secondary server running a Windows VM, with 7 (of 10)
1080TI GPUs passed through.

Full dmesg:
https://paste.ubuntu.com/p/Wx5hCBBXKb/

There isn't any "irq x: nobody cared" messages, and the crashkernel gets stuck
in the usual copying IR tables from dmar, which suggests an ongoing interrupt
storm.

Nathan disabled "kernel.hardlockup_panic = 1" sysctl, and managed to reproduce
the issue again, suggesting that we get stuck in kernel space for too long
without the ability for interrupts to be serviced.

It starts with the NIC hitting a tx queue timeout, and then does a NMI to unwind
the stack of each CPU, although the stacks don't appear to indicate where things
are stuck. The server then remains softlocked, and keeps unwinding stacks every
26 seconds or so, until it eventually hardlockups.

Full dmesg:
https://people.canonical.com/~mruffell/sf314568/1080TI_hardlockup.txt

The next interesting thing to report is when Nathan started the same Windows VM
on the primary host we have been debugging on, with the 8x 2080TI GPUs. Nathan
experienced a stuck VM, with the host responding just fine. When Nathan reset
the VM, he got 4x "irq xx: nobody cared" messages on IRQs 25, 27, 29 and 31,
which at the time corresponded to the PEX 8747 upstream PCI switches.

Interestingly, Nathan also observed 2x GPU Audio devices sharing the same IRQ
line as the upstream PCI switch, although Nathan mentioned this only occured
very briefly, and the GPU audio devices were re-assigned different IRQs shortly
afterward.

Full dmesg:
https://paste.ubuntu.com/p/C2V4CY3yjZ/

Output showing upstream ports belonging to those IRQs:
https://paste.ubuntu.com/p/6fkSbyFNWT/

Full lspci:
https://paste.ubuntu.com/p/CTX5kbjpRP/

Let us know if you would like any additional debug information. As always, we
are happy to test patches out.

Thanks,
Matthew

2021-11-04 23:00:16

by Alex Williamson

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

On Mon, 1 Nov 2021 17:35:04 +1300
Matthew Ruffell <[email protected]> wrote:

> Hi Alex,
>
> Nathan has been running a workload on the 5.14 kernel + the test patch, and has
> ran into some interesting softlockups and hardlockups.
>
> The first, happened on a secondary server running a Windows VM, with 7 (of 10)
> 1080TI GPUs passed through.
>
> Full dmesg:
> https://paste.ubuntu.com/p/Wx5hCBBXKb/
>
> There isn't any "irq x: nobody cared" messages, and the crashkernel gets stuck
> in the usual copying IR tables from dmar, which suggests an ongoing interrupt
> storm.
>
> Nathan disabled "kernel.hardlockup_panic = 1" sysctl, and managed to reproduce
> the issue again, suggesting that we get stuck in kernel space for too long
> without the ability for interrupts to be serviced.
>
> It starts with the NIC hitting a tx queue timeout, and then does a NMI to unwind
> the stack of each CPU, although the stacks don't appear to indicate where things
> are stuck. The server then remains softlocked, and keeps unwinding stacks every
> 26 seconds or so, until it eventually hardlockups.

Google finds numerous complaints about transmit queue time outs on igb
devices, bad NICs, bad cabling, bad drivers(?). I also see some
hearsay related specifically to supermicro compatibility. I'd also
suspect that a dual 1GbE NIC is sub-par for anything involving 7+ GPUs.
Time for an upgrade?

It's not clear to me how this would be related to the GPU assignment
perhaps other than the elevated workload on the host.

> The next interesting thing to report is when Nathan started the same Windows VM
> on the primary host we have been debugging on, with the 8x 2080TI GPUs. Nathan
> experienced a stuck VM, with the host responding just fine. When Nathan reset
> the VM, he got 4x "irq xx: nobody cared" messages on IRQs 25, 27, 29 and 31,
> which at the time corresponded to the PEX 8747 upstream PCI switches.
>
> Interestingly, Nathan also observed 2x GPU Audio devices sharing the same IRQ
> line as the upstream PCI switch, although Nathan mentioned this only occured
> very briefly, and the GPU audio devices were re-assigned different IRQs shortly
> afterward.

IME, the legacy interrupt support on NVIDIA GPU audio devices is
marginal for assignment. We don't claim to support assignment of the
audio function, even for Quadro cards on RHEL due to this. I can't
remember the details off the top of my head, but even with the hacky
safeguards added in the test patch, we still rely on hardware to both
honor the INTx disable bit in the command register and accurately report
if the device is asserting INTx is the status register. It seems like
one of these was a bit dicey in this controller.

Now that I think about it more, I recall that the issue was
predominantly with Linux guests, where the snd_intel_hda driver
includes:

/* quirks for Nvidia */
#define AZX_DCAPS_PRESET_NVIDIA \
(AZX_DCAPS_NO_MSI | AZX_DCAPS_CORBRP_SELF_CLEAR |\
AZX_DCAPS_SNOOP_TYPE(NVIDIA))

And the device table includes:

{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID),
.class = PCI_CLASS_MULTIMEDIA_HD_AUDIO << 8,
.class_mask = 0xffffff,
.driver_data = AZX_DRIVER_NVIDIA | AZX_DCAPS_PRESET_NVIDIA },

That NO_MSI quirk forces the sound driver to use legacy interrupts for
all NVIDIA HD audio devices. I think this made audio function
assignment to Linux guests essentially unusable without using the
snd_hda_intel.enable_msi=1 driver option to re-enable MSI. Windows
uses MSI for these devices, so it works better by default, but when
we're resetting the VM we're still transitioning through this mode
where I don't have a good opinion that the hardware behaves in a
manageable way.

My PCIe switch configuration with NVIDIA GPUs only has Tesla cards, so
I don't have a way to reproduce this specific shared INTx issue, but it
may be time to revisit examining the register behavior while running in
INTx mode. Thanks,

Alex

2021-11-24 05:52:35

by Matthew Ruffell

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

Hi Alex,

I have forward ported your patch to 5.16-rc2 to account for the vfio module
refactor that happened recently. Attached below.

Have you had an opportunity to research if it is possible to conditionalise
clearing DisINTx by looking at the interrupt status and seeing if there is a
pending interrupt but no handler set?

We are testing a 5.16-rc2 kernel with the patch applied on Nathan's server
currently, and we are also trying out the pci=clearmsi command line parameter
that was discussed on linux-pci a few years ago in [1][2][3][4] along with
setting snd-hda-intel.enable_msi=1 to see if it helps the crashkernel not get
stuck copying IR tables.

[1] https://marc.info/?l=linux-pci&m=153988799707413
[2] https://lore.kernel.org/linux-pci/[email protected]/
[3] https://lore.kernel.org/linux-pci/[email protected]/
[4] https://lore.kernel.org/linux-pci/[email protected]/

I will let you know how we get on.

Thanks,
Matthew

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f948e6cd2993..cbca207ddc45 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -276,6 +276,7 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
vdev->pci_2_3 = pci_intx_mask_supported(pdev);
}

+ vfio_intx_stub_init(vdev);
pci_read_config_word(pdev, PCI_COMMAND, &cmd);
if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
cmd &= ~PCI_COMMAND_INTX_DISABLE;
@@ -365,6 +366,14 @@ void vfio_pci_core_disable(struct
vfio_pci_core_device *vdev)
kfree(dummy_res);
}

+ /*
+ * Set known command register state, disabling MSI/X (via busmaster)
+ * and INTx directly. At this point we can teardown the INTx stub
+ * handler initialized from the SET_IRQS teardown above.
+ */
+ pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
+ vfio_intx_stub_exit(vdev);
+
vdev->needs_reset = true;

/*
@@ -382,12 +391,6 @@ void vfio_pci_core_disable(struct
vfio_pci_core_device *vdev)
pci_save_state(pdev);
}

- /*
- * Disable INTx and MSI, presumably to avoid spurious interrupts
- * during reset. Stolen from pci_reset_function()
- */
- pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
-
/*
* Try to get the locks ourselves to prevent a deadlock. The
* success of this is dependent on being able to lock the device,
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c
b/drivers/vfio/pci/vfio_pci_intrs.c
index 6069a11fb51a..98cf528aa175 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -139,6 +139,44 @@ static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
return ret;
}

+static irqreturn_t vfio_intx_stub(int irq, void *dev_id)
+{
+ struct vfio_pci_core_device *vdev = dev_id;
+
+ if (pci_check_and_mask_intx(vdev->pdev))
+ return IRQ_HANDLED;
+
+ return IRQ_NONE;
+}
+
+void vfio_intx_stub_init(struct vfio_pci_core_device *vdev)
+{
+ char *name;
+
+ if (vdev->nointx || !vdev->pci_2_3 || !vdev->pdev->irq)
+ return;
+
+ name = kasprintf(GFP_KERNEL, "vfio-intx-stub(%s)",
+ pci_name(vdev->pdev));
+ if (!name)
+ return;
+
+ if (request_irq(vdev->pdev->irq, vfio_intx_stub,
+ IRQF_SHARED, name, vdev))
+ kfree(name);
+
+ vdev->intx_stub = true;
+}
+
+void vfio_intx_stub_exit(struct vfio_pci_core_device *vdev)
+{
+ if (!vdev->intx_stub)
+ return;
+
+ kfree(free_irq(vdev->pdev->irq, vdev));
+ vdev->intx_stub = false;
+}
+
static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
{
if (!is_irq_none(vdev))
@@ -153,6 +191,8 @@ static int vfio_intx_enable(struct
vfio_pci_core_device *vdev)

vdev->num_ctx = 1;

+ vfio_intx_stub_exit(vdev);
+
/*
* If the virtual interrupt is masked, restore it. Devices
* supporting DisINTx can be masked at the hardware level
@@ -231,6 +271,7 @@ static void vfio_intx_disable(struct
vfio_pci_core_device *vdev)
vdev->irq_type = VFIO_PCI_NUM_IRQS;
vdev->num_ctx = 0;
kfree(vdev->ctx);
+ vfio_intx_stub_init(vdev);
}

/*
@@ -258,6 +299,8 @@ static int vfio_msi_enable(struct
vfio_pci_core_device *vdev, int nvec, bool msi
if (!vdev->ctx)
return -ENOMEM;

+ vfio_intx_stub_exit(vdev);
+
/* return the number of supported vectors if we can't get all: */
cmd = vfio_pci_memory_lock_and_enable(vdev);
ret = pci_alloc_irq_vectors(pdev, 1, nvec, flag);
@@ -266,6 +309,7 @@ static int vfio_msi_enable(struct
vfio_pci_core_device *vdev, int nvec, bool msi
pci_free_irq_vectors(pdev);
vfio_pci_memory_unlock_and_restore(vdev, cmd);
kfree(vdev->ctx);
+ vfio_intx_stub_init(vdev);
return ret;
}
vfio_pci_memory_unlock_and_restore(vdev, cmd);
@@ -388,6 +432,7 @@ static int vfio_msi_set_block(struct
vfio_pci_core_device *vdev, unsigned start,
static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
{
struct pci_dev *pdev = vdev->pdev;
+ pci_dev_flags_t dev_flags = pdev->dev_flags;
int i;
u16 cmd;

@@ -399,19 +444,22 @@ static void vfio_msi_disable(struct
vfio_pci_core_device *vdev, bool msix)
vfio_msi_set_block(vdev, 0, vdev->num_ctx, NULL, msix);

cmd = vfio_pci_memory_lock_and_enable(vdev);
- pci_free_irq_vectors(pdev);
- vfio_pci_memory_unlock_and_restore(vdev, cmd);

/*
- * Both disable paths above use pci_intx_for_msi() to clear DisINTx
- * via their shutdown paths. Restore for NoINTx devices.
+ * XXX pci_intx_for_msi() will clear DisINTx, which can trigger an
+ * INTx storm even before we return from pci_free_irq_vectors(), even
+ * as we'll restore the previous command register immediately after.
+ * Hack around it by masking in a dev_flag to prevent such behavior.
*/
- if (vdev->nointx)
- pci_intx(pdev, 0);
+ pdev->dev_flags |= PCI_DEV_FLAGS_MSI_INTX_DISABLE_BUG;
+ pci_free_irq_vectors(pdev);
+ pdev->dev_flags = dev_flags;

+ vfio_pci_memory_unlock_and_restore(vdev, cmd);
vdev->irq_type = VFIO_PCI_NUM_IRQS;
vdev->num_ctx = 0;
kfree(vdev->ctx);
+ vfio_intx_stub_init(vdev);
}

/*
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index ef9a44b6cf5d..58e1029eb083 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -124,6 +124,7 @@ struct vfio_pci_core_device {
bool needs_reset;
bool nointx;
bool needs_pm_restore;
+ bool intx_stub;
struct pci_saved_state *pci_saved_state;
struct pci_saved_state *pm_save;
int ioeventfds_nr;
@@ -145,6 +146,9 @@ struct vfio_pci_core_device {
#define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
#define irq_is(vdev, type) (vdev->irq_type == type)

+extern void vfio_intx_stub_init(struct vfio_pci_core_device *vdev);
+extern void vfio_intx_stub_exit(struct vfio_pci_core_device *vdev);
+
extern void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
extern void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);

2021-11-29 17:58:39

by Alex Williamson

[permalink] [raw]
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

On Wed, 24 Nov 2021 18:52:16 +1300
Matthew Ruffell <[email protected]> wrote:

> Hi Alex,
>
> I have forward ported your patch to 5.16-rc2 to account for the vfio module
> refactor that happened recently. Attached below.
>
> Have you had an opportunity to research if it is possible to conditionalise
> clearing DisINTx by looking at the interrupt status and seeing if there is a
> pending interrupt but no handler set?

Sorry, I've not had any time to continue looking at this. When I last
left it I had found that interrupt bit in the status register was not
set prior to clearing INTxDisable in the command register, but the
status register was immediately set upon clearing INTxDisable. That
suggests we could generalize re-masking INTx since we know there's not
a handler for it at this point, but it's not clear how this state gets
reported and cleared. More generally, should the interrupt code leave
INTx unmasked for any case where there's no handler. I'm not sure.

> We are testing a 5.16-rc2 kernel with the patch applied on Nathan's server
> currently, and we are also trying out the pci=clearmsi command line parameter
> that was discussed on linux-pci a few years ago in [1][2][3][4] along with
> setting snd-hda-intel.enable_msi=1 to see if it helps the crashkernel not get
> stuck copying IR tables.
>
> [1] https://marc.info/?l=linux-pci&m=153988799707413
> [2] https://lore.kernel.org/linux-pci/[email protected]/
> [3] https://lore.kernel.org/linux-pci/[email protected]/
> [4] https://lore.kernel.org/linux-pci/[email protected]/
>
> I will let you know how we get on.

Ok. I've not had any luck reproducing audio INTx issues, any trying to
test it has led me on several tangent bug hunts :-\ Thanks,

Alex