2018-01-15 02:59:45

by Keith Busch

[permalink] [raw]
Subject: [BUG 4.15-rc7] IRQ matrix management errors

I hoped to have a better report before the weekend, but I've run out of
time and without my machine till next week, so sending what I have and
praying someone more in the know will have a better clue.

I've a few NVMe drives and occasionally the IRQ teardown and bring-up
is failing. Resetting the controllers to re-run interrupt allocation
results in device interrupts from not occuring at all.

It appears I need at least 2 NVMe drives for the problem. The NVMe
driver initializes controllers in parallel, and multiple threads calling
pci_alloc_irq_vectors and/or pci_free_vectors at about the same time
seems required to trigger the issue.

Here are the relavent warnings. I also have trace events from irq_matrix
and irq_vectors, but from a different run that won't match up to the
timestamps below, but can send if that's helpful.

[ 288.519216] WARNING: CPU: 28 PID: 1420 at kernel/irq/matrix.c:222 irq_matrix_remove_managed+0x10f/0x120
[ 288.519218] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_uncore intel_rapl_perf iTCO_wdt joydev iTCO_vendor_support ipmi_ssif ipmi_si mei_me mei shpchp i2c_i801 wmi lpc_ich ipmi_devintf ioatdma ipmi_msghandler dca acpi_pad acpi_power_meter xfs libcrc32c ast i2c_algo_bit drm_kms_helper ttm drm i40e crc32c_intel
[ 288.519286] ptp nvme_core pps_core [last unloaded: nvme]
[ 288.519294] CPU: 28 PID: 1420 Comm: kworker/u674:2 Not tainted 4.15.0-rc7+ #4
[ 288.519296] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.2001.062220170731 06/22/2017
[ 288.519305] Workqueue: nvme-wq nvme_reset_work [nvme]
[ 288.519310] RIP: 0010:irq_matrix_remove_managed+0x10f/0x120
[ 288.519312] RSP: 0018:ffffb01f0aec7a88 EFLAGS: 00010046
[ 288.519315] RAX: 00000000000000ee RBX: ffff9e387d824900 RCX: 0000000000000000
[ 288.519317] RDX: 0000000000000100 RSI: 00000000000000ee RDI: ffff9e387d410c50
[ 288.519319] RBP: 0000000000000000 R08: 0000000000000100 R09: 0000000000000000
[ 288.519320] R10: 0000000000000018 R11: 0000000000000003 R12: ffff9e387d410c00
[ 288.519322] R13: ffff9e387d410c30 R14: 00000000000000ee R15: 00000000000000ee
[ 288.519324] FS: 0000000000000000(0000) GS:ffff9e487cc00000(0000) knlGS:0000000000000000
[ 288.519326] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 288.519328] CR2: 00007fbc1c2321d4 CR3: 00000010f5209003 CR4: 00000000007606e0
[ 288.519330] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 288.519332] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 288.519333] PKRU: 55555554
[ 288.519334] Call Trace:
[ 288.519345] x86_vector_free_irqs+0xa1/0x180
[ 288.519349] x86_vector_alloc_irqs+0x1e4/0x3a0
[ 288.519354] msi_domain_alloc+0x62/0x130
[ 288.519363] ? kmem_cache_alloc_node_trace+0x1ac/0x1d0
[ 288.519366] __irq_domain_alloc_irqs+0x121/0x300
[ 288.519370] msi_domain_alloc_irqs+0x99/0x2e0
[ 288.519376] native_setup_msi_irqs+0x54/0x90
[ 288.519383] __pci_enable_msix+0xfb/0x4e0
[ 288.519389] pci_alloc_irq_vectors_affinity+0x8e/0x130
[ 288.519394] nvme_reset_work+0x919/0x153b [nvme]
[ 288.519404] ? sched_clock+0x5/0x10
[ 288.519407] ? sched_clock+0x5/0x10
[ 288.519414] ? sched_clock_cpu+0xc/0xb0
[ 288.519420] ? pick_next_task_fair+0x4d5/0x5f0
[ 288.519427] ? __switch_to+0xa2/0x430
[ 288.519431] ? put_prev_entity+0x1e/0xe0
[ 288.519438] process_one_work+0x182/0x370
[ 288.519441] worker_thread+0x2e/0x380
[ 288.519444] ? process_one_work+0x370/0x370
[ 288.519449] kthread+0x111/0x130
[ 288.519453] ? kthread_create_worker_on_cpu+0x70/0x70
[ 288.519460] ret_from_fork+0x1f/0x30
[ 288.519464] Code: 89 ea 44 89 f6 41 ff d1 4d 8b 0f 4d 85 c9 75 e2 e9 2a ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f ff e9 14 ff ff ff <0f> ff e9 0d ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
[ 288.519517] ---[ end trace 8fcc70570a780634 ]---
[ 310.959112] nvme nvme0: pci function 0000:5e:00.0
[ 310.959333] nvme nvme1: pci function 0000:60:00.0
[ 310.959675] nvme nvme2: pci function 0000:86:00.0
[ 311.167369] WARNING: CPU: 1 PID: 898 at kernel/irq/matrix.c:215 irq_matrix_remove_managed+0x108/0x120
[ 311.167371] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_uncore intel_rapl_perf iTCO_wdt joydev iTCO_vendor_support ipmi_ssif ipmi_si mei_me mei shpchp i2c_i801 wmi lpc_ich ipmi_devintf ioatdma ipmi_msghandler dca acpi_pad acpi_power_meter xfs libcrc32c ast i2c_algo_bit drm_kms_helper ttm drm i40e crc32c_intel
[ 311.167440] ptp nvme_core pps_core [last unloaded: nvme]
[ 311.167448] CPU: 1 PID: 898 Comm: kworker/u673:9 Tainted: G W 4.15.0-rc7+ #4
[ 311.167450] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.2001.062220170731 06/22/2017
[ 311.167459] Workqueue: nvme-wq nvme_reset_work [nvme]
[ 311.167464] RIP: 0010:irq_matrix_remove_managed+0x108/0x120
[ 311.167466] RSP: 0018:ffffb01f0f387a88 EFLAGS: 00010046
[ 311.167469] RAX: 0000000000000000 RBX: ffff9e387df24900 RCX: 0000000000000038
[ 311.167471] RDX: 0000000000000038 RSI: 0000000000000000 RDI: ffff9e3871bcd3c0
[ 311.167472] RBP: 0000000000000038 R08: fffffffffffffffc R09: 0300000000000003
[ 311.167474] R10: ffff9e387d402480 R11: ffff9e3838149d1c R12: ffff9e387d410c00
[ 311.167476] R13: ffff9e387d410c30 R14: 0000000000000021 R15: 00000000000000ee
[ 311.167478] FS: 0000000000000000(0000) GS:ffff9e387d840000(0000) knlGS:0000000000000000
[ 311.167481] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 311.167483] CR2: 00007fbe60000b20 CR3: 00000010f5209001 CR4: 00000000007606e0
[ 311.167486] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 311.167489] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 311.167490] PKRU: 55555554
[ 311.167491] Call Trace:
[ 311.167505] x86_vector_free_irqs+0xa1/0x180
[ 311.167515] ? cpumask_next+0x17/0x20
[ 311.167520] x86_vector_alloc_irqs+0x1e4/0x3a0
[ 311.167525] msi_domain_alloc+0x62/0x130
[ 311.167532] ? kmem_cache_alloc_node_trace+0x1ac/0x1d0
[ 311.167536] __irq_domain_alloc_irqs+0x121/0x300
[ 311.167540] msi_domain_alloc_irqs+0x99/0x2e0
[ 311.167547] native_setup_msi_irqs+0x54/0x90
[ 311.167556] __pci_enable_msix+0xfb/0x4e0
[ 311.167562] pci_alloc_irq_vectors_affinity+0x8e/0x130
[ 311.167568] nvme_reset_work+0x919/0x153b [nvme]
[ 311.167576] ? update_curr+0xe4/0x1d0
[ 311.167580] ? account_entity_dequeue+0xa4/0xd0
[ 311.167584] ? dequeue_entity+0xd5/0x430
[ 311.167589] ? pick_next_task_fair+0x14f/0x5f0
[ 311.167597] ? __switch_to+0xa2/0x430
[ 311.167603] ? sched_clock+0x5/0x10
[ 311.167607] ? put_prev_entity+0x1e/0xe0
[ 311.167612] process_one_work+0x182/0x370
[ 311.167616] worker_thread+0x2e/0x380
[ 311.167619] ? process_one_work+0x370/0x370
[ 311.167623] kthread+0x111/0x130
[ 311.167628] ? kthread_create_worker_on_cpu+0x70/0x70
[ 311.167633] ret_from_fork+0x1f/0x30
[ 311.167637] Code: 18 49 89 d8 4c 89 e1 89 ea 44 89 f6 41 ff d1 4d 8b 0f 4d 85 c9 75 e2 e9 2a ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> ff e9 14 ff ff ff 0f ff e9 0d ff ff ff 66 2e 0f 1f 84 00 00
[ 311.167691] ---[ end trace 8fcc70570a780635 ]---
[ 311.172737] WARNING: CPU: 0 PID: 894 at arch/x86/kernel/apic/vector.c:404 x86_vector_activate+0x160/0x280
[ 311.172739] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_uncore intel_rapl_perf iTCO_wdt joydev iTCO_vendor_support ipmi_ssif ipmi_si mei_me mei shpchp i2c_i801 wmi lpc_ich ipmi_devintf ioatdma ipmi_msghandler dca acpi_pad acpi_power_meter xfs libcrc32c ast i2c_algo_bit drm_kms_helper ttm drm i40e crc32c_intel
[ 311.172801] ptp nvme_core pps_core [last unloaded: nvme]
[ 311.172808] CPU: 0 PID: 894 Comm: kworker/u673:5 Tainted: G W 4.15.0-rc7+ #4
[ 311.172809] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.2001.062220170731 06/22/2017
[ 311.172817] Workqueue: nvme-wq nvme_reset_work [nvme]
[ 311.172822] RIP: 0010:x86_vector_activate+0x160/0x280
[ 311.172824] RSP: 0018:ffffb01f0f367b88 EFLAGS: 00010086
[ 311.172827] RAX: 00000000ffffffe4 RBX: ffff9e386a90e380 RCX: 0000000000000000
[ 311.172829] RDX: 0000000000000100 RSI: 00000000000000ee RDI: ffff9e387d410c50
[ 311.172830] RBP: 0000000000000150 R08: 0000000000000100 R09: 0000000000000000
[ 311.172832] R10: 0000000000000018 R11: 0000000000000003 R12: 00000000ffffffe4
[ 311.172833] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9e386a0b70a4
[ 311.172835] FS: 0000000000000000(0000) GS:ffff9e387d800000(0000) knlGS:0000000000000000
[ 311.172837] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 311.172839] CR2: 00007f268d3d5000 CR3: 00000010f5209001 CR4: 00000000007606f0
[ 311.172841] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 311.172842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 311.172843] PKRU: 55555554
[ 311.172844] Call Trace:
[ 311.172855] ? kernfs_activate+0x63/0x80
[ 311.172864] __irq_domain_activate_irq+0x43/0x90
[ 311.172869] ? __kernfs_create_file+0x7d/0xb0
[ 311.172874] __irq_domain_activate_irq+0x7a/0x90
[ 311.172877] irq_domain_activate_irq+0x25/0x40
[ 311.172882] irq_startup+0xb2/0xe0
[ 311.172887] __setup_irq+0x66d/0x6e0
[ 311.172891] ? request_threaded_irq+0xad/0x160
[ 311.172894] request_threaded_irq+0xf5/0x160
[ 311.172900] ? nvme_del_cq_end+0x70/0x70 [nvme]
[ 311.172906] pci_request_irq+0x8f/0xd0
[ 311.172912] queue_request_irq+0x4d/0x70 [nvme]
[ 311.172917] nvme_reset_work+0x934/0x153b [nvme]
[ 311.172924] ? update_curr+0xe4/0x1d0
[ 311.172928] ? account_entity_dequeue+0xa4/0xd0
[ 311.172932] ? dequeue_entity+0xd5/0x430
[ 311.172937] ? pick_next_task_fair+0x14f/0x5f0
[ 311.172943] ? __switch_to+0xa2/0x430
[ 311.172947] ? sched_clock+0x5/0x10
[ 311.172951] ? put_prev_entity+0x1e/0xe0
[ 311.172956] process_one_work+0x182/0x370
[ 311.172959] worker_thread+0x2e/0x380
[ 311.172962] ? process_one_work+0x370/0x370
[ 311.172966] kthread+0x111/0x130
[ 311.172971] ? kthread_create_worker_on_cpu+0x70/0x70
[ 311.172975] ret_from_fork+0x1f/0x30
[ 311.172978] Code: 4b c2 3d 01 48 8b 3d a0 8a 9c 01 48 89 ee e8 88 33 3d 00 48 39 c5 0f 84 05 01 00 00 48 89 df e8 07 fd ff ff 85 c0 41 89 c4 79 a1 <0f> ff 8b 73 04 48 c7 c7 38 27 08 92 e8 f9 15 0b 00 eb 8e 45 31
[ 311.173032] ---[ end trace 8fcc70570a780636 ]---
[ 311.173034] Managed startup irq 41, no vector available
[ 311.173090] WARNING: CPU: 0 PID: 894 at kernel/irq/chip.c:226 irq_startup+0xd6/0xe0
[ 311.173090] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_uncore intel_rapl_perf iTCO_wdt joydev iTCO_vendor_support ipmi_ssif ipmi_si mei_me mei shpchp i2c_i801 wmi lpc_ich ipmi_devintf ioatdma ipmi_msghandler dca acpi_pad acpi_power_meter xfs libcrc32c ast i2c_algo_bit drm_kms_helper ttm drm i40e crc32c_intel
[ 311.173140] ptp nvme_core pps_core [last unloaded: nvme]
[ 311.173145] CPU: 0 PID: 894 Comm: kworker/u673:5 Tainted: G W 4.15.0-rc7+ #4
[ 311.173147] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.2001.062220170731 06/22/2017
[ 311.173151] Workqueue: nvme-wq nvme_reset_work [nvme]
[ 311.173156] RIP: 0010:irq_startup+0xd6/0xe0
[ 311.173158] RSP: 0018:ffffb01f0f367c28 EFLAGS: 00010086
[ 311.173160] RAX: 00000000ffffffe4 RBX: ffff9e386a0b7000 RCX: 0000000000000006
[ 311.173162] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
[ 311.173163] RBP: ffff9e386a90e580 R08: 0000000000000720 R09: 0000000000000000
[ 311.173164] R10: 0000000000000018 R11: ffff9e386c83c790 R12: 0000000000000001
[ 311.173166] R13: ffff9e386a0b7028 R14: ffff9e386a0b7160 R15: ffff9e386a0b70a4
[ 311.173168] FS: 0000000000000000(0000) GS:ffff9e387d800000(0000) knlGS:0000000000000000
[ 311.173169] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 311.173171] CR2: 00007f268d3d5000 CR3: 00000010f5209001 CR4: 00000000007606f0
[ 311.173172] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 311.173174] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 311.173175] PKRU: 55555554
[ 311.173176] Call Trace:
[ 311.173180] __setup_irq+0x66d/0x6e0
[ 311.173183] ? request_threaded_irq+0xad/0x160
[ 311.173187] request_threaded_irq+0xf5/0x160
[ 311.173191] ? nvme_del_cq_end+0x70/0x70 [nvme]
[ 311.173194] pci_request_irq+0x8f/0xd0
[ 311.173199] queue_request_irq+0x4d/0x70 [nvme]
[ 311.173203] nvme_reset_work+0x934/0x153b [nvme]
[ 311.173207] ? update_curr+0xe4/0x1d0
[ 311.173211] ? account_entity_dequeue+0xa4/0xd0
[ 311.173215] ? dequeue_entity+0xd5/0x430
[ 311.173220] ? pick_next_task_fair+0x14f/0x5f0
[ 311.173223] ? __switch_to+0xa2/0x430
[ 311.173227] ? sched_clock+0x5/0x10
[ 311.173231] ? put_prev_entity+0x1e/0xe0
[ 311.173234] process_one_work+0x182/0x370
[ 311.173237] worker_thread+0x2e/0x380
[ 311.173240] ? process_one_work+0x370/0x370
[ 311.173243] kthread+0x111/0x130
[ 311.173248] ? kthread_create_worker_on_cpu+0x70/0x70
[ 311.173250] ret_from_fork+0x1f/0x30
[ 311.173253] Code: 89 ef e8 0e 2f 00 00 85 c0 75 20 48 89 ee 31 d2 4c 89 ef e8 ad da ff ff 48 89 df e8 d5 fe ff ff 89 c5 e9 65 ff ff ff 0f ff eb b9 <0f> ff eb b5 66 0f 1f 44 00 00 0f 1f 44 00 00 55 53 48 89 fd 48
[ 311.173306] ---[ end trace 8fcc70570a780637 ]---
[ 371.677929] nvme nvme0: I/O 21 QID 0 timeout, completion polled
[ 431.709906] nvme nvme0: I/O 21 QID 0 timeout, completion polled
[ 491.997931] nvme nvme0: I/O 5 QID 0 timeout, completion polled
[ 552.733905] nvme nvme0: I/O 25 QID 0 timeout, completion polled
[ 552.733990] Managed startup irq 43, no vector available


2018-01-15 09:13:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Sun, 14 Jan 2018, Keith Busch wrote:

> I hoped to have a better report before the weekend, but I've run out of
> time and without my machine till next week, so sending what I have and
> praying someone more in the know will have a better clue.
>
> I've a few NVMe drives and occasionally the IRQ teardown and bring-up
> is failing. Resetting the controllers to re-run interrupt allocation
> results in device interrupts from not occuring at all.
>
> It appears I need at least 2 NVMe drives for the problem. The NVMe
> driver initializes controllers in parallel, and multiple threads calling
> pci_alloc_irq_vectors and/or pci_free_vectors at about the same time
> seems required to trigger the issue.
>
> Here are the relavent warnings. I also have trace events from irq_matrix
> and irq_vectors, but from a different run that won't match up to the
> timestamps below, but can send if that's helpful.

The dmesg is not that interesting. The traces definitely are if you can
identify the point where it goes into lala land.

Thanks,

tglx

2018-01-16 06:13:31

by Keith Busch

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Mon, Jan 15, 2018 at 01:13:29AM -0800, Thomas Gleixner wrote:
>
> The dmesg is not that interesting. The traces definitely are if you can
> identify the point where it goes into lala land.

The attached is with irq_matrix and irq_vector trace events enabled,
but I've stripped out irq_work_*, local_timer_*, rescheudle_*,
and call_function_* messages removed, as these appear to be traced
continuously unrelated to the allocations. I stopped at the first warning
rather than let it continue.

Here's the warning the trace events in the attachment were captured with:

[ 334.567321] WARNING: CPU: 28 PID: 1421 at kernel/irq/matrix.c:222 irq_matrix_remove_managed+0x10f/0x120
[ 334.567323] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_uncore intel_rapl_perf iTCO_wdt joydev iTCO_vendor_support ipmi_ssif ipmi_si mei_me ipmi_devintf mei shpchp i2c_i801 ipmi_msghandler lpc_ich ioatdma wmi dca acpi_power_meter acpi_pad xfs libcrc32c ast i2c_algo_bit drm_kms_helper ttm drm i40e crc32c_intel
[ 334.567391] ptp nvme_core pps_core [last unloaded: nvme]
[ 334.567398] CPU: 28 PID: 1421 Comm: kworker/u674:3 Not tainted 4.15.0-rc8+ #6
[ 334.567401] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.2001.062220170731 06/22/2017
[ 334.567407] Workqueue: nvme-wq nvme_reset_work [nvme]
[ 334.567412] RIP: 0010:irq_matrix_remove_managed+0x10f/0x120
[ 334.567414] RSP: 0018:ffffbada0af13a88 EFLAGS: 00010046
[ 334.567417] RAX: 00000000000000ee RBX: ffff9c45bd824900 RCX: 0000000000000000
[ 334.567419] RDX: 0000000000000100 RSI: 00000000000000ee RDI: ffff9c45bd410c50
[ 334.567420] RBP: 0000000000000000 R08: 0000000000000100 R09: 0000000000000000
[ 334.567422] R10: 0000000000000018 R11: 0000000000000003 R12: ffff9c45bd410c00
[ 334.567423] R13: ffff9c45bd410c30 R14: 00000000000000ee R15: 00000000000000ee
[ 334.567426] FS: 0000000000000000(0000) GS:ffff9c55bcc00000(0000) knlGS:0000000000000000
[ 334.567427] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 334.567429] CR2: 0000557bd66ab730 CR3: 00000019f9209002 CR4: 00000000007606e0
[ 334.567431] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 334.567432] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 334.567433] PKRU: 55555554
[ 334.567434] Call Trace:
[ 334.567445] x86_vector_free_irqs+0xa1/0x180
[ 334.567451] x86_vector_alloc_irqs+0x1e4/0x3a0
[ 334.567455] msi_domain_alloc+0x62/0x130
[ 334.567463] ? kmem_cache_alloc_node_trace+0x1ac/0x1d0
[ 334.567467] __irq_domain_alloc_irqs+0x121/0x300
[ 334.567471] msi_domain_alloc_irqs+0x99/0x2e0
[ 334.567477] native_setup_msi_irqs+0x54/0x90
[ 334.567484] __pci_enable_msix+0xfb/0x4e0
[ 334.567489] pci_alloc_irq_vectors_affinity+0x8e/0x130
[ 334.567495] nvme_reset_work+0x919/0x153b [nvme]
[ 334.567503] ? update_curr+0xe4/0x1d0
[ 334.567508] ? account_entity_dequeue+0xa4/0xd0
[ 334.567512] ? dequeue_entity+0xd5/0x430
[ 334.567515] ? pick_next_task_fair+0x14f/0x5f0
[ 334.567525] ? __switch_to+0xa2/0x430
[ 334.567532] ? sched_clock+0x5/0x10
[ 334.567536] ? put_prev_entity+0x1e/0xe0
[ 334.567542] process_one_work+0x182/0x370
[ 334.567546] worker_thread+0x2e/0x380
[ 334.567549] ? process_one_work+0x370/0x370
[ 334.567554] kthread+0x111/0x130
[ 334.567560] ? kthread_create_worker_on_cpu+0x70/0x70
[ 334.567566] ? do_syscall_64+0x61/0x170
[ 334.567573] ? SyS_exit_group+0x10/0x10
[ 334.567580] ret_from_fork+0x1f/0x30
[ 334.567584] Code: 89 ea 44 89 f6 41 ff d1 4d 8b 0f 4d 85 c9 75 e2 e9 2a ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f ff e9 14 ff ff ff <0f> ff e9 0d ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
[ 334.567637] ---[ end trace b97ec0c6a11aa61f ]---


Attachments:
(No filename) (4.13 kB)
irq-trace-events (196.59 kB)
Download all attachments

2018-01-16 07:08:34

by Keith Busch

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

This is all way over my head, but the part that obviously shows
something's gone wrong:

kworker/u674:3-1421 [028] d... 335.307051: irq_matrix_reserve_managed: bit=56 cpu=0 online=1 avl=86 alloc=116 managed=3 online_maps=112 global_avl=22084, global_rsvd=157, total_alloc=570
kworker/u674:3-1421 [028] d... 335.307053: irq_matrix_remove_managed: bit=56 cpu=0 online=1 avl=87 alloc=116 managed=2 online_maps=112 global_avl=22085, global_rsvd=157, total_alloc=570
kworker/u674:3-1421 [028] .... 335.307054: vector_reserve_managed: irq=45 ret=-28
kworker/u674:3-1421 [028] .... 335.307054: vector_setup: irq=45 is_legacy=0 ret=-28
kworker/u674:3-1421 [028] d... 335.307055: vector_teardown: irq=45 is_managed=1 has_reserved=0

Which leads me to x86_vector_alloc_irqs goto error:

error:
x86_vector_free_irqs(domain, virq, i + 1);

The last parameter looks weird. It's the nr_irqs, and since we failed and
bailed, I would think we'd need to subtract 1 rather than add 1. Adding
1 would doublely remove the failed one, and remove the next one that
was never setup, right?

Or maybe irq_matrix_reserve_managed wasn't expected to fail in the
first place?

2018-01-16 10:33:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Tue, 16 Jan 2018, Keith Busch wrote:

> This is all way over my head, but the part that obviously shows
> something's gone wrong:
>
> kworker/u674:3-1421 [028] d... 335.307051: irq_matrix_reserve_managed: bit=56 cpu=0 online=1 avl=86 alloc=116 managed=3 online_maps=112 global_avl=22084, global_rsvd=157, total_alloc=570
> kworker/u674:3-1421 [028] d... 335.307053: irq_matrix_remove_managed: bit=56 cpu=0 online=1 avl=87 alloc=116 managed=2 online_maps=112 global_avl=22085, global_rsvd=157, total_alloc=570
> kworker/u674:3-1421 [028] .... 335.307054: vector_reserve_managed: irq=45 ret=-28
> kworker/u674:3-1421 [028] .... 335.307054: vector_setup: irq=45 is_legacy=0 ret=-28
> kworker/u674:3-1421 [028] d... 335.307055: vector_teardown: irq=45 is_managed=1 has_reserved=0
>
> Which leads me to x86_vector_alloc_irqs goto error:
>
> error:
> x86_vector_free_irqs(domain, virq, i + 1);
>
> The last parameter looks weird. It's the nr_irqs, and since we failed and
> bailed, I would think we'd need to subtract 1 rather than add 1. Adding
> 1 would doublely remove the failed one, and remove the next one that
> was never setup, right?

Right. That's fishy. Let me stare at it.

> Or maybe irq_matrix_reserve_managed wasn't expected to fail in the
> first place?

Well, it can faul. I don't know why it fails in that case, but let me look
a bit more.

Thanks,

tglx


2018-01-16 11:20:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Tue, 16 Jan 2018, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Keith Busch wrote:
>
> > This is all way over my head, but the part that obviously shows
> > something's gone wrong:
> >
> > kworker/u674:3-1421 [028] d... 335.307051: irq_matrix_reserve_managed: bit=56 cpu=0 online=1 avl=86 alloc=116 managed=3 online_maps=112 global_avl=22084, global_rsvd=157, total_alloc=570
> > kworker/u674:3-1421 [028] d... 335.307053: irq_matrix_remove_managed: bit=56 cpu=0 online=1 avl=87 alloc=116 managed=2 online_maps=112 global_avl=22085, global_rsvd=157, total_alloc=570
> > kworker/u674:3-1421 [028] .... 335.307054: vector_reserve_managed: irq=45 ret=-28
> > kworker/u674:3-1421 [028] .... 335.307054: vector_setup: irq=45 is_legacy=0 ret=-28
> > kworker/u674:3-1421 [028] d... 335.307055: vector_teardown: irq=45 is_managed=1 has_reserved=0
> >
> > Which leads me to x86_vector_alloc_irqs goto error:
> >
> > error:
> > x86_vector_free_irqs(domain, virq, i + 1);
> >
> > The last parameter looks weird. It's the nr_irqs, and since we failed and
> > bailed, I would think we'd need to subtract 1 rather than add 1. Adding
> > 1 would doublely remove the failed one, and remove the next one that
> > was never setup, right?
>
> Right. That's fishy. Let me stare at it.

What we want is s/i + 1/i/

That's correct because x86_vector_free_irqs() does:

for (i = 0; i < nr; i++)
....

So if we fail at the first irq, then the loop will do nothing. Failing on
the second will free the first ....

Fix below.

Thanks,

tglx

8<----------------------
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index f8b03bb8e725..3cc471beb50b 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -542,14 +542,17 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,

err = assign_irq_vector_policy(irqd, info);
trace_vector_setup(virq + i, false, err);
- if (err)
+ if (err) {
+ irqd->chip_data = NULL;
+ free_apic_chip_data(apicd);
goto error;
+ }
}

return 0;

error:
- x86_vector_free_irqs(domain, virq, i + 1);
+ x86_vector_free_irqs(domain, virq, i);
return err;
}


2018-01-16 14:23:17

by Keith Busch

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote:
> What we want is s/i + 1/i/
>
> That's correct because x86_vector_free_irqs() does:
>
> for (i = 0; i < nr; i++)
> ....
>
> So if we fail at the first irq, then the loop will do nothing. Failing on
> the second will free the first ....
>
> Fix below.
>
> Thanks,
>
> tglx

Thanks! This looks much better. I'll try to verify by tomorrow, though
the hardware I was using to recreate is not available to me at the
moment. I may be able to synth the conditions on something else.


> 8<----------------------
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index f8b03bb8e725..3cc471beb50b 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -542,14 +542,17 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
>
> err = assign_irq_vector_policy(irqd, info);
> trace_vector_setup(virq + i, false, err);
> - if (err)
> + if (err) {
> + irqd->chip_data = NULL;
> + free_apic_chip_data(apicd);
> goto error;
> + }
> }
>
> return 0;
>
> error:
> - x86_vector_free_irqs(domain, virq, i + 1);
> + x86_vector_free_irqs(domain, virq, i);
> return err;
> }
>

2018-01-17 02:21:58

by Keith Busch

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote:
> 8<----------------------
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index f8b03bb8e725..3cc471beb50b 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -542,14 +542,17 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
>
> err = assign_irq_vector_policy(irqd, info);
> trace_vector_setup(virq + i, false, err);
> - if (err)
> + if (err) {
> + irqd->chip_data = NULL;
> + free_apic_chip_data(apicd);
> goto error;
> + }
> }
>
> return 0;
>
> error:
> - x86_vector_free_irqs(domain, virq, i + 1);
> + x86_vector_free_irqs(domain, virq, i);
> return err;
> }
>

The patch does indeed fix all the warnings and allows device binding to
succeed, albeit in a degraded performance mode. Despite that, this is
a good fix, and looks applicable to 4.4-stable, so:

Tested-by: Keith Busch <[email protected]>

I'm still concerned assign_irq_vector_policy is failing. That has
interrupt allocation abandon MSI-x and fall back to legacy IRQ.

Your patch does address my main concern, though. Are you comfortable
enough to queue this up for 4.15?

Thanks,
Keith

2018-01-17 07:34:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Tue, 16 Jan 2018, Keith Busch wrote:
> On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote:
> > 8<----------------------
> > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > index f8b03bb8e725..3cc471beb50b 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -542,14 +542,17 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
> >
> > err = assign_irq_vector_policy(irqd, info);
> > trace_vector_setup(virq + i, false, err);
> > - if (err)
> > + if (err) {
> > + irqd->chip_data = NULL;
> > + free_apic_chip_data(apicd);
> > goto error;
> > + }
> > }
> >
> > return 0;
> >
> > error:
> > - x86_vector_free_irqs(domain, virq, i + 1);
> > + x86_vector_free_irqs(domain, virq, i);
> > return err;
> > }
> >
>
> The patch does indeed fix all the warnings and allows device binding to
> succeed, albeit in a degraded performance mode. Despite that, this is
> a good fix, and looks applicable to 4.4-stable, so:
>
> Tested-by: Keith Busch <[email protected]>
>
> I'm still concerned assign_irq_vector_policy is failing. That has
> interrupt allocation abandon MSI-x and fall back to legacy IRQ.

Can you trace the matrix allocations from the very beginning or tell me how
to reproduce. I'd like to figure out why this is happening.

> Your patch does address my main concern, though. Are you comfortable
> enough to queue this up for 4.15?

Yes, it's a pretty obvious bug and the fix is straight forward.

Thanks,

tglx

2018-01-17 07:51:46

by Keith Busch

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote:
> Can you trace the matrix allocations from the very beginning or tell me how
> to reproduce. I'd like to figure out why this is happening.

Sure, I'll get the irq_matrix events.

I reproduce this on a machine with 112 CPUs and 3 NVMe controllers. The
first two NVMe want 112 MSI-x vectors, and the last only 31 vectors. The
test runs 'modprobe nvme' and 'modprobe -r nvme' in a loop with 10
second delay between each step. Repro occurs within a few iterations,
sometimes already broken after the initial boot.

2018-01-17 09:24:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Wed, 17 Jan 2018, Keith Busch wrote:
> On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote:
> > Can you trace the matrix allocations from the very beginning or tell me how
> > to reproduce. I'd like to figure out why this is happening.
>
> Sure, I'll get the irq_matrix events.
>
> I reproduce this on a machine with 112 CPUs and 3 NVMe controllers. The
> first two NVMe want 112 MSI-x vectors, and the last only 31 vectors. The
> test runs 'modprobe nvme' and 'modprobe -r nvme' in a loop with 10
> second delay between each step. Repro occurs within a few iterations,
> sometimes already broken after the initial boot.

That doesn't sound right. The vectors should be spread evenly accross the
CPUs. So ENOSPC should never happen.

Can you please take snapshots of /sys/kernel/debug/irq/ between the
modprobe and modprobe -r steps?

Thanks,

tglx





2018-01-17 09:32:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Wed, 17 Jan 2018, Thomas Gleixner wrote:

> On Wed, 17 Jan 2018, Keith Busch wrote:
> > On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote:
> > > Can you trace the matrix allocations from the very beginning or tell me how
> > > to reproduce. I'd like to figure out why this is happening.
> >
> > Sure, I'll get the irq_matrix events.
> >
> > I reproduce this on a machine with 112 CPUs and 3 NVMe controllers. The
> > first two NVMe want 112 MSI-x vectors, and the last only 31 vectors. The
> > test runs 'modprobe nvme' and 'modprobe -r nvme' in a loop with 10
> > second delay between each step. Repro occurs within a few iterations,
> > sometimes already broken after the initial boot.
>
> That doesn't sound right. The vectors should be spread evenly accross the
> CPUs. So ENOSPC should never happen.
>
> Can you please take snapshots of /sys/kernel/debug/irq/ between the
> modprobe and modprobe -r steps?

The allocation fails because CPU1 has exhausted it's vector space here:

[002] d... 333.028216: irq_matrix_alloc_managed: bit=34 cpu=1 online=1 avl=0 alloc=202 managed=2 online_maps=112 global_avl=22085, global_rsvd=158, total_alloc=460

Now the interesting question is how that happens.

Thanks,

tglx

Subject: [tip:x86/urgent] x86/apic/vector: Fix off by one in error path

Commit-ID: 45d55e7bac4028af93f5fa324e69958a0b868e96
Gitweb: https://git.kernel.org/tip/45d55e7bac4028af93f5fa324e69958a0b868e96
Author: Thomas Gleixner <[email protected]>
AuthorDate: Tue, 16 Jan 2018 12:20:18 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Wed, 17 Jan 2018 12:11:36 +0100

x86/apic/vector: Fix off by one in error path

Keith reported the following warning:

WARNING: CPU: 28 PID: 1420 at kernel/irq/matrix.c:222 irq_matrix_remove_managed+0x10f/0x120
x86_vector_free_irqs+0xa1/0x180
x86_vector_alloc_irqs+0x1e4/0x3a0
msi_domain_alloc+0x62/0x130

The reason for this is that if the vector allocation fails the error
handling code tries to free the failed vector as well, which causes the
above imbalance warning to trigger.

Adjust the error path to handle this correctly.

Fixes: b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage CPU interrupt vectors")
Reported-by: Keith Busch <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Keith Busch <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801161217300.1823@nanos

---
arch/x86/kernel/apic/vector.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index f8b03bb..3cc471b 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -542,14 +542,17 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,

err = assign_irq_vector_policy(irqd, info);
trace_vector_setup(virq + i, false, err);
- if (err)
+ if (err) {
+ irqd->chip_data = NULL;
+ free_apic_chip_data(apicd);
goto error;
+ }
}

return 0;

error:
- x86_vector_free_irqs(domain, virq, i + 1);
+ x86_vector_free_irqs(domain, virq, i);
return err;
}


2018-01-17 14:21:27

by Keith Busch

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Wed, Jan 17, 2018 at 10:32:12AM +0100, Thomas Gleixner wrote:
> On Wed, 17 Jan 2018, Thomas Gleixner wrote:
> > That doesn't sound right. The vectors should be spread evenly accross the
> > CPUs. So ENOSPC should never happen.
> >
> > Can you please take snapshots of /sys/kernel/debug/irq/ between the
> > modprobe and modprobe -r steps?
>
> The allocation fails because CPU1 has exhausted it's vector space here:
>
> [002] d... 333.028216: irq_matrix_alloc_managed: bit=34 cpu=1 online=1 avl=0 alloc=202 managed=2 online_maps=112 global_avl=22085, global_rsvd=158, total_alloc=460
>
> Now the interesting question is how that happens.

The trace with "trace_events=irq_matrix" kernel parameter is attached,
ended shortly after an allocation failure.

I'll have to get back to you on the debug/irq snapshots.


Attachments:
(No filename) (819.00 B)
irq-matrix-events (480.07 kB)
Download all attachments

2018-01-17 15:01:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Wed, 17 Jan 2018, Keith Busch wrote:

> On Wed, Jan 17, 2018 at 10:32:12AM +0100, Thomas Gleixner wrote:
> > On Wed, 17 Jan 2018, Thomas Gleixner wrote:
> > > That doesn't sound right. The vectors should be spread evenly accross the
> > > CPUs. So ENOSPC should never happen.
> > >
> > > Can you please take snapshots of /sys/kernel/debug/irq/ between the
> > > modprobe and modprobe -r steps?
> >
> > The allocation fails because CPU1 has exhausted it's vector space here:
> >
> > [002] d... 333.028216: irq_matrix_alloc_managed: bit=34 cpu=1 online=1 avl=0 alloc=202 managed=2 online_maps=112 global_avl=22085, global_rsvd=158, total_alloc=460
> >
> > Now the interesting question is how that happens.
>
> The trace with "trace_events=irq_matrix" kernel parameter is attached,
> ended shortly after an allocation failure.

Which device is allocating gazillions of non-managed interrupts?

NetworkManager-2208 [044] d... 8.648608: irq_matrix_alloc: bit=68 cpu=0 online=1 avl=168 alloc=35 managed=3 online_maps=112 global_avl=22359, global_rsvd=532, total_alloc=215

....

NetworkManager-2208 [044] d... 8.665114: irq_matrix_alloc: bit=237 cpu=0 online=1 avl=0 alloc=203 managed=3 online_maps=112 global_avl=22191, global_rsvd=364, total_alloc=383

That's 168 interrupts total. Enterprise grade insanity.

The patch below should cure that by spreading them out on allocation.

Thanks,

tglx

8<------------------

diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 0ba0dd8863a7..5831cc7db27d 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -321,29 +321,38 @@ void irq_matrix_remove_reserved(struct irq_matrix *m)
int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
bool reserved, unsigned int *mapped_cpu)
{
- unsigned int cpu;
+ unsigned int cpu, best_cpu, maxavl = 0;
+ struct cpumap *cm;
+ unsigned int bit;

+ best_cpu = UINT_MAX;
for_each_cpu(cpu, msk) {
- struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
- unsigned int bit;
+ cm = per_cpu_ptr(m->maps, cpu);

- if (!cm->online)
+ if (!cm->online || cm->available <= maxavl)
continue;

- bit = matrix_alloc_area(m, cm, 1, false);
- if (bit < m->alloc_end) {
- cm->allocated++;
- cm->available--;
- m->total_allocated++;
- m->global_available--;
- if (reserved)
- m->global_reserved--;
- *mapped_cpu = cpu;
- trace_irq_matrix_alloc(bit, cpu, m, cm);
- return bit;
- }
+ best_cpu = cpu;
+ maxavl = cm->available;
}
- return -ENOSPC;
+
+ if (!maxavl)
+ return -ENOSPC;
+
+ cm = per_cpu_ptr(m->maps, best_cpu);
+ bit = matrix_alloc_area(m, cm, 1, false);
+ if (bit >= m->alloc_end)
+ return -ENOSPC;
+
+ cm->allocated++;
+ cm->available--;
+ m->total_allocated++;
+ m->global_available--;
+ if (reserved)
+ m->global_reserved--;
+ *mapped_cpu = best_cpu;
+ trace_irq_matrix_alloc(bit, best_cpu, m, cm);
+ return bit;
}

/**

2018-01-18 02:34:44

by Keith Busch

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Wed, Jan 17, 2018 at 04:01:47PM +0100, Thomas Gleixner wrote:
> Which device is allocating gazillions of non-managed interrupts?

I believe that would be the i40e. :)

> The patch below should cure that by spreading them out on allocation.

Yep, this is successfully testing already over 200 iterations that used to
fail within only a few. I'd say the problem is cured. Thanks!

Tested-by: Keith Busch <[email protected]>

2018-01-18 08:11:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Wed, 17 Jan 2018, Keith Busch wrote:
> On Wed, Jan 17, 2018 at 04:01:47PM +0100, Thomas Gleixner wrote:
> > Which device is allocating gazillions of non-managed interrupts?
>
> I believe that would be the i40e. :)

So enterprise grade insanity was spot on.

Can you please provide the output of

# cat /sys/kernel/debug/irq/irqs/$ONE_I40_IRQ

Thanks,

tglx

2018-01-18 08:45:55

by Keith Busch

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Thu, Jan 18, 2018 at 09:10:43AM +0100, Thomas Gleixner wrote:
> Can you please provide the output of
>
> # cat /sys/kernel/debug/irq/irqs/$ONE_I40_IRQ

# cat /sys/kernel/debug/irq/irqs/48
handler: handle_edge_irq
device: 0000:1a:00.0
status: 0x00000000
istate: 0x00000000
ddepth: 0
wdepth: 0
dstate: 0x05401200
IRQD_ACTIVATED
IRQD_IRQ_STARTED
IRQD_SINGLE_TARGET
IRQD_AFFINITY_SET
IRQD_CAN_RESERVE
node: 0
affinity: 0-27,56-83
effectiv: 0
pending:
domain: PCI-MSI-2
hwirq: 0xd00002
chip: PCI-MSI
flags: 0x10
IRQCHIP_SKIP_SET_WAKE
parent:
domain: VECTOR
hwirq: 0x30
chip: APIC
flags: 0x0
Vector: 69
Target: 0

2018-01-18 09:07:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors

On Thu, 18 Jan 2018, Keith Busch wrote:

> On Thu, Jan 18, 2018 at 09:10:43AM +0100, Thomas Gleixner wrote:
> > Can you please provide the output of
> >
> > # cat /sys/kernel/debug/irq/irqs/$ONE_I40_IRQ
>
> # cat /sys/kernel/debug/irq/irqs/48
> handler: handle_edge_irq
> device: 0000:1a:00.0
> status: 0x00000000
> istate: 0x00000000
> ddepth: 0
> wdepth: 0
> dstate: 0x05401200
> IRQD_ACTIVATED
> IRQD_IRQ_STARTED
> IRQD_SINGLE_TARGET
> IRQD_AFFINITY_SET
> IRQD_CAN_RESERVE
> node: 0
> affinity: 0-27,56-83
> effectiv: 0
> pending:
> domain: PCI-MSI-2
> hwirq: 0xd00002
> chip: PCI-MSI
> flags: 0x10
> IRQCHIP_SKIP_SET_WAKE

Bah. That's MSI where we can't use interrupt reservation because we cant
mask MSI vectors. I really admire these designed by committee hardware
bogosities.

Thanks,

tglx

Subject: [tip:irq/urgent] irq/matrix: Spread interrupts on allocation

Commit-ID: a0c9259dc4e1923a98356967ce8b732da1979df8
Gitweb: https://git.kernel.org/tip/a0c9259dc4e1923a98356967ce8b732da1979df8
Author: Thomas Gleixner <[email protected]>
AuthorDate: Wed, 17 Jan 2018 16:01:47 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 18 Jan 2018 11:38:41 +0100

irq/matrix: Spread interrupts on allocation

Keith reported an issue with vector space exhaustion on a server machine
which is caused by the i40e driver allocating 168 MSI interrupts when the
driver is initialized, even when most of these interrupts are not used at
all.

The x86 vector allocation code tries to avoid the immediate allocation with
the reservation mode, but the card uses MSI and does not support MSI entry
masking, which prevents reservation mode and requires immediate vector
allocation.

The matrix allocator is a bit naive and prefers the first CPU in the
cpumask which describes the possible target CPUs for an allocation. That
results in allocating all 168 vectors on CPU0 which later causes vector
space exhaustion when the NVMe driver tries to allocate managed interrupts
on each CPU for the per CPU queues.

Avoid this by finding the CPU which has the lowest vector allocation count
to spread out the non managed interrupt accross the possible target CPUs.

Fixes: 2f75d9e1c905 ("genirq: Implement bitmap matrix allocator")
Reported-by: Keith Busch <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Keith Busch <[email protected]>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801171557330.1777@nanos

---
kernel/irq/matrix.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 0ba0dd8..5187dfe 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -321,15 +321,23 @@ void irq_matrix_remove_reserved(struct irq_matrix *m)
int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
bool reserved, unsigned int *mapped_cpu)
{
- unsigned int cpu;
+ unsigned int cpu, best_cpu, maxavl = 0;
+ struct cpumap *cm;
+ unsigned int bit;

+ best_cpu = UINT_MAX;
for_each_cpu(cpu, msk) {
- struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
- unsigned int bit;
+ cm = per_cpu_ptr(m->maps, cpu);

- if (!cm->online)
+ if (!cm->online || cm->available <= maxavl)
continue;

+ best_cpu = cpu;
+ maxavl = cm->available;
+ }
+
+ if (maxavl) {
+ cm = per_cpu_ptr(m->maps, best_cpu);
bit = matrix_alloc_area(m, cm, 1, false);
if (bit < m->alloc_end) {
cm->allocated++;
@@ -338,8 +346,8 @@ int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
m->global_available--;
if (reserved)
m->global_reserved--;
- *mapped_cpu = cpu;
- trace_irq_matrix_alloc(bit, cpu, m, cm);
+ *mapped_cpu = best_cpu;
+ trace_irq_matrix_alloc(bit, best_cpu, m, cm);
return bit;
}
}