2017-08-08 19:25:42

by YASUAKI ISHIMATSU

[permalink] [raw]
Subject: system hung up when offlining CPUs

Hi Thomas,

When offlining all CPUs except cpu0, system hung up with the following message.

[...] INFO: task kworker/u384:1:1234 blocked for more than 120 seconds.
[...] Not tainted 4.12.0-rc6+ #19
[...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[...] kworker/u384:1 D 0 1234 2 0x00000000
[...] Workqueue: writeback wb_workfn (flush-253:0)
[...] Call Trace:
[...] __schedule+0x28a/0x880
[...] schedule+0x36/0x80
[...] schedule_timeout+0x249/0x300
[...] ? __schedule+0x292/0x880
[...] __down_common+0xfc/0x132
[...] ? _xfs_buf_find+0x2bb/0x510 [xfs]
[...] __down+0x1d/0x1f
[...] down+0x41/0x50
[...] xfs_buf_lock+0x3c/0xf0 [xfs]
[...] _xfs_buf_find+0x2bb/0x510 [xfs]
[...] xfs_buf_get_map+0x2a/0x280 [xfs]
[...] xfs_buf_read_map+0x2d/0x180 [xfs]
[...] xfs_trans_read_buf_map+0xf5/0x310 [xfs]
[...] xfs_btree_read_buf_block.constprop.35+0x78/0xc0 [xfs]
[...] xfs_btree_lookup_get_block+0x88/0x160 [xfs]
[...] xfs_btree_lookup+0xd0/0x3b0 [xfs]
[...] ? xfs_allocbt_init_cursor+0x41/0xe0 [xfs]
[...] xfs_alloc_ag_vextent_near+0xaf/0xaa0 [xfs]
[...] xfs_alloc_ag_vextent+0x13c/0x150 [xfs]
[...] xfs_alloc_vextent+0x425/0x590 [xfs]
[...] xfs_bmap_btalloc+0x448/0x770 [xfs]
[...] xfs_bmap_alloc+0xe/0x10 [xfs]
[...] xfs_bmapi_write+0x61d/0xc10 [xfs]
[...] ? kmem_zone_alloc+0x96/0x100 [xfs]
[...] xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
[...] xfs_map_blocks+0x1e8/0x260 [xfs]
[...] xfs_do_writepage+0x1ca/0x680 [xfs]
[...] write_cache_pages+0x26f/0x510
[...] ? xfs_vm_set_page_dirty+0x1d0/0x1d0 [xfs]
[...] ? blk_mq_dispatch_rq_list+0x305/0x410
[...] ? deadline_remove_request+0x7d/0xc0
[...] xfs_vm_writepages+0xb6/0xd0 [xfs]
[...] do_writepages+0x1c/0x70
[...] __writeback_single_inode+0x45/0x320
[...] writeback_sb_inodes+0x280/0x570
[...] __writeback_inodes_wb+0x8c/0xc0
[...] wb_writeback+0x276/0x310
[...] ? get_nr_dirty_inodes+0x4d/0x80
[...] wb_workfn+0x2d4/0x3b0
[...] process_one_work+0x149/0x360
[...] worker_thread+0x4d/0x3c0
[...] kthread+0x109/0x140
[...] ? rescuer_thread+0x380/0x380
[...] ? kthread_park+0x60/0x60
[...] ret_from_fork+0x25/0x30


I bisected upstream kernel. And I found that the following commit lead
the issue.

commit c5cb83bb337c25caae995d992d1cdf9b317f83de
Author: Thomas Gleixner <[email protected]>
Date: Tue Jun 20 01:37:51 2017 +0200

genirq/cpuhotplug: Handle managed IRQs on CPU hotplug


Thanks,
Yasuaki Ishimatsu


2017-08-09 11:42:20

by Marc Zyngier

[permalink] [raw]
Subject: Re: system hung up when offlining CPUs

On Tue, 8 Aug 2017 15:25:35 -0400
YASUAKI ISHIMATSU <[email protected]> wrote:

> Hi Thomas,
>
> When offlining all CPUs except cpu0, system hung up with the following message.
>
> [...] INFO: task kworker/u384:1:1234 blocked for more than 120 seconds.
> [...] Not tainted 4.12.0-rc6+ #19
> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [...] kworker/u384:1 D 0 1234 2 0x00000000
> [...] Workqueue: writeback wb_workfn (flush-253:0)
> [...] Call Trace:
> [...] __schedule+0x28a/0x880
> [...] schedule+0x36/0x80
> [...] schedule_timeout+0x249/0x300
> [...] ? __schedule+0x292/0x880
> [...] __down_common+0xfc/0x132
> [...] ? _xfs_buf_find+0x2bb/0x510 [xfs]
> [...] __down+0x1d/0x1f
> [...] down+0x41/0x50
> [...] xfs_buf_lock+0x3c/0xf0 [xfs]
> [...] _xfs_buf_find+0x2bb/0x510 [xfs]
> [...] xfs_buf_get_map+0x2a/0x280 [xfs]
> [...] xfs_buf_read_map+0x2d/0x180 [xfs]
> [...] xfs_trans_read_buf_map+0xf5/0x310 [xfs]
> [...] xfs_btree_read_buf_block.constprop.35+0x78/0xc0 [xfs]
> [...] xfs_btree_lookup_get_block+0x88/0x160 [xfs]
> [...] xfs_btree_lookup+0xd0/0x3b0 [xfs]
> [...] ? xfs_allocbt_init_cursor+0x41/0xe0 [xfs]
> [...] xfs_alloc_ag_vextent_near+0xaf/0xaa0 [xfs]
> [...] xfs_alloc_ag_vextent+0x13c/0x150 [xfs]
> [...] xfs_alloc_vextent+0x425/0x590 [xfs]
> [...] xfs_bmap_btalloc+0x448/0x770 [xfs]
> [...] xfs_bmap_alloc+0xe/0x10 [xfs]
> [...] xfs_bmapi_write+0x61d/0xc10 [xfs]
> [...] ? kmem_zone_alloc+0x96/0x100 [xfs]
> [...] xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
> [...] xfs_map_blocks+0x1e8/0x260 [xfs]
> [...] xfs_do_writepage+0x1ca/0x680 [xfs]
> [...] write_cache_pages+0x26f/0x510
> [...] ? xfs_vm_set_page_dirty+0x1d0/0x1d0 [xfs]
> [...] ? blk_mq_dispatch_rq_list+0x305/0x410
> [...] ? deadline_remove_request+0x7d/0xc0
> [...] xfs_vm_writepages+0xb6/0xd0 [xfs]
> [...] do_writepages+0x1c/0x70
> [...] __writeback_single_inode+0x45/0x320
> [...] writeback_sb_inodes+0x280/0x570
> [...] __writeback_inodes_wb+0x8c/0xc0
> [...] wb_writeback+0x276/0x310
> [...] ? get_nr_dirty_inodes+0x4d/0x80
> [...] wb_workfn+0x2d4/0x3b0
> [...] process_one_work+0x149/0x360
> [...] worker_thread+0x4d/0x3c0
> [...] kthread+0x109/0x140
> [...] ? rescuer_thread+0x380/0x380
> [...] ? kthread_park+0x60/0x60
> [...] ret_from_fork+0x25/0x30
>
>
> I bisected upstream kernel. And I found that the following commit lead
> the issue.
>
> commit c5cb83bb337c25caae995d992d1cdf9b317f83de
> Author: Thomas Gleixner <[email protected]>
> Date: Tue Jun 20 01:37:51 2017 +0200
>
> genirq/cpuhotplug: Handle managed IRQs on CPU hotplug

Can you please post your /proc/interrupts and details of which
interrupt you think goes wrong? This backtrace is not telling us much
in terms of where to start looking...

Thanks,

M.
--
Without deviation from the norm, progress is not possible.

2017-08-09 19:09:40

by YASUAKI ISHIMATSU

[permalink] [raw]
Subject: Re: system hung up when offlining CPUs

Hi Marc,

On 08/09/2017 07:42 AM, Marc Zyngier wrote:
> On Tue, 8 Aug 2017 15:25:35 -0400
> YASUAKI ISHIMATSU <[email protected]> wrote:
>
>> Hi Thomas,
>>
>> When offlining all CPUs except cpu0, system hung up with the following message.
>>
>> [...] INFO: task kworker/u384:1:1234 blocked for more than 120 seconds.
>> [...] Not tainted 4.12.0-rc6+ #19
>> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [...] kworker/u384:1 D 0 1234 2 0x00000000
>> [...] Workqueue: writeback wb_workfn (flush-253:0)
>> [...] Call Trace:
>> [...] __schedule+0x28a/0x880
>> [...] schedule+0x36/0x80
>> [...] schedule_timeout+0x249/0x300
>> [...] ? __schedule+0x292/0x880
>> [...] __down_common+0xfc/0x132
>> [...] ? _xfs_buf_find+0x2bb/0x510 [xfs]
>> [...] __down+0x1d/0x1f
>> [...] down+0x41/0x50
>> [...] xfs_buf_lock+0x3c/0xf0 [xfs]
>> [...] _xfs_buf_find+0x2bb/0x510 [xfs]
>> [...] xfs_buf_get_map+0x2a/0x280 [xfs]
>> [...] xfs_buf_read_map+0x2d/0x180 [xfs]
>> [...] xfs_trans_read_buf_map+0xf5/0x310 [xfs]
>> [...] xfs_btree_read_buf_block.constprop.35+0x78/0xc0 [xfs]
>> [...] xfs_btree_lookup_get_block+0x88/0x160 [xfs]
>> [...] xfs_btree_lookup+0xd0/0x3b0 [xfs]
>> [...] ? xfs_allocbt_init_cursor+0x41/0xe0 [xfs]
>> [...] xfs_alloc_ag_vextent_near+0xaf/0xaa0 [xfs]
>> [...] xfs_alloc_ag_vextent+0x13c/0x150 [xfs]
>> [...] xfs_alloc_vextent+0x425/0x590 [xfs]
>> [...] xfs_bmap_btalloc+0x448/0x770 [xfs]
>> [...] xfs_bmap_alloc+0xe/0x10 [xfs]
>> [...] xfs_bmapi_write+0x61d/0xc10 [xfs]
>> [...] ? kmem_zone_alloc+0x96/0x100 [xfs]
>> [...] xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
>> [...] xfs_map_blocks+0x1e8/0x260 [xfs]
>> [...] xfs_do_writepage+0x1ca/0x680 [xfs]
>> [...] write_cache_pages+0x26f/0x510
>> [...] ? xfs_vm_set_page_dirty+0x1d0/0x1d0 [xfs]
>> [...] ? blk_mq_dispatch_rq_list+0x305/0x410
>> [...] ? deadline_remove_request+0x7d/0xc0
>> [...] xfs_vm_writepages+0xb6/0xd0 [xfs]
>> [...] do_writepages+0x1c/0x70
>> [...] __writeback_single_inode+0x45/0x320
>> [...] writeback_sb_inodes+0x280/0x570
>> [...] __writeback_inodes_wb+0x8c/0xc0
>> [...] wb_writeback+0x276/0x310
>> [...] ? get_nr_dirty_inodes+0x4d/0x80
>> [...] wb_workfn+0x2d4/0x3b0
>> [...] process_one_work+0x149/0x360
>> [...] worker_thread+0x4d/0x3c0
>> [...] kthread+0x109/0x140
>> [...] ? rescuer_thread+0x380/0x380
>> [...] ? kthread_park+0x60/0x60
>> [...] ret_from_fork+0x25/0x30
>>
>>
>> I bisected upstream kernel. And I found that the following commit lead
>> the issue.
>>
>> commit c5cb83bb337c25caae995d992d1cdf9b317f83de
>> Author: Thomas Gleixner <[email protected]>
>> Date: Tue Jun 20 01:37:51 2017 +0200
>>
>> genirq/cpuhotplug: Handle managed IRQs on CPU hotplug
>
> Can you please post your /proc/interrupts and details of which
> interrupt you think goes wrong? This backtrace is not telling us much
> in terms of where to start looking...

Thank you for giving advise.

The issue is easily reproduced on physical/virtual machine by offling CPUs except cpu0.
Here are my /proc/interrupts on kvm guest before reproducing the issue. And when offlining
cpu1, the issue occurred. But when offling cpu0, the issue didn't occur.

CPU0 CPU1
0: 127 0 IO-APIC 2-edge timer
1: 10 0 IO-APIC 1-edge i8042
4: 227 0 IO-APIC 4-edge ttyS0
6: 3 0 IO-APIC 6-edge floppy
8: 0 0 IO-APIC 8-edge rtc0
9: 0 0 IO-APIC 9-fasteoi acpi
10: 10822 0 IO-APIC 10-fasteoi ehci_hcd:usb1, uhci_hcd:usb2, virtio3
11: 23 0 IO-APIC 11-fasteoi uhci_hcd:usb3, uhci_hcd:usb4, qxl
12: 15 0 IO-APIC 12-edge i8042
14: 218 0 IO-APIC 14-edge ata_piix
15: 0 0 IO-APIC 15-edge ata_piix
24: 0 0 PCI-MSI 49152-edge virtio0-config
25: 359 0 PCI-MSI 49153-edge virtio0-input.0
26: 1 0 PCI-MSI 49154-edge virtio0-output.0
27: 0 0 PCI-MSI 114688-edge virtio2-config
28: 1 3639 PCI-MSI 114689-edge virtio2-req.0
29: 0 0 PCI-MSI 98304-edge virtio1-config
30: 4 0 PCI-MSI 98305-edge virtio1-virtqueues
31: 189 0 PCI-MSI 65536-edge snd_hda_intel:card0
NMI: 0 0 Non-maskable interrupts
LOC: 16115 12845 Local timer interrupts
SPU: 0 0 Spurious interrupts
PMI: 0 0 Performance monitoring interrupts
IWI: 0 0 IRQ work interrupts
RTR: 0 0 APIC ICR read retries
RES: 3016 2135 Rescheduling interrupts
CAL: 3666 557 Function call interrupts
TLB: 65 12 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
DFR: 0 0 Deferred Error APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 1 1 Machine check polls
ERR: 0
MIS: 0
PIN: 0 0 Posted-interrupt notification event
NPI: 0 0 Nested posted-interrupt event
PIW: 0 0 Posted-interrupt wakeup event


Thanks,
Yasuaki Ishimatsu

>
> Thanks,
>
> M.
>

2017-08-10 11:54:39

by Marc Zyngier

[permalink] [raw]
Subject: Re: system hung up when offlining CPUs

+ Christoph, since he's the one who came up with the idea

On 09/08/17 20:09, YASUAKI ISHIMATSU wrote:
> Hi Marc,
>
> On 08/09/2017 07:42 AM, Marc Zyngier wrote:
>> On Tue, 8 Aug 2017 15:25:35 -0400
>> YASUAKI ISHIMATSU <[email protected]> wrote:
>>
>>> Hi Thomas,
>>>
>>> When offlining all CPUs except cpu0, system hung up with the following message.
>>>
>>> [...] INFO: task kworker/u384:1:1234 blocked for more than 120 seconds.
>>> [...] Not tainted 4.12.0-rc6+ #19
>>> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [...] kworker/u384:1 D 0 1234 2 0x00000000
>>> [...] Workqueue: writeback wb_workfn (flush-253:0)
>>> [...] Call Trace:
>>> [...] __schedule+0x28a/0x880
>>> [...] schedule+0x36/0x80
>>> [...] schedule_timeout+0x249/0x300
>>> [...] ? __schedule+0x292/0x880
>>> [...] __down_common+0xfc/0x132
>>> [...] ? _xfs_buf_find+0x2bb/0x510 [xfs]
>>> [...] __down+0x1d/0x1f
>>> [...] down+0x41/0x50
>>> [...] xfs_buf_lock+0x3c/0xf0 [xfs]
>>> [...] _xfs_buf_find+0x2bb/0x510 [xfs]
>>> [...] xfs_buf_get_map+0x2a/0x280 [xfs]
>>> [...] xfs_buf_read_map+0x2d/0x180 [xfs]
>>> [...] xfs_trans_read_buf_map+0xf5/0x310 [xfs]
>>> [...] xfs_btree_read_buf_block.constprop.35+0x78/0xc0 [xfs]
>>> [...] xfs_btree_lookup_get_block+0x88/0x160 [xfs]
>>> [...] xfs_btree_lookup+0xd0/0x3b0 [xfs]
>>> [...] ? xfs_allocbt_init_cursor+0x41/0xe0 [xfs]
>>> [...] xfs_alloc_ag_vextent_near+0xaf/0xaa0 [xfs]
>>> [...] xfs_alloc_ag_vextent+0x13c/0x150 [xfs]
>>> [...] xfs_alloc_vextent+0x425/0x590 [xfs]
>>> [...] xfs_bmap_btalloc+0x448/0x770 [xfs]
>>> [...] xfs_bmap_alloc+0xe/0x10 [xfs]
>>> [...] xfs_bmapi_write+0x61d/0xc10 [xfs]
>>> [...] ? kmem_zone_alloc+0x96/0x100 [xfs]
>>> [...] xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
>>> [...] xfs_map_blocks+0x1e8/0x260 [xfs]
>>> [...] xfs_do_writepage+0x1ca/0x680 [xfs]
>>> [...] write_cache_pages+0x26f/0x510
>>> [...] ? xfs_vm_set_page_dirty+0x1d0/0x1d0 [xfs]
>>> [...] ? blk_mq_dispatch_rq_list+0x305/0x410
>>> [...] ? deadline_remove_request+0x7d/0xc0
>>> [...] xfs_vm_writepages+0xb6/0xd0 [xfs]
>>> [...] do_writepages+0x1c/0x70
>>> [...] __writeback_single_inode+0x45/0x320
>>> [...] writeback_sb_inodes+0x280/0x570
>>> [...] __writeback_inodes_wb+0x8c/0xc0
>>> [...] wb_writeback+0x276/0x310
>>> [...] ? get_nr_dirty_inodes+0x4d/0x80
>>> [...] wb_workfn+0x2d4/0x3b0
>>> [...] process_one_work+0x149/0x360
>>> [...] worker_thread+0x4d/0x3c0
>>> [...] kthread+0x109/0x140
>>> [...] ? rescuer_thread+0x380/0x380
>>> [...] ? kthread_park+0x60/0x60
>>> [...] ret_from_fork+0x25/0x30
>>>
>>>
>>> I bisected upstream kernel. And I found that the following commit lead
>>> the issue.
>>>
>>> commit c5cb83bb337c25caae995d992d1cdf9b317f83de
>>> Author: Thomas Gleixner <[email protected]>
>>> Date: Tue Jun 20 01:37:51 2017 +0200
>>>
>>> genirq/cpuhotplug: Handle managed IRQs on CPU hotplug
>>
>> Can you please post your /proc/interrupts and details of which
>> interrupt you think goes wrong? This backtrace is not telling us much
>> in terms of where to start looking...
>
> Thank you for giving advise.
>
> The issue is easily reproduced on physical/virtual machine by offling CPUs except cpu0.
> Here are my /proc/interrupts on kvm guest before reproducing the issue. And when offlining
> cpu1, the issue occurred. But when offling cpu0, the issue didn't occur.
>
> CPU0 CPU1
> 0: 127 0 IO-APIC 2-edge timer
> 1: 10 0 IO-APIC 1-edge i8042
> 4: 227 0 IO-APIC 4-edge ttyS0
> 6: 3 0 IO-APIC 6-edge floppy
> 8: 0 0 IO-APIC 8-edge rtc0
> 9: 0 0 IO-APIC 9-fasteoi acpi
> 10: 10822 0 IO-APIC 10-fasteoi ehci_hcd:usb1, uhci_hcd:usb2, virtio3
> 11: 23 0 IO-APIC 11-fasteoi uhci_hcd:usb3, uhci_hcd:usb4, qxl
> 12: 15 0 IO-APIC 12-edge i8042
> 14: 218 0 IO-APIC 14-edge ata_piix
> 15: 0 0 IO-APIC 15-edge ata_piix
> 24: 0 0 PCI-MSI 49152-edge virtio0-config
> 25: 359 0 PCI-MSI 49153-edge virtio0-input.0
> 26: 1 0 PCI-MSI 49154-edge virtio0-output.0
> 27: 0 0 PCI-MSI 114688-edge virtio2-config
> 28: 1 3639 PCI-MSI 114689-edge virtio2-req.0
> 29: 0 0 PCI-MSI 98304-edge virtio1-config
> 30: 4 0 PCI-MSI 98305-edge virtio1-virtqueues
> 31: 189 0 PCI-MSI 65536-edge snd_hda_intel:card0
> NMI: 0 0 Non-maskable interrupts
> LOC: 16115 12845 Local timer interrupts
> SPU: 0 0 Spurious interrupts
> PMI: 0 0 Performance monitoring interrupts
> IWI: 0 0 IRQ work interrupts
> RTR: 0 0 APIC ICR read retries
> RES: 3016 2135 Rescheduling interrupts
> CAL: 3666 557 Function call interrupts
> TLB: 65 12 TLB shootdowns
> TRM: 0 0 Thermal event interrupts
> THR: 0 0 Threshold APIC interrupts
> DFR: 0 0 Deferred Error APIC interrupts
> MCE: 0 0 Machine check exceptions
> MCP: 1 1 Machine check polls
> ERR: 0
> MIS: 0
> PIN: 0 0 Posted-interrupt notification event
> NPI: 0 0 Nested posted-interrupt event
> PIW: 0 0 Posted-interrupt wakeup event

I was able to reproduce this with an arm64 VM pretty easily. The issue
is that (in the above case), IRQ27 is affine to CPU0, and IRQ28 to CPU1.
If you have more CPUs, IRQ27 is affine to the first half of the CPUs,
and IRQ28 to the others.

When CPU1 is offlined, the fact that we have a "managed" interrupt
affinity prevents the interrupt from being moved to CPU0, and you loose
your disk. I don't think that's the expected effect... The opposite case
(offlining CPU0) only "works" because you're not getting any config
interrupt (IOW, you're lucky).

I'm not sure how this is supposed to work. Shutting down the interrupt
in migrate_one_irq() just breaks everything (unless the driver somehow
knows about it, which doesn't seem to be the case).

Christoph?

M.
--
Jazz is not dead. It just smells funny...

2017-08-21 12:07:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: system hung up when offlining CPUs

Hi Marc,

in general the driver should know not to use the queue / irq,
as blk-mq will never schedule I/O to queues that have no online
cpus.

The real bugs seems to be that we're using affinity for a device
that only has one real queue (as the config queue should not
have affinity). Let me dig into what's going on here with virtio.

2017-08-21 13:18:13

by Christoph Hellwig

[permalink] [raw]
Subject: Re: system hung up when offlining CPUs

Can you try the patch below please?

---
>From d5f59cb7a629de8439b318e1384660e6b56e7dd8 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <[email protected]>
Date: Mon, 21 Aug 2017 14:24:11 +0200
Subject: virtio_pci: fix cpu affinity support

Commit 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for
virtqueues"") removed the adjustment of the pre_vectors for the virtio
MSI-X vector allocation which was added in commit fb5e31d9 ("virtio:
allow drivers to request IRQ affinity when creating VQs"). This will
lead to an incorrect assignment of MSI-X vectors, and potential
deadlocks when offlining cpus.

Signed-off-by: Christoph Hellwig <[email protected]>
Fixes: 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for virtqueues")
Reported-by: YASUAKI ISHIMATSU <[email protected]>
---
drivers/virtio/virtio_pci_common.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index 007a4f366086..1c4797e53f68 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -107,6 +107,7 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors,
{
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
const char *name = dev_name(&vp_dev->vdev.dev);
+ unsigned flags = PCI_IRQ_MSIX;
unsigned i, v;
int err = -ENOMEM;

@@ -126,10 +127,13 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors,
GFP_KERNEL))
goto error;

+ if (desc) {
+ flags |= PCI_IRQ_AFFINITY;
+ desc->pre_vectors++; /* virtio config vector */
+ }
+
err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,
- nvectors, PCI_IRQ_MSIX |
- (desc ? PCI_IRQ_AFFINITY : 0),
- desc);
+ nvectors, flags, desc);
if (err < 0)
goto error;
vp_dev->msix_enabled = 1;
--
2.11.0

2017-08-21 13:38:02

by Marc Zyngier

[permalink] [raw]
Subject: Re: system hung up when offlining CPUs

On 21/08/17 14:18, Christoph Hellwig wrote:
> Can you try the patch below please?
>
> ---
> From d5f59cb7a629de8439b318e1384660e6b56e7dd8 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <[email protected]>
> Date: Mon, 21 Aug 2017 14:24:11 +0200
> Subject: virtio_pci: fix cpu affinity support
>
> Commit 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for
> virtqueues"") removed the adjustment of the pre_vectors for the virtio
> MSI-X vector allocation which was added in commit fb5e31d9 ("virtio:
> allow drivers to request IRQ affinity when creating VQs"). This will
> lead to an incorrect assignment of MSI-X vectors, and potential
> deadlocks when offlining cpus.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Fixes: 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for virtqueues")
> Reported-by: YASUAKI ISHIMATSU <[email protected]>

Just gave it a go on an arm64 VM, and the behaviour seems much saner
(the virtio queue affinity now spans the whole system).

Tested-by: Marc Zyngier <[email protected]>

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2017-09-07 20:23:46

by YASUAKI ISHIMATSU

[permalink] [raw]
Subject: Re: system hung up when offlining CPUs

Hi Mark and Christoph,

Sorry for the late reply. I appreciated that you fixed the issue on kvm environment.
But the issue still occurs on physical server.

Here ares irq information that I summarized megasas irqs from /proc/interrupts
and /proc/irq/*/smp_affinity_list on my server:

---
IRQ affinity_list IRQ_TYPE
42 0-5 IR-PCI-MSI 1048576-edge megasas
43 0-5 IR-PCI-MSI 1048577-edge megasas
44 0-5 IR-PCI-MSI 1048578-edge megasas
45 0-5 IR-PCI-MSI 1048579-edge megasas
46 0-5 IR-PCI-MSI 1048580-edge megasas
47 0-5 IR-PCI-MSI 1048581-edge megasas
48 0-5 IR-PCI-MSI 1048582-edge megasas
49 0-5 IR-PCI-MSI 1048583-edge megasas
50 0-5 IR-PCI-MSI 1048584-edge megasas
51 0-5 IR-PCI-MSI 1048585-edge megasas
52 0-5 IR-PCI-MSI 1048586-edge megasas
53 0-5 IR-PCI-MSI 1048587-edge megasas
54 0-5 IR-PCI-MSI 1048588-edge megasas
55 0-5 IR-PCI-MSI 1048589-edge megasas
56 0-5 IR-PCI-MSI 1048590-edge megasas
57 0-5 IR-PCI-MSI 1048591-edge megasas
58 0-5 IR-PCI-MSI 1048592-edge megasas
59 0-5 IR-PCI-MSI 1048593-edge megasas
60 0-5 IR-PCI-MSI 1048594-edge megasas
61 0-5 IR-PCI-MSI 1048595-edge megasas
62 0-5 IR-PCI-MSI 1048596-edge megasas
63 0-5 IR-PCI-MSI 1048597-edge megasas
64 0-5 IR-PCI-MSI 1048598-edge megasas
65 0-5 IR-PCI-MSI 1048599-edge megasas
66 24-29 IR-PCI-MSI 1048600-edge megasas
67 24-29 IR-PCI-MSI 1048601-edge megasas
68 24-29 IR-PCI-MSI 1048602-edge megasas
69 24-29 IR-PCI-MSI 1048603-edge megasas
70 24-29 IR-PCI-MSI 1048604-edge megasas
71 24-29 IR-PCI-MSI 1048605-edge megasas
72 24-29 IR-PCI-MSI 1048606-edge megasas
73 24-29 IR-PCI-MSI 1048607-edge megasas
74 24-29 IR-PCI-MSI 1048608-edge megasas
75 24-29 IR-PCI-MSI 1048609-edge megasas
76 24-29 IR-PCI-MSI 1048610-edge megasas
77 24-29 IR-PCI-MSI 1048611-edge megasas
78 24-29 IR-PCI-MSI 1048612-edge megasas
79 24-29 IR-PCI-MSI 1048613-edge megasas
80 24-29 IR-PCI-MSI 1048614-edge megasas
81 24-29 IR-PCI-MSI 1048615-edge megasas
82 24-29 IR-PCI-MSI 1048616-edge megasas
83 24-29 IR-PCI-MSI 1048617-edge megasas
84 24-29 IR-PCI-MSI 1048618-edge megasas
85 24-29 IR-PCI-MSI 1048619-edge megasas
86 24-29 IR-PCI-MSI 1048620-edge megasas
87 24-29 IR-PCI-MSI 1048621-edge megasas
88 24-29 IR-PCI-MSI 1048622-edge megasas
89 24-29 IR-PCI-MSI 1048623-edge megasas
---

In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline CPU#24-29,
I/O does not work, showing the following messages.

---
[...] sd 0:2:0:0: [sda] tag#1 task abort called for scmd(ffff8820574d7560)
[...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28 00 0d e8 cf 78 00 00 08 00
[...] sd 0:2:0:0: task abort: FAILED scmd(ffff8820574d7560)
[...] sd 0:2:0:0: [sda] tag#0 task abort called for scmd(ffff882057426560)
[...] sd 0:2:0:0: [sda] tag#0 CDB: Write(10) 2a 00 0d 58 37 00 00 00 08 00
[...] sd 0:2:0:0: task abort: FAILED scmd(ffff882057426560)
[...] sd 0:2:0:0: target reset called for scmd(ffff8820574d7560)
[...] sd 0:2:0:0: [sda] tag#1 megasas: target reset FAILED!!
[...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO timeout
[...] SCSI command pointer: (ffff882057426560) SCSI host state: 5 SCSI
[...] IO request frame:
[...]
<snip>
[...]
[...] megaraid_sas 0000:02:00.0: [ 0]waiting for 2 commands to complete for scsi0
[...] INFO: task auditd:1200 blocked for more than 120 seconds.
[...] Not tainted 4.13.0+ #15
[...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[...] auditd D 0 1200 1 0x00000000
[...] Call Trace:
[...] __schedule+0x28d/0x890
[...] schedule+0x36/0x80
[...] io_schedule+0x16/0x40
[...] wait_on_page_bit_common+0x109/0x1c0
[...] ? page_cache_tree_insert+0xf0/0xf0
[...] __filemap_fdatawait_range+0x127/0x190
[...] ? __filemap_fdatawrite_range+0xd1/0x100
[...] file_write_and_wait_range+0x60/0xb0
[...] xfs_file_fsync+0x67/0x1d0 [xfs]
[...] vfs_fsync_range+0x3d/0xb0
[...] do_fsync+0x3d/0x70
[...] SyS_fsync+0x10/0x20
[...] entry_SYSCALL_64_fastpath+0x1a/0xa5
[...] RIP: 0033:0x7f0bd9633d2d
[...] RSP: 002b:00007f0bd751ed30 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[...] RAX: ffffffffffffffda RBX: 00005590566d0080 RCX: 00007f0bd9633d2d
[...] RDX: 00005590566d1260 RSI: 0000000000000000 RDI: 0000000000000005
[...] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000017
[...] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
[...] R13: 00007f0bd751f9c0 R14: 00007f0bd751f700 R15: 0000000000000000
---

Thanks,
Yasuaki Ishimatsu

On 08/21/2017 09:37 AM, Marc Zyngier wrote:
> On 21/08/17 14:18, Christoph Hellwig wrote:
>> Can you try the patch below please?
>>
>> ---
>> From d5f59cb7a629de8439b318e1384660e6b56e7dd8 Mon Sep 17 00:00:00 2001
>> From: Christoph Hellwig <[email protected]>
>> Date: Mon, 21 Aug 2017 14:24:11 +0200
>> Subject: virtio_pci: fix cpu affinity support
>>
>> Commit 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for
>> virtqueues"") removed the adjustment of the pre_vectors for the virtio
>> MSI-X vector allocation which was added in commit fb5e31d9 ("virtio:
>> allow drivers to request IRQ affinity when creating VQs"). This will
>> lead to an incorrect assignment of MSI-X vectors, and potential
>> deadlocks when offlining cpus.
>>
>> Signed-off-by: Christoph Hellwig <[email protected]>
>> Fixes: 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for virtqueues")
>> Reported-by: YASUAKI ISHIMATSU <[email protected]>
>
> Just gave it a go on an arm64 VM, and the behaviour seems much saner
> (the virtio queue affinity now spans the whole system).
>
> Tested-by: Marc Zyngier <[email protected]>
>
> Thanks,
>
> M.
>

2017-09-12 18:15:58

by YASUAKI ISHIMATSU

[permalink] [raw]
Subject: Re: system hung up when offlining CPUs

+ linux-scsi and maintainers of megasas

When offlining CPU, I/O stops. Do you have any ideas?

On 09/07/2017 04:23 PM, YASUAKI ISHIMATSU wrote:
> Hi Mark and Christoph,
>
> Sorry for the late reply. I appreciated that you fixed the issue on kvm environment.
> But the issue still occurs on physical server.
>
> Here ares irq information that I summarized megasas irqs from /proc/interrupts
> and /proc/irq/*/smp_affinity_list on my server:
>
> ---
> IRQ affinity_list IRQ_TYPE
> 42 0-5 IR-PCI-MSI 1048576-edge megasas
> 43 0-5 IR-PCI-MSI 1048577-edge megasas
> 44 0-5 IR-PCI-MSI 1048578-edge megasas
> 45 0-5 IR-PCI-MSI 1048579-edge megasas
> 46 0-5 IR-PCI-MSI 1048580-edge megasas
> 47 0-5 IR-PCI-MSI 1048581-edge megasas
> 48 0-5 IR-PCI-MSI 1048582-edge megasas
> 49 0-5 IR-PCI-MSI 1048583-edge megasas
> 50 0-5 IR-PCI-MSI 1048584-edge megasas
> 51 0-5 IR-PCI-MSI 1048585-edge megasas
> 52 0-5 IR-PCI-MSI 1048586-edge megasas
> 53 0-5 IR-PCI-MSI 1048587-edge megasas
> 54 0-5 IR-PCI-MSI 1048588-edge megasas
> 55 0-5 IR-PCI-MSI 1048589-edge megasas
> 56 0-5 IR-PCI-MSI 1048590-edge megasas
> 57 0-5 IR-PCI-MSI 1048591-edge megasas
> 58 0-5 IR-PCI-MSI 1048592-edge megasas
> 59 0-5 IR-PCI-MSI 1048593-edge megasas
> 60 0-5 IR-PCI-MSI 1048594-edge megasas
> 61 0-5 IR-PCI-MSI 1048595-edge megasas
> 62 0-5 IR-PCI-MSI 1048596-edge megasas
> 63 0-5 IR-PCI-MSI 1048597-edge megasas
> 64 0-5 IR-PCI-MSI 1048598-edge megasas
> 65 0-5 IR-PCI-MSI 1048599-edge megasas
> 66 24-29 IR-PCI-MSI 1048600-edge megasas
> 67 24-29 IR-PCI-MSI 1048601-edge megasas
> 68 24-29 IR-PCI-MSI 1048602-edge megasas
> 69 24-29 IR-PCI-MSI 1048603-edge megasas
> 70 24-29 IR-PCI-MSI 1048604-edge megasas
> 71 24-29 IR-PCI-MSI 1048605-edge megasas
> 72 24-29 IR-PCI-MSI 1048606-edge megasas
> 73 24-29 IR-PCI-MSI 1048607-edge megasas
> 74 24-29 IR-PCI-MSI 1048608-edge megasas
> 75 24-29 IR-PCI-MSI 1048609-edge megasas
> 76 24-29 IR-PCI-MSI 1048610-edge megasas
> 77 24-29 IR-PCI-MSI 1048611-edge megasas
> 78 24-29 IR-PCI-MSI 1048612-edge megasas
> 79 24-29 IR-PCI-MSI 1048613-edge megasas
> 80 24-29 IR-PCI-MSI 1048614-edge megasas
> 81 24-29 IR-PCI-MSI 1048615-edge megasas
> 82 24-29 IR-PCI-MSI 1048616-edge megasas
> 83 24-29 IR-PCI-MSI 1048617-edge megasas
> 84 24-29 IR-PCI-MSI 1048618-edge megasas
> 85 24-29 IR-PCI-MSI 1048619-edge megasas
> 86 24-29 IR-PCI-MSI 1048620-edge megasas
> 87 24-29 IR-PCI-MSI 1048621-edge megasas
> 88 24-29 IR-PCI-MSI 1048622-edge megasas
> 89 24-29 IR-PCI-MSI 1048623-edge megasas
> ---
>
> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline CPU#24-29,
> I/O does not work, showing the following messages.
>
> ---
> [...] sd 0:2:0:0: [sda] tag#1 task abort called for scmd(ffff8820574d7560)
> [...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28 00 0d e8 cf 78 00 00 08 00
> [...] sd 0:2:0:0: task abort: FAILED scmd(ffff8820574d7560)
> [...] sd 0:2:0:0: [sda] tag#0 task abort called for scmd(ffff882057426560)
> [...] sd 0:2:0:0: [sda] tag#0 CDB: Write(10) 2a 00 0d 58 37 00 00 00 08 00
> [...] sd 0:2:0:0: task abort: FAILED scmd(ffff882057426560)
> [...] sd 0:2:0:0: target reset called for scmd(ffff8820574d7560)
> [...] sd 0:2:0:0: [sda] tag#1 megasas: target reset FAILED!!
> [...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO timeout
> [...] SCSI command pointer: (ffff882057426560) SCSI host state: 5 SCSI
> [...] IO request frame:
> [...]
> <snip>
> [...]
> [...] megaraid_sas 0000:02:00.0: [ 0]waiting for 2 commands to complete for scsi0
> [...] INFO: task auditd:1200 blocked for more than 120 seconds.
> [...] Not tainted 4.13.0+ #15
> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [...] auditd D 0 1200 1 0x00000000
> [...] Call Trace:
> [...] __schedule+0x28d/0x890
> [...] schedule+0x36/0x80
> [...] io_schedule+0x16/0x40
> [...] wait_on_page_bit_common+0x109/0x1c0
> [...] ? page_cache_tree_insert+0xf0/0xf0
> [...] __filemap_fdatawait_range+0x127/0x190
> [...] ? __filemap_fdatawrite_range+0xd1/0x100
> [...] file_write_and_wait_range+0x60/0xb0
> [...] xfs_file_fsync+0x67/0x1d0 [xfs]
> [...] vfs_fsync_range+0x3d/0xb0
> [...] do_fsync+0x3d/0x70
> [...] SyS_fsync+0x10/0x20
> [...] entry_SYSCALL_64_fastpath+0x1a/0xa5
> [...] RIP: 0033:0x7f0bd9633d2d
> [...] RSP: 002b:00007f0bd751ed30 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
> [...] RAX: ffffffffffffffda RBX: 00005590566d0080 RCX: 00007f0bd9633d2d
> [...] RDX: 00005590566d1260 RSI: 0000000000000000 RDI: 0000000000000005
> [...] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000017
> [...] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
> [...] R13: 00007f0bd751f9c0 R14: 00007f0bd751f700 R15: 0000000000000000
> ---
>
> Thanks,
> Yasuaki Ishimatsu
>
> On 08/21/2017 09:37 AM, Marc Zyngier wrote:
>> On 21/08/17 14:18, Christoph Hellwig wrote:
>>> Can you try the patch below please?
>>>
>>> ---
>>> From d5f59cb7a629de8439b318e1384660e6b56e7dd8 Mon Sep 17 00:00:00 2001
>>> From: Christoph Hellwig <[email protected]>
>>> Date: Mon, 21 Aug 2017 14:24:11 +0200
>>> Subject: virtio_pci: fix cpu affinity support
>>>
>>> Commit 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for
>>> virtqueues"") removed the adjustment of the pre_vectors for the virtio
>>> MSI-X vector allocation which was added in commit fb5e31d9 ("virtio:
>>> allow drivers to request IRQ affinity when creating VQs"). This will
>>> lead to an incorrect assignment of MSI-X vectors, and potential
>>> deadlocks when offlining cpus.
>>>
>>> Signed-off-by: Christoph Hellwig <[email protected]>
>>> Fixes: 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for virtqueues")
>>> Reported-by: YASUAKI ISHIMATSU <[email protected]>
>>
>> Just gave it a go on an arm64 VM, and the behaviour seems much saner
>> (the virtio queue affinity now spans the whole system).
>>
>> Tested-by: Marc Zyngier <[email protected]>
>>
>> Thanks,
>>
>> M.
>>

2017-09-13 11:14:13

by Hannes Reinecke

[permalink] [raw]
Subject: Re: system hung up when offlining CPUs

On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote:
> + linux-scsi and maintainers of megasas
>
> When offlining CPU, I/O stops. Do you have any ideas?
>
> On 09/07/2017 04:23 PM, YASUAKI ISHIMATSU wrote:
>> Hi Mark and Christoph,
>>
>> Sorry for the late reply. I appreciated that you fixed the issue on kvm environment.
>> But the issue still occurs on physical server.
>>
>> Here ares irq information that I summarized megasas irqs from /proc/interrupts
>> and /proc/irq/*/smp_affinity_list on my server:
>>
>> ---
>> IRQ affinity_list IRQ_TYPE
>> 42 0-5 IR-PCI-MSI 1048576-edge megasas
>> 43 0-5 IR-PCI-MSI 1048577-edge megasas
>> 44 0-5 IR-PCI-MSI 1048578-edge megasas
>> 45 0-5 IR-PCI-MSI 1048579-edge megasas
>> 46 0-5 IR-PCI-MSI 1048580-edge megasas
>> 47 0-5 IR-PCI-MSI 1048581-edge megasas
>> 48 0-5 IR-PCI-MSI 1048582-edge megasas
>> 49 0-5 IR-PCI-MSI 1048583-edge megasas
>> 50 0-5 IR-PCI-MSI 1048584-edge megasas
>> 51 0-5 IR-PCI-MSI 1048585-edge megasas
>> 52 0-5 IR-PCI-MSI 1048586-edge megasas
>> 53 0-5 IR-PCI-MSI 1048587-edge megasas
>> 54 0-5 IR-PCI-MSI 1048588-edge megasas
>> 55 0-5 IR-PCI-MSI 1048589-edge megasas
>> 56 0-5 IR-PCI-MSI 1048590-edge megasas
>> 57 0-5 IR-PCI-MSI 1048591-edge megasas
>> 58 0-5 IR-PCI-MSI 1048592-edge megasas
>> 59 0-5 IR-PCI-MSI 1048593-edge megasas
>> 60 0-5 IR-PCI-MSI 1048594-edge megasas
>> 61 0-5 IR-PCI-MSI 1048595-edge megasas
>> 62 0-5 IR-PCI-MSI 1048596-edge megasas
>> 63 0-5 IR-PCI-MSI 1048597-edge megasas
>> 64 0-5 IR-PCI-MSI 1048598-edge megasas
>> 65 0-5 IR-PCI-MSI 1048599-edge megasas
>> 66 24-29 IR-PCI-MSI 1048600-edge megasas
>> 67 24-29 IR-PCI-MSI 1048601-edge megasas
>> 68 24-29 IR-PCI-MSI 1048602-edge megasas
>> 69 24-29 IR-PCI-MSI 1048603-edge megasas
>> 70 24-29 IR-PCI-MSI 1048604-edge megasas
>> 71 24-29 IR-PCI-MSI 1048605-edge megasas
>> 72 24-29 IR-PCI-MSI 1048606-edge megasas
>> 73 24-29 IR-PCI-MSI 1048607-edge megasas
>> 74 24-29 IR-PCI-MSI 1048608-edge megasas
>> 75 24-29 IR-PCI-MSI 1048609-edge megasas
>> 76 24-29 IR-PCI-MSI 1048610-edge megasas
>> 77 24-29 IR-PCI-MSI 1048611-edge megasas
>> 78 24-29 IR-PCI-MSI 1048612-edge megasas
>> 79 24-29 IR-PCI-MSI 1048613-edge megasas
>> 80 24-29 IR-PCI-MSI 1048614-edge megasas
>> 81 24-29 IR-PCI-MSI 1048615-edge megasas
>> 82 24-29 IR-PCI-MSI 1048616-edge megasas
>> 83 24-29 IR-PCI-MSI 1048617-edge megasas
>> 84 24-29 IR-PCI-MSI 1048618-edge megasas
>> 85 24-29 IR-PCI-MSI 1048619-edge megasas
>> 86 24-29 IR-PCI-MSI 1048620-edge megasas
>> 87 24-29 IR-PCI-MSI 1048621-edge megasas
>> 88 24-29 IR-PCI-MSI 1048622-edge megasas
>> 89 24-29 IR-PCI-MSI 1048623-edge megasas
>> ---
>>
>> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline CPU#24-29,
>> I/O does not work, showing the following messages.
>>
>> ---
>> [...] sd 0:2:0:0: [sda] tag#1 task abort called for scmd(ffff8820574d7560)
>> [...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28 00 0d e8 cf 78 00 00 08 00
>> [...] sd 0:2:0:0: task abort: FAILED scmd(ffff8820574d7560)
>> [...] sd 0:2:0:0: [sda] tag#0 task abort called for scmd(ffff882057426560)
>> [...] sd 0:2:0:0: [sda] tag#0 CDB: Write(10) 2a 00 0d 58 37 00 00 00 08 00
>> [...] sd 0:2:0:0: task abort: FAILED scmd(ffff882057426560)
>> [...] sd 0:2:0:0: target reset called for scmd(ffff8820574d7560)
>> [...] sd 0:2:0:0: [sda] tag#1 megasas: target reset FAILED!!
>> [...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO timeout
>> [...] SCSI command pointer: (ffff882057426560) SCSI host state: 5 SCSI
>> [...] IO request frame:
>> [...]
>> <snip>
>> [...]
>> [...] megaraid_sas 0000:02:00.0: [ 0]waiting for 2 commands to complete for scsi0
>> [...] INFO: task auditd:1200 blocked for more than 120 seconds.
>> [...] Not tainted 4.13.0+ #15
>> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [...] auditd D 0 1200 1 0x00000000
>> [...] Call Trace:
>> [...] __schedule+0x28d/0x890
>> [...] schedule+0x36/0x80
>> [...] io_schedule+0x16/0x40
>> [...] wait_on_page_bit_common+0x109/0x1c0
>> [...] ? page_cache_tree_insert+0xf0/0xf0
>> [...] __filemap_fdatawait_range+0x127/0x190
>> [...] ? __filemap_fdatawrite_range+0xd1/0x100
>> [...] file_write_and_wait_range+0x60/0xb0
>> [...] xfs_file_fsync+0x67/0x1d0 [xfs]
>> [...] vfs_fsync_range+0x3d/0xb0
>> [...] do_fsync+0x3d/0x70
>> [...] SyS_fsync+0x10/0x20
>> [...] entry_SYSCALL_64_fastpath+0x1a/0xa5
>> [...] RIP: 0033:0x7f0bd9633d2d
>> [...] RSP: 002b:00007f0bd751ed30 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
>> [...] RAX: ffffffffffffffda RBX: 00005590566d0080 RCX: 00007f0bd9633d2d
>> [...] RDX: 00005590566d1260 RSI: 0000000000000000 RDI: 0000000000000005
>> [...] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000017
>> [...] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
>> [...] R13: 00007f0bd751f9c0 R14: 00007f0bd751f700 R15: 0000000000000000
>> ---
>>
>> Thanks,
>> Yasuaki Ishimatsu
>>

This indeed looks like a problem.
We're going to great lengths to submit and complete I/O on the same CPU,
so if the CPU is offlined while I/O is in flight we won't be getting a
completion for this particular I/O.
However, the megasas driver should be able to cope with this situation;
after all, the firmware maintains completions queues, so it would be
dead easy to look at _other_ completions queues, too, if a timeout occurs.
Also the IRQ affinity looks bogus (we should spread IRQs to _all_ CPUs,
not just a subset), and the driver should make sure to receive
completions even if the respective CPUs are offlined.
Alternatively it should not try to submit a command abort via an
offlined CPUs; that's guaranteed to run into the same problems.

So it looks more like a driver issue to me...

Cheers,

Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

2017-09-13 11:36:02

by Kashyap Desai

[permalink] [raw]
Subject: RE: system hung up when offlining CPUs

>
> On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote:
> > + linux-scsi and maintainers of megasas
> >
> > When offlining CPU, I/O stops. Do you have any ideas?
> >
> > On 09/07/2017 04:23 PM, YASUAKI ISHIMATSU wrote:
> >> Hi Mark and Christoph,
> >>
> >> Sorry for the late reply. I appreciated that you fixed the issue on kvm
> environment.
> >> But the issue still occurs on physical server.
> >>
> >> Here ares irq information that I summarized megasas irqs from
> >> /proc/interrupts and /proc/irq/*/smp_affinity_list on my server:
> >>
> >> ---
> >> IRQ affinity_list IRQ_TYPE
> >> 42 0-5 IR-PCI-MSI 1048576-edge megasas
> >> 43 0-5 IR-PCI-MSI 1048577-edge megasas
> >> 44 0-5 IR-PCI-MSI 1048578-edge megasas
> >> 45 0-5 IR-PCI-MSI 1048579-edge megasas
> >> 46 0-5 IR-PCI-MSI 1048580-edge megasas
> >> 47 0-5 IR-PCI-MSI 1048581-edge megasas
> >> 48 0-5 IR-PCI-MSI 1048582-edge megasas
> >> 49 0-5 IR-PCI-MSI 1048583-edge megasas
> >> 50 0-5 IR-PCI-MSI 1048584-edge megasas
> >> 51 0-5 IR-PCI-MSI 1048585-edge megasas
> >> 52 0-5 IR-PCI-MSI 1048586-edge megasas
> >> 53 0-5 IR-PCI-MSI 1048587-edge megasas
> >> 54 0-5 IR-PCI-MSI 1048588-edge megasas
> >> 55 0-5 IR-PCI-MSI 1048589-edge megasas
> >> 56 0-5 IR-PCI-MSI 1048590-edge megasas
> >> 57 0-5 IR-PCI-MSI 1048591-edge megasas
> >> 58 0-5 IR-PCI-MSI 1048592-edge megasas
> >> 59 0-5 IR-PCI-MSI 1048593-edge megasas
> >> 60 0-5 IR-PCI-MSI 1048594-edge megasas
> >> 61 0-5 IR-PCI-MSI 1048595-edge megasas
> >> 62 0-5 IR-PCI-MSI 1048596-edge megasas
> >> 63 0-5 IR-PCI-MSI 1048597-edge megasas
> >> 64 0-5 IR-PCI-MSI 1048598-edge megasas
> >> 65 0-5 IR-PCI-MSI 1048599-edge megasas
> >> 66 24-29 IR-PCI-MSI 1048600-edge megasas
> >> 67 24-29 IR-PCI-MSI 1048601-edge megasas
> >> 68 24-29 IR-PCI-MSI 1048602-edge megasas
> >> 69 24-29 IR-PCI-MSI 1048603-edge megasas
> >> 70 24-29 IR-PCI-MSI 1048604-edge megasas
> >> 71 24-29 IR-PCI-MSI 1048605-edge megasas
> >> 72 24-29 IR-PCI-MSI 1048606-edge megasas
> >> 73 24-29 IR-PCI-MSI 1048607-edge megasas
> >> 74 24-29 IR-PCI-MSI 1048608-edge megasas
> >> 75 24-29 IR-PCI-MSI 1048609-edge megasas
> >> 76 24-29 IR-PCI-MSI 1048610-edge megasas
> >> 77 24-29 IR-PCI-MSI 1048611-edge megasas
> >> 78 24-29 IR-PCI-MSI 1048612-edge megasas
> >> 79 24-29 IR-PCI-MSI 1048613-edge megasas
> >> 80 24-29 IR-PCI-MSI 1048614-edge megasas
> >> 81 24-29 IR-PCI-MSI 1048615-edge megasas
> >> 82 24-29 IR-PCI-MSI 1048616-edge megasas
> >> 83 24-29 IR-PCI-MSI 1048617-edge megasas
> >> 84 24-29 IR-PCI-MSI 1048618-edge megasas
> >> 85 24-29 IR-PCI-MSI 1048619-edge megasas
> >> 86 24-29 IR-PCI-MSI 1048620-edge megasas
> >> 87 24-29 IR-PCI-MSI 1048621-edge megasas
> >> 88 24-29 IR-PCI-MSI 1048622-edge megasas
> >> 89 24-29 IR-PCI-MSI 1048623-edge megasas
> >> ---
> >>
> >> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline
> >> CPU#24-29, I/O does not work, showing the following messages.
> >>
> >> ---
> >> [...] sd 0:2:0:0: [sda] tag#1 task abort called for
> >> scmd(ffff8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28
> >> 00 0d e8 cf 78 00 00 08 00 [...] sd 0:2:0:0: task abort: FAILED
> >> scmd(ffff8820574d7560) [...] sd 0:2:0:0: [sda] tag#0 task abort
> >> called for scmd(ffff882057426560) [...] sd 0:2:0:0: [sda] tag#0 CDB:
> >> Write(10) 2a 00 0d 58 37 00 00 00 08 00 [...] sd 0:2:0:0: task abort:
> >> FAILED scmd(ffff882057426560) [...] sd 0:2:0:0: target reset called
> >> for scmd(ffff8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 megasas:
> >> target
> reset FAILED!!
> >> [...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO
> >> timeout
> >> [...] SCSI command pointer: (ffff882057426560) SCSI host state: 5
> >> SCSI
> >> [...] IO request frame:
> >> [...]
> >> <snip>
> >> [...]
> >> [...] megaraid_sas 0000:02:00.0: [ 0]waiting for 2 commands to
> >> complete for scsi0 [...] INFO: task auditd:1200 blocked for more than
> >> 120
> seconds.
> >> [...] Not tainted 4.13.0+ #15
> >> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> >> [...] auditd D 0 1200 1 0x00000000
> >> [...] Call Trace:
> >> [...] __schedule+0x28d/0x890
> >> [...] schedule+0x36/0x80
> >> [...] io_schedule+0x16/0x40
> >> [...] wait_on_page_bit_common+0x109/0x1c0
> >> [...] ? page_cache_tree_insert+0xf0/0xf0 [...]
> >> __filemap_fdatawait_range+0x127/0x190
> >> [...] ? __filemap_fdatawrite_range+0xd1/0x100
> >> [...] file_write_and_wait_range+0x60/0xb0
> >> [...] xfs_file_fsync+0x67/0x1d0 [xfs] [...]
> >> vfs_fsync_range+0x3d/0xb0 [...] do_fsync+0x3d/0x70 [...]
> >> SyS_fsync+0x10/0x20 [...] entry_SYSCALL_64_fastpath+0x1a/0xa5
> >> [...] RIP: 0033:0x7f0bd9633d2d
> >> [...] RSP: 002b:00007f0bd751ed30 EFLAGS: 00000293 ORIG_RAX:
> >> 000000000000004a [...] RAX: ffffffffffffffda RBX: 00005590566d0080
> >> RCX: 00007f0bd9633d2d [...] RDX: 00005590566d1260 RSI:
> >> 0000000000000000 RDI: 0000000000000005 [...] RBP: 0000000000000000
> >> R08: 0000000000000000 R09: 0000000000000017 [...] R10:
> >> 0000000000000000 R11: 0000000000000293 R12: 0000000000000000 [...]
> >> R13: 00007f0bd751f9c0 R14: 00007f0bd751f700 R15: 0000000000000000
> >> ---
> >>
> >> Thanks,
> >> Yasuaki Ishimatsu
> >>
>
> This indeed looks like a problem.
> We're going to great lengths to submit and complete I/O on the same CPU,
> so
> if the CPU is offlined while I/O is in flight we won't be getting a
> completion for
> this particular I/O.
> However, the megasas driver should be able to cope with this situation;
> after
> all, the firmware maintains completions queues, so it would be dead easy
> to
> look at _other_ completions queues, too, if a timeout occurs.
In case of IO timeout, megaraid_sas driver is checking other queues as well.
That is why IO was completed in this case and further IOs were resumed.

Driver complete commands as below code executed from
megasas_wait_for_outstanding_fusion().
for (MSIxIndex = 0 ; MSIxIndex < count; MSIxIndex++)
complete_cmd_fusion(instance, MSIxIndex);

Because of above code executed in driver, we see only one print as below in
this logs.
megaraid_sas 0000:02:00.0: [ 0]waiting for 2 commands to complete for scsi0

As per below link CPU hotplug will take care- "All interrupts targeted to
this CPU are migrated to a new CPU"
https://www.kernel.org/doc/html/v4.11/core-api/cpu_hotplug.html

BTW - We are also able reproduce this issue locally. Reason for IO timeout
is -" IO is completed, but corresponding interrupt did not arrived on Online
CPU. Either missed due to CPU is in transient state of being OFFLINED. I am
not sure which component should take care this."

Question - "what happens once __cpu_disable is called and some of the queued
interrupt has affinity to that particular CPU ?"
I assume ideally those pending/queued Interrupt should be migrated to
remaining online CPUs. It should not be unhandled if we want to avoid such
IO timeout.

Kashyap

> Also the IRQ affinity looks bogus (we should spread IRQs to _all_ CPUs,
> not
> just a subset), and the driver should make sure to receive completions
> even if
> the respective CPUs are offlined.
> Alternatively it should not try to submit a command abort via an offlined
> CPUs; that's guaranteed to run into the same problems.
>
> So it looks more like a driver issue to me...
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke Teamlead Storage & Networking
> [email protected] +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284
> (AG Nürnberg)

2017-09-13 13:33:20

by Thomas Gleixner

[permalink] [raw]
Subject: RE: system hung up when offlining CPUs

On Wed, 13 Sep 2017, Kashyap Desai wrote:
> > On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote:
> > > + linux-scsi and maintainers of megasas

> > >> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline
> > >> CPU#24-29, I/O does not work, showing the following messages.

....

> > This indeed looks like a problem.
> > We're going to great lengths to submit and complete I/O on the same CPU,
> > so
> > if the CPU is offlined while I/O is in flight we won't be getting a
> > completion for
> > this particular I/O.
> > However, the megasas driver should be able to cope with this situation;
> > after
> > all, the firmware maintains completions queues, so it would be dead easy
> > to
> > look at _other_ completions queues, too, if a timeout occurs.
> In case of IO timeout, megaraid_sas driver is checking other queues as well.
> That is why IO was completed in this case and further IOs were resumed.
>
> Driver complete commands as below code executed from
> megasas_wait_for_outstanding_fusion().
> for (MSIxIndex = 0 ; MSIxIndex < count; MSIxIndex++)
> complete_cmd_fusion(instance, MSIxIndex);
>
> Because of above code executed in driver, we see only one print as below in
> this logs.
> megaraid_sas 0000:02:00.0: [ 0]waiting for 2 commands to complete for scsi0
>
> As per below link CPU hotplug will take care- "All interrupts targeted to
> this CPU are migrated to a new CPU"
> https://www.kernel.org/doc/html/v4.11/core-api/cpu_hotplug.html
>
> BTW - We are also able reproduce this issue locally. Reason for IO timeout
> is -" IO is completed, but corresponding interrupt did not arrived on Online
> CPU. Either missed due to CPU is in transient state of being OFFLINED. I am
> not sure which component should take care this."
>
> Question - "what happens once __cpu_disable is called and some of the queued
> interrupt has affinity to that particular CPU ?"
> I assume ideally those pending/queued Interrupt should be migrated to
> remaining online CPUs. It should not be unhandled if we want to avoid such
> IO timeout.

Can you please provide the following information, before and after
offlining the last CPU in the affinity set:

# cat /proc/irq/$IRQNUM/smp_affinity_list
# cat /proc/irq/$IRQNUM/effective_affinity
# cat /sys/kernel/debug/irq/irqs/$IRQNUM

The last one requires: CONFIG_GENERIC_IRQ_DEBUGFS=y

Thanks,

tglx

2017-09-14 16:28:18

by YASUAKI ISHIMATSU

[permalink] [raw]
Subject: Re: system hung up when offlining CPUs



On 09/13/2017 09:33 AM, Thomas Gleixner wrote:
> On Wed, 13 Sep 2017, Kashyap Desai wrote:
>>> On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote:
>>>> + linux-scsi and maintainers of megasas
>
>>>>> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline
>>>>> CPU#24-29, I/O does not work, showing the following messages.
>
> ....
>
>>> This indeed looks like a problem.
>>> We're going to great lengths to submit and complete I/O on the same CPU,
>>> so
>>> if the CPU is offlined while I/O is in flight we won't be getting a
>>> completion for
>>> this particular I/O.
>>> However, the megasas driver should be able to cope with this situation;
>>> after
>>> all, the firmware maintains completions queues, so it would be dead easy
>>> to
>>> look at _other_ completions queues, too, if a timeout occurs.
>> In case of IO timeout, megaraid_sas driver is checking other queues as well.
>> That is why IO was completed in this case and further IOs were resumed.
>>
>> Driver complete commands as below code executed from
>> megasas_wait_for_outstanding_fusion().
>> for (MSIxIndex = 0 ; MSIxIndex < count; MSIxIndex++)
>> complete_cmd_fusion(instance, MSIxIndex);
>>
>> Because of above code executed in driver, we see only one print as below in
>> this logs.
>> megaraid_sas 0000:02:00.0: [ 0]waiting for 2 commands to complete for scsi0
>>
>> As per below link CPU hotplug will take care- "All interrupts targeted to
>> this CPU are migrated to a new CPU"
>> https://www.kernel.org/doc/html/v4.11/core-api/cpu_hotplug.html
>>
>> BTW - We are also able reproduce this issue locally. Reason for IO timeout
>> is -" IO is completed, but corresponding interrupt did not arrived on Online
>> CPU. Either missed due to CPU is in transient state of being OFFLINED. I am
>> not sure which component should take care this."
>>
>> Question - "what happens once __cpu_disable is called and some of the queued
>> interrupt has affinity to that particular CPU ?"
>> I assume ideally those pending/queued Interrupt should be migrated to
>> remaining online CPUs. It should not be unhandled if we want to avoid such
>> IO timeout.
>
> Can you please provide the following information, before and after
> offlining the last CPU in the affinity set:
>
> # cat /proc/irq/$IRQNUM/smp_affinity_list
> # cat /proc/irq/$IRQNUM/effective_affinity
> # cat /sys/kernel/debug/irq/irqs/$IRQNUM
>
> The last one requires: CONFIG_GENERIC_IRQ_DEBUGFS=y

Here are one irq's info of megasas:

- Before offline CPU
/proc/irq/70/smp_affinity_list
24-29

/proc/irq/70/effective_affinity
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,3f000000

/sys/kernel/debug/irq/irqs/70
handler: handle_edge_irq
status: 0x00004000
istate: 0x00000000
ddepth: 0
wdepth: 0
dstate: 0x00609200
IRQD_ACTIVATED
IRQD_IRQ_STARTED
IRQD_MOVE_PCNTXT
IRQD_AFFINITY_SET
IRQD_AFFINITY_MANAGED
node: 1
affinity: 24-29
effectiv: 24-29
pending:
domain: INTEL-IR-MSI-0-2
hwirq: 0x100018
chip: IR-PCI-MSI
flags: 0x10
IRQCHIP_SKIP_SET_WAKE
parent:
domain: INTEL-IR-0
hwirq: 0x400000
chip: INTEL-IR
flags: 0x0
parent:
domain: VECTOR
hwirq: 0x46
chip: APIC
flags: 0x0

- After offline CPU#24-29
/proc/irq/70/smp_affinity_list
29

/proc/irq/70/effective_affinity
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,20000000

/sys/kernel/debug/irq/irqs/70
handler: handle_edge_irq
status: 0x00004000
istate: 0x00000000
ddepth: 1
wdepth: 0
dstate: 0x00a39000
IRQD_IRQ_DISABLED
IRQD_IRQ_MASKED
IRQD_MOVE_PCNTXT
IRQD_AFFINITY_SET
IRQD_AFFINITY_MANAGED
IRQD_MANAGED_SHUTDOWN
node: 1
affinity: 29
effectiv: 29
pending:
domain: INTEL-IR-MSI-0-2
hwirq: 0x100018
chip: IR-PCI-MSI
flags: 0x10
IRQCHIP_SKIP_SET_WAKE
parent:
domain: INTEL-IR-0
hwirq: 0x400000
chip: INTEL-IR
flags: 0x0
parent:
domain: VECTOR
hwirq: 0x46
chip: APIC
flags: 0x0

Thanks,
Yasuaki Ishimatsu

>
> Thanks,
>
> tglx
>

2017-09-16 10:16:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: system hung up when offlining CPUs

On Thu, 14 Sep 2017, YASUAKI ISHIMATSU wrote:
> On 09/13/2017 09:33 AM, Thomas Gleixner wrote:
> >> Question - "what happens once __cpu_disable is called and some of the queued
> >> interrupt has affinity to that particular CPU ?"
> >> I assume ideally those pending/queued Interrupt should be migrated to
> >> remaining online CPUs. It should not be unhandled if we want to avoid such
> >> IO timeout.
> >
> > Can you please provide the following information, before and after
> > offlining the last CPU in the affinity set:
> >
> > # cat /proc/irq/$IRQNUM/smp_affinity_list
> > # cat /proc/irq/$IRQNUM/effective_affinity
> > # cat /sys/kernel/debug/irq/irqs/$IRQNUM
> >
> > The last one requires: CONFIG_GENERIC_IRQ_DEBUGFS=y
>
> Here are one irq's info of megasas:
>
> - Before offline CPU
> /proc/irq/70/smp_affinity_list
> 24-29
>
> /proc/irq/70/effective_affinity
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,3f000000
>
> /sys/kernel/debug/irq/irqs/70
> handler: handle_edge_irq
> status: 0x00004000
> istate: 0x00000000
> ddepth: 0
> wdepth: 0
> dstate: 0x00609200
> IRQD_ACTIVATED
> IRQD_IRQ_STARTED
> IRQD_MOVE_PCNTXT
> IRQD_AFFINITY_SET
> IRQD_AFFINITY_MANAGED

So this uses managed affinity, which means that once the last CPU in the
affinity mask goes offline, the interrupt is shut down by the irq core
code, which is the case:

> dstate: 0x00a39000
> IRQD_IRQ_DISABLED
> IRQD_IRQ_MASKED
> IRQD_MOVE_PCNTXT
> IRQD_AFFINITY_SET
> IRQD_AFFINITY_MANAGED
> IRQD_MANAGED_SHUTDOWN <---------------

So the irq core code works as expected, but something in the
driver/scsi/block stack seems to fiddle with that shut down queue.

I only can tell about the inner workings of the irq code, but I have no
clue about the rest.

Thanks,

tglx


2017-09-16 15:03:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: system hung up when offlining CPUs

On Sat, 16 Sep 2017, Thomas Gleixner wrote:
> On Thu, 14 Sep 2017, YASUAKI ISHIMATSU wrote:
> > Here are one irq's info of megasas:
> >
> > - Before offline CPU
> > /proc/irq/70/smp_affinity_list
> > 24-29
> >
> > /proc/irq/70/effective_affinity
> > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,3f000000
> >
> > /sys/kernel/debug/irq/irqs/70
> > handler: handle_edge_irq
> > status: 0x00004000
> > istate: 0x00000000
> > ddepth: 0
> > wdepth: 0
> > dstate: 0x00609200
> > IRQD_ACTIVATED
> > IRQD_IRQ_STARTED
> > IRQD_MOVE_PCNTXT
> > IRQD_AFFINITY_SET
> > IRQD_AFFINITY_MANAGED
>
> So this uses managed affinity, which means that once the last CPU in the
> affinity mask goes offline, the interrupt is shut down by the irq core
> code, which is the case:
>
> > dstate: 0x00a39000
> > IRQD_IRQ_DISABLED
> > IRQD_IRQ_MASKED
> > IRQD_MOVE_PCNTXT
> > IRQD_AFFINITY_SET
> > IRQD_AFFINITY_MANAGED
> > IRQD_MANAGED_SHUTDOWN <---------------
>
> So the irq core code works as expected, but something in the
> driver/scsi/block stack seems to fiddle with that shut down queue.
>
> I only can tell about the inner workings of the irq code, but I have no
> clue about the rest.

Though there is something wrong here:

> affinity: 24-29
> effectiv: 24-29

and after offlining:

> affinity: 29
> effectiv: 29

But that should be:

affinity: 24-29
effectiv: 29

because the irq core code preserves 'affinity'. It merily updates
'effective', which is where your interrupts are routed to.

Is the driver issuing any set_affinity() calls? If so, that's wrong.

Which driver are we talking about?

Thanks,

tglx