2023-09-28 15:42:18

by Wei Gong

[permalink] [raw]
Subject: [PATCH v3] genirq: avoid long loops in handle_edge_irq

When there are a large number of interrupts occurring on the tx
queue(irq smp_affinity=1) of the network card, changing the CPU
affinity of the tx queue (echo 2 > /proc/irq/xx/smp_affinity)
will cause handle_edge_irq to loop for a long time in the
do {} while() loop.

After setting the IRQ CPU affinity, the next interrupt will only
be activated when it arrives. Therefore, the next interrupt will
still be on CPU 0. When a new CPU affinity is activated on CPU 0,
subsequent interrupts will be processed on CPU 1.

cpu 0 cpu 1
- handle_edge_irq
- apic_ack_irq
- irq_do_set_affinity
- handle_edge_irq
- do {
- handle_irq_event
- istate &= ~IRQS_PENDIN
- IRQD_IRQ_INPROGRESS
- spin_unlock()
- spin_lock()
- istate |= IRQS_PENDIN
- handle_irq_event_percpu - mask_ack_irq()
- spin_unlock()
- spin_unlock

} while(IRQS_PENDIN &&
!irq_disable)

Therefore, when determining whether to continue looping, we add a check
to see if the current CPU belongs to the affinity table of the interrupt.

Signed-off-by: Wei Gong <[email protected]>
---
kernel/irq/chip.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index dc94e0bf2c94..a457490bd965 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -831,7 +831,9 @@ void handle_edge_irq(struct irq_desc *desc)
handle_irq_event(desc);

} while ((desc->istate & IRQS_PENDING) &&
- !irqd_irq_disabled(&desc->irq_data));
+ !irqd_irq_disabled(&desc->irq_data) &&
+ cpumask_test_cpu(smp_processor_id(),
+ irq_data_get_effective_affinity_mask(&desc->irq_data)));

out_unlock:
raw_spin_unlock(&desc->lock);
--
2.32.1 (Apple Git-133)


2023-10-09 14:32:36

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3] genirq: avoid long loops in handle_edge_irq

On Thu, Sep 28 2023 at 18:06, Wei Gong wrote:
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -831,7 +831,9 @@ void handle_edge_irq(struct irq_desc *desc)
> handle_irq_event(desc);
>
> } while ((desc->istate & IRQS_PENDING) &&
> - !irqd_irq_disabled(&desc->irq_data));
> + !irqd_irq_disabled(&desc->irq_data) &&
> + cpumask_test_cpu(smp_processor_id(),
> + irq_data_get_effective_affinity_mask(&desc->irq_data)));

Ok. So now that mask part is correct, but what guarantees that this does
not lose interrupts?

Assume the following scenario:

CPU 0 CPU 1

interrupt
set IN_PROGRESS
do {
change_affinity_to(CPU1);
handle_irq_event()
ack_in_device()
interrupt
set PENDING
} while (COND)

Now $COND is not true due to the affinity change and the edge handler
returns. As a consequence nothing acks the device and no further
interrupts are sent by the device.

That might not be true for your case, but that's a generic function and the
zoo of hardware which uses that is massive.

So no, we are not taking a risk here.

Thanks,

tglx


2023-10-12 13:40:15

by Wei Gong

[permalink] [raw]
Subject: Re: [PATCH v3] genirq: avoid long loops in handle_edge_irq

O Mon, Oct 09, 2023 at 04:32:10PM +0200, Thomas Gleixner wrote:
> On Thu, Sep 28 2023 at 18:06, Wei Gong wrote:
> > --- a/kernel/irq/chip.c
> > +++ b/kernel/irq/chip.c
> > @@ -831,7 +831,9 @@ void handle_edge_irq(struct irq_desc *desc)
> > handle_irq_event(desc);
> >
> > } while ((desc->istate & IRQS_PENDING) &&
> > - !irqd_irq_disabled(&desc->irq_data));
> > + !irqd_irq_disabled(&desc->irq_data) &&
> > + cpumask_test_cpu(smp_processor_id(),
> > + irq_data_get_effective_affinity_mask(&desc->irq_data)));
>
> Ok. So now that mask part is correct, but what guarantees that this does
> not lose interrupts?
>
> Assume the following scenario:
>
> CPU 0 CPU 1
>
> interrupt
> set IN_PROGRESS
> do {
> change_affinity_to(CPU1);
> handle_irq_event()
> ack_in_device()
> interrupt
> set PENDING
> } while (COND)
>
> Now $COND is not true due to the affinity change and the edge handler
> returns. As a consequence nothing acks the device and no further
> interrupts are sent by the device.
>
> That might not be true for your case, but that's a generic function and the
> zoo of hardware which uses that is massive.
>
> So no, we are not taking a risk here.
>
> Thanks,
>
> tglx
>
>
By maintaining the original loop exit condition, if a mask mismatch is
detected within the loop, we will not perform the unmask_irq operation.
Instead, we will wait until the loop exits before executing unmask_irq.
Could this approach potentially solve the issue of lost interrupts?

Thanks,
Wei Gong

2023-10-13 08:44:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3] genirq: avoid long loops in handle_edge_irq

On Thu, Oct 12 2023 at 21:39, Wei Gong wrote:
> O Mon, Oct 09, 2023 at 04:32:10PM +0200, Thomas Gleixner wrote:
>> Now $COND is not true due to the affinity change and the edge handler
>> returns. As a consequence nothing acks the device and no further
>> interrupts are sent by the device.
>>
>> That might not be true for your case, but that's a generic function and the
>> zoo of hardware which uses that is massive.
>>
>> So no, we are not taking a risk here.
>>
>> Thanks,
>>
>> tglx
>>
>>
> By maintaining the original loop exit condition, if a mask mismatch is
> detected within the loop, we will not perform the unmask_irq operation.
> Instead, we will wait until the loop exits before executing unmask_irq.
> Could this approach potentially solve the issue of lost interrupts?

How so exactly?

2023-10-19 07:02:50

by Wei Gong

[permalink] [raw]
Subject: Re: [PATCH v3] genirq: avoid long loops in handle_edge_irq

O Fri, Oct 13, 2023 at 10:44:36AM +0200, Thomas Gleixner wrote:
> On Thu, Oct 12 2023 at 21:39, Wei Gong wrote:
> > O Mon, Oct 09, 2023 at 04:32:10PM +0200, Thomas Gleixner wrote:
> >> Now $COND is not true due to the affinity change and the edge handler
> >> returns. As a consequence nothing acks the device and no further
> >> interrupts are sent by the device.
> >>
> >> That might not be true for your case, but that's a generic function and the
> >> zoo of hardware which uses that is massive.
> >>
> >> So no, we are not taking a risk here.
> >>
> >> Thanks,
> >>
> >> tglx
> >>
> >>
> > By maintaining the original loop exit condition, if a mask mismatch is
> > detected within the loop, we will not perform the unmask_irq operation.
> > Instead, we will wait until the loop exits before executing unmask_irq.
> > Could this approach potentially solve the issue of lost interrupts?
>
> How so exactly?
>


Attachments:
(No filename) (958.00 B)
p.diff.txt (857.00 B)
Download all attachments

2023-10-19 08:29:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3] genirq: avoid long loops in handle_edge_irq

On Thu, Oct 19 2023 at 15:02, Wei Gong wrote:
> O Fri, Oct 13, 2023 at 10:44:36AM +0200, Thomas Gleixner wrote:
>> > By maintaining the original loop exit condition, if a mask mismatch is
>> > detected within the loop, we will not perform the unmask_irq operation.
>> > Instead, we will wait until the loop exits before executing unmask_irq.
>> > Could this approach potentially solve the issue of lost interrupts?
>>
>> How so exactly?
>>
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 8f8f1d62f..b846659ce 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -823,7 +823,9 @@ void handle_edge_irq(struct irq_desc *desc)
> */
> if (unlikely(desc->istate & IRQS_PENDING)) {
> if (!irqd_irq_disabled(&desc->irq_data) &&
> - irqd_irq_masked(&desc->irq_data))
> + irqd_irq_masked(&desc->irq_data) &&
> + cpumask_test_cpu(smp_processor_id(),
> + irq_data_get_effective_affinity_mask(&desc->irq_data)))
> unmask_irq(desc);
> }
>
> @@ -832,6 +834,10 @@ void handle_edge_irq(struct irq_desc *desc)
> } while ((desc->istate & IRQS_PENDING) &&
> !irqd_irq_disabled(&desc->irq_data));
>
> +if (!irqd_irq_disabled(&desc->irq_data) &&
> + irqd_irq_masked(&desc->irq_data))
> + unmask_irq(desc);

What is this supposed to solve? The last interrupt has been delivered
already. It's the one which sets the PENDING bit.

Again:

CPU 0 CPU 1

interrupt #1
set IN_PROGRESS
do {
change_affinity_to(CPU1);
handle_irq_event()
ack_in_device(interrupt #1)

interrupt #2
set PENDING
mask();
} while (COND)

unmask();

The unmask does not make the interrupt #2 magically reappear. This is
edge, which is a fire and forget mechanism. That's why we have the
PENDING bit logic for edge to ensure that no interrupt is lost.

With your change interrupt #2 is lost forever.

So if the device depends on ack_in_device() for being able to send the
next interrupt, then the lack of ack_in_device() for interrupt #2 makes
it stale.

It may well be that your particular device does not need the
ack_in_device() sequence, but this is generic core code which has to
work for everything.

I'm still having a hard time to understand why this is such a big
issue at all. What's the practical impact of processing a bunch of
interrupts on CPU0 after changing the affinity to CPU1?

It's obviously suboptimal in terms of locality, but that's a temporary
problem which resolves itself. It's suboptimal, but correct behaviour.

Your attempts of "fixing" it at the core edge handler level result in a
fundamental correctness problem.

There are certainly ways to solve it at that level, but for that you
have to fully understand and accept the fundamental properties and
intricacies of edge triggered interrupts in the first place.

Whether the resulting complexity is worth it, is a different
question. As long as there is not a compelling reason to do so, the
answer is simply no.

Thanks,

tglx

2023-10-19 09:37:34

by Wei Gong

[permalink] [raw]
Subject: Re: [PATCH v3] genirq: avoid long loops in handle_edge_irq

O Thu, Oct 19, 2023 at 10:28:58AM +0200, Thomas Gleixner wrote:
> On Thu, Oct 19 2023 at 15:02, Wei Gong wrote:
> > O Fri, Oct 13, 2023 at 10:44:36AM +0200, Thomas Gleixner wrote:
> >> > By maintaining the original loop exit condition, if a mask mismatch is
> >> > detected within the loop, we will not perform the unmask_irq operation.
> >> > Instead, we will wait until the loop exits before executing unmask_irq.
> >> > Could this approach potentially solve the issue of lost interrupts?
> >>
> >> How so exactly?
> >>
> > diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> > index 8f8f1d62f..b846659ce 100644
> > --- a/kernel/irq/chip.c
> > +++ b/kernel/irq/chip.c
> > @@ -823,7 +823,9 @@ void handle_edge_irq(struct irq_desc *desc)
> > */
> > if (unlikely(desc->istate & IRQS_PENDING)) {
> > if (!irqd_irq_disabled(&desc->irq_data) &&
> > - irqd_irq_masked(&desc->irq_data))
> > + irqd_irq_masked(&desc->irq_data) &&
> > + cpumask_test_cpu(smp_processor_id(),
> > + irq_data_get_effective_affinity_mask(&desc->irq_data)))
> > unmask_irq(desc);
> > }
> >
> > @@ -832,6 +834,10 @@ void handle_edge_irq(struct irq_desc *desc)
> > } while ((desc->istate & IRQS_PENDING) &&
> > !irqd_irq_disabled(&desc->irq_data));
> >
> > +if (!irqd_irq_disabled(&desc->irq_data) &&
> > + irqd_irq_masked(&desc->irq_data))
> > + unmask_irq(desc);
>
> What is this supposed to solve? The last interrupt has been delivered
> already. It's the one which sets the PENDING bit.
>
> Again:
>
> CPU 0 CPU 1
>
> interrupt #1
> set IN_PROGRESS
> do {
> change_affinity_to(CPU1);
> handle_irq_event()
> ack_in_device(interrupt #1)
>
> interrupt #2
> set PENDING
> mask();
> } while (COND)
>
> unmask();
>
> The unmask does not make the interrupt #2 magically reappear. This is
> edge, which is a fire and forget mechanism. That's why we have the
> PENDING bit logic for edge to ensure that no interrupt is lost.
>
> With your change interrupt #2 is lost forever.
>
> So if the device depends on ack_in_device() for being able to send the
> next interrupt, then the lack of ack_in_device() for interrupt #2 makes
> it stale.
>
> It may well be that your particular device does not need the
> ack_in_device() sequence, but this is generic core code which has to
> work for everything.
>
> I'm still having a hard time to understand why this is such a big
> issue at all. What's the practical impact of processing a bunch of
> interrupts on CPU0 after changing the affinity to CPU1?
>
> It's obviously suboptimal in terms of locality, but that's a temporary
> problem which resolves itself. It's suboptimal, but correct behaviour.
>
> Your attempts of "fixing" it at the core edge handler level result in a
> fundamental correctness problem.
>
> There are certainly ways to solve it at that level, but for that you
> have to fully understand and accept the fundamental properties and
> intricacies of edge triggered interrupts in the first place.
>
> Whether the resulting complexity is worth it, is a different
> question. As long as there is not a compelling reason to do so, the
> answer is simply no.
>
> Thanks,
>
> tglx

I wholeheartedly concur with your viewpoint.

The problem I'm facing is that when a multitude of interrupts are
processed on CPU0, it results in CPU0 being extensively occupied.
This prevents it from addressing other tasks, thereby leading to a softlockup.

92266.035199] NMI backtrace for cpu 1
[92266.035200] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G
[92266.035201] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0
[92266.035203] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4
09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8
b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00
[92266.035203] RSP: 0018:ff6012b446364e78 EFLAGS: 00000002
[92266.035204] RAX: 0000000000000101 RBX: ff41817bd9738e00 RCX: 00000000ffffffff
[92266.035204] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff41817bd9738ea4
[92266.035205] RBP: ff41817bd9738ea4 R08: 00006cd77aeca1ca R09: 0000000000000000
[92266.035205] R10: 0000000000000000 R11: 0000000000000000 R12: ff41817bd9738e00
[92266.035206] R13: 0000000000000028 R14: 0000000000000000 R15: 0000000000000000
[92266.035206] FS: 0000000000000000(0000) GS:ff41818a49640000(0000) knlGS:0000000000000000
[92266.035206] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[92266.035207] CR2: 00007f9592422220 CR3: 0000000dc3010006 CR4: 0000000000771ee0
[92266.035207] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[92266.035208] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[92266.035208] PKRU: 55555554
[92266.035208] Call Trace:
[92266.035208] <IRQ>
[92266.035209] _raw_spin_lock+0x1c/0x20
[92266.035209] handle_edge_irq+0x19/0x190
[92266.035209] handle_irq+0x1c/0x30
[92266.035210] do_IRQ+0x49/0xd0
[92266.035210] common_interrupt+0xf/0xf
[92266.035210] RIP: 0010:__do_softirq+0x72/0x2d6
[92266.035211] Code: 12 52 ff 65 81 05 b2 5b 81 75 00 01 00 00 c7 44 24 10 0a 00
00 00 48 c7 c0 80 99 02 00 65 66 c7 00 00 00 fb 66 0f 1f 44 00 00 <b8> ff ff ff
ff 48 c7 c3 00 61 20 8b 41 0f bc c7 89 c6 83 c6 01 89
[92266.035212] RSP: 0018:ff6012b446364f98 EFLAGS: 00000286 ORIG_RAX: ffffffffffffffd7
[92266.035212] RAX: 0000000000029980 RBX: 0000000000000000 RCX: 00000000ffffffff
[92266.035213] RDX: 00000000000000ed RSI: 0000000000000254 RDI: ffffffff8c601040
[92266.035213] RBP: 0000000000000000 R08: 00006cd740f1f9b6 R09: 0000000000000000
[92266.035214] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[92266.035214] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000080
[92266.035215] ? common_interrupt+0xa/0xf
[92266.035215] ? __do_softirq+0x4b/0x2d6
[92266.035215] irq_exit+0xf7/0x100
[92266.035216] reschedule_interrupt+0xf/0x20
[92266.035216] </IRQ>
[92266.035216] RIP: 0010:native_safe_halt+0xe/0x10
[92266.035217] Code: ff ff 7f c3 65 48 8b 04 25 40 5c 01 00 f0 80 48 02 20 48 8b 00 a8
08 75 c4 eb 80 90 e9 07 00 00 00 0f 00 2d 76 9c 48 00 fb f4 <c3> 90 e9 07 00 00 00 0f
00 2d 66 9c 48 00 f4 c3 90 90 0f 1f 44 00
[92266.035217] RSP: 0018:ff6012b4462b7ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff02
[92266.035218] RAX: ffffffff8a57fd80 RBX: 0000000000000001 RCX: 0000000000000000
[92266.035218] RDX: 000000000ab164e2 RSI: 0000000000000087 RDI: 0000000000000001
[92266.035219] RBP: 0000000000000001 R08: ff41818a4965d8a0 R09: 0000000000000000
[92266.035219] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffffffffffff
[92266.035220] R13: 0000000000000000 R14: 0000000000000000 R15: ff41817bc6855ac0
[92266.035220] ? __sched_text_end+0x7/0x7
[92266.035220] default_idle+0xa/0x10
[92266.035221] default_idle_call+0x40/0xf0
[92266.035221] do_idle+0x1f4/0x260
[92266.035221] cpu_startup_entry+0x6f/0x80
[92266.035221] start_secondary+0x19b/0x1e0
[92266.035222] secondary_startup_64_no_verify+0xc2/0xcb
[92266.035223] NMI backtrace for cpu 0
[92266.035223] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: G
[92266.035224] RIP: 0010:__pci_msix_desc_mask_irq+0x3b/0x40
[92266.035225] Code: f6 47 55 02 75 24 0f b7 57 56 c1 e2 04 48 63 d2 48 03 57 60
74 14 8b 47 50 83 e0 fe 89 c1 83 c9 01 83 e6 01 0f 45 c1 89 42 0c <c3> 0f 1f 40
00 0f 1f 44 00 00 53 48 8b 47 10 48 8b 58 10 f6 43 54
[92266.035225] RSP: 0018:ff6012b440003f80 EFLAGS: 00000046
[92266.035226] RAX: 0000000000000000 RBX: ff41817bc6ca3c00 RCX: 0000000000000001
[92266.035226] RDX: ff6012b4464a5020 RSI: 0000000000000000 RDI: ff41817bc6ca3c00
[92266.035226] RBP: ff41817bd9738ea4 R08: 000000000000da73 R09: 0000000000007ffe
[92266.035226] R10: 0000000000000068 R11: 000000000000004c R12: ff41817bd9738e00
[92266.035227] R13: 0000000000000024 R14: 0000000000000000 R15: 0000000000000000
[92266.035227] FS: 0000000000000000(0000) GS:ff41818a49600000(0000) knlGS:0000000000000000
[92266.035227] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[92266.035227] CR2: 00007fcfde8a40c0 CR3: 0000000dc3010004 CR4: 0000000000771ef0
[92266.035228] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[92266.035228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[92266.035228] PKRU: 55555554
[92266.035228] Call Trace:
[92266.035229] <IRQ>
[92266.035229] msi_set_mask_bit+0x1c/0x50
[92266.035229] unmask_irq.part.42+0x1f/0x30
[92266.035229] handle_edge_irq+0xdb/0x190
[92266.035229] handle_irq+0x1c/0x30
[92266.035230] do_IRQ+0x49/0xd0
[92266.035230] common_interrupt+0xf/0xf
[92266.035230] </IRQ>
[92266.035230] RIP: 0010:native_safe_halt+0xe/0x10
[92266.035231] Code: ff ff 7f c3 65 48 8b 04 25 40 5c 01 00 f0 80 48 02 20 48 8b 00
a8 08 75 c4 eb 80 90 e9 07 00 00 00 0f 00 2d 76 9c 48 00 fb f4 <c3> 90 e9 07 00 00
00 0f 00 2d 66 9c 48 00 f4 c3 90 90 0f 1f 44 00
[92266.035231] RSP: 0018:ffffffff8b203e80 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffdb
[92266.035231] RAX: ffffffff8a57fd80 RBX: 0000000000000000 RCX: ff41817bfb4f5058
[92266.035232] RDX: 000000005a65557a RSI: 0000000000000000 RDI: 000053e53530b240
[92266.035232] RBP: 0000000000000000 R08: 00006cce9cd00c88 R09: ff41817d0c085b40
[92266.035232] R10: 0000000000000007 R11: ff41817d9ad7505b R12: ffffffffffffffff
[92266.035233] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff8b218840
[92266.035233] ? __sched_text_end+0x7/0x7
[92266.035233] default_idle+0xa/0x10
[92266.035233] default_idle_call+0x40/0xf0
[92266.035234] do_idle+0x1f4/0x260
[92266.035234] cpu_startup_entry+0x6f/0x80
[92266.035234] start_kernel+0x51d/0x53d
[92266.035234] secondary_startup_64_no_verify+0xc2/0xcb
[92292.771711] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [swapper/1:0]

Thanks,

Wei Gong