2018-10-18 13:24:48

by Lukas Wunner

[permalink] [raw]
Subject: [PATCH] genirq: Fix race on spurious interrupt detection

Commit 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of
threaded irqs") made detection of spurious interrupts work for threaded
handlers by:

a) incrementing a counter every time the thread returns IRQ_HANDLED, and
b) checking whether that counter has increased every time the thread is
woken.

However for oneshot interrupts, the commit unmasks the interrupt before
incrementing the counter. If another interrupt occurs right after
unmasking but before the counter is incremented, that interrupt is
incorrectly considered spurious:

time
| irq_thread()
| irq_thread_fn()
| action->thread_fn()
| irq_finalize_oneshot()
| unmask_threaded_irq() /* interrupt is unmasked */
|
| /* interrupt fires, incorrectly deemed spurious */
|
| atomic_inc(&desc->threads_handled); /* counter is incremented */
v

I am seeing this with a hi3110 CAN controller receiving data at high
volume (from a separate machine sending with "cangen -g 0 -i -x"):
The controller signals a huge number of interrupts (hundreds of millions
per day) and every second there are about a dozen which are deemed
spurious. The issue is benign in this case, mostly just an irritation,
but I'm worrying that at high CPU load and in the presence of higher
priority tasks, the number of incorrectly detected spurious interrupts
might increase beyond the 99,900 threshold and cause disablement of the
IRQ.

Fix by moving the incrementation of the counter from irq_thread() to
irq_thread_fn() and irq_forced_thread_fn(), before the invocation of
irq_finalize_oneshot().

Fixes: 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of threaded irqs")
Signed-off-by: Lukas Wunner <[email protected]>
Cc: [email protected] # v3.16+
---
kernel/irq/manage.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index fb86146037a7..9dbdccab3b6a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -927,6 +927,9 @@ irq_forced_thread_fn(struct irq_desc *desc, struct irqaction *action)

local_bh_disable();
ret = action->thread_fn(action->irq, action->dev_id);
+ if (ret == IRQ_HANDLED)
+ atomic_inc(&desc->threads_handled);
+
irq_finalize_oneshot(desc, action);
local_bh_enable();
return ret;
@@ -943,6 +946,9 @@ static irqreturn_t irq_thread_fn(struct irq_desc *desc,
irqreturn_t ret;

ret = action->thread_fn(action->irq, action->dev_id);
+ if (ret == IRQ_HANDLED)
+ atomic_inc(&desc->threads_handled);
+
irq_finalize_oneshot(desc, action);
return ret;
}
@@ -1020,8 +1026,6 @@ static int irq_thread(void *data)
irq_thread_check_affinity(desc, action);

action_ret = handler_fn(desc, action);
- if (action_ret == IRQ_HANDLED)
- atomic_inc(&desc->threads_handled);
if (action_ret == IRQ_WAKE_THREAD)
irq_wake_secondary(desc, action);

--
2.19.1



2018-10-19 14:33:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] genirq: Fix race on spurious interrupt detection

On Thu, 18 Oct 2018, Lukas Wunner wrote:
> Commit 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of
> threaded irqs") made detection of spurious interrupts work for threaded
> handlers by:
>
> a) incrementing a counter every time the thread returns IRQ_HANDLED, and
> b) checking whether that counter has increased every time the thread is
> woken.
>
> However for oneshot interrupts, the commit unmasks the interrupt before
> incrementing the counter. If another interrupt occurs right after
> unmasking but before the counter is incremented, that interrupt is
> incorrectly considered spurious:
>
> time
> | irq_thread()
> | irq_thread_fn()
> | action->thread_fn()
> | irq_finalize_oneshot()
> | unmask_threaded_irq() /* interrupt is unmasked */
> |
> | /* interrupt fires, incorrectly deemed spurious */
> |
> | atomic_inc(&desc->threads_handled); /* counter is incremented */
> v
>
> I am seeing this with a hi3110 CAN controller receiving data at high
> volume (from a separate machine sending with "cangen -g 0 -i -x"):
> The controller signals a huge number of interrupts (hundreds of millions
> per day) and every second there are about a dozen which are deemed
> spurious. The issue is benign in this case, mostly just an irritation,
> but I'm worrying that at high CPU load and in the presence of higher
> priority tasks, the number of incorrectly detected spurious interrupts
> might increase beyond the 99,900 threshold and cause disablement of the
> IRQ.

I doubt that this can happen in reality, so I'd rather reword that
paragraph slightly:

In theory high CPU load and in the presence of higher priority tasks, the
number of incorrectly detected spurious interrupts might increase beyond
the 99,900 threshold and cause disablement of the interrupt.

In practice it just increments the spurious interrupt count. But that can
cause people to waste time investigating it over and over.

Hmm?

Thanks,

tglx

2018-10-19 15:26:03

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH] genirq: Fix race on spurious interrupt detection

On Fri, Oct 19, 2018 at 04:31:30PM +0200, Thomas Gleixner wrote:
> On Thu, 18 Oct 2018, Lukas Wunner wrote:
> > Commit 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of
> > threaded irqs") made detection of spurious interrupts work for threaded
> > handlers by:
> >
> > a) incrementing a counter every time the thread returns IRQ_HANDLED, and
> > b) checking whether that counter has increased every time the thread is
> > woken.
> >
> > However for oneshot interrupts, the commit unmasks the interrupt before
> > incrementing the counter. If another interrupt occurs right after
> > unmasking but before the counter is incremented, that interrupt is
> > incorrectly considered spurious:
> >
> > time
> > | irq_thread()
> > | irq_thread_fn()
> > | action->thread_fn()
> > | irq_finalize_oneshot()
> > | unmask_threaded_irq() /* interrupt is unmasked */
> > |
> > | /* interrupt fires, incorrectly deemed spurious */
> > |
> > | atomic_inc(&desc->threads_handled); /* counter is incremented */
> > v
> >
> > I am seeing this with a hi3110 CAN controller receiving data at high
> > volume (from a separate machine sending with "cangen -g 0 -i -x"):
> > The controller signals a huge number of interrupts (hundreds of millions
> > per day) and every second there are about a dozen which are deemed
> > spurious. The issue is benign in this case, mostly just an irritation,
> > but I'm worrying that at high CPU load and in the presence of higher
> > priority tasks, the number of incorrectly detected spurious interrupts
> > might increase beyond the 99,900 threshold and cause disablement of the
> > IRQ.
>
> I doubt that this can happen in reality, so I'd rather reword that
> paragraph slightly:
>
> In theory high CPU load and in the presence of higher priority tasks, the
> number of incorrectly detected spurious interrupts might increase beyond
> the 99,900 threshold and cause disablement of the interrupt.
>
> In practice it just increments the spurious interrupt count. But that can
> cause people to waste time investigating it over and over.
>
> Hmm?

Sure, fine by me. Would you prefer me to resend with that change or
can you fold it in when applying?

FWIW I did manage to reach the 99,900 threshold once because I had
added copious amounts of printk() to the hi3110 IRQ thread to debug
another issue. But I never experienced that without those printk()'s.
Here's the resulting splat:

irq 194: nobody cared (try booting with the "irqpoll" option)
CPU: 0 PID: 1929 Comm: candump Tainted: G O 4.9.76-rt60-v7+ #1
Hardware name: BCM2835
[<8011106c>] (unwind_backtrace) from [<8010cdd8>] (show_stack+0x20/0x24)
[<8010cdd8>] (show_stack) from [<8047cb2c>] (dump_stack+0xc8/0x10c)
[<8047cb2c>] (dump_stack) from [<8018192c>] (__report_bad_irq+0x3c/0xdc)
[<8018192c>] (__report_bad_irq) from [<80181d94>] (note_interrupt+0x29c/0x2ec)
[<80181d94>] (note_interrupt) from [<8017ec9c>] (handle_irq_event_percpu+0x78/0x84)
[<8017ec9c>] (handle_irq_event_percpu) from [<8017ed20>] (handle_irq_event+0x78/0xbc)
[<8017ed20>] (handle_irq_event) from [<80182ad8>] (handle_edge_irq+0x13c/0x1e8)
[<80182ad8>] (handle_edge_irq) from [<8017db64>] (generic_handle_irq+0x34/0x44)
[<8017db64>] (generic_handle_irq) from [<804ae1a0>] (bcm2835_gpio_irq_handle_bank+0x88/0xac)
[<804ae1a0>] (bcm2835_gpio_irq_handle_bank) from [<804ae2ac>] (bcm2835_gpio_irq_handler+0xe8/0x154)
[<804ae2ac>] (bcm2835_gpio_irq_handler) from [<8017db64>] (generic_handle_irq+0x34/0x44)
[<8017db64>] (generic_handle_irq) from [<804a7720>] (bcm2836_chained_handle_irq+0x38/0x50)
[<804a7720>] (bcm2836_chained_handle_irq) from [<8017db64>] (generic_handle_irq+0x34/0x44)
[<8017db64>] (generic_handle_irq) from [<8017e144>] (__handle_domain_irq+0x6c/0xc4)
[<8017e144>] (__handle_domain_irq) from [<8010155c>] (bcm2836_arm_irqchip_handle_irq+0xac/0xb0)
[<8010155c>] (bcm2836_arm_irqchip_handle_irq) from [<80775dec>] (__irq_usr+0x4c/0x60)
Exception stack(0xb6b15fb0 to 0xb6b15ff8)
5fa0: 76ec1d50 0000000a 011f8026 fbad2aa4
5fc0: 76ec1d50 0000000a 76f1a000 000263bc 00000000 000263fc 00000001 7ec3f470
5fe0: 00000444 7ec3f2e8 76de5b90 76def528 40000010 ffffffff
handlers:
[<8017edc0>] irq_default_primary_handler threaded [<7f37c734>] hi3110_can_ist [hi311x]
Disabling IRQ #194

Thanks,

Lukas

2018-10-19 15:29:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] genirq: Fix race on spurious interrupt detection

On Fri, 19 Oct 2018, Lukas Wunner wrote:
> On Fri, Oct 19, 2018 at 04:31:30PM +0200, Thomas Gleixner wrote:
> > I doubt that this can happen in reality, so I'd rather reword that
> > paragraph slightly:
> >
> > In theory high CPU load and in the presence of higher priority tasks, the
> > number of incorrectly detected spurious interrupts might increase beyond
> > the 99,900 threshold and cause disablement of the interrupt.
> >
> > In practice it just increments the spurious interrupt count. But that can
> > cause people to waste time investigating it over and over.
> >
> > Hmm?
>
> Sure, fine by me. Would you prefer me to resend with that change or
> can you fold it in when applying?

I'll fold it. No problem.

> FWIW I did manage to reach the 99,900 threshold once because I had
> added copious amounts of printk() to the hi3110 IRQ thread to debug
> another issue. But I never experienced that without those printk()'s.

Cute.

Thanks

tglx

Subject: [tip:irq/core] genirq: Fix race on spurious interrupt detection

Commit-ID: 746a923b863a1065ef77324e1e43f19b1a3eab5c
Gitweb: https://git.kernel.org/tip/746a923b863a1065ef77324e1e43f19b1a3eab5c
Author: Lukas Wunner <[email protected]>
AuthorDate: Thu, 18 Oct 2018 15:15:05 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 19 Oct 2018 17:31:00 +0200

genirq: Fix race on spurious interrupt detection

Commit 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of
threaded irqs") made detection of spurious interrupts work for threaded
handlers by:

a) incrementing a counter every time the thread returns IRQ_HANDLED, and
b) checking whether that counter has increased every time the thread is
woken.

However for oneshot interrupts, the commit unmasks the interrupt before
incrementing the counter. If another interrupt occurs right after
unmasking but before the counter is incremented, that interrupt is
incorrectly considered spurious:

time
| irq_thread()
| irq_thread_fn()
| action->thread_fn()
| irq_finalize_oneshot()
| unmask_threaded_irq() /* interrupt is unmasked */
|
| /* interrupt fires, incorrectly deemed spurious */
|
| atomic_inc(&desc->threads_handled); /* counter is incremented */
v

This is observed with a hi3110 CAN controller receiving data at high volume
(from a separate machine sending with "cangen -g 0 -i -x"): The controller
signals a huge number of interrupts (hundreds of millions per day) and
every second there are about a dozen which are deemed spurious.

In theory with high CPU load and the presence of higher priority tasks, the
number of incorrectly detected spurious interrupts might increase beyond
the 99,900 threshold and cause disablement of the interrupt.

In practice it just increments the spurious interrupt count. But that can
cause people to waste time investigating it over and over.

Fix it by moving the accounting before the invocation of
irq_finalize_oneshot().

[ tglx: Folded change log update ]

Fixes: 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of threaded irqs")
Signed-off-by: Lukas Wunner <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Mathias Duckeck <[email protected]>
Cc: Akshay Bhat <[email protected]>
Cc: Casey Fitzpatrick <[email protected]>
Cc: [email protected] # v3.16+
Link: https://lkml.kernel.org/r/1dfd8bbd16163940648045495e3e9698e63b50ad.1539867047.git.lukas@wunner.de

---
kernel/irq/manage.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index fb86146037a7..9dbdccab3b6a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -927,6 +927,9 @@ irq_forced_thread_fn(struct irq_desc *desc, struct irqaction *action)

local_bh_disable();
ret = action->thread_fn(action->irq, action->dev_id);
+ if (ret == IRQ_HANDLED)
+ atomic_inc(&desc->threads_handled);
+
irq_finalize_oneshot(desc, action);
local_bh_enable();
return ret;
@@ -943,6 +946,9 @@ static irqreturn_t irq_thread_fn(struct irq_desc *desc,
irqreturn_t ret;

ret = action->thread_fn(action->irq, action->dev_id);
+ if (ret == IRQ_HANDLED)
+ atomic_inc(&desc->threads_handled);
+
irq_finalize_oneshot(desc, action);
return ret;
}
@@ -1020,8 +1026,6 @@ static int irq_thread(void *data)
irq_thread_check_affinity(desc, action);

action_ret = handler_fn(desc, action);
- if (action_ret == IRQ_HANDLED)
- atomic_inc(&desc->threads_handled);
if (action_ret == IRQ_WAKE_THREAD)
irq_wake_secondary(desc, action);


2018-10-19 16:15:23

by David Laight

[permalink] [raw]
Subject: RE: [tip:irq/core] genirq: Fix race on spurious interrupt detection

From: Lukas Wunner
> Sent: 19 October 2018 16:34
>
> genirq: Fix race on spurious interrupt detection
>
> Commit 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of
> threaded irqs") made detection of spurious interrupts work for threaded
> handlers by:
>
> a) incrementing a counter every time the thread returns IRQ_HANDLED, and
> b) checking whether that counter has increased every time the thread is
> woken.

That seems horribly broken.
What is it trying to achieve?

There are (at least) two common cases where IRQ_HANDLED doesn't get returned.
(Unless the driver always returns it to avoid the message.)

1) The IOW that causes the hardware to drop a level sensitive IRQ is posted
on the bus (etc) and happens late enough that the IRQ line is still
asserted when the iret executes.
If this happens all the time you need to flush the IOW, but if only
occasionally it doesn't matter and you don't want a message.

2) Typically an ethernet driver ISR has to enable the interrupt and then
check the ring for work before returning from the interrupt.
If a packet arrives at this time it might be processed by the 'old'
ISR invocation but still generate another interrupt.
If no more packets arrive the second ISR invocation will find no work.
Again this is normal behaviour.
(Deferring everything with NAPI might make this not happen - but other
interrupts end up working the same way.)

If you are really trying to detect 'stuck' interrupts then you probably
want to count un-handled ones and zero the count on handled ones.
I'm also pretty sure you don't need an atomic counter.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2018-10-19 18:42:20

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [tip:irq/core] genirq: Fix race on spurious interrupt detection

David,

On Fri, 19 Oct 2018, David Laight wrote:
> From: Lukas Wunner
> > Sent: 19 October 2018 16:34
> >
> > genirq: Fix race on spurious interrupt detection
> >
> > Commit 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of
> > threaded irqs") made detection of spurious interrupts work for threaded
> > handlers by:
> >
> > a) incrementing a counter every time the thread returns IRQ_HANDLED, and
> > b) checking whether that counter has increased every time the thread is
> > woken.
>
> That seems horribly broken.
> What is it trying to achieve?
>
> There are (at least) two common cases where IRQ_HANDLED doesn't get returned.
> (Unless the driver always returns it to avoid the message.)
>
> 1) The IOW that causes the hardware to drop a level sensitive IRQ is posted
> on the bus (etc) and happens late enough that the IRQ line is still
> asserted when the iret executes.
> If this happens all the time you need to flush the IOW, but if only
> occasionally it doesn't matter and you don't want a message.
>
> 2) Typically an ethernet driver ISR has to enable the interrupt and then
> check the ring for work before returning from the interrupt.
> If a packet arrives at this time it might be processed by the 'old'
> ISR invocation but still generate another interrupt.
> If no more packets arrive the second ISR invocation will find no work.
> Again this is normal behaviour.
> (Deferring everything with NAPI might make this not happen - but other
> interrupts end up working the same way.)
>
> If you are really trying to detect 'stuck' interrupts then you probably
> want to count un-handled ones and zero the count on handled ones.
> I'm also pretty sure you don't need an atomic counter.

Care to look at the logic which handles all of this including the
interaction with threaded interrupt handlers?

Thanks,

tglx





2018-10-19 20:27:16

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [tip:irq/core] genirq: Fix race on spurious interrupt detection

David,

On Fri, 19 Oct 2018, Thomas Gleixner wrote:
> On Fri, 19 Oct 2018, David Laight wrote:
> > From: Lukas Wunner
> > > Sent: 19 October 2018 16:34
> > >
> > > genirq: Fix race on spurious interrupt detection
> > >
> > > Commit 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of
> > > threaded irqs") made detection of spurious interrupts work for threaded
> > > handlers by:
> > >
> > > a) incrementing a counter every time the thread returns IRQ_HANDLED, and
> > > b) checking whether that counter has increased every time the thread is
> > > woken.
> >
> > That seems horribly broken.
> > What is it trying to achieve?
> >
> > There are (at least) two common cases where IRQ_HANDLED doesn't get returned.
> > (Unless the driver always returns it to avoid the message.)
> >
> > 1) The IOW that causes the hardware to drop a level sensitive IRQ is posted
> > on the bus (etc) and happens late enough that the IRQ line is still
> > asserted when the iret executes.
> > If this happens all the time you need to flush the IOW, but if only
> > occasionally it doesn't matter and you don't want a message.
> >
> > 2) Typically an ethernet driver ISR has to enable the interrupt and then
> > check the ring for work before returning from the interrupt.
> > If a packet arrives at this time it might be processed by the 'old'
> > ISR invocation but still generate another interrupt.
> > If no more packets arrive the second ISR invocation will find no work.
> > Again this is normal behaviour.
> > (Deferring everything with NAPI might make this not happen - but other
> > interrupts end up working the same way.)
> >
> > If you are really trying to detect 'stuck' interrupts then you probably
> > want to count un-handled ones and zero the count on handled ones.
> > I'm also pretty sure you don't need an atomic counter.
>
> Care to look at the logic which handles all of this including the
> interaction with threaded interrupt handlers?

Sorry for being dense. Wanted to postpone and hit send ...

The thing is that the spurious detector works perfectly fine for the
scenarios you described since years. It does proper handled vs. unhandled
accounting.

What Lukas is addressing is not the normal interrupt case, it's the
threaded handler case where we did the accounting check after reenabling
the interrupt, which causes spurious detection to trigger
occasionally. That's not leading to disabling the interrupt unless you
enforce it with a gazillion of printks.

The atomic counter is necessary in that case because for certain scenarios
the hard interrupt handler can race with the threaded handler legitimately.

Thanks,

tglx