2024-01-16 07:37:37

by Leonardo Bras

[permalink] [raw]
Subject: [RESEND RFC PATCH v1 1/2] irq/spurious: Reset irqs_unhandled if an irq_thread handles one IRQ request

When the IRQs are threaded, the part of the handler that runs in
interruption context can be pretty fast, as per design, while letting the
slow part to run into the thread handler.

In some cases, given IRQs can be triggered too fast, making it impossible
for the irq_thread to be able to keep up handling every request.

If two requests happen before any irq_thread handler is able to finish,
no increment to threads_handled happen, causing threads_handled and
threads_handled_last to be equal, which will ends up
causing irqs_unhandled to be incremented in note_interrupt().

Once irqs_unhandled gets to ~100k, the IRQ line gets disabled, disrupting
the device work.

As of today, the only way to reset irqs_unhandled before disabling the IRQ
line is to stay 100ms without having any increment to irqs_unhandled, which
can be pretty hard to happen if the IRQ is very busy.

On top of that, some irq_thread handlers can handle requests in batches,
effectively incrementing threads_handled only once despite dealing with a
lot of requests, which make the irqs_unhandled to reach 100k pretty fast
if the IRQ is getting a lot of requests.

This IRQ line disable bug can be easily reproduced with a serial8250
console on a PREEMPT_RT kernel: it only takes the user to print a lot
of text to the console (or to ttyS0): around 300k chars should be enough.

To fix this bug, reset irqs_unhandled whenever irq_thread handles at least
one IRQ request.

This fix makes possible to avoid disabling IRQs which irq_thread handlers
can take long (while on heavy usage of the IRQ line), without losing the
ability of disabling IRQs that actually get unhandled for too long.

Signed-off-by: Leonardo Bras <[email protected]>
---
kernel/irq/spurious.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index 02b2daf074414..b60748f89738a 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -339,6 +339,14 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret)
handled |= SPURIOUS_DEFERRED;
if (handled != desc->threads_handled_last) {
action_ret = IRQ_HANDLED;
+
+ /*
+ * If the thread handlers handle
+ * one IRQ reset the unhandled
+ * IRQ counter.
+ */
+ desc->irqs_unhandled = 0;
+
/*
* Note: We keep the SPURIOUS_DEFERRED
* bit set. We are handling the
--
2.43.0



2024-01-17 22:08:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RESEND RFC PATCH v1 1/2] irq/spurious: Reset irqs_unhandled if an irq_thread handles one IRQ request

On Tue, Jan 16 2024 at 04:36, Leonardo Bras wrote:
> This IRQ line disable bug can be easily reproduced with a serial8250
> console on a PREEMPT_RT kernel: it only takes the user to print a lot
> of text to the console (or to ttyS0): around 300k chars should be
> enough.

That has nothing to do with RT, it's a problem of force threaded
interrupts in combination with an edge type interrupt line and a
hardware which keeps firing interrupts forever.

> To fix this bug, reset irqs_unhandled whenever irq_thread handles at least
> one IRQ request.

This papers over the symptom and makes runaway detection way weaker for
all interrupts or breaks it completely.

The problem with edge type interrupts is that we cannot mask them like
we do with level type interrupts in the hard interrupt handler and
unmask them once the threaded handler finishes.

So yes, we need special rules here when:

1) The interrupt handler is force threaded

2) The interrupt line is edge type

3) The accumulated unhandled interrupts are within a sane margin

Thanks,

tglx

2024-01-17 22:47:29

by Leonardo Bras

[permalink] [raw]
Subject: Re: [RESEND RFC PATCH v1 1/2] irq/spurious: Reset irqs_unhandled if an irq_thread handles one IRQ request

On Wed, Jan 17, 2024 at 11:08:44PM +0100, Thomas Gleixner wrote:
> On Tue, Jan 16 2024 at 04:36, Leonardo Bras wrote:
> > This IRQ line disable bug can be easily reproduced with a serial8250
> > console on a PREEMPT_RT kernel: it only takes the user to print a lot
> > of text to the console (or to ttyS0): around 300k chars should be
> > enough.
>
> That has nothing to do with RT, it's a problem of force threaded
> interrupts in combination with an edge type interrupt line and a
> hardware which keeps firing interrupts forever.

Hello Thomas, thanks for your feedback!

I agreed it has nothing to do with RT.
I just mentioned PREEMPT_RT as my test case scenario, since it enables
force-threaded IRQs.

>
> > To fix this bug, reset irqs_unhandled whenever irq_thread handles at least
> > one IRQ request.
>
> This papers over the symptom and makes runaway detection way weaker for
> all interrupts or breaks it completely.

This change is supposed to only touch threaded interruptions, since it will
reach the included line only if (action_ret == IRQ_WAKE_THREAD) and if
desc->threads_handled changes since the last IRQ request.

This incrementing also happens only on irq_forced_thread_fn() and
irq_thread_fn(), which are called only from irq_thread_fn().

But I get the overall worry about having this making runaway detection way
weaker for all threaded interrupts.

I have previously worked on a solution that can be more precise and be an
opt-in for drivers instead of a general solution:

It required a change in IRQ interface that let the handlers inform how
many IRQs were actually handled (batching). This number would then be
added to desc->threads_handle (in irq_*thread_fn(), just changing the
atomic_inc() to atomic_add()), and then subtracted from irqs_unhandled
at note_interrupt().

In the serial8250 case, the driver would be changed to use that interface,
since it's already able to process multiple IRQs, and the bug just
vanishes.

This also solved the serial driver issue, but required a deeper change in
the code, which caused me to consider a simpler solution first.

This solution sure does give better runnaway detection. Do you think it
would be better that the one I sent in this patch?

>
> The problem with edge type interrupts is that we cannot mask them like
> we do with level type interrupts in the hard interrupt handler and
> unmask them once the threaded handler finishes.
>
> So yes, we need special rules here when:
>
> 1) The interrupt handler is force threaded
>
> 2) The interrupt line is edge type
>
> 3) The accumulated unhandled interrupts are within a sane margin
>
> Thanks,
>
> tglx
>

Completelly agree, that's why I am suggesting dealing with threaded
interruptions in a different way: reseting the unhandled count when it
handles a request.

I am not sure how force threaded and just threaded are different in this
scenario. Could you help me understand?

Thanks!
Leo


2024-01-18 09:24:45

by Leonardo Bras

[permalink] [raw]
Subject: Re: [RESEND RFC PATCH v1 1/2] irq/spurious: Reset irqs_unhandled if an irq_thread handles one IRQ request

On Wed, Jan 17, 2024 at 07:46:28PM -0300, Leonardo Bras wrote:
> On Wed, Jan 17, 2024 at 11:08:44PM +0100, Thomas Gleixner wrote:
> > On Tue, Jan 16 2024 at 04:36, Leonardo Bras wrote:
> > > This IRQ line disable bug can be easily reproduced with a serial8250
> > > console on a PREEMPT_RT kernel: it only takes the user to print a lot
> > > of text to the console (or to ttyS0): around 300k chars should be
> > > enough.
> >
> > That has nothing to do with RT, it's a problem of force threaded
> > interrupts in combination with an edge type interrupt line and a
> > hardware which keeps firing interrupts forever.
>
> Hello Thomas, thanks for your feedback!
>
> I agreed it has nothing to do with RT.
> I just mentioned PREEMPT_RT as my test case scenario, since it enables
> force-threaded IRQs.
>
> >
> > > To fix this bug, reset irqs_unhandled whenever irq_thread handles at least
> > > one IRQ request.
> >
> > This papers over the symptom and makes runaway detection way weaker for
> > all interrupts or breaks it completely.
>
> This change is supposed to only touch threaded interruptions, since it will
> reach the included line only if (action_ret == IRQ_WAKE_THREAD) and if
> desc->threads_handled changes since the last IRQ request.
>
> This incrementing also happens only on irq_forced_thread_fn() and
> irq_thread_fn(), which are called only from irq_thread_fn().
>
> But I get the overall worry about having this making runaway detection way
> weaker for all threaded interrupts.
>
> I have previously worked on a solution that can be more precise and be an
> opt-in for drivers instead of a general solution:
>
> It required a change in IRQ interface that let the handlers inform how
> many IRQs were actually handled (batching). This number would then be
> added to desc->threads_handle (in irq_*thread_fn(), just changing the
> atomic_inc() to atomic_add()), and then subtracted from irqs_unhandled
> at note_interrupt().
>
> In the serial8250 case, the driver would be changed to use that interface,
> since it's already able to process multiple IRQs, and the bug just
> vanishes.
>
> This also solved the serial driver issue, but required a deeper change in
> the code, which caused me to consider a simpler solution first.
>
> This solution sure does give better runnaway detection. Do you think it
> would be better that the one I sent in this patch?

For reference, this is the alternative:
https://gitlab.com/LeoBras/linux/-/commits/serial8250

Please let me know it you think this one is better.

Thanks!
Leo

>
> >
> > The problem with edge type interrupts is that we cannot mask them like
> > we do with level type interrupts in the hard interrupt handler and
> > unmask them once the threaded handler finishes.
> >
> > So yes, we need special rules here when:
> >
> > 1) The interrupt handler is force threaded
> >
> > 2) The interrupt line is edge type
> >
> > 3) The accumulated unhandled interrupts are within a sane margin
> >
> > Thanks,
> >
> > tglx
> >
>
> Completelly agree, that's why I am suggesting dealing with threaded
> interruptions in a different way: reseting the unhandled count when it
> handles a request.
>
> I am not sure how force threaded and just threaded are different in this
> scenario. Could you help me understand?
>
> Thanks!
> Leo