2006-11-08 11:30:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.19-rc5: known regressions

On Wed, 2006-11-08 at 09:52 +0100, Adrian Bunk wrote:
> Subject : SMP kernel can not generate ISA irq properly
> References : http://lkml.org/lkml/2006/10/22/15
> Submitter : Komuro <[email protected]>
> Handled-By : Thomas Gleixner <[email protected]>
> Status : Thomas is investigating

Problem is not reproducable on any of my boxen.

Komuro,

is this still happening on -rc5 ? If yes, can you please provide the
boot log with "apic=verbose" on the commandline ?

tglx



2006-11-08 16:04:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: 2.6.19-rc5: known regressions



On Wed, 8 Nov 2006, Komuro wrote:
>
> Intel ISA PCIC probe:
> Intel i82365sl B step ISA-to-PCMCIA at port 0x3e0 ofs 0x00, 2 sockets
> host opts [0]: none
> host opts [1]: none
> ISA irqs (scanned) = 3,4,5,7,9,11,15 status change on irq 15

This definitely means that the IRQ subsystem works, at least here. That
"scanned" means that the PCMCIA driver actually tested those interrupts,
and they worked.

At that point, at least.

Of course, the "they worked" test is fairly simple, so it's by no means
foolproof, but in general, it does sound like it all really should be ok.

Komuro, if you're a git user (or are willing to learn), and it's reliable
with one particular card, it really would make most sense to bisect it.
Just start off with

git bisect start
git bisect good v2.6.18
git bisect bad v2.6.19-rc1

and off you go. That's a lot of commits (abotu 5000), but even if you
don't ant to do the 12 or 13 kernel compiles and reboots that are needed
for a full bisection, doing just 4-5 would cut the number down a lot, and
then you can send the bisection log out.

But testing 2.6.19-rc5 is still worth it. The APIC fixes might fix it, or
some other changes might.

Linus






> warning: process `date' used the removed sysctl system call
> EXT3 FS on hda1, internal journal
> Adding 257032k swap on /dev/hda2. Priority:-1 extents:1 across:257032k
> warning: process `ls' used the removed sysctl system call
> warning: process `sleep' used the removed sysctl system call
> cs: IO port probe 0x100-0x3af: excluding 0x170-0x177 0x290-0x297 0x370-0x37f
> cs: IO port probe 0x3e0-0x4ff: excluding 0x4d0-0x4d7
> cs: IO port probe 0x820-0x8ff: clean.
> cs: IO port probe 0xc00-0xcf7: clean.
> cs: IO port probe 0xa00-0xaff: clean.
> cs: IO port probe 0x100-0x3af: excluding 0x170-0x177 0x290-0x297 0x370-0x37f
> cs: IO port probe 0x3e0-0x4ff: excluding 0x4d0-0x4d7
> cs: IO port probe 0x820-0x8ff: clean.
> cs: IO port probe 0xc00-0xcf7: clean.
> cs: IO port probe 0xa00-0xaff: clean.
>
> Best Regards
> Komuro
>
>

2006-11-10 12:42:53

by Komuro

[permalink] [raw]
Subject: Re: Re: Re: 2.6.19-rc5: known regressions :SMP kernel can not generate ISA irq

Hi,

>> Intel ISA PCIC probe:
>> Intel i82365sl B step ISA-to-PCMCIA at port 0x3e0 ofs 0x00, 2 sockets
>> host opts [0]: none
>> host opts [1]: none
>> ISA irqs (scanned) = 3,4,5,7,9,11,15 status change on irq 15
>
>This definitely means that the IRQ subsystem works, at least here. That
>"scanned" means that the PCMCIA driver actually tested those interrupts,
>and they worked.
>
>At that point, at least.
>
>Of course, the "they worked" test is fairly simple, so it's by no means
>foolproof, but in general, it does sound like it all really should be ok.
>
>
>But testing 2.6.19-rc5 is still worth it. The APIC fixes might fix it, or
>some other changes might.
>
> Linus

I tried the 2.6.19-rc5, the problem still happens.

But,
I remove the disable_irq_nosync() , enable_irq()
from the linux/drivers/net/pcmcia/axnet_cs.c
the interrupt is generated properly.

So I think enable_irq does not enable the irq.

Thanks!

Best Regards
Komuro



2006-11-13 16:07:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: Re: 2.6.19-rc5: known regressions :SMP kernel can not generate ISA irq


On Fri, 10 Nov 2006, Komuro wrote:
>
> I tried the 2.6.19-rc5, the problem still happens.

Ok, that's good data, and especially:

> But,
> I remove the disable_irq_nosync() , enable_irq()
> from the linux/drivers/net/pcmcia/axnet_cs.c
> the interrupt is generated properly.

All RIGHT. That's a very good clue. The major difference between PCI and
ISA irq's is that they have different trigger types (they also have
different polarity, but that tends to be just a small detail). In
particular, ISA IRQ's are edge-triggered, and PCI IRQ's are level-
triggered.

Now, edge-triggered interrupts are a _lot_ harder to mask, because the
Intel APIC is an unbelievable piece of sh*t, and has the edge-detect logic
_before_ the mask logic, so if a edge happens _while_ the device is
masked, you'll never ever see the edge ever again (unmasking will not
cause a new edge, so you simply lost the interrupt).

So when you "mask" an edge-triggered IRQ, you can't really mask it at all,
because if you did that, you'd lose it forever if the IRQ comes in while
you masked it. Instead, we're supposed to leave it active, and set a flag,
and IF the IRQ comes in, we just remember it, and mask it at that point
instead, and then on unmasking, we have to replay it by sending a
self-IPI.

Maybe that part got broken by some of the IRQ changes by Eric.

Eric, can you please double-check this all? I suspect you disable
edge-triggered interrupts when moving them, or something, and maybe you
didn't realize that if you disable them on the IO-APIC level, they can be
gone forever.

[ Note: this is true EVEN IF we are in the interrupt handler right then -
if we get another edge while in the interrupt handler, the interrupt
will normally be _delayed_ until we've ACK'ed it, but if we have
_masked_ it, it will simply be lost entirely. So a simple "mask"
operation is always incorrect for edge-triggered interrupts.

One option might be to do a simple mask, and on unmask, turn the edge
trigger into a level trigger at the same time. Then, the first time you
get the interrupt, you turn it back into an edge trigger _before_ you
call the interrupt handlers. That might actually be simpler than doing
the "irq replay" dance with self-IPI, because we can't actually just
fake the IRQ handling - when enable_irq() is called, irq's are normally
disabled on the CPU, so we can't just call the irq handler at that
point: we really do need to "replay" the dang thing.

Did I mention that the Intel APIC's are a piece of cr*p already? ]

> So I think enable_irq does not enable the irq.

It probably does enable it (that's the easy part), but see above: if any
of the support structure for the APIC crapola is subtly broken, we'll have
lost the IRQ anyway.

(Many other IRQ controllers get this right: the "old and broken" Intel
i8259 interrupt controller was a much better IRQ controller than the APIC
in this regard, because it simply had the edge-detect logic after the
masking logic, so if you unmasked an active interrupt that had been
masked, you would always see it as an edge, and the i8259 controller needs
none of the subtle code at _all_. It just works.)

Anyway, if you _can_ bisect the exact point where this started happening,
that would be good. But I would not be surprised in the least if this is
all introduced by Eric Biedermans dynamic IRQ handling.

Eric?

Linus

2006-11-13 17:13:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: 2.6.19-rc5: known regressions :SMP kernel can not generate ISA irq

Linus Torvalds <[email protected]> writes:

> On Fri, 10 Nov 2006, Komuro wrote:
>>
>> I tried the 2.6.19-rc5, the problem still happens.
>
> Ok, that's good data, and especially:
>
>> But,
>> I remove the disable_irq_nosync() , enable_irq()
>> from the linux/drivers/net/pcmcia/axnet_cs.c
>> the interrupt is generated properly.
>
> All RIGHT. That's a very good clue. The major difference between PCI and
> ISA irq's is that they have different trigger types (they also have
> different polarity, but that tends to be just a small detail). In
> particular, ISA IRQ's are edge-triggered, and PCI IRQ's are level-
> triggered.
>
> Now, edge-triggered interrupts are a _lot_ harder to mask, because the
> Intel APIC is an unbelievable piece of sh*t, and has the edge-detect logic
> _before_ the mask logic, so if a edge happens _while_ the device is
> masked, you'll never ever see the edge ever again (unmasking will not
> cause a new edge, so you simply lost the interrupt).
>
> So when you "mask" an edge-triggered IRQ, you can't really mask it at all,
> because if you did that, you'd lose it forever if the IRQ comes in while
> you masked it. Instead, we're supposed to leave it active, and set a flag,
> and IF the IRQ comes in, we just remember it, and mask it at that point
> instead, and then on unmasking, we have to replay it by sending a
> self-IPI.
>
> Maybe that part got broken by some of the IRQ changes by Eric.

Hmm. The other possibility is that this is a genirq migration issue.

Yep. That looks like it. In the genirq migration the edge and
level triggered cases got merged and previously disable_edge_ioapic
was a noop. Ouch.

Darn I missed this one in my review of Ingos changes.

I'm not at all certain what the correct fix here is.
- Do we make the make the generic code aware of this messed up
case? I believe it is aware of part of the don't disable edge
triggered interrupt logic already.
- Do we modify the disable logic so it doesn't actually disable the
irq?
- Do we do as Linus suggests and make the enable logic pass through a
level triggered state?
- Do we split the edge and level triggered cases apart on again on
i386 and x86_64?

And how do we make it drop dead clear what we are doing so that
someone doesn't break this in the future by accident. That I
suspect was the real problem. That stupid vector == irq case had
introduced so many levels of abstraction it was nearly impossible
to read the code.

Can we get this abstraction right so that we can make obviously
correct code here and still handle all of the weird code bugs?

> Eric, can you please double-check this all? I suspect you disable
> edge-triggered interrupts when moving them, or something, and maybe you
> didn't realize that if you disable them on the IO-APIC level, they can be
> gone forever.

Sure. So the hypothesis is that it is somewhere near commit
e7b946e98a456077dd6897f726f3d6197bd7e3b9 causing the problem.

Anything I have changed in this area should affect both i386 and
x86_64.

> [ Note: this is true EVEN IF we are in the interrupt handler right then -
> if we get another edge while in the interrupt handler, the interrupt
> will normally be _delayed_ until we've ACK'ed it, but if we have
> _masked_ it, it will simply be lost entirely. So a simple "mask"
> operation is always incorrect for edge-triggered interrupts.
>
> One option might be to do a simple mask, and on unmask, turn the edge
> trigger into a level trigger at the same time. Then, the first time you
> get the interrupt, you turn it back into an edge trigger _before_ you
> call the interrupt handlers. That might actually be simpler than doing
> the "irq replay" dance with self-IPI, because we can't actually just
> fake the IRQ handling - when enable_irq() is called, irq's are normally
> disabled on the CPU, so we can't just call the irq handler at that
> point: we really do need to "replay" the dang thing.
>
> Did I mention that the Intel APIC's are a piece of cr*p already? ]

Ok. After a quick skim it appears that there is a disable/enable pair
in the irq migration path for edge triggered interrupts. But we
do that work while the irq is pending and it doesn't look like I
changed that part of the code. Just the level triggered irq
migration.

>> So I think enable_irq does not enable the irq.
>
> It probably does enable it (that's the easy part), but see above: if any
> of the support structure for the APIC crapola is subtly broken, we'll have
> lost the IRQ anyway.
>
> (Many other IRQ controllers get this right: the "old and broken" Intel
> i8259 interrupt controller was a much better IRQ controller than the APIC
> in this regard, because it simply had the edge-detect logic after the
> masking logic, so if you unmasked an active interrupt that had been
> masked, you would always see it as an edge, and the i8259 controller needs
> none of the subtle code at _all_. It just works.)
>
> Anyway, if you _can_ bisect the exact point where this started happening,
> that would be good. But I would not be surprised in the least if this is
> all introduced by Eric Biedermans dynamic IRQ handling.

I will share the credit because I missed this in code review but this
is really Ingo's generic irq code.

Eric

2006-11-13 20:46:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.19-rc5: known regressions :SMP kernel can not generate ISA irq

On Mon, 2006-11-13 at 10:11 -0700, Eric W. Biederman wrote:
> > So when you "mask" an edge-triggered IRQ, you can't really mask it
> at all,
> > because if you did that, you'd lose it forever if the IRQ comes in
> while
> > you masked it. Instead, we're supposed to leave it active, and set a
> flag,
> > and IF the IRQ comes in, we just remember it, and mask it at that
> point
> > instead, and then on unmasking, we have to replay it by sending a
> > self-IPI.
> >
> > Maybe that part got broken by some of the IRQ changes by Eric.
>
> Hmm. The other possibility is that this is a genirq migration issue.
>
> Yep. That looks like it. In the genirq migration the edge and
> level triggered cases got merged and previously disable_edge_ioapic
> was a noop. Ouch.

hm, that should be solved by the generic edge-triggered flow handler as
well: we never mask an IRQ first time around, we only mask it if
we /already/ have the 'soft' IRQ_PENDING flag set. (in that case the
lost edge is not an issue because we have the information already - and
the masking will prevent a screaming edge source)

but maybe this concept has not been pushed through to the disable/enable
irq logic itself? (it's only present in the flow handler) Thomas, do you
concur?

Ingo

2006-11-13 21:12:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: 2.6.19-rc5: known regressions :SMP kernel can not generate ISA irq

Ingo Molnar <[email protected]> writes:

> On Mon, 2006-11-13 at 10:11 -0700, Eric W. Biederman wrote:
>> > So when you "mask" an edge-triggered IRQ, you can't really mask it
>> at all,
>> > because if you did that, you'd lose it forever if the IRQ comes in
>> while
>> > you masked it. Instead, we're supposed to leave it active, and set a
>> flag,
>> > and IF the IRQ comes in, we just remember it, and mask it at that
>> point
>> > instead, and then on unmasking, we have to replay it by sending a
>> > self-IPI.
>> >
>> > Maybe that part got broken by some of the IRQ changes by Eric.
>>
>> Hmm. The other possibility is that this is a genirq migration issue.
>>
>> Yep. That looks like it. In the genirq migration the edge and
>> level triggered cases got merged and previously disable_edge_ioapic
>> was a noop. Ouch.
>
> hm, that should be solved by the generic edge-triggered flow handler as
> well: we never mask an IRQ first time around, we only mask it if
> we /already/ have the 'soft' IRQ_PENDING flag set. (in that case the
> lost edge is not an issue because we have the information already - and
> the masking will prevent a screaming edge source)
>
> but maybe this concept has not been pushed through to the disable/enable
> irq logic itself? (it's only present in the flow handler) Thomas, do you
> concur?

I just looked. I think the logic is actually in there as well.
I keep forgetting disable != mask.

I looks like what is really missing is that we aren't setting
IRQ_DELAYED_DISABLE.

So I think what we really need to do is just set IRQ_DELAYED_DISABLE.
Does the patch below look right?


Eric


diff --git a/arch/x86_64/kernel/io_apic.c b/arch/x86_64/kernel/io_apic.c
index 41bfc49..14654e6 100644
--- a/arch/x86_64/kernel/io_apic.c
+++ b/arch/x86_64/kernel/io_apic.c
@@ -790,9 +790,11 @@ static void ioapic_register_intr(int irq
trigger == IOAPIC_LEVEL)
set_irq_chip_and_handler_name(irq, &ioapic_chip,
handle_fasteoi_irq, "fasteoi");
- else
+ else {
+ irq_desc[irq].status |= IRQ_DELAYED_DISABLE;
set_irq_chip_and_handler_name(irq, &ioapic_chip,
handle_edge_irq, "edge");
+ }
}

static void __init setup_IO_APIC_irqs(void)

2006-11-14 08:16:10

by Ingo Molnar

[permalink] [raw]
Subject: [patch] irq: do not mask interrupts by default

On Mon, 2006-11-13 at 14:11 -0700, Eric W. Biederman wrote:
> - else
> + else {
> + irq_desc[irq].status |= IRQ_DELAYED_DISABLE;
> set_irq_chip_and_handler_name(irq, &ioapic_chip,
> handle_edge_irq,
> "edge");
> + }
> }

yeah. Komuro, could you try my patch below - Eric's patch only updates
x86_64 while your failure was on the i386 kernel.

Note, i also took another approach to fix this problem, that should
cover both the case found by Komuro, and some other cases as well. The
theory is this:

1) disable_irq() is relatively rare (used in about 10% of drivers, but
there it's overwhelmingly used in some slowpath) so it's performance
uncritical.

2) missing an IRQ while the line is masked is often a lethal regression
to the user. An IRQ could be missed even if we think that the IRQ line
is 'level-triggered'.

so my patch changes the default irq-disable logic of /all/ controllers
to "delayed disable". (IRQ chips can still override this by providing a
different chip->disable method that just clones their ->mask method, if
it is absolutely sure that no IRQs can be lost while masked)

So this patch has the worst-case effect of getting at most one 'extra'
interrupt after the IRQ line has been 'disabled' - at which point the
line will be masked for real (by the flow handler). (I updated the
fasteoi and the simple irq flow handlers to mask the IRQ for real if an
IRQ triggers and the line was disabled.)

It's a bit late in the -rc cycle for a change like this, but i'm fairly
positive about it. I booted it on a couple of boxes and saw no badness.
(neither did i see any increase in IRQ rates)

Ingo

NOTE: this also means that the old IRQ_DELAYED_DISABLE bit can probably
be scrapped - i'll do that later on in a separate mail, if this patch
works out fine.

------------>
Subject: irq: do not mask interrupts by default
From: Ingo Molnar <[email protected]>

never mask interrupts immediately upon request. Disabling
interrupts in high-performance codepaths is rare, and on
the other hand this change could recover lost edges (or
even other types of lost interrupts) by conservatively
only masking interrupts after they happen. (NOTE: with
this change the highlevel irq-disable code still soft-disables
this IRQ line - and if such an interrupt happens then the
IRQ flow handler keeps the IRQ masked.)

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/irq/chip.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

Index: linux/kernel/irq/chip.c
===================================================================
--- linux.orig/kernel/irq/chip.c
+++ linux/kernel/irq/chip.c
@@ -202,10 +202,6 @@ static void default_enable(unsigned int
*/
static void default_disable(unsigned int irq)
{
- struct irq_desc *desc = irq_desc + irq;
-
- if (!(desc->status & IRQ_DELAYED_DISABLE))
- desc->chip->mask(irq);
}

/*
@@ -272,8 +268,11 @@ handle_simple_irq(unsigned int irq, stru
kstat_cpu(cpu).irqs[irq]++;

action = desc->action;
- if (unlikely(!action || (desc->status & IRQ_DISABLED)))
+ if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+ if (desc->chip->mask)
+ desc->chip->mask(irq);
goto out_unlock;
+ }

desc->status |= IRQ_INPROGRESS;
spin_unlock(&desc->lock);
@@ -366,11 +365,13 @@ handle_fasteoi_irq(unsigned int irq, str

/*
* If its disabled or no action available
- * keep it masked and get out of here
+ * then mask it and get out of here:
*/
action = desc->action;
if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
desc->status |= IRQ_PENDING;
+ if (desc->chip->mask)
+ desc->chip->mask(irq);
goto out;
}



2006-11-14 08:20:40

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] irq: do not mask interrupts by default


> so my patch changes the default irq-disable logic of /all/ controllers
> to "delayed disable". (IRQ chips can still override this by providing a
> different chip->disable method that just clones their ->mask method, if
> it is absolutely sure that no IRQs can be lost while masked)
>
> So this patch has the worst-case effect of getting at most one 'extra'
> interrupt after the IRQ line has been 'disabled' - at which point the
> line will be masked for real (by the flow handler). (I updated the
> fasteoi and the simple irq flow handlers to mask the IRQ for real if an
> IRQ triggers and the line was disabled.)

since disable_irq() is used as locking against interrupt context by
several drivers (*cough* ne2000 *cough*) I am not entirely convinced
this is a good idea....



2006-11-14 12:43:26

by Komuro

[permalink] [raw]
Subject: Re: [patch] irq: do not mask interrupts by default

Dear Ingo

I tried your patch with 2.6.19-rc5.

The irq is generated properly.


Thanks!

Best Regards
Komuro


>>
>------------>
>Subject: irq: do not mask interrupts by default
>From: Ingo Molnar <[email protected]>
>
>never mask interrupts immediately upon request. Disabling
>interrupts in high-performance codepaths is rare, and on
>the other hand this change could recover lost edges (or
>even other types of lost interrupts) by conservatively
>only masking interrupts after they happen. (NOTE: with
>this change the highlevel irq-disable code still soft-disables
>this IRQ line - and if such an interrupt happens then the
>IRQ flow handler keeps the IRQ masked.)
>
>Signed-off-by: Ingo Molnar <[email protected]>
>---
> kernel/irq/chip.c | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
>
>Index: linux/kernel/irq/chip.c
>===================================================================
>--- linux.orig/kernel/irq/chip.c
>+++ linux/kernel/irq/chip.c
>@@ -202,10 +202,6 @@ static void default_enable(unsigned int
> */
> static void default_disable(unsigned int irq)
> {
>- struct irq_desc *desc = irq_desc + irq;
>-
>- if (!(desc->status & IRQ_DELAYED_DISABLE))
>- desc->chip->mask(irq);
> }
>
> /*
>@@ -272,8 +268,11 @@ handle_simple_irq(unsigned int irq, stru
> kstat_cpu(cpu).irqs[irq]++;
>
> action = desc->action;
>- if (unlikely(!action || (desc->status & IRQ_DISABLED)))
>+ if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
>+ if (desc->chip->mask)
>+ desc->chip->mask(irq);
> goto out_unlock;
>+ }
>
> desc->status |= IRQ_INPROGRESS;
> spin_unlock(&desc->lock);
>@@ -366,11 +365,13 @@ handle_fasteoi_irq(unsigned int irq, str
>
> /*
> * If its disabled or no action available
>- * keep it masked and get out of here
>+ * then mask it and get out of here:
> */
> action = desc->action;
> if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> desc->status |= IRQ_PENDING;
>+ if (desc->chip->mask)
>+ desc->chip->mask(irq);
> goto out;
> }
>
>
>

2006-11-14 16:14:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] irq: do not mask interrupts by default



On Tue, 14 Nov 2006, Ingo Molnar wrote:
>
> 1) disable_irq() is relatively rare (used in about 10% of drivers, but
> there it's overwhelmingly used in some slowpath) so it's performance
> uncritical.

Well, the thing is, the _replay_ if it does happen, is going to be really
really slow compared to the masking. So at that point, it may well be a
net performance downside if the masking is going to almost always have an
interrupt happen while the thing is masked. I dunno.

There's another thing too:

For level-triggered interrupts, I _really_ don't think we should do this.
The code inside the masked region is sometimes "setup code", which will do
things that _will_ raise an interrupt, but may read the status register or
whatever to then unraise it. So in that case, your patch will generate
different behaviour, something that I really don't want to introduce at
this point in the 2.6.19 series.

> 2) missing an IRQ while the line is masked is often a lethal regression
> to the user. An IRQ could be missed even if we think that the IRQ line
> is 'level-triggered'.

If it's level-triggered, it's going to be missed only if it's de-asserted
by code inside the masked region, and that is what we have always done on
purpose, so "missing" it is the right thing to do. It's what we have
tested all level-triggered interrupts with for the last 15+ years, and
it's been part of the semantics for masking.

So I absolutely do _not_ think your change is improved semantics. It's new
semantics, and illogical. If the driver masked the irq line, did some
testing that raises and clears it again ("let's check if this version of
the chip raises the interrupt when we do XYZZY"), then the logical thing
to do would be to not cause the interrupt to happen.

Of course, for edge-triggered APIC interrupts, we _have_ to replay the irq
(since we don't have any way of even *knowing* whether we might get it
again), but for level-triggered and for the old legacy i8259 controller
that gets it right for edges anwyay, we should _not_ send the spurious
interrupt that is no longer active.

And a lot of code has been tested with either just the i8259 (old machines
without any APIC) or with PCI-only devices (which are always level-
triggered), so the fact that edge-triggered things have always seen the
potential for spurious interrupts is not a reasong to say "well, they have
to handle it anyway". True PCI drivers generally do _not_ have to handle
the crazy case, and generally have never seen it.

> so my patch changes the default irq-disable logic of /all/ controllers
> to "delayed disable". (IRQ chips can still override this by providing a
> different chip->disable method that just clones their ->mask method, if
> it is absolutely sure that no IRQs can be lost while masked)

I really think we should do this just for APIC edge triggered interrupts,
ie keep the old behaviour.

Also, I worry a bit about the patch:

> @@ -272,8 +268,11 @@ handle_simple_irq(unsigned int irq, stru
> kstat_cpu(cpu).irqs[irq]++;
>
> action = desc->action;
> - if (unlikely(!action || (desc->status & IRQ_DISABLED)))
> + if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> + if (desc->chip->mask)
> + desc->chip->mask(irq);
> goto out_unlock;
> + }

The simple-irq case too? That's not even going to replay the thing? So now
you just mask (without replaying) simple irqs, but then the other irqs you
mask and replay.. See above on why I don't think this is necessarily a bug
(since masking is almost always the right thing _anyway_), but now it will
*STILL* depend on some internal implementation decision on whether the
replay happens at all. I'd much rather have the replay decision be based
on hard physical data: we replay _only_ for edge-triggered interrupts, and
_only_ for controllers that need it.

In other words, I think we should just make APIC-edge have the "please
delay masking and replay" bit, and nobody else.

Can you send that patch (for both x86 and x86-64), and we can ask Komuro
to test it. That would be the "same behaviour as we've always had" thing,
which I think is also the _right_ behaviour.

Linus

2006-11-14 17:53:14

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] Use delayed disable mode of ioapic edge triggered interrupts


Linus Torvalds <[email protected]> writes:

> Of course, for edge-triggered APIC interrupts, we _have_ to replay the irq
> (since we don't have any way of even *knowing* whether we might get it
> again), but for level-triggered and for the old legacy i8259 controller
> that gets it right for edges anwyay, we should _not_ send the spurious
> interrupt that is no longer active.
>
> And a lot of code has been tested with either just the i8259 (old machines
> without any APIC) or with PCI-only devices (which are always level-
> triggered), so the fact that edge-triggered things have always seen the
> potential for spurious interrupts is not a reasong to say "well, they have
> to handle it anyway". True PCI drivers generally do _not_ have to handle
> the crazy case, and generally have never seen it.
>
> In other words, I think we should just make APIC-edge have the "please
> delay masking and replay" bit, and nobody else.
>
> Can you send that patch (for both x86 and x86-64), and we can ask Komuro
> to test it. That would be the "same behaviour as we've always had" thing,
> which I think is also the _right_ behaviour.

Hopefully this is the trivial patch that solves the problem.

Signed-off-by: Eric W. Biederman <[email protected]>

diff --git a/arch/i386/kernel/io_apic.c b/arch/i386/kernel/io_apic.c
index ad84bc2..3b7a63e 100644
--- a/arch/i386/kernel/io_apic.c
+++ b/arch/i386/kernel/io_apic.c
@@ -1287,9 +1287,11 @@ static void ioapic_register_intr(int irq
trigger == IOAPIC_LEVEL)
set_irq_chip_and_handler_name(irq, &ioapic_chip,
handle_fasteoi_irq, "fasteoi");
- else
+ else {
+ irq_desc[irq].status |= IRQ_DELAYED_DISABLE;
set_irq_chip_and_handler_name(irq, &ioapic_chip,
handle_edge_irq, "edge");
+ }
set_intr_gate(vector, interrupt[irq]);
}

diff --git a/arch/x86_64/kernel/io_apic.c b/arch/x86_64/kernel/io_apic.c
index 41bfc49..14654e6 100644
--- a/arch/x86_64/kernel/io_apic.c
+++ b/arch/x86_64/kernel/io_apic.c
@@ -790,9 +790,11 @@ static void ioapic_register_intr(int irq
trigger == IOAPIC_LEVEL)
set_irq_chip_and_handler_name(irq, &ioapic_chip,
handle_fasteoi_irq, "fasteoi");
- else
+ else {
+ irq_desc[irq].status |= IRQ_DELAYED_DISABLE;
set_irq_chip_and_handler_name(irq, &ioapic_chip,
handle_edge_irq, "edge");
+ }
}

static void __init setup_IO_APIC_irqs(void)

2006-11-14 23:38:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Use delayed disable mode of ioapic edge triggered interrupts



On Tue, 14 Nov 2006, Eric W. Biederman wrote:
>
> Hopefully this is the trivial patch that solves the problem.

Komuro, can you check this patch _instead_ of the one from Ingo (ie not
together with, since that combination won't tell us anything new - if
Ingo's patch is there too, the new patch will basically be a no-op).

Linus

2006-11-15 01:21:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Use delayed disable mode of ioapic edge triggered interrupts



On Tue, 14 Nov 2006, Eric W. Biederman wrote:
>
> Hopefully this is the trivial patch that solves the problem.

Ok, having looked more at this, I have to say that the whole
"IRQ_DELAYED_DISABLE" thing seems very fragile indeed.

It looks like we should do it not only for APIC edge-triggered interrupts,
but for HT and MSI interrupts too, as far as I can tell (at least they
also use the "handle_edge_irq" routine)

So I'm wondering how many other cases there are that are missing this.

In that sense, Ingo's patch was a lot safer, although I still dislike it
for all the other reasons I mentioned - it's simply wrong to re-send a
level-triggered irq.

I don't know MSI and HT interrupts well enough to tell whether they will
re-trigger on their own when we unmask them, but the point is, this
_looks_ like it might be incomplete.

I think part of the problem is a bad interface. We should simply never set
the IRQ handler on its own. It should be a field in the "irq_chip"
structure, and we should use _different_ irq chip structures for level and
edge-triggered. Then we should also add the "flags" thing there, and you
could do something like

static struct irq_chip level_ioapic_chip = {
..

instead of making the insane decision to use the "same" chip for all
ioapic things.

Ingo? Eric? Comments?

Linus

2006-11-15 05:15:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] Use delayed disable mode of ioapic edge triggered interrupts

Linus Torvalds <[email protected]> writes:

> On Tue, 14 Nov 2006, Eric W. Biederman wrote:
>>
>> Hopefully this is the trivial patch that solves the problem.
>
> Ok, having looked more at this, I have to say that the whole
> "IRQ_DELAYED_DISABLE" thing seems very fragile indeed.
>
> It looks like we should do it not only for APIC edge-triggered interrupts,
> but for HT and MSI interrupts too, as far as I can tell (at least they
> also use the "handle_edge_irq" routine)
>
> So I'm wondering how many other cases there are that are missing this.

I think it is a good question.

The big one I did not set it on is the interrupt if it comes in
through ExtInt. I assume the 8259 is sane but I may be wrong.

> In that sense, Ingo's patch was a lot safer, although I still dislike it
> for all the other reasons I mentioned - it's simply wrong to re-send a
> level-triggered irq.
>
> I don't know MSI and HT interrupts well enough to tell whether they will
> re-trigger on their own when we unmask them, but the point is, this
> _looks_ like it might be incomplete.

Yes. I think there is an interrupt status bit there.
For at least one case in MSI we don't have a disable at all.

The truth is in practice I don't think it matters because I don't
think anyone actually disables MSI or hypertransport interrupts.

If it was going to change it would probably change per card.

But the real truth is that the hardware device knows what is going on.
The interrupt message is sent by the hardware device or it is not.
This isn't a case of can we detect an interrupt being raised by the
device while we disabled the interrupt at the device. This is a
case of we disable the interrupt at the device. So I think the whole
question of do we detect an interrupt raised by the device while
we have disabled interrupts on the device is silly.

So until I learn more I am going to assume that MSI and hypertransport
interrupts are sane like 8259 interrupts. If that makes sense.

> I think part of the problem is a bad interface. We should simply never set
> the IRQ handler on its own. It should be a field in the "irq_chip"
> structure, and we should use _different_ irq chip structures for level and
> edge-triggered. Then we should also add the "flags" thing there, and you
> could do something like
>
> static struct irq_chip level_ioapic_chip = {
> ..
>
> instead of making the insane decision to use the "same" chip for all
> ioapic things.

I think there is probably a sensible case for a separate structure.

At this point I have two questions.
- What is the easiest path to get us to a stable 2.6.19 where
everything works?

I don't think that is backing out genirq. But I haven't at all of
these corner cases.

I think for 2.6.19 we can get away with just my stupid patch, or
some simple variation of it.

- What is the sanest thing for long term maintenance, of irqs?

genirq is less code to maintain overall (a plus).
genirq helps us do things across architectures, which is nice.
genirq is also a little convoluted to read and to use a downside.

My gut feel is that there is room for a lot more cleanup in this
area but we probably need to stabilize what we have.

Since you aren't complaining about what the code actually does but
rather how the interface looks, I have a proposal. I assert that
the interface for registering an irq is much to general, and broad.


Instead of having:

irq_desc[irq].status |= IRQ_DELAYED_DISABLE;
set_irq_chip_and_handler_name(irq, &ioapic_chip,
handle_edge_irq, "edge");

We should have a set of helper functions one for each common type
of interrupt.

set_irq_edge_lossy(irq, &ioapic_chip);
set_irq_edge(irq, &ioapic_chip);
set_irq_level(irq, &ioapic_chip);

The more stupid parameters we have to set the more likely
an implementor is to get it wrong.

Although I do agree to some extent it has been a bit of a strain
having both edge and level triggered interrupts with the same
methods. So if our goal is to make an even simpler interface
than what we have now I will be happy. Hopefully we can do all
of this in helper functions instead of having to rip up all of
the interrupt infrastructure one more time.

I really don't know. I'm tired and I want to see this code work.

Eric

2006-11-15 12:40:40

by Komuro

[permalink] [raw]
Subject: Re: [PATCH] Use delayed disable mode of ioapic edge triggered interrupts

Hi,

I tried the Eric's patch instead of Ingo's
with 2.6.19-rc5.


The interrupt is generated properly.

Thanks!

Best Regards
Komuro


>
>Hopefully this is the trivial patch that solves the problem.
>
>Signed-off-by: Eric W. Biederman <[email protected]>
>
>diff --git a/arch/i386/kernel/io_apic.c b/arch/i386/kernel/io_apic.c
>index ad84bc2..3b7a63e 100644
>--- a/arch/i386/kernel/io_apic.c
>+++ b/arch/i386/kernel/io_apic.c
>@@ -1287,9 +1287,11 @@ static void ioapic_register_intr(int irq
> trigger == IOAPIC_LEVEL)
> set_irq_chip_and_handler_name(irq, &ioapic_chip,
> handle_fasteoi_irq, "fasteoi");
>- else
>+ else {
>+ irq_desc[irq].status |= IRQ_DELAYED_DISABLE;
> set_irq_chip_and_handler_name(irq, &ioapic_chip,
> handle_edge_irq, "edge");
>+ }
> set_intr_gate(vector, interrupt[irq]);
> }
>
>diff --git a/arch/x86_64/kernel/io_apic.c b/arch/x86_64/kernel/io_apic.c
>index 41bfc49..14654e6 100644
>--- a/arch/x86_64/kernel/io_apic.c
>+++ b/arch/x86_64/kernel/io_apic.c
>@@ -790,9 +790,11 @@ static void ioapic_register_intr(int irq
> trigger == IOAPIC_LEVEL)
> set_irq_chip_and_handler_name(irq, &ioapic_chip,
> handle_fasteoi_irq, "fasteoi");
>- else
>+ else {
>+ irq_desc[irq].status |= IRQ_DELAYED_DISABLE;
> set_irq_chip_and_handler_name(irq, &ioapic_chip,
> handle_edge_irq, "edge");
>+ }
> }
>
> static void __init setup_IO_APIC_irqs(void)
>

2006-11-15 16:09:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Use delayed disable mode of ioapic edge triggered interrupts



On Tue, 14 Nov 2006, Eric W. Biederman wrote:
>
> The big one I did not set it on is the interrupt if it comes in
> through ExtInt. I assume the 8259 is sane but I may be wrong.

Yes, ExtInt is ok, i fyou actually mask it at the 8259. As mentioned
earlier in the thread, the i8259 has its edge detect logic _after_ the
masking logic, so if the irq is still active, and you unmask it, it will
see an edge, and re-assert the interrupt in hardware.

So the i8259 is a good interrupt controller, and does not need delayed
disable and software logic to re-assert the irq.

> The truth is in practice I don't think it matters because I don't
> think anyone actually disables MSI or hypertransport interrupts.

Fair enough, at least for a 2.6.19 kind of release timeframe (and that is
what I worry about most, at least right now).

> At this point I have two questions.
> - What is the easiest path to get us to a stable 2.6.19 where
> everything works?

If people don't expect HT and MSI interrupts to be masked (and I can well
imagine that), then I think your two-liner patch is good to go. Komuro
seems to have acked it already, and in many ways that's the "minimal
change" for 2.6.19 right now.

I do like Ingo's patch because it seems "safe" (even if I think it might
be a bit _overly_ safe), but it changes semantics enough that I don't like
it for 2.6.19. Even his second version definitely changes semantics for
level-triggered PCI interrupts, even though he fixed ExtInt/i8259 ones.

So I think I'll go with your patch for now, and we can re-visit Ingo's
thing after 2.6.19.

> - What is the sanest thing for long term maintenance, of irqs?
>
> genirq is less code to maintain overall (a plus).

Oh, I absolutely think genirq is the right thing to do. No question at
all. I just think that we might want to refactor the code somewhat, and in
particular I suspect that many irq controller drivers should use separate
"struct irq_chip" entries for edge and level, because they are
fundamentally different.

> My gut feel is that there is room for a lot more cleanup in this
> area but we probably need to stabilize what we have.

Exactly. Baby steps. Make it work. Then clean up. Slowly.

> Since you aren't complaining about what the code actually does but
> rather how the interface looks, I have a proposal. I assert that
> the interface for registering an irq is much to general, and broad.
>
> Instead of having:
>
> irq_desc[irq].status |= IRQ_DELAYED_DISABLE;
> set_irq_chip_and_handler_name(irq, &ioapic_chip,
> handle_edge_irq, "edge");
>
> We should have a set of helper functions one for each common type
> of interrupt.
>
> set_irq_edge_lossy(irq, &ioapic_chip);
> set_irq_edge(irq, &ioapic_chip);
> set_irq_level(irq, &ioapic_chip);

Yeah, that might be a fine way too. That's largely what we do for the IO
schedulers, and it's been fairly successful. Start out by setting common
defaults, and then allow chip drivers to specify particular details
explicitly.

Ingo?

Linus

2006-11-15 16:59:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] Use delayed disable mode of ioapic edge triggered interrupts

Linus Torvalds <[email protected]> writes:

> On Tue, 14 Nov 2006, Eric W. Biederman wrote:
>
>> The truth is in practice I don't think it matters because I don't
>> think anyone actually disables MSI or hypertransport interrupts.
>
> Fair enough, at least for a 2.6.19 kind of release timeframe (and that is
> what I worry about most, at least right now).
>
>> At this point I have two questions.
>> - What is the easiest path to get us to a stable 2.6.19 where
>> everything works?
>
> If people don't expect HT and MSI interrupts to be masked (and I can well
> imagine that), then I think your two-liner patch is good to go. Komuro
> seems to have acked it already, and in many ways that's the "minimal
> change" for 2.6.19 right now.

Well I just doubled checked this assertion. The one driver that uses
the hypertransport irqs doesn't call disable_irq. On the msi side
at least the forcedeth driver does call disable_irq when in msi mode.

I just doubled checked the historical behavior of the msi code and
it has never done the delayed disable thing. So not doing it there
is not a regression.

The MSI case is different. MSI is fundamentally about non-shared
interrupts, and interrupts that don't race with your DMAs. So with
MSI you don't need a status register read to process the interrupt.

In the context of Ingo's patch I don't like the idea of saddling MSI
interrupts down with the best in class work arounds for a completely
different hardware interrupt model. Although I don't doubt MSI will
get it's own set of work arounds as we come to know it better.

> I do like Ingo's patch because it seems "safe" (even if I think it might
> be a bit _overly_ safe), but it changes semantics enough that I don't like
> it for 2.6.19. Even his second version definitely changes semantics for
> level-triggered PCI interrupts, even though he fixed ExtInt/i8259 ones.
>
> So I think I'll go with your patch for now, and we can re-visit Ingo's
> thing after 2.6.19.

Sounds like a plan.

Eric