2007-08-02 17:03:42

by Gabriel C

[permalink] [raw]
Subject: Re: [patch] genirq: temporary fix for level-triggered IRQ resend

Ingo Molnar wrote:
> Linus,
>
> with -rc2 approaching i think we should apply the minimal fix below to
> get Marcin's ne2k-pci networking back in working order. The
> WARN_ON_ONCE() will not prevent the system from working and it will be a
> reminder.
>
> a better workaround would be to inhibit the resent vector via the
> IO-APIC irqchip - but i'd still like to have the patch below because the
> ne2k driver _should_ be able to survive the spurious irq that happens.
> (even on Marcin's system that ne2k-pci irq line is shared with another
> networking card, so an irq could happen at any moment - it's just that
> with the delayed-disable logic it happens _all the time_.)
>

I get a warning on each boot now with this patch ..

[ 63.686613] WARNING: at kernel/irq/resend.c:70 check_irq_resend()
[ 63.686636] [<c013c55c>] check_irq_resend+0x8c/0xa0
[ 63.686653] [<c013c15f>] enable_irq+0xad/0xb3
[ 63.686662] [<e886481e>] vortex_timer+0x20c/0x3d5 [3c59x]
[ 63.686675] [<c01164b9>] scheduler_tick+0x154/0x273
[ 63.686685] [<c012fed1>] getnstimeofday+0x34/0xe3
[ 63.686697] [<c0121f4a>] run_timer_softirq+0x137/0x197
[ 63.686709] [<e8864612>] vortex_timer+0x0/0x3d5 [3c59x]
[ 63.686720] [<c011ed09>] __do_softirq+0x75/0xe1
[ 63.686729] [<c011edac>] do_softirq+0x37/0x3d
[ 63.686735] [<c011ef85>] irq_exit+0x7c/0x7e
[ 63.686740] [<c010e013>] smp_apic_timer_interrupt+0x59/0x84
[ 63.686751] [<c0103428>] apic_timer_interrupt+0x28/0x30
[ 63.686759] [<c0101355>] default_idle+0x0/0x3f
[ 63.686767] [<c0101385>] default_idle+0x30/0x3f
[ 63.686773] [<c0100c19>] cpu_idle+0x5e/0x8e
[ 63.686779] [<c03fdc5f>] start_kernel+0x2d7/0x368


That means ?:)


> Ingo
>


Gabriel


2007-08-02 20:11:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] genirq: temporary fix for level-triggered IRQ resend


* Gabriel C <[email protected]> wrote:

> I get a warning on each boot now with this patch ..
>
> [ 63.686613] WARNING: at kernel/irq/resend.c:70 check_irq_resend()
> [ 63.686636] [<c013c55c>] check_irq_resend+0x8c/0xa0
> [ 63.686653] [<c013c15f>] enable_irq+0xad/0xb3
> [ 63.686662] [<e886481e>] vortex_timer+0x20c/0x3d5 [3c59x]
> [ 63.686675] [<c01164b9>] scheduler_tick+0x154/0x273
> [ 63.686685] [<c012fed1>] getnstimeofday+0x34/0xe3
> [ 63.686697] [<c0121f4a>] run_timer_softirq+0x137/0x197
> [ 63.686709] [<e8864612>] vortex_timer+0x0/0x3d5 [3c59x]
> [ 63.686720] [<c011ed09>] __do_softirq+0x75/0xe1
> [ 63.686729] [<c011edac>] do_softirq+0x37/0x3d
> [ 63.686735] [<c011ef85>] irq_exit+0x7c/0x7e
> [ 63.686740] [<c010e013>] smp_apic_timer_interrupt+0x59/0x84
> [ 63.686751] [<c0103428>] apic_timer_interrupt+0x28/0x30
> [ 63.686759] [<c0101355>] default_idle+0x0/0x3f
> [ 63.686767] [<c0101385>] default_idle+0x30/0x3f
> [ 63.686773] [<c0100c19>] cpu_idle+0x5e/0x8e
> [ 63.686779] [<c03fdc5f>] start_kernel+0x2d7/0x368
>
>
> That means ?:)

if your network still works fine then you can ignore it :-)

we are still trying to figure out what happens with ne2k-pci. The
message will vanish soon.

Ingo

2007-08-03 05:58:24

by Jarek Poplawski

[permalink] [raw]
Subject: [patch] genirq: fix simple and fasteoi irq handlers

On Thu, Aug 02, 2007 at 10:11:26PM +0200, Ingo Molnar wrote:
>
> * Gabriel C <[email protected]> wrote:
>
> > I get a warning on each boot now with this patch ..
> >
> > [ 63.686613] WARNING: at kernel/irq/resend.c:70 check_irq_resend()
...
> we are still trying to figure out what happens with ne2k-pci. The
> message will vanish soon.

Hi,

I can't guarantee this is all needed to fix this bug, but I think
this patch is necessary here.

Regards,
Jarek P.
------------>

Subject: genirq: fix simple and fasteoi irq handlers

After the "genirq: do not mask interrupts by default" patch interrupts
should be disabled not immediately upon request, but after they happen.
But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
driver's work.

The main reason of problems here, pointing the broken patch and making
the first patch which can fix this was done by Marcin Slusarz.
Additional test patches of Thomas Gleixner and Ingo Molnar tested by
Marcin Slusarz helped to narrow possible reasons even more. Thanks.

PS: this patch fixes only one evident error here, but there could be
more places affected by above-mentioned change in irq handling.


Signed-off-by: Jarek Poplawski <[email protected]>

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c 2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c 2007-08-02 20:42:38.000000000 +0200
@@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru

spin_lock(&desc->lock);

- if (unlikely(desc->status & IRQ_INPROGRESS))
- goto out_unlock;
kstat_cpu(cpu).irqs[irq]++;

action = desc->action;
- if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+ if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+ IRQ_DISABLED)))) {
if (desc->chip->mask)
desc->chip->mask(irq);
desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
@@ -392,18 +391,16 @@ handle_fasteoi_irq(unsigned int irq, str

spin_lock(&desc->lock);

- if (unlikely(desc->status & IRQ_INPROGRESS))
- goto out;
-
desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
kstat_cpu(cpu).irqs[irq]++;

/*
- * If its disabled or no action available
+ * If it's running, disabled or no action available
* then mask it and get out of here:
*/
action = desc->action;
- if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+ if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+ IRQ_DISABLED)))) {
desc->status |= IRQ_PENDING;
if (desc->chip->mask)
desc->chip->mask(irq);

2007-08-03 08:04:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] genirq: fix simple and fasteoi irq handlers


* Jarek Poplawski <[email protected]> wrote:

> I can't guarantee this is all needed to fix this bug, but I think this
> patch is necessary here.

hmmm ... very interesting! Now _this_ is something we'd like to see
tested. Could you send a patch to Marcin that also undoes the workaround
we have in place now, so that he could check whether ne2k-pci works fine
with your fix alone?

Ingo

2007-08-03 08:47:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] genirq: fix simple and fasteoi irq handlers


* Ingo Molnar <[email protected]> wrote:

> * Jarek Poplawski <[email protected]> wrote:
>
> > I can't guarantee this is all needed to fix this bug, but I think
> > this patch is necessary here.
>
> hmmm ... very interesting! Now _this_ is something we'd like to see
> tested. Could you send a patch to Marcin that also undoes the
> workaround we have in place now, so that he could check whether
> ne2k-pci works fine with your fix alone?

or it would be nice if Marcin could test pure 2.6.22 plus your fix
(without any other patches applied).

Ingo

2007-08-03 09:00:50

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [patch] genirq: fix simple and fasteoi irq handlers

On Fri, Aug 03, 2007 at 10:04:08AM +0200, Ingo Molnar wrote:
>
> * Jarek Poplawski <[email protected]> wrote:
>
> > I can't guarantee this is all needed to fix this bug, but I think this
> > patch is necessary here.
>
> hmmm ... very interesting! Now _this_ is something we'd like to see
> tested. Could you send a patch to Marcin that also undoes the workaround
> we have in place now, so that he could check whether ne2k-pci works fine
> with your fix alone?

I'm not sure this is needed... Marcin got this patch, I hope, and I
don't have another possibility to contact with him. Since he managed
with this bisection and all the previous patches I don't think there
could be any problems, so:

Marcin! I'd be very glad if you could test this patch alone; this
should apply without any problems to 2.6.21 (with some offset) and
later "vanilla" versions (or try to revert Ingo's last patch with
patch -p1 -R). Please, contact me on any problems (alas not during
the weekend...).

Thanks,
Jarek P.

PS: of course, I'm very curious of this testing too, but, on the other
hand, as I've written earlier, I think this patch is needed for logical
reasons only, and it really doesn't look like it could make any damage
here.

2007-08-03 11:57:16

by Marcin Ślusarz

[permalink] [raw]
Subject: Re: [patch] genirq: fix simple and fasteoi irq handlers

2007/8/3, Jarek Poplawski <[email protected]>:
> On Fri, Aug 03, 2007 at 10:04:08AM +0200, Ingo Molnar wrote:
> >
> > * Jarek Poplawski <[email protected]> wrote:
> >
> > > I can't guarantee this is all needed to fix this bug, but I think this
> > > patch is necessary here.
> >
> > hmmm ... very interesting! Now _this_ is something we'd like to see
> > tested. Could you send a patch to Marcin that also undoes the workaround
> > we have in place now, so that he could check whether ne2k-pci works fine
> > with your fix alone?
>
> I'm not sure this is needed... Marcin got this patch, I hope, and I
> don't have another possibility to contact with him. Since he managed
> with this bisection and all the previous patches I don't think there
> could be any problems, so:
>
> Marcin! I'd be very glad if you could test this patch alone; this
> should apply without any problems to 2.6.21 (with some offset) and
> later "vanilla" versions (or try to revert Ingo's last patch with
> patch -p1 -R). Please, contact me on any problems (alas not during
> the weekend...).

I'll test this patch tomorrow (and confirm that the last one from Ingo
works fine) and report results on monday (sorry, no internet at home
since I moved out of city :|).

Marcin

2007-08-03 12:26:46

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [patch] genirq: fix simple and fasteoi irq handlers

On Fri, Aug 03, 2007 at 01:57:00PM +0200, Marcin ?lusarz wrote:
...
> I'll test this patch tomorrow (and confirm that the last one from Ingo
> works fine) and report results on monday (sorry, no internet at home
> since I moved out of city :|).

So, you are a lucky guy! I have only no internet at home.
...and time for dreaming about moving out of a city...

Cheers,
Jarek P.

2007-08-06 06:07:14

by Jarek Poplawski

[permalink] [raw]
Subject: [patch (take 2)] genirq: fix simple and fasteoi irq handlers

Hi,

Sorry, guys! I tried to be so logical that I forgot to unmask my eyes.

Regards,
Jarek P.
------------>
(take 2)

Subject: genirq: fix simple and fasteoi irq handlers

After the "genirq: do not mask interrupts by default" patch interrupts
should be disabled not immediately upon request, but after they happen.
But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
driver's work.

The main reason of problems here, pointing the broken patch and making
the first patch which can fix this was done by Marcin Slusarz.
Additional test patches of Thomas Gleixner and Ingo Molnar tested by
Marcin Slusarz helped to narrow possible reasons even more. Thanks.

PS: this patch fixes only one evident error here, but there could be
more places affected by above-mentioned change in irq handling.

PS 2:
After rethinking, IMHO, there are two most probable scenarios here:

1. After hw resend there could be a conflict between retriggered
edge type irq and the next level type one: e.g. if this level type
irq (io_apic is enabled then) is triggered while retriggered irq is
serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
the next such levels are triggered and looping, so probably kind of
flood in io_apic until this retriggered edge service has ended.
2. There is something wrong with ioapic_retrigger_irq (less probable
because this should be probably seen with 'normal' edge retriggers,
but on the other hand, they could be less common).

So, if there is #1, this fixed patch should work.

But, since level types don't need this retriggers too much I think
this "don't mask interrupts by default" idea should be rethinked:
is there enough gain to risk such hard to diagnose errors?

So, IMHO, there should be at least possibility to turn this off for
level types in config (it should be a visible option, so people could
find & try this before writing for help or changing a network card).


Signed-off-by: Jarek Poplawski <[email protected]>

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c 2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c 2007-08-05 21:49:46.000000000 +0200
@@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru

spin_lock(&desc->lock);

- if (unlikely(desc->status & IRQ_INPROGRESS))
- goto out_unlock;
kstat_cpu(cpu).irqs[irq]++;

action = desc->action;
- if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+ if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+ IRQ_DISABLED)))) {
if (desc->chip->mask)
desc->chip->mask(irq);
desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
@@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru

spin_lock(&desc->lock);
desc->status &= ~IRQ_INPROGRESS;
+ if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+ desc->chip->unmask(irq);
out_unlock:
spin_unlock(&desc->lock);
}
@@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str

spin_lock(&desc->lock);

- if (unlikely(desc->status & IRQ_INPROGRESS))
- goto out;
-
desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
kstat_cpu(cpu).irqs[irq]++;

/*
- * If its disabled or no action available
+ * If it's running, disabled or no action available
* then mask it and get out of here:
*/
action = desc->action;
- if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+ if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+ IRQ_DISABLED)))) {
desc->status |= IRQ_PENDING;
if (desc->chip->mask)
desc->chip->mask(irq);
@@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str

spin_lock(&desc->lock);
desc->status &= ~IRQ_INPROGRESS;
+ if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+ desc->chip->unmask(irq);
out:
desc->chip->eoi(irq);

2007-08-06 06:15:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch (take 2)] genirq: fix simple and fasteoi irq handlers


* Jarek Poplawski <[email protected]> wrote:

> Subject: genirq: fix simple and fasteoi irq handlers
>
> After the "genirq: do not mask interrupts by default" patch interrupts
> should be disabled not immediately upon request, but after they
> happen. But, handle_simple_irq() and handle_fasteoi_irq() can skip
> this once or more if an irq is just serviced (IRQ_INPROGRESS),
> possibly disrupting a driver's work.

nice fix. I think this is exactly the type of bug we were hoping to be
able to identify and fix, and it could explain the regression in its
entirety. The big question - does it fix Marcin's regression?

Ingo

2007-08-06 07:06:12

by Marcin Ślusarz

[permalink] [raw]
Subject: Re: [patch] genirq: fix simple and fasteoi irq handlers

2007/8/3, Jarek Poplawski <[email protected]>:
> Hi,
>
> I can't guarantee this is all needed to fix this bug, but I think
> this patch is necessary here.
>
> Regards,
> Jarek P.
> ------------>
>
> Subject: genirq: fix simple and fasteoi irq handlers
>
> After the "genirq: do not mask interrupts by default" patch interrupts
> should be disabled not immediately upon request, but after they happen.
> But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
> more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
> driver's work.
>
> The main reason of problems here, pointing the broken patch and making
> the first patch which can fix this was done by Marcin Slusarz.
> Additional test patches of Thomas Gleixner and Ingo Molnar tested by
> Marcin Slusarz helped to narrow possible reasons even more. Thanks.
>
> PS: this patch fixes only one evident error here, but there could be
> more places affected by above-mentioned change in irq handling.
>
>
> Signed-off-by: Jarek Poplawski <[email protected]>
>
> ---
>
> diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
> --- 2.6.23-rc1-/kernel/irq/chip.c 2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.23-rc1/kernel/irq/chip.c 2007-08-02 20:42:38.000000000 +0200
> @@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
>
> spin_lock(&desc->lock);
>
> - if (unlikely(desc->status & IRQ_INPROGRESS))
> - goto out_unlock;
> kstat_cpu(cpu).irqs[irq]++;
>
> action = desc->action;
> - if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> + if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
> + IRQ_DISABLED)))) {
> if (desc->chip->mask)
> desc->chip->mask(irq);
> desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
> @@ -392,18 +391,16 @@ handle_fasteoi_irq(unsigned int irq, str
>
> spin_lock(&desc->lock);
>
> - if (unlikely(desc->status & IRQ_INPROGRESS))
> - goto out;
> -
> desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
> kstat_cpu(cpu).irqs[irq]++;
>
> /*
> - * If its disabled or no action available
> + * If it's running, disabled or no action available
> * then mask it and get out of here:
> */
> action = desc->action;
> - if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> + if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
> + IRQ_DISABLED)))) {
> desc->status |= IRQ_PENDING;
> if (desc->chip->mask)
> desc->chip->mask(irq);
>
This patch didn't fix my NIC (tried on 2.6.22.1)

Marcin

2007-08-06 07:08:17

by Marcin Ślusarz

[permalink] [raw]
Subject: Re: [patch (take 2)] genirq: fix simple and fasteoi irq handlers

2007/8/6, Ingo Molnar <[email protected]>:
>
> * Jarek Poplawski <[email protected]> wrote:
>
> > Subject: genirq: fix simple and fasteoi irq handlers
> >
> > After the "genirq: do not mask interrupts by default" patch interrupts
> > should be disabled not immediately upon request, but after they
> > happen. But, handle_simple_irq() and handle_fasteoi_irq() can skip
> > this once or more if an irq is just serviced (IRQ_INPROGRESS),
> > possibly disrupting a driver's work.
>
> nice fix. I think this is exactly the type of bug we were hoping to be
> able to identify and fix, and it could explain the regression in its
> entirety. The big question - does it fix Marcin's regression?
I'll try this patch tomorrow.

Marcin

2007-08-06 07:19:21

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [patch (take 2)] genirq: fix simple and fasteoi irq handlers

On Mon, Aug 06, 2007 at 08:14:59AM +0200, Ingo Molnar wrote:
>
> * Jarek Poplawski <[email protected]> wrote:
>
> > Subject: genirq: fix simple and fasteoi irq handlers
> >
> > After the "genirq: do not mask interrupts by default" patch interrupts
> > should be disabled not immediately upon request, but after they
> > happen. But, handle_simple_irq() and handle_fasteoi_irq() can skip
> > this once or more if an irq is just serviced (IRQ_INPROGRESS),
> > possibly disrupting a driver's work.
>
> nice fix. I think this is exactly the type of bug we were hoping to be
> able to identify and fix, and it could explain the regression in its
> entirety. The big question - does it fix Marcin's regression?

Alas, there still could be something more... To be more sure, even
with this, there should be some debug printk (which could mess too),
but I don't know how much patience (and similar boxes...) Marcin has.
Of course, this "temporary fix" in -rc2 should give us more time.
But, I think you should confirm this gain with levels (I mean there
could be some saving on flag setting/ checking too). E.g. I've thought
about adding another ioapic_chip struct for fasteoi without .retrigger
(and maybe with .disable = .mask) maybe with some #ifdef CONFIG_...,
but maybe there could be reconsidered IRQ_DELAYED_DISABLE too (but
with this, there probably was a possibility to run this hw ->retrigger
'by chance' too, so with some strange IO-APICS there would be still
an unnecessary risk here).

The big question for me is still why this isn't more common: it seems
some (most of?) IO-APICS have some safety against this?

BTW: Marcin, if you're still willing to test anything (and your box is
alive after my previous 'could not make any damage' patch - sorry!),
this should be done with something before -rc2, so 2.6.22 or .23-rc1.

Jarek P.

PS: I've just read Marcin's messages - so, happily, it seems
everybody's alive! Thanks.