2007-01-22 10:07:22

by Luigi Genoni

[permalink] [raw]
Subject: System crash after "No irq handler for vector" linux 2.6.19

Hi,
this night a linux server 8 dual core CPU Optern 2600Mhz crashed just
after giving this message

Jan 22 04:48:28 frey kernel: do_IRQ: 1.98 No irq handler for vector

I have no other logs, and I eventually lost the OOPS since I have no net
console setled up.

As I said sistem is running linux 2.6.19 compiled with gcc 4.1.1 for AMD
Opteron (attached see .config), no kernel preemption excepted the BKL
preemption. glibc 2.4.

System has 16 GB RAM and 8 dual core Opteron 2600Mhz.

I am running irqbalance 0.55.

any hints on what has happened?

thanx

regards

Luigi Genoni


Attachments:
cpuinfo (10.26 kB)
lspci (19.83 kB)
interrupts (3.01 kB)
config.gz (9.09 kB)
Download all attachments

2007-01-22 17:14:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19

"Luigi Genoni" <[email protected]> writes:

> (e-mail resent because not delivered using my other e-mail account)
>
> Hi,
> this night a linux server 8 dual core CPU Optern 2600Mhz crashed just after
> giving this message
>
> Jan 22 04:48:28 frey kernel: do_IRQ: 1.98 No irq handler for vector

Ok. This indicates that the hardware is doing something we didn't expect.
We don't know which irq the hardware was trying to deliver when it
sent vector 0x98 to cpu 1.

> I have no other logs, and I eventually lost the OOPS since I have no net
> console setled up.

If you had an oops it may have meant the above message was a secondary
symptom. Groan. If it stayed up long enough to give an OOPS then
there is a chance the above message appearing only once had nothing
to do with the actual crash.

How long had the system been up?

> As I said sistem is running linux 2.6.19 compiled with gcc 4.1.1 for AMD
> Opteron (attached see .config), no kernel preemption excepted the BKL
> preemption. glibc 2.4.
>
> System has 16 GB RAM and 8 dual core Opteron 2600Mhz.
>
> I am running irqbalance 0.55.
>
> any hints on what has happened?

Three guesses.

- A race triggered by irq migration (but I would expect more people to be yelling).
The code path where that message comes from is new in 2.6.19 so it may not have
had all of the bugs found yet :(
- A weird hardware or BIOS setup.
- A secondary symptom triggered by some other bug.

If this winds up being reproducible we should be able to track it down.
If not this may end up in the files of crap something bad happened that
we don't understand.

The one condition I know how to test for (if you are willing) is an
irq migration race. Simply by triggering irq migration much more often,
and thus increasing our chances of hitting a problem.

Stopping irqbalance and running something like:
for irq in 0 24 28 29 44 45 60 68 ; do
while :; do
for mask in 1 2 4 8 10 20 40 80 100 200 400 800 1000 2000 4000 8000 ; do
echo mask > /proc/irq/$irq/smp_affinity
sleep 1
done
done &
done

Should force every irq to migrate once a second, and removing the sleep 1
is even harsher, although we max at one irq migration by irq received.

If some variation of the above loop does not trigger the do_IRQ ??? No irq handler
for vector message chances are it isn't a race in irq migration.

If we can rule out the race scenario it will at least put us in the right direction
for guessing what went wrong with your box.

Eric

2007-01-23 10:19:16

by Luigi Genoni

[permalink] [raw]
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19

reproduced.
it took more or less one hour to reproduce it. I could reproduce it olny
running also irqbalance 0.55 and commenting out the sleep 1. The message
in
syslog is the same and then, after a few seconds I think, KABOM! system
crash
and reboot.

I tested also a similar system that has 4 dual core CPU Opteron 2600MHZ.
On
this system (linux sees 8 CPU, but it is the same kernel, same gcc, same
config, same glibc, same active services) I could not reproduce it even
running irqbalance 0.55 in almost 1 hour. Maybe I could reproduce it
waiting
for more time, but my users need to do their work, so I could not have a
longer test window. So on 16 CPU I had the crash, on 8 CPU I had no crash.

I need to give back the system to the users, so if you need other tests,
please, tell me as soon.

thanx

Luigi Genoni

On Monday 22 January 2007 18:14, Eric W. Biederman wrote:
> "Luigi Genoni" <[email protected]> writes:
> > (e-mail resent because not delivered using my other e-mail account)
> >
> > Hi,
> > this night a linux server 8 dual core CPU Optern 2600Mhz crashed just
> > after giving this message
> >
> > Jan 22 04:48:28 frey kernel: do_IRQ: 1.98 No irq handler for vector
>
> Ok. This indicates that the hardware is doing something we didn't
expect.
> We don't know which irq the hardware was trying to deliver when it
> sent vector 0x98 to cpu 1.
>
> > I have no other logs, and I eventually lost the OOPS since I have no
net
> > console setled up.
>
> If you had an oops it may have meant the above message was a secondary
> symptom. Groan. If it stayed up long enough to give an OOPS then
> there is a chance the above message appearing only once had nothing
> to do with the actual crash.
>
> How long had the system been up?
>
> > As I said sistem is running linux 2.6.19 compiled with gcc 4.1.1 for
AMD
> > Opteron (attached see .config), no kernel preemption excepted the BKL
> > preemption. glibc 2.4.
> >
> > System has 16 GB RAM and 8 dual core Opteron 2600Mhz.
> >
> > I am running irqbalance 0.55.
> >
> > any hints on what has happened?
>
> Three guesses.
>
> - A race triggered by irq migration (but I would expect more people to
be
> yelling). The code path where that message comes from is new in 2.6.19
so
> it may not have had all of the bugs found yet :(
> - A weird hardware or BIOS setup.
> - A secondary symptom triggered by some other bug.
>
> If this winds up being reproducible we should be able to track it down.
> If not this may end up in the files of crap something bad happened that
> we don't understand.
>
> The one condition I know how to test for (if you are willing) is an
> irq migration race. Simply by triggering irq migration much more often,
> and thus increasing our chances of hitting a problem.
>
> Stopping irqbalance and running something like:
> for irq in 0 24 28 29 44 45 60 68 ; do
> while :; do
> for mask in 1 2 4 8 10 20 40 80 100 200 400 800 1000 2000
4000 8000 ; do
> echo mask > /proc/irq/$irq/smp_affinity
> sleep 1
> done
> done &
> done
>
> Should force every irq to migrate once a second, and removing the sleep
1
> is even harsher, although we max at one irq migration by irq received.
>
> If some variation of the above loop does not trigger the do_IRQ ??? No
irq
> handler for vector message chances are it isn't a race in irq migration.
>
> If we can rule out the race scenario it will at least put us in the
right
> direction for guessing what went wrong with your box.
>
> Eric

--

2007-01-23 10:36:16

by Luigi Genoni

[permalink] [raw]
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19

On Mon, 22 Jan 2007, Eric W. Biederman wrote:

> "Luigi Genoni" <[email protected]> writes:
>
>> Hi,
>> this night a linux server 8 dual core CPU Optern 2600Mhz crashed just
after
>> giving this message
>>
>> Jan 22 04:48:28 frey kernel: do_IRQ: 1.98 No irq handler for vector
>
> Ok. This indicates that the hardware is doing something we didn't
expect.
> We don't know which irq the hardware was trying to deliver when it
> sent vector 0x98 to cpu 1.
>
>> I have no other logs, and I eventually lost the OOPS since I have no
net
>> console setled up.
>
> If you had an oops it may have meant the above message was a secondary
> symptom. Groan. If it stayed up long enough to give an OOPS then
> there is a chance the above message appearing only once had nothing
> to do with the actual crash.
>
>
> How long had the system been up?

Sorry, my english is so bad, so I could not espress really what I wanted
to
say.

I didn't get an OOPS. I could not see it on the console (nor logs). I do
not
have netconsole

The system was up and running since 52 days.

>
>> As I said sistem is running linux 2.6.19 compiled with gcc 4.1.1 for
AMD
>> Opteron (attached see .config), no kernel preemption excepted the BKL
>> preemption. glibc 2.4.
>>
>> System has 16 GB RAM and 8 dual core Opteron 2600Mhz.
>>
>> I am running irqbalance 0.55.
>>
>> any hints on what has happened?
>
> Three guesses.
>
> - A race triggered by irq migration (but I would expect more people to
be
> yelling).
> The code path where that message comes from is new in 2.6.19 so it may
not
> have
> had all of the bugs found yet :(
> - A weird hardware or BIOS setup.
> - A secondary symptom triggered by some other bug.
>
> If this winds up being reproducible we should be able to track it down.
> If not this may end up in the files of crap something bad happened that
> we don't understand.
>
> The one condition I know how to test for (if you are willing) is an
> irq migration race. Simply by triggering irq migration much more often,
> and thus increasing our chances of hitting a problem.

OK, will try tomorrow morning (I am at home right now, in Italy is night).

>
> Stopping irqbalance and running something like:
> for irq in 0 24 28 29 44 45 60 68 ; do
> while :; do
> for mask in 1 2 4 8 10 20 40 80 100 200 400 800 1000 2000
4000
> 8000 ; do
> echo mask > /proc/irq/$irq/smp_affinity
> sleep 1
> done
> done &
> done
>
> Should force every irq to migrate once a second, and removing the sleep
1
> is even harsher, although we max at one irq migration by irq received.
>
> If some variation of the above loop does not trigger the do_IRQ ??? No
irq
> handler
> for vector message chances are it isn't a race in irq migration.
>
> If we can rule out the race scenario it will at least put us in the
right
> direction
> for guessing what went wrong with your box.

btw, the server is running tibco (bwengine, imse and adr3, all tibco
software
is multithread) which use multicast and stress the net cards continuosly
sending out and receiving multicast and UDP packets talking with a remote
oracle DB on eth2 (gigabit ethernet). tibco continuously creates and
deletes
small files (small often less than 1 KB, thousands per minute when it has
a lot
of work to do) on three SAN luns (IBM DS8000) of 33 GB each one, accessed
trought 2 path fibre channels 2Gbit (linux dm multipath) with LVM2 volumes
(reiserFS, since gives us the best performance with all those small files,
and
never gave us trobles).

In reality the system sees 21 luns on two paths, and those luns are also
seen
by other 4 linux servers identical to this one clustered togheter with
this
system using service guard (which is only userspace cluster software, no
kernel
modules, eth0 and eth4 are HB and eth6 is the card on the public LAN ).
The Volume Group accessed by this server is reserved using LVM2 tags.

Tibco also does a normal I/O (about 2 MB/s) on a TCP NFS3 mounted volume
using
eth2. Never had troubles with this NFS mount

Client nfs v3 (after 10 hour of uptime):
null getattr setattr lookup access readlink
0 0% 2770320 78% 122 0% 546401 15% 19672 0% 133 0%
read write create mkdir symlink mknod
40673 1% 3338 0% 58 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
32 0% 0 0% 0 0% 0 0% 144213 4% 552 0%
fsstat fsinfo pathconf commit
316 0% 872 0% 0 0% 16 0%


so network cards and SAN cards do work a lot. I include the boot messages,
maybe could help you to figure out how this server is configured.

what more... ah, usually the system during the sunday night is not working
a
lot (not working at all), and it crashed a sunday night. Loocking the sar
I see
strange statistics...

sar -I SUM INTR intr/s
04:40:01 sum 839.40
05:00:01 sum 62.10 <---
05:10:01 sum 245.63


sar -a
CPU %user %nice %system %iowait %idle
04:40:01 all 1.14 0.00 0.44 0.06 98.36
05:00:01 all 287.15 288.31 287.73 288.17 0.00 <---
05:10:01 all 0.00 0.00 0.00 0.01 99.98

but normally during the day:
10:30:01 all 88.41 0.00 0.81 0.05 10.73)

sar -W cswch/s
04:40:01 7738.69
05:00:01 478986467937.98 <---
05:10:01 194.36

(but normally during the day:
10:20:01 61264.85)

sar -b tps rtps wtps bread/s bwrtn/s
04:40:01 187.05 0.02 187.03 0.27 1777.81
05:00:01 106.24 111.47 106.29 108.40 59.48 <---
05:10:01 2.48 0.02 2.47 0.27 28.40

sar -c proc/s
04:40:01 0.34
05:00:01 478986468210.56 <--- this is absolutelly abnormal
05:10:01 0.03

That's all I can say about the race scenario. As you need more tests to be
done, please tell me.

>
> Eric
>

Thanx

Luigi Genoni


Attachments:
boot.msg (45.81 kB)

2007-01-23 12:19:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19

"Luigi Genoni" <[email protected]> writes:

> reproduced.
> it took more or less one hour to reproduce it. I could reproduce it olny
> running also irqbalance 0.55 and commenting out the sleep 1. The message in
> syslog is the same and then, after a few seconds I think, KABOM! system crash
> and reboot.
>
> I tested also a similar system that has 4 dual core CPU Opteron 2600MHZ. On
> this system (linux sees 8 CPU, but it is the same kernel, same gcc, same
> config, same glibc, same active services) I could not reproduce it even
> running irqbalance 0.55 in almost 1 hour. Maybe I could reproduce it waiting
> for more time, but my users need to do their work, so I could not have a
> longer test window. So on 16 CPU I had the crash, on 8 CPU I had no crash.
>
> I need to give back the system to the users, so if you need other tests,
> please, tell me as soon.

Ok. Since it seems to be the irq code I'm going to need to get a dump
of the state of the apics, basically the output print_IO_APIC from the time
this happens.

I'm not really out of bed yet so no patch but a heads up. Once I've finished
sleeping I'll look at putting a debugging patch together.

Getting a bootlog of the identical system that would not crash at 8 cpus
would be interesting. In part this is because the number of apic modes
that are available for use are much fewer when we have 8 or more cpus,
so it would be interesting to see if it is actually using the same code
for interrupt delivery.

If you have a few minutes to try it, it would be interesting to know
if forcing the migration from the command line (as I was suggesting)
would reproduce this faster than irqbalance.

Hopefully I can think up things to fix this when I wake up.

The time zone difference is going to be a pain :(

Eric

2007-01-31 08:40:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19

<[email protected]> writes:

> I have in interesting update, at less I suppose I have.

It was, at least as another data point.

> I do not know very well what happens with irq stuff migrating shared irq, but I
> suppose this has something to do with this crash.

The fact the irq was shared should have no bearing on this crash scenario.
A shared irq is not at all helpful in a performance sense, but this problem
is low enough that a shared irq should have made not difference at all, except
for the frequency the interrupt fired, and was migrated. And a high interrupt
frequency, and a high migration rate tend to cause this problem.

> Right now I stopped irqbalance and puff! load average is back to normal, and
> under the same workload notthing similar is happening for the moment.

Yes. That sounds like a good work around until this problem is sorted out.

> Lesson number one I learnt: avoid shared IRQ on this systems (but to reconfigure
> HW cabling right now is not so easy).

Right. Because the only sharing should be because the traces on the
motherboard are shared.

> I hope this helps

It has all helped.

I have been tracking some easier problems, keeping this one on my back burner.
The good/bad news is that by restricting my set of vectors I can choose from
in the kernel, and running ping -f to another machine. And migrating the
single irq for my NIC. I have been able to reproduce this in about 5 minutes.

I haven't root caused it yet, but the fact I can reproduce this on dual socket
motherboard suggest the reason it took you an hour to reproduce the problem is
simply because you had so few irqs on your system, and that the extreme
latency of 8-socket Opterons is not to blame.

Hopefully now that I can reproduce this I will be able to root cause
this and then fix this bug.

Eric

2007-02-01 06:00:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19

"Luigi Genoni" <[email protected]> writes:

> OK,
> willing to test any patch.

Sure. After I get things working on this end I will copy you,
on any fixes so you can confirm they work for you.

I am still root causing this but I have found a small fix that should
keep the system from going down when this problem occurs.

If you could confirm that it keeps your system from going down I'd
appreciate it.

Eric

2007-02-01 07:20:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19

"Luigi Genoni" <[email protected]> writes:

> OK,
> willing to test any patch.

Ok. I've finally figured out what is going on. The code is
race free but the programmer was an idiot.

In the local apic there are two relevant registers.
ISR (in service register) describing all of the
interrupts that the cpu in the process of handling.
IRR (intrerupt request register) which lists all of
the interrupts that are currently pending.

Well it happens that IRR is used to catch the case
when we are servicing an interrupt and that same interrupt
comes in again. When that happens as soon as we are
done service the interrupt that same interrupt fires again.

We perform interrupt migration in an interrupt handler, so
we can be race free.

It turns out that if I'm performing migration (updating all
of the data structures and hardware registers) while IRR
is set the interrupt will happen in the old location immediate after
my migration work is complete. And since the kernel is not
setup to deal with it we get an ugly error message.

Anyway now that I know what is going on I'm going to have to think
about this a little bit more to figure out how to fix this. My hunch
is the easy fix will be simply not to migrate until I have an
interrupt instance when IRR is clear.

Anyway with a little luck tomorrow I will be able to figure it out,
it's to bed with me now.

Eric

2007-02-01 13:33:30

by Chris Rankin

[permalink] [raw]
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19

> Ok. I've finally figured out what is going on. The code is race free but the programmer was an
> idiot.

Hi,

Could this IRQ problem account for this bug as well, please? Or is yours strictly a 2.6.19.x
issue?

http://bugzilla.kernel.org/show_bug.cgi?id=7847

I have a dual P4 Xeon box (HT enabled) so there is a lot of scope for IRQ migration. I was playing
WoW when this bug occurred, so there would have been a lot of IRQs needing handling between both
the video and sound cards.

Cheers,
Chris




___________________________________________________________
What kind of emailer are you? Find out today - get a free analysis of your email personality. Take the quiz at the Yahoo! Mail Championship.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk

2007-02-02 18:02:46

by Eric W. Biederman

[permalink] [raw]
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19

"Luigi Genoni" <[email protected]> writes:

> I tested the patch, but I could not really stress the HW.
> anyway no crash, but load average is somehow abnormal, higher than it should
> be.

Thanks. Did you get any nasty messages about "No irq handler for vector?"
If not then you never even hit the problem condition.

Given how rare the trigger condition is it would probably take force
irq migration to even trigger the "No irq handler for vector message".

That patch I'm not really worried about, I've tested it and it meets
the obviously correct condition :) But it should enable people to
not be afraid of IRQ migration.

I'm slowly working my way towards a real fix. I know what I have to code,
now I just have to figure out how. :)

Eric

2007-02-02 18:32:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19

"Luigi Genoni" <[email protected]> writes:

> the message appeared just once, but no crash.
> anyway the load average was really abnormal.

Good. You tested it and it worked!

High load average is interesting, because it has similar causes
as the "No irq handler for vector", but technically they are
completely independent.

In practice surviving a high load average is a very useful property
though.

Eric

2007-02-03 00:40:37

by Eric W. Biederman

[permalink] [raw]
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19


Luigi,

Unless you have a completely different cause I believe the patches I
just posted will fix the issue. If you can test and confirm this that
would be great.

Eric