2005-01-12 19:17:16

by Justin Piszcz

[permalink] [raw]
Subject: Question regarding ERR in /proc/interrupts.

Is there anyway to log each ERR to a file or way to find out what caused
each ERR?

For example, I know this is the cause of a few of them:
spurious 8259A interrupt: IRQ7.

But not all 20, is there any available option to do this?

$ cat /proc/interrupts
CPU0
0: 887759057 XT-PIC timer
1: 3138 XT-PIC i8042
2: 0 XT-PIC cascade
5: 5811 XT-PIC Crystal audio controller
9: 265081861 XT-PIC ide4, eth1, eth2
10: 9087912 XT-PIC ide6, ide7
11: 837707 XT-PIC ide2, ide3
12: 13854 XT-PIC i8042
14: 63373075 XT-PIC eth0
NMI: 0
ERR: 20


2005-01-12 19:40:52

by Randy.Dunlap

[permalink] [raw]
Subject: Re: Question regarding ERR in /proc/interrupts.

Justin Piszcz wrote:
> Is there anyway to log each ERR to a file or way to find out what caused
> each ERR?
>
> For example, I know this is the cause of a few of them:
> spurious 8259A interrupt: IRQ7.
>
> But not all 20, is there any available option to do this?

Are you sure about that?

MOTD: what kernel version?

2.6.10 (and probably all) prints such message one time for each
"spurious" IRQ, sets a flag for that IRQ, and then doesn't
print such message for that IRQ any more (i.e., so that
log isn't spammed). Each distinct spurious IRQ should be
logged (one time). If you want more, you'll need to patch
a source file and rebuild the kernel (attached, for i8259
PIC, not for APIC, since that's what you seem to have).

> $ cat /proc/interrupts
> CPU0
> 0: 887759057 XT-PIC timer
> 1: 3138 XT-PIC i8042
> 2: 0 XT-PIC cascade
> 5: 5811 XT-PIC Crystal audio controller
> 9: 265081861 XT-PIC ide4, eth1, eth2
> 10: 9087912 XT-PIC ide6, ide7
> 11: 837707 XT-PIC ide2, ide3
> 12: 13854 XT-PIC i8042
> 14: 63373075 XT-PIC eth0
> NMI: 0
> ERR: 20

--
~Randy


Attachments:
irq_err_msg.patch (749.00 B)

2005-01-12 20:01:43

by Justin Piszcz

[permalink] [raw]
Subject: Re: Re: Question regarding ERR in /proc/interrupts.

The kernel is 2.6.10.

The patch would effectively increment the ERR counter for each ERR
correct?

Is there anyway to trace the path or cause of an ERR?

For instance, I know I can make one occur like this:

I have 3 promise boards in a box, when I am doing multiple transfers
across 2-3 drives and doing an NFS transfer, I may hear the IBM or Hitachi
disk click and the ERR will incremement or just a long pause. Also, I
have used the IBM drive for 4-5+ yrs, never had any data corruption. The
disks themselves are not bad. It would just be nice to understand why
such spurious interrupts occur.

Dell Setup:

PCI SLOT 1 = PCI1

The PCI slots are on a riser board (Dell GX1p)

PCI1 = Closest to motherboard.

PCI1 = Intel GigE Nic
PCI2 = Promise ATA/100
PCI3 = Maxtor Promise ATA/133
PCI4 = Maxtor Promise ATA/133
PCI5 = 4 Port 10/100 NIC
ISA1 = Empty
ISA2 = Empty
ISA3 = Empty

Note: Nothing is attached to the system's IDE ports, they are disabled.
I also turned off ACPI/stuff I do not use.



On Wed, 12 Jan 2005, Randy.Dunlap
wrote:

> Justin Piszcz wrote:
>> Is there anyway to log each ERR to a file or way to find out what caused
>> each ERR?
>>
>> For example, I know this is the cause of a few of them:
>> spurious 8259A interrupt: IRQ7.
>>
>> But not all 20, is there any available option to do this?
>
> Are you sure about that?
>
> MOTD: what kernel version?
>
> 2.6.10 (and probably all) prints such message one time for each
> "spurious" IRQ, sets a flag for that IRQ, and then doesn't
> print such message for that IRQ any more (i.e., so that
> log isn't spammed). Each distinct spurious IRQ should be
> logged (one time). If you want more, you'll need to patch
> a source file and rebuild the kernel (attached, for i8259
> PIC, not for APIC, since that's what you seem to have).
>
>> $ cat /proc/interrupts
>> CPU0
>> 0: 887759057 XT-PIC timer
>> 1: 3138 XT-PIC i8042
>> 2: 0 XT-PIC cascade
>> 5: 5811 XT-PIC Crystal audio controller
>> 9: 265081861 XT-PIC ide4, eth1, eth2
>> 10: 9087912 XT-PIC ide6, ide7
>> 11: 837707 XT-PIC ide2, ide3
>> 12: 13854 XT-PIC i8042
>> 14: 63373075 XT-PIC eth0
>> NMI: 0
>> ERR: 20
>
> --
> ~Randy
>

2005-01-12 20:26:21

by Randy.Dunlap

[permalink] [raw]
Subject: Re: Question regarding ERR in /proc/interrupts.

Justin Piszcz wrote:
> The kernel is 2.6.10.
>
> The patch would effectively increment the ERR counter for each ERR correct?

No, that is already incremented for each ERR, the patch just makes
each and every one of them be printed. (warning)

> Is there anyway to trace the path or cause of an ERR?

Just the interrupt number and hence what device it is used by
(in /proc/interrupts). However, the 8259 PIC reports spurious
interrupts on IRQ 7. That's "normal" for it.

> For instance, I know I can make one occur like this:
>
> I have 3 promise boards in a box, when I am doing multiple transfers
> across 2-3 drives and doing an NFS transfer, I may hear the IBM or
> Hitachi disk click and the ERR will incremement or just a long pause.
> Also, I have used the IBM drive for 4-5+ yrs, never had any data
> corruption. The disks themselves are not bad. It would just be nice to
> understand why such spurious interrupts occur.

No idea, sorry. I've seen a few problems with riser boards (in
general, mostly timing related), but I don't know anything about
this one.

Did this start happening recently?

Have you tried asking the drives if they have any SMART data
(problems) logged?

> Dell Setup:
>
> PCI SLOT 1 = PCI1
>
> The PCI slots are on a riser board (Dell GX1p)
>
> PCI1 = Closest to motherboard.
>
> PCI1 = Intel GigE Nic
> PCI2 = Promise ATA/100
> PCI3 = Maxtor Promise ATA/133
> PCI4 = Maxtor Promise ATA/133
> PCI5 = 4 Port 10/100 NIC
> ISA1 = Empty
> ISA2 = Empty
> ISA3 = Empty
>
> Note: Nothing is attached to the system's IDE ports, they are disabled.
> I also turned off ACPI/stuff I do not use.
>
>
>
> On Wed, 12 Jan 2005, Randy.Dunlap wrote:
>
>> Justin Piszcz wrote:
>>
>>> Is there anyway to log each ERR to a file or way to find out what
>>> caused each ERR?
>>>
>>> For example, I know this is the cause of a few of them:
>>> spurious 8259A interrupt: IRQ7.
>>>
>>> But not all 20, is there any available option to do this?
>>
>>
>> Are you sure about that?
>>
>> MOTD: what kernel version?
>>
>> 2.6.10 (and probably all) prints such message one time for each
>> "spurious" IRQ, sets a flag for that IRQ, and then doesn't
>> print such message for that IRQ any more (i.e., so that
>> log isn't spammed). Each distinct spurious IRQ should be
>> logged (one time). If you want more, you'll need to patch
>> a source file and rebuild the kernel (attached, for i8259
>> PIC, not for APIC, since that's what you seem to have).
>>
>>> $ cat /proc/interrupts
>>> CPU0
>>> 0: 887759057 XT-PIC timer
>>> 1: 3138 XT-PIC i8042
>>> 2: 0 XT-PIC cascade
>>> 5: 5811 XT-PIC Crystal audio controller
>>> 9: 265081861 XT-PIC ide4, eth1, eth2
>>> 10: 9087912 XT-PIC ide6, ide7
>>> 11: 837707 XT-PIC ide2, ide3
>>> 12: 13854 XT-PIC i8042
>>> 14: 63373075 XT-PIC eth0
>>> NMI: 0
>>> ERR: 20

--
~Randy

2005-01-12 20:25:19

by linux-os

[permalink] [raw]
Subject: Re: Question regarding ERR in /proc/interrupts.

On Wed, 12 Jan 2005, Justin Piszcz wrote:

> Is there anyway to log each ERR to a file or way to find out what caused each
> ERR?
>
> For example, I know this is the cause of a few of them:
> spurious 8259A interrupt: IRQ7.
>
> But not all 20, is there any available option to do this?
>
> $ cat /proc/interrupts
> CPU0
> 0: 887759057 XT-PIC timer
> 1: 3138 XT-PIC i8042
> 2: 0 XT-PIC cascade
> 5: 5811 XT-PIC Crystal audio controller
> 9: 265081861 XT-PIC ide4, eth1, eth2
> 10: 9087912 XT-PIC ide6, ide7
> 11: 837707 XT-PIC ide2, ide3
> 12: 13854 XT-PIC i8042
> 14: 63373075 XT-PIC eth0
> NMI: 0
> ERR: 20
>

I'm not sure you really want to do that! The ERR value is a
spurious interrupt total. You will never learn where
it comes from because it comes from nowhere, which is
why it is called "spurious". Spurious interrupts are
really caused by the CPU, not a particular interrupt
controller. When the INT line is raised, the hardware
is supposed to put an address on the bus so the CPU can
branch to the handler (via some indirection). The
INT pin to the CPU is supposed to be manipulated
by a controller, either the PIC or IO-APIC.

Suppose the controller didn't raise an interrupt, but
the CPU thought it did. In that case, when the CPU signals
the controller to output the vector, the controller says;
"Dohhh... WTF. It's not me...". But the CPU needs some
address to complete the cycle so the controller puts its
last, lowest priority, vector on the bus to complete the
cycle. The CPU branches to the code and the code checks
for a possible printer interrupt (IRQ7). If the printer
didn't signal, it used to write a nasty-gram to the log
before acknowledging the interrupt. Recent kernels only
write such once. However, the number of such instances
are totaled for your review. If you have a lot of them,
it generally means you have:

(1) Too much crosstalk on the motherboard.
(2) Power supplies out of specification.
(3) Too hot so timing gets skewed.
(4) Etc.

It's NEVER the interrupt controller! NEVER. The Spurious
interrupt proves that the controller did its job by completing
the hardware handshake with the CPU. Don't kill the messenger.
It's just doing its job!

FYI 20 spurious interrupts out of the bazzillion shown isn't
too bad. It shows that your hardware isn't perfect.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.10 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.