Hi all,
First of all, I know the Abit BP6 is infamous about its APIC, but I
would like to make sure there's absolutely no solution for this except
disabling the APIC.
I am experiencing problems for a long time now, which are always related
to the NIC in the box (probably due the being a device that generates a
lot of interrupts). The NIC has changed a couple of times (from 3com 10
Mbit to Intel eepro100 to 3Com PCI 3c905B Cyclone 100baseTx now), and
it's NOT placed in the infamous (I believe 3rd) PCI slot of the board
(mentioned in the manual). Also, /proc/interrups shows NO sharing with
another device. The running kernel is 2.4.19-pre8-ac5 SMP, though many
kernels have preceded it, with the same results.
The problems appear once in a while (in order of days/weeks). They are
always interluded with an "unexpected IRQ trap at vector 7d", and then
followed within a minute by chaos in the network driver. I found the
message of the 3com driver to be the most clear one, see the snippet
below. When I boot with "noapic", the problems go away.
Is there a solution that does not require disabling the APIC as a whole
or is this just too flaky hardware?
Thanks in advance,
- Robbert Kouprie
PS. Please CC me in answers, as I'm not on the list.
Jun 12 23:47:56 radium kernel: unexpected IRQ trap at vector 7d
Jun 12 23:47:56 radium kernel: unexpected IRQ trap at vector 7d
Jun 12 23:48:29 radium kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun 12 23:48:29 radium kernel: eth0: transmit timed out, tx_status 00
status e681.
Jun 12 23:48:29 radium kernel: diagnostics: net 0cf2 media 8880 dma
0000003a fifo 8800
Jun 12 23:48:29 radium kernel: eth0: Interrupt posted but not delivered
-- IRQ blocked by another device?
Jun 12 23:48:29 radium kernel: Flags; bus-master 1, dirty 16264012(12)
current 16264012(12)
Jun 12 23:48:29 radium kernel: Transmit list 00000000 vs. c133e500.
Jun 12 23:48:29 radium kernel: 0: @c133e200 length 800005cc status
000105cc
Jun 12 23:48:29 radium kernel: 1: @c133e240 length 800005cc status
000105cc
Jun 12 23:48:29 radium kernel: 2: @c133e280 length 800005cc status
000105cc
Jun 12 23:48:29 radium kernel: 3: @c133e2c0 length 800005cc status
000105cc
Jun 12 23:48:29 radium kernel: 4: @c133e300 length 8000001e status
0001001e
Jun 12 23:48:29 radium kernel: 5: @c133e340 length 800005cc status
000105cc
Jun 12 23:48:29 radium kernel: 6: @c133e380 length 80000045 status
00010045
Jun 12 23:48:29 radium kernel: 7: @c133e3c0 length 80000045 status
00010045
Jun 12 23:48:29 radium kernel: 8: @c133e400 length 80000045 status
00010045
Jun 12 23:48:29 radium kernel: 9: @c133e440 length 80000045 status
00010045
Jun 12 23:48:29 radium kernel: 10: @c133e480 length 80000045 status
80010045
Jun 12 23:48:29 radium kernel: 11: @c133e4c0 length 800005cc status
800105cc
Jun 12 23:48:29 radium kernel: 12: @c133e500 length 80000042 status
00010042
Jun 12 23:48:29 radium kernel: 13: @c133e540 length 80000042 status
00010042
Jun 12 23:48:29 radium kernel: 14: @c133e580 length 80000042 status
00010042
Jun 12 23:48:29 radium kernel: 15: @c133e5c0 length 80000042 status
00010042
Jun 12 23:48:29 radium kernel: eth0: Resetting the Tx ring pointer.
Jun 12 23:48:29 radium kernel: NETDEV WATCHDOG: eth0: transmit timed out
Raphael Manfredi wrote:
> I also know is that it works for me. ;-)
I just triggered the bug using a couple of simultaneous "ping -f -s 10
host" commands, and the patched kernel indeed recovers from the bug with
a "kernel: Kicking IO-APIC IRQ 17:" message :)
Now if only we could call the recovery mechanism from the point where
the "unexpected IRQ trap at vector" message gets printed (in
arch/i386/kernel/irq.c:ack_none), then we would have a lot more generic
code for all kinds of devices. If we then surround it by an #ifdef
CONFIG_BROKEN_APIC like Helge suggested, there's more chance this patch
gets accepted.
Problem now is, in the ack_none function we only know about the
(illegal) vector we are getting, and not about the interrupt we need to
reset. Could there be some kind of link between these, so that
kick_IO_APIC_irq can be called from there?
- Robbert
On Tue, 18 Jun 2002, Robbert Kouprie wrote:
> Raphael Manfredi wrote:
>
> I just triggered the bug using a couple of simultaneous "ping -f -s 10
> host" commands, and the patched kernel indeed recovers from the bug with
> a "kernel: Kicking IO-APIC IRQ 17:" message :)
> Now if only we could call the recovery mechanism from the point where
> the "unexpected IRQ trap at vector" message gets printed (in
> arch/i386/kernel/irq.c:ack_none), then we would have a lot more generic
> code for all kinds of devices. If we then surround it by an #ifdef
> CONFIG_BROKEN_APIC like Helge suggested, there's more chance this patch
> gets accepted.
>
> Problem now is, in the ack_none function we only know about the
> (illegal) vector we are getting, and not about the interrupt we need to
> reset. Could there be some kind of link between these, so that
> kick_IO_APIC_irq can be called from there?
Interesting, i haven't come across this problem before but it sounds like
the vector isn't getting delivered when the interrupt gets asserted and
only gets triggered later followed by an EOI... or something. Then again
its probably been beaten around a couple of times by now so i probably am
not adding anything new.
arch/i386/kernel/io_apic.c:irq_vector seems to be what you're looking for.
Good luck,
Zwane Mwaikambo
--
http://function.linuxpower.ca
On Tue, 18 Jun 2002, Robbert Kouprie wrote:
> Problem now is, in the ack_none function we only know about the
> (illegal) vector we are getting, and not about the interrupt we need to
> reset. Could there be some kind of link between these, so that
> kick_IO_APIC_irq can be called from there?
You get an invalid vector delivered due to massive transmission errors at
the inter-APIC bus. The errors are a serious hardware problem that cannot
and should not be fixed in software.
I'm told getting a better PSU may help, though.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +
> The errors are a serious hardware
> problem that cannot
> and should not be fixed in software.
I know the hardware sucks bad, but what's wrong with trying to work
around the problem providing noone else is bugged by the workaround?
> I'm told getting a better PSU may help, though.
The box already has an (overkill) 300W PSU, but yet I'm still seeing
problems.
- Robbert Kouprie
On Wed, 19 Jun 2002 15:23:13 +0200,
"Robbert Kouprie" <[email protected]> wrote:
>I know the hardware sucks bad, but what's wrong with trying to work
>around the problem providing noone else is bugged by the workaround?
You do not have the data required to (a) detect the problem and (b)
recover even if you could detect the problem. The APIC bus has a
single bit checksum, the APIC hardware detects single bit errors and
does a retransmission. It _cannot_ detect double bit errors, the bad
data is accepted and processed with undefined side effects.
What you see in the logs for a BP6 are error messages for single bit
errors that were recovered by the hardware. You will never see
messages for double bit errors, just unexplained oops and/or machine
hangs.
Yes, I have a BP6 :(.
On Wed, 19 Jun 2002, Robbert Kouprie wrote:
> I know the hardware sucks bad, but what's wrong with trying to work
> around the problem providing noone else is bugged by the workaround?
The reliability of the hardware is next to null. You are not able to
recover from that. You may succeed to recover from a subset of malformed
APIC messages, especially as losing an interrupt is often negligible, but
sooner or later you'll get hit by a corrupted IPI message, such as a TLB
flush and the system will either crash or produce wrong results. Note
that APIC hardware already performs consistency checks on messages
exchanged, which are capable to filter out damaged ones. If the hardware
fails to do that a message has to be seriously harmed.
Similarly you wouldn't like to work around occasional corruptions on your
host data bus, would you?
> The box already has an (overkill) 300W PSU, but yet I'm still seeing
> problems.
Too bad.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +
On Thu, 20 Jun 2002, Keith Owens wrote:
> You do not have the data required to (a) detect the problem and (b)
> recover even if you could detect the problem. The APIC bus has a
> single bit checksum, the APIC hardware detects single bit errors and
> does a retransmission. It _cannot_ detect double bit errors, the bad
> data is accepted and processed with undefined side effects.
Thanks to the way the checksum is calculated (a two-bit cumulative sum),
about 75% of double-bit errors are detected as well.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +
Keith Owens wrote:
> You do not have the data required to (a) detect the problem and (b)
> recover even if you could detect the problem.
Maciej W. Rozycki wrote:
> The reliability of the hardware is next to null. You are not able to
> recover from that.
Okay, you guys convinced me that some hardware can suck *really* bad. I
think I'm just going to stop my effort on this, stay with Raphael
Manfredi's hack to avoid most of the hangs on my BP6 for now, and get a
new board ASAP.
Thanks for all the help,
- Robbert Kouprie
Robbert Kouprie wrote:
>
> Hi all,
>
> First of all, I know the Abit BP6 is infamous about its APIC, but I
> would like to make sure there's absolutely no solution for this except
> disabling the APIC.
>
> I am experiencing problems for a long time now, which are always related
> to the NIC in the box (probably due the being a device that generates a
> lot of interrupts). The NIC has changed a couple of times (from 3com 10
> Mbit to Intel eepro100 to 3Com PCI 3c905B Cyclone 100baseTx now), and
> it's NOT placed in the infamous (I believe 3rd) PCI slot of the board
> (mentioned in the manual). Also, /proc/interrups shows NO sharing with
> another device. The running kernel is 2.4.19-pre8-ac5 SMP, though many
> kernels have preceded it, with the same results.
>
> The problems appear once in a while (in order of days/weeks). They are
> always interluded with an "unexpected IRQ trap at vector 7d", and then
> followed within a minute by chaos in the network driver. I found the
> message of the 3com driver to be the most clear one, see the snippet
> below. When I boot with "noapic", the problems go away.
>
> Is there a solution that does not require disabling the APIC as a whole
> or is this just too flaky hardware?
>
> Thanks in advance,
> - Robbert Kouprie
>
> PS. Please CC me in answers, as I'm not on the list.
>
> Jun 12 23:47:56 radium kernel: unexpected IRQ trap at vector 7d
> Jun 12 23:47:56 radium kernel: unexpected IRQ trap at vector 7d
It _can_ be solved - rebooting cures it, so assuming the problem
is autodetectable it _can_ be solved by doing whatever it is
a reboot (or driver reload) does to the APIC.
My guess is that the APIC setup for that IRQ have to be reprogrammed.
you could do that as a quirk for the BP6.
The first question is if there is a reliable way to detect this
condition. "No interrupts from a device" could simply mean that
it isn't used much at the time. You get a unexpected IRQ trap - do
the problem always manifest itself this way?
The second question is if all the PCI card drivers out there
survive a lost interrupt handled outside the driver.
If not, you have to close+reopen the device, and that involves
userspace.
A network card will need reinitialization, a disk controller
remounting...
Helge Hafting
Helge Hafting wrote:
> > Jun 12 23:47:56 radium kernel: unexpected IRQ trap at vector 7d
> > Jun 12 23:47:56 radium kernel: unexpected IRQ trap at vector 7d
>
> It _can_ be solved - rebooting cures it, so assuming the problem
> is autodetectable it _can_ be solved by doing whatever it is
> a reboot (or driver reload) does to the APIC.
True.
> My guess is that the APIC setup for that IRQ have to be reprogrammed.
> you could do that as a quirk for the BP6.
> The first question is if there is a reliable way to detect this
> condition. "No interrupts from a device" could simply mean that
> it isn't used much at the time. You get a unexpected IRQ trap - do
> the problem always manifest itself this way?
Yes, I always get the "unexpected IRQ trap at vector 7d" message. This
is the same message even with different NICs (though they were placed in
the same PCI slot). About 30-120 seconds after this message (depending
on some driver timeout value I guess) the NETDEV watchdog kicks in with
a "eth0: transmit timed out".
> The second question is if all the PCI card drivers out there
> survive a lost interrupt handled outside the driver.
> If not, you have to close+reopen the device, and that involves
> userspace.
> A network card will need reinitialization, a disk controller
> remounting...
That could indeed be a problem. But this will become clear pretty soon
once this APIC reprogramming workaround is actually implemented in the
kernel. Then I will be able to test that. Any ideas how this workaround
in the kernel would look like?
Thanks for the help,
- Robbert Kouprie
Robbert Kouprie wrote:
> That could indeed be a problem. But this will become clear pretty soon
> once this APIC reprogramming workaround is actually implemented in the
> kernel. Then I will be able to test that. Any ideas how this workaround
> in the kernel would look like?
Not much. Take a look at what happens in the kernel
when a pci device driver allocate an irq, and what happens
when it releases it.
What you have to do, is probably to release the (broken) irq
without disturbing the driver's internal data. Then
claim it again immediately on behalf of the driver. You
have now treated the APIC the same way as a close/open do.
No interrupt from that device should happen in the middle
of this - but you should be fine as the irq supposedly is dead.
And this is something you'll have to do wherever the error
is detected, i.e. near the code that prints that message.
Helge Hafting
Quoting Robbert Kouprie <[email protected]> from ml.linux.kernel:
:I am experiencing problems for a long time now, which are always related
:to the NIC in the box (probably due the being a device that generates a
:lot of interrupts). The NIC has changed a couple of times (from 3com 10
:Mbit to Intel eepro100 to 3Com PCI 3c905B Cyclone 100baseTx now), and
:it's NOT placed in the infamous (I believe 3rd) PCI slot of the board
:(mentioned in the manual). Also, /proc/interrups shows NO sharing with
:another device. The running kernel is 2.4.19-pre8-ac5 SMP, though many
:kernels have preceded it, with the same results.
Here's my own solution for it, in an old article. I've been running
with this patch since then, and transmit timeouts have never been a problem.
I run 2.4.18-pre7 nowadays, and the patch below applied without problem.
Raphael
------------------------------------------------------------------------
From: Raphael Manfredi <[email protected]>
To: [email protected]
Subject: [PATCH] Recover from lockups after "eth0: transmit timed out"
Date: Thu, 08 Nov 2001 19:42:05 +0100
Message-ID: <[email protected]>
This is my second take at the fix. I've tested it on my ABIT BP6 with
linux 2.4.12-ac3, but it should apply fine on more recent versions.
I've verified that I indeed recovered from a timeout situation where
I had to reboot before.
The fix assumes that the "NETDEV WATCHDOG" will only run when there is
an APIC, so it's OK to call apic routines. If this assumption is wrong,
then could someone more knowledgeable than me protect the call correctly
so we don't address missing hardware?
This fix is driver-independent, contrary to my first fix. It's also
shorter, as it re-uses existing macros in io_apic.c instead of expanding
them.
Please apply, and if rejected, let me know why.
Raphael
--- linux-2.4.12-ac3/arch/i386/kernel/io_apic.c.orig Mon Oct 29 19:34:42 2001
+++ linux-2.4.12-ac3/arch/i386/kernel/io_apic.c Sun Nov 4 15:53:05 2001
@@ -1616,3 +1616,35 @@
check_timer();
print_IO_APIC();
}
+
+/*
+ * The purpose of this routine is to recover from hopeless situations,
+ * where the IO-APIC level interrupt no longer happens, despite the use
+ * of end_level_ioapic_irq().
+ *
+ * This happens mainly whith Ethernet cards under heavy network traffic,
+ * on boxes with streams of APIC errors. The visible symptom is a message:
+ *
+ * NETDEV WATCHDOG: eth0: transmit timed out
+ *
+ * At this point, a driver-specific TX timout routine is called. Upon
+ * return, the watchdog calls:
+ *
+ * kick_IO_APIC_irq(dev->irq)
+ *
+ * to re-enable the interrupt source, or the machine may be stuck without
+ * network, until rebooted.
+ *
+ * Idea was suggested by Manfred Spraul, implemented by Raphael Manfredi.
+ */
+void kick_IO_APIC_irq(int irq)
+{
+ printk(KERN_CRIT "Kicking IO-APIC IRQ %d:\n", irq);
+
+ spin_lock(&ioapic_lock);
+ __mask_and_edge_IO_APIC_irq(irq);
+ udelay(10);
+ __unmask_and_level_IO_APIC_irq(irq);
+ spin_unlock(&ioapic_lock);
+}
+
--- linux-2.4.12-ac3/net/sched/sch_generic.c.orig Sun Nov 4 15:47:10 2001
+++ linux-2.4.12-ac3/net/sched/sch_generic.c Sun Nov 4 15:51:14 2001
@@ -153,6 +153,7 @@
(jiffies - dev->trans_start) > dev->watchdog_timeo) {
printk(KERN_INFO "NETDEV WATCHDOG: %s: transmit timed out\n", dev->name);
dev->tx_timeout(dev);
+ kick_IO_APIC_irq(dev->irq); /* Added by RAM */
}
if (!mod_timer(&dev->watchdog_timer, jiffies + dev->watchdog_timeo))
dev_hold(dev);
Raphael Manfredi wrote:
> Here's my own solution for it, in an old article. I've been running
> with this patch since then, and transmit timeouts have never been a
problem.
>
> I run 2.4.18-pre7 nowadays, and the patch below applied without
problem.
Thanks very much! This looks very promising. I just patched
2.4.19pre10-ac2 with it and booted it up on my BP6. I will report back
any failure or success of APIC kicking ;)
BTW, did you get any explanation why this wasn't applied in -ac or main
kernel?
- Robbert Kouprie
Quoting Robbert Kouprie <[email protected]> from ml.linux.kernel:
:BTW, did you get any explanation why this wasn't applied in -ac or main
:kernel?
None.
But I know that this patch is dirty because it attacks a hardware-dependent
layer from a rather generic one. This may be why it's rejected. And it
may also be completely APIC-BP6 specific.
I also know is that it works for me. ;-)
Raphael