Hi,
I am using for my Internet-Gateway a dual Pentium MMX 200Mhz with a
Gigabyte 586DX Motherboard (with the Intel 430HX Chipset). The last year
I used Linux-2.2.16,2.2.17 with it and had several hangs of the network
and ISDN subsystem.
The network dies like this (when copying huge amount of data):
Feb 3 11:58:03 violin kernel: eth0: Interrupt posted but not delivered
-- IRQ blocked by another device?
Feb 3 11:58:03 violin kernel: Flags; bus-master 1, full 0; dirty 16
current 16.
Feb 3 11:58:03 violin kernel: Transmit list 00000000 vs. c13dfa00.
Feb 3 11:58:03 violin kernel: 0: @c13dfa00 length 8000006e status
0001006e
...
and the isdn subsystem like this:
Jan 27 00:41:15 violin kernel: isdn_tx_timeout dev ippp0 dialstate 0
Jan 27 00:41:28 violin kernel: ippp0: dialing 2 194040...
Jan 27 00:41:28 violin kernel: isdn: HiSax,ch0 cause: E001B
Although there is no direct hint to an APIC problem, I read in several
newsgroup articles that these two errors refer to APIC errors.
They system is still usable after such an error, only that eth0/isdn is
not accessible, even if I reload the modules. The only solution
is a reboot.
Well - some days ago I tried to switch to 2.4.3, hoping that these
errors will be gone then. The first thing that I noticed was that I got
thousands of lines like this:
Apr 22 16:19:31 violin kernel: APIC error on CPU0: 04(00)
At first I ignored it and started to test the stability by copying huge
amounts of data via NFS over eth0. After around 500MB (and 45000 APIC
Errors!) the isdn subsystem died:
Apr 18 16:32:12 violin kernel: isdn_tx_timeout dev ippp0 dialstate 0
Apr 18 16:32:12 violin kernel: ippp0: all channels busy - requeuing!
...
After around 10 minutes the whole
system crashed, leaving this in the syslog:
Apr 18 16:59:15 violin kernel: APIC error on CPU1: 02(00)
Apr 18 16:59:15 violin kernel: APIC error on CPU0?230?:'217?8226^C^@^@^
A^@H^@R*??f^M?j225L`c۽235?^A^@^@^@203$^A^@?!^C9?!^C9?????
?8^A^@+??8,??8?230?:"217?8^^^K^@^@^A^@H^@a??2009?B1kp^BĬ:?
...
I did a little statistic of the occurence of the APIC-errors:
APIC error on CPU0: 01(00) 650
APIC error on CPU0: 02(00) 9037
APIC error on CPU0: 04(00) 916
APIC error on CPU0: 06(00) 1
APIC error on CPU0: 06(04) 1
APIC error on CPU0: 08(00) 5369
APIC error on CPU0: 08(02) 1
APIC error on CPU0: 09(00) 3
APIC error on CPU0: 0a(00) 4
APIC error on CPU0: 0c(00) 1
APIC error on CPU0: 40(00) 60
APIC error on CPU0: 48(00) 23
APIC error on CPU1: 00(00) 13
APIC error on CPU1: 01(00) 5398
APIC error on CPU1: 02(00) 9533
APIC error on CPU1: 02(02) 5
APIC error on CPU1: 04(00) 6861
APIC error on CPU1: 08(00) 7836
APIC error on CPU1: 08(01) 7
APIC error on CPU1: 08(02) 1
APIC error on CPU1: 08(04) 1
APIC error on CPU1: 08(08) 1
APIC error on CPU1: 09(00) 5
APIC error on CPU1: 0a(00) 17
APIC error on CPU1: 0c(00) 23
Following the advice of Donald Becker he gave in some newsgroup I
restarted the
kernel with the "noapic" parameter. The strange thing is that the APIC
errors are still there, at least there are a lot less than before,
moreover the system seems slower but at least more stable. BTW, why are
there still APIC errors although there are no interrupts assigned to
CPU1 (as seen in /proc/interrupts).
I next tried to find out what triggers these APIC errors:
Without "noapic" kernel parameter:
The Errors are triggered by a certain amount of interrupts, whatever
device produces interrupts.
With "noapic":
It seems as if those errors are mostly triggered by NFS. When I copy the
same
amount of data with FTP, there are a lot less Errors. (E.g. for 500MB
there
are 40 with NFS and only 2 with FTP).
What I wonder is why linux outputs a line like this (with noapic):
<4>Intel MultiProcessor Specification v1.1
<4> Virtual Wire compatibility mode.
although the board seems to be capable of MPS 1.4 (as there is a Bios
option "MPS 1.4 for single Processor).
The following Hardware is in the system:
3com905b, ISDN AVM Fritz!, 128MB RAM, IBM 36GB HD, some SCSI-Devices
(HD,CDROM,Tape in an external case) and a very old monochrome graphics
card. Perhaps this old graphics adapter is a problem?
It would be somehow sad to give away the board or use it as a single CPU
board, so do you perhaps have any clue of how to get rid of these
problems?
If you need any further information that would help to fix this, so
please tell me.
Best Regards,
Hermann Himmelbauer
--
,_,
(O,O) "There is more to life than increasing its speed."
( ) -- Gandhi
-"-"--------------------------------------------------------------
On Sat, Apr 21, 2001 at 03:07:22PM +0200, Hermann Himmelbauer wrote:
> Hi,
> I am using for my Internet-Gateway a dual Pentium MMX 200Mhz with a
> Gigabyte 586DX Motherboard (with the Intel 430HX Chipset). The last year
> I used Linux-2.2.16,2.2.17 with it and had several hangs of the network
> and ISDN subsystem.
>
> The network dies like this (when copying huge amount of data):
> Feb 3 11:58:03 violin kernel: eth0: Interrupt posted but not delivered
> -- IRQ blocked by another device?
> Feb 3 11:58:03 violin kernel: Flags; bus-master 1, full 0; dirty 16
> current 16.
> Feb 3 11:58:03 violin kernel: Transmit list 00000000 vs. c13dfa00.
> Feb 3 11:58:03 violin kernel: 0: @c13dfa00 length 8000006e status
> 0001006e
> ...
>
> and the isdn subsystem like this:
> Jan 27 00:41:15 violin kernel: isdn_tx_timeout dev ippp0 dialstate 0
> Jan 27 00:41:28 violin kernel: ippp0: dialing 2 194040...
> Jan 27 00:41:28 violin kernel: isdn: HiSax,ch0 cause: E001B
>
> Although there is no direct hint to an APIC problem, I read in several
> newsgroup articles that these two errors refer to APIC errors.
For the ISDN one:
E001B - EURO ISDN cause Out of order mean, that here is no answer from
the exchange while trying to establish a D-channel L2 connection.
This may be have various reasons: broken cable, wrong addresses, no
IRQs. The no IRQ may (but don't must) related to APIC errors.
I have here the same board with 2*233 MMX and don't see this kind of ISDN
error on recent 2.2 kernels, but got also lot of APIC errors with the
2.3/2.4, because the APIC errors are only reported in 2.3/4.
> They system is still usable after such an error, only that eth0/isdn is
> not accessible, even if I reload the modules. The only solution
> is a reboot.
>
> Well - some days ago I tried to switch to 2.4.3, hoping that these
> errors will be gone then. The first thing that I noticed was that I got
> thousands of lines like this:
>
> Apr 22 16:19:31 violin kernel: APIC error on CPU0: 04(00)
No the kernel cannot change this, since it is a hardware problem.
The GA586DX is known that it produce lot of checksum errors on the APIC
bus, in 2.4 these are reported in 2.2 they are simple ignored, but also
here. These errors itself are not a problem since the APIC bus detect
it and recover, but if here are double errors in a way that the checksum
is OK, the APIC may run in trouble.
> Errors!) the isdn subsystem died:
> Apr 18 16:32:12 violin kernel: isdn_tx_timeout dev ippp0 dialstate 0
> Apr 18 16:32:12 violin kernel: ippp0: all channels busy - requeuing!
Yes that is also a hint that the IRQ of the card is blocked.
> Following the advice of Donald Becker he gave in some newsgroup I
> restarted the
> kernel with the "noapic" parameter. The strange thing is that the APIC
> errors are still there, at least there are a lot less than before,
> moreover the system seems slower but at least more stable. BTW, why are
> there still APIC errors although there are no interrupts assigned to
> CPU1 (as seen in /proc/interrupts).
>
Yes, no APIC means all IRQ are handled by one CPU only, so communication
errors about IRQ events on the APIC bus don't care.
> I next tried to find out what triggers these APIC errors:
>
> Without "noapic" kernel parameter:
> The Errors are triggered by a certain amount of interrupts, whatever
> device produces interrupts.
>
> With "noapic":
> It seems as if those errors are mostly triggered by NFS. When I copy the
> same
> amount of data with FTP, there are a lot less Errors. (E.g. for 500MB
> there
> are 40 with NFS and only 2 with FTP).
I don't know all kinds of events the APIC bus is used for, it is not only
for the IRQs.
> What I wonder is why linux outputs a line like this (with noapic):
> <4>Intel MultiProcessor Specification v1.1
> <4> Virtual Wire compatibility mode.
>
> although the board seems to be capable of MPS 1.4 (as there is a Bios
> option "MPS 1.4 for single Processor).
>
One or 2 years ago I was playing with these options, it seemed that setting
it to 1.1 reduce the error count a little bit, but this maybe a
misinterpretation.
--
Karsten Keil
SuSE Labs
ISDN development
> here. These errors itself are not a problem since the APIC bus detect
> it and recover, but if here are double errors in a way that the checksum
> is OK, the APIC may run in trouble.
Also nothing but recent -ac kernels in the 2.4 range handle the replay of
IPI's sometimes caused by this. That patch is a post 2.4.4 thing to sort out.
> I don't know all kinds of events the APIC bus is used for, it is not only
> for the IRQs.
Interrupts from I/O devices and interrupts sent between processors. The latter
are used to tell the other cpus to do things like flush TLB entries, change
an MTRR value etc
Alan
Karsten Keil wrote:
>
> I have here the same board with 2*233 MMX and don't see this kind of ISDN
> error on recent 2.2 kernels, but got also lot of APIC errors with the
> 2.3/2.4, because the APIC errors are only reported in 2.3/4.
Right - same behavior here, no APIC errors with 2.2 (as they are not
reported). The ISDN error happens very seldom (4 times last year) and is
not reproducable - which is not so with the eth0 errors (as eth0 locks
at around 500-1000MB while copying data).
> > kernel with the "noapic" parameter. The strange thing is that the APIC
> > errors are still there, at least there are a lot less than before,
> > moreover the system seems slower but at least more stable. BTW, why are
> > there still APIC errors although there are no interrupts assigned to
> > CPU1 (as seen in /proc/interrupts).
> >
>
> Yes, no APIC means all IRQ are handled by one CPU only, so communication
> errors about IRQ events on the APIC bus don't care.
Hmmm, so does that mean that those checksum errors have no effect on the
stability of my system?
> > What I wonder is why linux outputs a line like this (with noapic):
> > <4>Intel MultiProcessor Specification v1.1
> > <4> Virtual Wire compatibility mode.
> >
> > although the board seems to be capable of MPS 1.4 (as there is a Bios
> > option "MPS 1.4 for single Processor).
> >
>
> One or 2 years ago I was playing with these options, it seemed that setting
> it to 1.1 reduce the error count a little bit, but this maybe a
> misinterpretation.
How did you do that? The BIOS Option only enables the use of MPS 1.4 for
single CPU but I could not find an option for switching between 1.1/1.4.
Is there a way to force the Linux kernel to use 1.4?
Many thanks for your quick answer!
Best Regards,
Hermann
--
,_,
(O,O) "There is more to life than increasing its speed."
( ) -- Gandhi
-"-"--------------------------------------------------------------
> > Yes, no APIC means all IRQ are handled by one CPU only, so communication
> > errors about IRQ events on the APIC bus don't care.
>
> Hmmm, so does that mean that those checksum errors have no effect on the
> stability of my system?
If you have a lot of them eventually it will get you.
> How did you do that? The BIOS Option only enables the use of MPS 1.4 for
> single CPU but I could not find an option for switching between 1.1/1.4.
> Is there a way to force the Linux kernel to use 1.4?
It can only use what the BIOS offers.
Alan Cox wrote:
>
> > here. These errors itself are not a problem since the APIC bus detect
> > it and recover, but if here are double errors in a way that the checksum
> > is OK, the APIC may run in trouble.
>
> Also nothing but recent -ac kernels in the 2.4 range handle the replay of
> IPI's sometimes caused by this. That patch is a post 2.4.4 thing to sort out.
Hmmm, that's a little too technical for me ;-)
Does that mean that this patch would perhaps increase the stability of
my board as this code tries to prevent those double errors?
If yes, where could I get this patch to try it out?
What do you think of the following suggestion:
-Implement two runtime kernel variables like
/proc/sys/kernel/print_apic_errors
This would simply disable those "APIC error" kernel logs, so that the
logfile is not flooded. (45000 log entries in 1 hour are quite a lot).
Anyway once you know that your board has this problem, IMHO there is no
further use in those messages.
/proc/sys/kernel/enable_apic
The second one would enable/disable the APIC code for testing purposes -
like the "noapic" parameter during boottime. But as I have no knowledge
about those kernel internals, perhaps this wish is impossible to
implement...
Once again, thank you for your help!
Best Regards,
Hermann
--
,_,
(O,O) "There is more to life than increasing its speed."
( ) -- Gandhi
-"-"--------------------------------------------------------------
> /proc/sys/kernel/print_apic_errors
> This would simply disable those "APIC error" kernel logs, so that the
> logfile is not flooded. (45000 log entries in 1 hour are quite a lot).
> Anyway once you know that your board has this problem, IMHO there is no
> further use in those messages.
'My computer is broken, please dont tell me'. At 45,000 an hour you are asking
to get real problems.
> /proc/sys/kernel/enable_apic
> The second one would enable/disable the APIC code for testing purposes -
> like the "noapic" parameter during boottime. But as I have no knowledge
> about those kernel internals, perhaps this wish is impossible to
> implement...
That one is actually very tricky to do. The decision is made at boot time and
rather hard to flip between them.
Alan
On Sun, Apr 22, 2001 at 11:22:24AM +0200, Hermann Himmelbauer wrote:
> Karsten Keil wrote:
> >
> > I have here the same board with 2*233 MMX and don't see this kind of ISDN
> > error on recent 2.2 kernels, but got also lot of APIC errors with the
> > 2.3/2.4, because the APIC errors are only reported in 2.3/4.
>
> Right - same behavior here, no APIC errors with 2.2 (as they are not
> reported). The ISDN error happens very seldom (4 times last year) and is
> not reproducable - which is not so with the eth0 errors (as eth0 locks
> at around 500-1000MB while copying data).
I had a similar problem, but with less RAM than you have, I think.
And it hung the whole machine that heavy, that not even SysRq was
responding.
On that machine I had no swap installed and only 64MB of RAM.
Adding just another 64MB of RAM made it go away.
This might be an VM-skb-interaction-issue, but I saw no solution
so far.
The problem persistent with several processor (Cyrix III, Intel
Pentium (MXX), AMD Duron), several Chipsets (VIA-598, Intel BX)
and 3 different NICs (Realtek 8139, 3c509TX, Ether Express Pro)
and only under 100MBit.
I could copy MANY files (smb, scp, ftp), but ONE single file with
about 60MB or more (I tried to receive ISO images) killed the
machine. The behavior was also very random. Twice I got a
panic, but had problems writing it down due to the screen
darkening because of APM or setting "reboot on panic" :-(
Just FYI.
I don't know, why adding 64MB made it go away. I tried very hard
to reproduce it with 128MB, but really couldn't :-(
Regards
Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>