2002-01-20 19:16:49

by Robbert Kouprie

[permalink] [raw]
Subject: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Hi all,

I have an Abit BP6 Dual Celeron 433, 192 Mb RAM, Intel NIC on 100 Mbit,
running Debian Woody with Linux kernel 2.4.17 from source. This weekend
the network card totally locked up. No network connections were possible
anymore and system logs were full of NETDEV errors (included below).
Note the "unexpected IRQ trap" which started this. Soft rebooting the
system was not possible anymore as the system hung on these NETDEV
errors after issueing the "reboot" command. I performed a little search
on the error message and found an earlier lkml message which looks like
exactly the same problem:
http://lists.kernelnotes.de/linux-kernel/Week-of-Mon-20010618/026269.htm
l

Did anyone ever found out the problem here?

Regards,
- Robbert Kouprie


radium:/$ lspci -vx -d 8086:1229
00:0d.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev
09)
Subsystem: Intel Corp.: Unknown device 0011
Flags: bus master, medium devsel, latency 32, IRQ 17
Memory at da020000 (32-bit, non-prefetchable) [size=4K]
I/O ports at c800 [size=64]
Memory at da000000 (32-bit, non-prefetchable) [size=128K]
Expansion ROM at <unassigned> [disabled] [size=1M]
Capabilities: [dc] Power Management version 2
00: 86 80 29 12 07 00 90 02 09 00 00 02 08 20 00 00
10: 00 00 02 da 01 c8 00 00 00 00 00 da 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 11 00
30: 00 00 00 00 dc 00 00 00 00 00 00 00 0a 01 08 38

Jan 19 01:27:16 radium kernel: unexpected IRQ trap at vector 7d
Jan 19 01:29:11 radium kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jan 19 01:29:11 radium kernel: eth0: Transmit timed out: status f048
0c00 at 15835657/15835685 command 000ca000.
Jan 19 01:29:11 radium kernel: eth0: Tx ring dump, Tx queue 15835685 /
15835657:
Jan 19 01:29:11 radium kernel: eth0: 0 200ca000.
Jan 19 01:29:11 radium kernel: eth0: 1 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 2 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 3 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 4 400ca000.
Jan 19 01:29:11 radium kernel: eth0: = 5 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 6 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 7 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 8 200ca000.
Jan 19 01:29:11 radium kernel: eth0: * 9 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 10 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 11 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 12 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 13 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 14 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 15 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 16 200ca000.
Jan 19 01:29:11 radium kernel: eth0: 17 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 18 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 19 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 20 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 21 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 22 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 23 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 24 200ca000.
Jan 19 01:29:11 radium kernel: eth0: 25 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 26 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 27 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 28 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 29 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 30 000ca000.
Jan 19 01:29:11 radium kernel: eth0: 31 000ca000.
Jan 19 01:29:11 radium kernel: eth0: Printing Rx ring (next to receive
into 13805956, dirty index 13805956).
Jan 19 01:29:11 radium kernel: eth0: 0 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 1 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 2 0000a020.
Jan 19 01:29:11 radium kernel: eth0: l 3 c000a020.
Jan 19 01:29:11 radium kernel: eth0: *= 4 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 5 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 6 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 7 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 8 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 9 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 10 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 11 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 12 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 13 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 14 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 15 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 16 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 17 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 18 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 19 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 20 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 21 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 22 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 23 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 24 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 25 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 26 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 27 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 28 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 29 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 30 0000a020.
Jan 19 01:29:11 radium kernel: eth0: 31 0000a020.

Jan 19 01:30:17 radium kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jan 19 01:30:17 radium kernel: eth0: Transmit timed out: status f048
0c00 at 15835685/15835713 command 0001a000.
Jan 19 01:30:17 radium kernel: eth0: Tx ring dump, Tx queue 15835713 /
15835685:
Jan 19 01:30:17 radium kernel: eth0: 0 600ca000.
Jan 19 01:30:17 radium kernel: eth0: = 1 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 2 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 3 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 4 400ca000.
Jan 19 01:30:17 radium kernel: eth0: * 5 0001a000.
Jan 19 01:30:17 radium kernel: eth0: 6 0002a000.
Jan 19 01:30:17 radium kernel: eth0: 7 0003a000.
Jan 19 01:30:17 radium kernel: eth0: 8 200ca000.
Jan 19 01:30:17 radium kernel: eth0: 9 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 10 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 11 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 12 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 13 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 14 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 15 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 16 200ca000.
Jan 19 01:30:17 radium kernel: eth0: 17 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 18 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 19 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 20 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 21 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 22 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 23 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 24 200ca000.
Jan 19 01:30:17 radium kernel: eth0: 25 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 26 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 27 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 28 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 29 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 30 000ca000.
Jan 19 01:30:17 radium kernel: eth0: 31 000ca000.
Jan 19 01:30:17 radium kernel: eth0: Printing Rx ring (next to receive
into 13805956, dirty index 13805956).
Jan 19 01:30:17 radium kernel: eth0: 0 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 1 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 2 0000a020.
Jan 19 01:30:17 radium kernel: eth0: l 3 c000a020.
Jan 19 01:30:17 radium kernel: eth0: *= 4 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 5 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 6 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 7 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 8 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 9 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 10 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 11 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 12 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 13 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 14 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 15 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 16 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 17 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 18 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 19 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 20 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 21 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 22 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 23 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 24 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 25 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 26 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 27 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 28 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 29 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 30 0000a020.
Jan 19 01:30:17 radium kernel: eth0: 31 0000a020.

... And so on (2 Mb of logs available ;))


2002-01-20 21:53:32

by Robbert Kouprie

[permalink] [raw]
Subject: RE: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Hi Kenny,

Thanks for the quick reply :) Just checked it, and it's in slot 2, so
that's not the problem. It doesn't share the HPT366 IRQ. This is my
/proc/interrupts:

radium:/$ cat /proc/interrupts
CPU0 CPU1
0: 1207342 1186359 IO-APIC-edge timer
1: 3 1 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
4: 68 55 IO-APIC-edge serial
8: 2 0 IO-APIC-edge rtc
14: 6 12 IO-APIC-edge ide0
17: 1723020 1719230 IO-APIC-level eth0
18: 33419 33452 IO-APIC-level ide2
19: 0 0 IO-APIC-level es1371
NMI: 0 0
LOC: 2393677 2393676
ERR: 14
MIS: 0

Regards,
- Robbert

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Kenneth MacDonald
> Sent: zondag 20 januari 2002 22:18
> To: Robbert Kouprie
> Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)
>
>
> Do you have the NIC in slot 3? Read your motherboard manual,
> page 1-4.
>
> Hope that helps, just in case!
> --
> Kenny
>
> ADML Support, EUCS, The University of Edinburgh, Scotland.
>

2002-01-21 10:07:32

by Robbert Kouprie

[permalink] [raw]
Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)


I know the bad reputation of the BP6, I see APIC errors every day since
the specific printk was added in 2.4.0, but system is fairly stable. Last
time it was unstable the cause was the Intel eepro100, which turned out
to suffer from a chipset bug, which is now corrected for in the
eepro100.c driver.

I ran 2.4 kernels with this NIC and the corrected driver since 2.4.14 or
so and never had problems. 2.4.16 ran for 40 days without trouble.

My BP6 bios is the most recent (RU revision). I can't really say "noapic"
helps as I have no way to reproduce this. It was just a one time message, which was
very similar to the earlier bug report to lkml I mentioned, that's why I
reported it.

If someone has any clues on how to reproduce this I am happy to do some
testing.

Regards,
- Robbert


On Sun, 20 Jan 2002, Mark Hahn wrote:

> > I have an Abit BP6 Dual Celeron 433, 192 Mb RAM, Intel NIC on 100 Mbit,
>
> I assume you know that the BP6 is quite notorious for it's
> rather ignoble irq and apic handling? is it safe to assume
> you have a recent bios installed? does booting "noapic" help?
>
>

2002-01-21 18:53:42

by Jussi Laako

[permalink] [raw]
Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Robbert Kouprie wrote:
>
> Thanks for the quick reply :) Just checked it, and it's in slot 2, so
> that's not the problem. It doesn't share the HPT366 IRQ. This is my
> /proc/interrupts:

Driver is eepro100? I suspect there is something in eepro100 driver that
should be protected by a spinlock but is not. I haven't got time to analyze
it further, yet...


- Jussi Laako

--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers

2002-01-22 08:52:17

by Robbert Kouprie

[permalink] [raw]
Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)


Jussi Laako wrote:

> Robbert Kouprie wrote:
> >
> > Thanks for the quick reply :) Just checked it, and it's in slot 2, so
> > that's not the problem. It doesn't share the HPT366 IRQ. This is my
> > /proc/interrupts:
>
> Driver is eepro100? I suspect there is something in eepro100 driver that
> should be protected by a spinlock but is not. I haven't got time to
> analyze it further, yet...
>
> - Jussi Laako

Yes, eepro100.c. Let me know if I can test something, although I would
need a reproducible testcase also. Still doing some tests with high
network load, as this caused the similar lockup in the other thread.

- Robbert

2002-01-22 14:54:46

by James Bourne

[permalink] [raw]
Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

On Tue, 22 Jan 2002, Robbert Kouprie wrote:

>
> Jussi Laako wrote:
>
> > Robbert Kouprie wrote:
> > >
> > > Thanks for the quick reply :) Just checked it, and it's in slot 2, so
> > > that's not the problem. It doesn't share the HPT366 IRQ. This is my
> > > /proc/interrupts:
> >
> > Driver is eepro100? I suspect there is something in eepro100 driver that
> > should be protected by a spinlock but is not. I haven't got time to
> > analyze it further, yet...
> >
> > - Jussi Laako
>
> Yes, eepro100.c. Let me know if I can test something, although I would
> need a reproducible testcase also. Still doing some tests with high
> network load, as this caused the similar lockup in the other thread.
>
> - Robbert
>

Perhaps this will help. Yesterday we had a strange error on an eepro100
NIC. System is 4-way Xeon, 4G RAM, 4 eepro100 nics (2 in use), Dell PE6400.
Kernel is 2.4.17, no additional patches. The system has not locked up
though.

The error was
eth0: can't fill rx buffer (force 0)!
eth0: Tx ring dump, Tx queue 3013060 / 3013060:
eth0: 0 200ca000.
eth0: 1 000ca000.
eth0: 2 000ca000.
eth0: 3 400ca000.
eth0: *= 4 000ca000.
eth0: 5 000ca000.
[... all the same ...]
eth0: 30 000ca000.
eth0: 31 000ca000.
eth0: Printing Rx ring (next to receive into 2522947, dirty index 2522946).
eth0: 0 00000001.
eth0: l 1 c0000001.
eth0: * 2 00000000.
eth0: = 3 00000001.
eth0: 4 00000001.
eth0: 5 00000001.
[... all the same ...]
eth0: 30 00000001.
eth0: 31 00000001.

System has only 4 days uptime, eth0 output is:
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3049032 errors:0 dropped:0 overruns:0 frame:0
TX packets:3566542 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:580129470 (553.2 Mb) TX bytes:2775810991 (2647.2 Mb)
Interrupt:26

/proc/interrupts
loki:bash# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 10332192 10317829 10310969 10381268 IO-APIC-edge timer
1: 292 307 250 273 IO-APIC-edge keyboard
2: 0 0 0 0 XT-PIC cascade
8: 26 32 30 27 IO-APIC-edge rtc
17: 6 4 3 3 IO-APIC-level aic7xxx
18: 2 5 2 7 IO-APIC-level aic7xxx
22: 766030 765530 764892 765127 IO-APIC-level eth1
23: 388900 388509 388022 388376 IO-APIC-level megaraid
26: 1395108 1395240 1394961 1396555 IO-APIC-level eth0
NMI: 0 0 0 0
LOC: 41347538 41347536 41347536 41347494
ERR: 0
MIS: 0

lspci
loki:bash# lspci
00:00.0 Host bridge: ServerWorks CNB20HE Host Bridge (rev 21)
00:00.1 Host bridge: ServerWorks CNB20HE Host Bridge (rev 01)
00:00.2 Host bridge: ServerWorks: Unknown device 0006
00:00.3 Host bridge: ServerWorks: Unknown device 0006
00:04.0 VGA compatible controller: ATI Technologies Inc 3D Rage IIC (rev 7a)
00:05.0 SCSI storage controller: Adaptec 7899P (rev 01)
00:05.1 SCSI storage controller: Adaptec 7899P (rev 01)
00:08.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)
00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 50)
00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller
00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04)
03:08.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)
03:09.0 PCI bridge: Digital Equipment Corporation DECchip 21154 (rev 05)
03:0a.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)
03:0b.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)
04:00.0 PCI bridge: Digital Equipment Corporation DECchip 21154 (rev 05)
04:01.0 SCSI storage controller: Q Logic QLA12160 (rev 06)
05:00.0 RAID bus controller: American Megatrends Inc. MegaRAID (rev 20)


Although I haven't had much time to track this yet (was planning later
today) I thought it might be related to the above... If any other
information would help, please let me know.

Regards
James Bourne

--
James Bourne, Supervisor Data Centre Operations
Mount Royal College, Calgary, AB, CA
http://www.mtroyal.ab.ca

******************************************************************************
This communication is intended for the use of the recipient to which it is
addressed, and may contain confidential, personal, and or privileged
information. Please contact the sender immediately if you are not the
intended recipient of this communication, and do not copy, distribute, or
take action relying on it. Any communication received in error, or
subsequent reply, should be deleted or destroyed.
******************************************************************************

2002-01-30 19:29:56

by Robbert Kouprie

[permalink] [raw]
Subject: RE: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Not much new, but still:

Today I got the same problem again with 2.4.18-pre3-ac2. Network
connections stuck, NFS mounts stuck. Bringing down/up the interface
doesn't help. Seems like the NIC is really in trouble here. Only a
reboot would bring the nick back in use.

Still no testcase though, and I have no idea on how to investigate this
:(
Can anyone give a hint as where to seek?

Regards,
- Robbert Kouprie

> -----Original Message-----
> From: James Bourne [mailto:[email protected]]
> Sent: dinsdag 22 januari 2002 15:54
> To: Robbert Kouprie
> Cc: [email protected]; [email protected]
> Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)
>
>
> On Tue, 22 Jan 2002, Robbert Kouprie wrote:
>
> >
> > Jussi Laako wrote:
> >
> > > Robbert Kouprie wrote:
> > > >
> > > > Thanks for the quick reply :) Just checked it, and it's
> in slot 2, so
> > > > that's not the problem. It doesn't share the HPT366
> IRQ. This is my
> > > > /proc/interrupts:
> > >
> > > Driver is eepro100? I suspect there is something in
> eepro100 driver that
> > > should be protected by a spinlock but is not. I haven't
> got time to
> > > analyze it further, yet...
> > >
> > > - Jussi Laako
> >
> > Yes, eepro100.c. Let me know if I can test something,
> although I would
> > need a reproducible testcase also. Still doing some tests with high
> > network load, as this caused the similar lockup in the other thread.
> >
> > - Robbert
> >
>
> Perhaps this will help. Yesterday we had a strange error on
> an eepro100
> NIC. System is 4-way Xeon, 4G RAM, 4 eepro100 nics (2 in
> use), Dell PE6400.
> Kernel is 2.4.17, no additional patches. The system has not locked up
> though.
>
> The error was
> eth0: can't fill rx buffer (force 0)!
> eth0: Tx ring dump, Tx queue 3013060 / 3013060:
> eth0: 0 200ca000.
> eth0: 1 000ca000.
> eth0: 2 000ca000.
> eth0: 3 400ca000.
> eth0: *= 4 000ca000.
> eth0: 5 000ca000.
> [... all the same ...]
> eth0: 30 000ca000.
> eth0: 31 000ca000.
> eth0: Printing Rx ring (next to receive into 2522947, dirty
> index 2522946).
> eth0: 0 00000001.
> eth0: l 1 c0000001.
> eth0: * 2 00000000.
> eth0: = 3 00000001.
> eth0: 4 00000001.
> eth0: 5 00000001.
> [... all the same ...]
> eth0: 30 00000001.
> eth0: 31 00000001.
>
> System has only 4 days uptime, eth0 output is:
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:3049032 errors:0 dropped:0 overruns:0 frame:0
> TX packets:3566542 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:100
> RX bytes:580129470 (553.2 Mb) TX bytes:2775810991
> (2647.2 Mb)
> Interrupt:26
>
> /proc/interrupts
> loki:bash# cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3
> 0: 10332192 10317829 10310969 10381268
> IO-APIC-edge timer
> 1: 292 307 250 273
> IO-APIC-edge keyboard
> 2: 0 0 0 0
> XT-PIC cascade
> 8: 26 32 30 27 IO-APIC-edge rtc
> 17: 6 4 3 3
> IO-APIC-level aic7xxx
> 18: 2 5 2 7
> IO-APIC-level aic7xxx
> 22: 766030 765530 764892 765127 IO-APIC-level eth1
> 23: 388900 388509 388022 388376
> IO-APIC-level megaraid
> 26: 1395108 1395240 1394961 1396555 IO-APIC-level eth0
> NMI: 0 0 0 0
> LOC: 41347538 41347536 41347536 41347494
> ERR: 0
> MIS: 0
>
> lspci
> loki:bash# lspci
> 00:00.0 Host bridge: ServerWorks CNB20HE Host Bridge (rev 21)
> 00:00.1 Host bridge: ServerWorks CNB20HE Host Bridge (rev 01)
> 00:00.2 Host bridge: ServerWorks: Unknown device 0006
> 00:00.3 Host bridge: ServerWorks: Unknown device 0006
> 00:04.0 VGA compatible controller: ATI Technologies Inc 3D
> Rage IIC (rev 7a)
> 00:05.0 SCSI storage controller: Adaptec 7899P (rev 01)
> 00:05.1 SCSI storage controller: Adaptec 7899P (rev 01)
> 00:08.0 Ethernet controller: Intel Corporation 82557
> [Ethernet Pro 100] (rev 08)
> 00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 50)
> 00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller
> 00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB
> Controller (rev 04)
> 03:08.0 Ethernet controller: Intel Corporation 82557
> [Ethernet Pro 100] (rev 08)
> 03:09.0 PCI bridge: Digital Equipment Corporation DECchip
> 21154 (rev 05)
> 03:0a.0 Ethernet controller: Intel Corporation 82557
> [Ethernet Pro 100] (rev 08)
> 03:0b.0 Ethernet controller: Intel Corporation 82557
> [Ethernet Pro 100] (rev 08)
> 04:00.0 PCI bridge: Digital Equipment Corporation DECchip
> 21154 (rev 05)
> 04:01.0 SCSI storage controller: Q Logic QLA12160 (rev 06)
> 05:00.0 RAID bus controller: American Megatrends Inc.
> MegaRAID (rev 20)
>
>
> Although I haven't had much time to track this yet (was planning later
> today) I thought it might be related to the above... If any other
> information would help, please let me know.
>
> Regards
> James Bourne
>
> --
> James Bourne, Supervisor Data Centre Operations
> Mount Royal College, Calgary, AB, CA
> http://www.mtroyal.ab.ca
>
> **************************************************************
> ****************
> This communication is intended for the use of the recipient
> to which it is
> addressed, and may contain confidential, personal, and or privileged
> information. Please contact the sender immediately if you are not the
> intended recipient of this communication, and do not copy,
> distribute, or
> take action relying on it. Any communication received in error, or
> subsequent reply, should be deleted or destroyed.
> **************************************************************
> ****************
>

2002-01-30 21:08:18

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

On Wed, 30 Jan 2002 20:29:15 +0100
"Robbert Kouprie" <[email protected]> wrote:

> Not much new, but still:
>
> Today I got the same problem again with 2.4.18-pre3-ac2. Network
> connections stuck, NFS mounts stuck. Bringing down/up the interface
> doesn't help. Seems like the NIC is really in trouble here. Only a
> reboot would bring the nick back in use.
>
> Still no testcase though, and I have no idea on how to investigate this
> :(
> Can anyone give a hint as where to seek?

How about http://www.kernel.org? Download _latest_ kernel-patch (-pre7) and tell us
about it. As long as you are trying only old pre's there is not much of
a chance any important brain will listen to you.

Regards,
Stephan
(no important brain)

2002-01-31 00:28:25

by Robbert Kouprie

[permalink] [raw]
Subject: RE: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Thanks for your reaction Stephan, but I seriously doubt the change below
would fix the problem... Also, as the problem appears randomly, and
usually after some uptime, I obviously can not know about it being fixed
if I constantly upgrade the kernel. I'd rather wait and see if it
appears again in time after I did a kernel upgrade, and not trying every
-pre while there's no mention on the mailing list of such bug being
fixed.

Anyway, I just rebooted with 2.4.18-pre7-ac1, we'll see if it helps.

Regards,
- Robbert


radium:/usr/src# diff -u linux.18p3-ac2/drivers/net/eepro100.c
linux.18p7-ac1/drivers/net/eepro100.c
--- linux.18p3-ac2/drivers/net/eepro100.c Fri Dec 21 18:41:54 2001
+++ linux.18p7-ac1/drivers/net/eepro100.c Thu Jan 31 00:35:56 2002
@@ -28,7 +28,7 @@
*/

static const char *version =
-"eepro100.c:v1.09j-t 9/29/99 Donald Becker
http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html\n"
+"eepro100.c:v1.09j-t 9/29/99 Donald Becker
http://www.scyld.com/network/eepro100.html\n"
"eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V.
Savochkin <[email protected]> and others\n";

/* A few user-configurable values that apply to all boards.


> -----Original Message-----
> From: Stephan von Krawczynski [mailto:[email protected]]
> Sent: woensdag 30 januari 2002 22:07
> To: Robbert Kouprie
> Cc: [email protected]
> Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)
>
>
> On Wed, 30 Jan 2002 20:29:15 +0100
> "Robbert Kouprie" <[email protected]> wrote:
>
> > Not much new, but still:
> >
> > Today I got the same problem again with 2.4.18-pre3-ac2. Network
> > connections stuck, NFS mounts stuck. Bringing down/up the interface
> > doesn't help. Seems like the NIC is really in trouble here. Only a
> > reboot would bring the nick back in use.
> >
> > Still no testcase though, and I have no idea on how to
> investigate this
> > :(
> > Can anyone give a hint as where to seek?
>
> How about http://www.kernel.org? Download _latest_ kernel-patch
> (-pre7) and tell us
> about it. As long as you are trying only old pre's there is
> not much of
> a chance any important brain will listen to you.
>
> Regards,
> Stephan
> (no important brain)
>

2002-01-31 12:36:10

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

On Thu, 31 Jan 2002 01:27:47 +0100
"Robbert Kouprie" <[email protected]> wrote:

> Thanks for your reaction Stephan, but I seriously doubt the change below
> would fix the problem... Also, as the problem appears randomly, and
> usually after some uptime, I obviously can not know about it being fixed
> if I constantly upgrade the kernel. I'd rather wait and see if it
> appears again in time after I did a kernel upgrade, and not trying every
> -pre while there's no mention on the mailing list of such bug being
> fixed.
>
> Anyway, I just rebooted with 2.4.18-pre7-ac1, we'll see if it helps.

Hello Robert,

Well, I know the changes to the driver are rather ... small :-)
But on the other hand, I would not be all that sure that the bug is a
hundred percent related to the driver itself.
I run a working config with eepro100-driver, btw.

Regards,
Stephan


2002-01-31 16:33:47

by Ben Greear

[permalink] [raw]
Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

What does the rest of the hardware-config look like? Is
the NIC attached to a 10bt hub? Are you using PCI-Riser
cards?

Stephan von Krawczynski wrote:

> On Thu, 31 Jan 2002 01:27:47 +0100
> "Robbert Kouprie" <[email protected]> wrote:
>
>
>>Thanks for your reaction Stephan, but I seriously doubt the change below
>>would fix the problem... Also, as the problem appears randomly, and
>>usually after some uptime, I obviously can not know about it being fixed
>>if I constantly upgrade the kernel. I'd rather wait and see if it
>>appears again in time after I did a kernel upgrade, and not trying every
>>-pre while there's no mention on the mailing list of such bug being
>>fixed.
>>
>>Anyway, I just rebooted with 2.4.18-pre7-ac1, we'll see if it helps.
>>
>
> Hello Robert,
>
> Well, I know the changes to the driver are rather ... small :-)
> But on the other hand, I would not be all that sure that the bug is a
> hundred percent related to the driver itself.
> I run a working config with eepro100-driver, btw.
>
> Regards,
> Stephan
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>


--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear


2002-01-31 16:55:09

by Robbert Kouprie

[permalink] [raw]
Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)


The box is an Abit BP6 with Dual Celerons 433 and 192 Mb RAM. No
PCI-Riser cards. It is connected at 100 Mbit full duplex to a 100
Mbit switch. APIC is enabled. No kind of power management is enabled.

Below is my /proc/interrupts, lspci -vx and dmesg output.

Regards,
- Robbert



radium:/# cat /proc/interrupts
CPU0 CPU1
0: 2944301 2940065 IO-APIC-edge timer
1: 39 41 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
4: 3074 3208 IO-APIC-edge serial
8: 2 0 IO-APIC-edge rtc
14: 20 29 IO-APIC-edge ide0
17: 627932 628166 IO-APIC-level eth0
18: 121201 121973 IO-APIC-level ide2
19: 522304 521928 IO-APIC-level es1371
NMI: 0 0
LOC: 5884708 5884706
ERR: 170
MIS: 0

radium:/# lspci -vx
00:00.0 Host bridge: Intel Corp. 440BX/ZX - 82443BX/ZX Host bridge (rev
03)
Flags: bus master, medium devsel, latency 32
Memory at d0000000 (32-bit, prefetchable) [size=64M]
Capabilities: [a0] AGP version 1.0
00: 86 80 90 71 06 00 10 22 03 00 00 06 00 20 00 00
10: 08 00 00 d0 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 a0 00 00 00 00 00 00 00 00 00 00 00

00:01.0 PCI bridge: Intel Corp. 440BX/ZX - 82443BX/ZX AGP bridge (rev 03)
(prog-if 00 [Normal decode])
Flags: bus master, 66Mhz, medium devsel, latency 64
Bus: primary=00, secondary=01, subordinate=01, sec-latency=32
Memory behind bridge: d4000000-d7ffffff
Prefetchable memory behind bridge: d8000000-d8ffffff
00: 86 80 91 71 07 01 20 02 03 00 04 06 00 40 01 00
10: 00 00 00 00 00 00 00 00 00 01 01 20 f0 00 a0 22
20: 00 d4 f0 d7 00 d8 f0 d8 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 88 00

00:07.0 ISA bridge: Intel Corp. 82371AB PIIX4 ISA (rev 02)
Flags: bus master, medium devsel, latency 0
00: 86 80 10 71 0f 00 80 02 02 00 01 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:07.1 IDE interface: Intel Corp. 82371AB PIIX4 IDE (rev 01) (prog-if 80
[Master])
Flags: bus master, medium devsel, latency 32
I/O ports at f000 [size=16]
00: 86 80 11 71 05 00 80 02 01 80 01 01 00 20 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 01 f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:07.2 USB Controller: Intel Corp. 82371AB PIIX4 USB (rev 01) (prog-if 00
[UHCI])
Flags: bus master, medium devsel, latency 32, IRQ 19
I/O ports at c000 [size=32]
00: 86 80 12 71 05 00 80 02 01 00 03 0c 00 20 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 01 c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 0c 04 00 00

00:07.3 Bridge: Intel Corp. 82371AB PIIX4 ACPI (rev 02)
Flags: medium devsel, IRQ 9
00: 86 80 13 71 03 00 80 02 02 00 80 06 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:09.0 Multimedia audio controller: Ensoniq ES1371 [AudioPCI-97] (rev 06)
Subsystem: Ensoniq Creative Sound Blaster AudioPCI64V, AudioPCI128
Flags: bus master, slow devsel, latency 32, IRQ 19
I/O ports at c400 [size=64]
Capabilities: [dc] Power Management version 1
00: 74 12 71 13 05 01 10 34 06 00 01 04 00 20 00 00
10: 01 c4 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 74 12 71 13
30: 00 00 00 00 dc 00 00 00 00 00 00 00 0c 01 0c 80

00:0d.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 09)
Subsystem: Intel Corp.: Unknown device 0011
Flags: bus master, medium devsel, latency 32, IRQ 17
Memory at da020000 (32-bit, non-prefetchable) [size=4K]
I/O ports at c800 [size=64]
Memory at da000000 (32-bit, non-prefetchable) [size=128K]
Expansion ROM at <unassigned> [disabled] [size=1M]
Capabilities: [dc] Power Management version 2
00: 86 80 29 12 07 00 90 02 09 00 00 02 08 20 00 00
10: 00 00 02 da 01 c8 00 00 00 00 00 da 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 11 00
30: 00 00 00 00 dc 00 00 00 00 00 00 00 0a 01 08 38

00:13.0 Unknown mass storage controller: Triones Technologies, Inc. HPT366
/ HPT370 (rev 01)
Flags: bus master, medium devsel, latency 120, IRQ 18
I/O ports at cc00 [size=8]
I/O ports at d000 [size=4]
I/O ports at d400 [size=256]
Expansion ROM at <unassigned> [disabled] [size=128K]
00: 03 11 04 00 05 00 00 02 01 00 80 01 08 78 80 00
10: 01 cc 00 00 01 d0 00 00 00 00 00 00 00 00 00 00
20: 01 d4 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 0b 01 08 08

00:13.1 Unknown mass storage controller: Triones Technologies, Inc. HPT366
/ HPT370 (rev 01)
Flags: bus master, medium devsel, latency 120, IRQ 18
I/O ports at d800 [size=8]
I/O ports at dc00 [size=4]
I/O ports at e000 [size=256]
00: 03 11 04 00 07 00 00 02 01 00 80 01 08 78 80 00
10: 01 d8 00 00 01 dc 00 00 00 00 00 00 00 00 00 00
20: 01 e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 0b 02 08 08

01:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200 AGP (rev
01) (prog-if 00 [VGA])
Subsystem: Matrox Graphics, Inc. Millennium G200 AGP
Flags: bus master, medium devsel, latency 32, IRQ 16
Memory at d8000000 (32-bit, prefetchable) [size=16M]
Memory at d4000000 (32-bit, non-prefetchable) [size=16K]
Memory at d5000000 (32-bit, non-prefetchable) [size=8M]
Expansion ROM at <unassigned> [disabled] [size=64K]
Capabilities: [dc] Power Management version 1
Capabilities: [f0] AGP version 1.0
00: 2b 10 21 05 07 00 90 02 01 00 00 03 08 20 00 00
10: 08 00 00 d8 00 00 00 d4 00 00 00 d5 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 2b 10 03 ff
30: 00 00 00 00 dc 00 00 00 00 00 00 00 09 01 10 20

radium:/# cat /var/log/dmesg
Linux version 2.4.18-pre7-ac1 (root@radium) (gcc version 2.95.4 (Debian
prerelease)) #1 SMP Thu Jan 31 01:06:59 CET 2002
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000000c000000 (usable)
BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
found SMP MP-table at 000f5ae0
hm, page 000f5000 reserved twice.
hm, page 000f6000 reserved twice.
hm, page 000f1000 reserved twice.
hm, page 000f2000 reserved twice.
On node 0 totalpages: 49152
zone(0): 4096 pages.
zone(1): 45056 pages.
zone(2): 0 pages.
Intel MultiProcessor Specification v1.4
Virtual Wire compatibility mode.
OEM ID: OEM00000 Product ID: PROD00000000 APIC at: 0xFEE00000
Processor #0 Pentium(tm) Pro APIC version 17
Processor #1 Pentium(tm) Pro APIC version 17
I/O APIC #2 Version 17 at 0xFEC00000.
Processors: 2
Kernel command line: auto BOOT_IMAGE=Linux ro root=2101
Initializing CPU#0
Detected 434.324 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 865.07 BogoMIPS
Memory: 191360k/196608k available (1316k kernel code, 4864k reserved, 383k
data, 240k init, 0k highmem)
Dentry-cache hash table entries: 32768 (order: 6, 262144 bytes)
Inode-cache hash table entries: 16384 (order: 5, 131072 bytes)
Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
Buffer-cache hash table entries: 16384 (order: 4, 65536 bytes)
Page-cache hash table entries: 65536 (order: 6, 262144 bytes)
CPU: Before vendor init, caps: 0183fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 128K
CPU: After vendor init, caps: 0183fbff 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: After generic, caps: 0183fbff 00000000 00000000 00000000
CPU: Common caps: 0183fbff 00000000 00000000 00000000
Enabling fast FPU save and restore... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
mtrr: v1.40 (20010327) Richard Gooch ([email protected])
mtrr: detected mtrr type: Intel
CPU: Before vendor init, caps: 0183fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 128K
CPU: After vendor init, caps: 0183fbff 00000000 00000000 00000000
Intel machine check reporting enabled on CPU#0.
CPU: After generic, caps: 0183fbff 00000000 00000000 00000000
CPU: Common caps: 0183fbff 00000000 00000000 00000000
CPU0: Intel Celeron (Mendocino) stepping 05
per-CPU timeslice cutoff: 365.86 usecs.
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Booting processor 1/1 eip 2000
Initializing CPU#1
masked ExtINT on CPU#1
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 868.35 BogoMIPS
CPU: Before vendor init, caps: 0183fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 128K
CPU: After vendor init, caps: 0183fbff 00000000 00000000 00000000
Intel machine check reporting enabled on CPU#1.
CPU: After generic, caps: 0183fbff 00000000 00000000 00000000
CPU: Common caps: 0183fbff 00000000 00000000 00000000
CPU1: Intel Celeron (Mendocino) stepping 05
Total of 2 processors activated (1733.42 BogoMIPS).
ENABLING IO-APIC IRQs
Setting 2 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 2 ... ok.
init IO_APIC IRQs
IO-APIC (apicid-pin) 2-0, 2-9, 2-10, 2-11, 2-12, 2-20, 2-21, 2-22, 2-23
not connected.
..TIMER: vector=0x31 pin1=2 pin2=0
number of MP IRQ sources: 19.
number of IO-APIC #2 registers: 24.
testing the IO APIC.......................

IO APIC #2......
.... register #00: 02000000
....... : physical APIC id: 02
.... register #01: 00170011
....... : max redirection entries: 0017
....... : PRQ implemented: 0
....... : IO APIC version: 0011
.... register #02: 00000000
....... : arbitration: 00
.... IRQ redirection table:
NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
00 000 00 1 0 0 0 0 0 0 00
01 003 03 0 0 0 0 0 1 1 39
02 003 03 0 0 0 0 0 1 1 31
03 003 03 0 0 0 0 0 1 1 41
04 003 03 0 0 0 0 0 1 1 49
05 003 03 0 0 0 0 0 1 1 51
06 003 03 0 0 0 0 0 1 1 59
07 003 03 0 0 0 0 0 1 1 61
08 003 03 0 0 0 0 0 1 1 69
09 000 00 1 0 0 0 0 0 0 00
0a 000 00 1 0 0 0 0 0 0 00
0b 000 00 1 0 0 0 0 0 0 00
0c 000 00 1 0 0 0 0 0 0 00
0d 003 03 0 0 0 0 0 1 1 71
0e 003 03 0 0 0 0 0 1 1 79
0f 003 03 0 0 0 0 0 1 1 81
10 003 03 1 1 0 1 0 1 1 89
11 003 03 1 1 0 1 0 1 1 91
12 003 03 1 1 0 1 0 1 1 99
13 003 03 1 1 0 1 0 1 1 A1
14 000 00 1 0 0 0 0 0 0 00
15 000 00 1 0 0 0 0 0 0 00
16 000 00 1 0 0 0 0 0 0 00
17 000 00 1 0 0 0 0 0 0 00
IRQ to pin mappings:
IRQ0 -> 0:2
IRQ1 -> 0:1
IRQ3 -> 0:3
IRQ4 -> 0:4
IRQ5 -> 0:5
IRQ6 -> 0:6
IRQ7 -> 0:7
IRQ8 -> 0:8
IRQ13 -> 0:13
IRQ14 -> 0:14
IRQ15 -> 0:15
IRQ16 -> 0:16
IRQ17 -> 0:17
IRQ18 -> 0:18
IRQ19 -> 0:19
.................................... done.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 434.3005 MHz.
..... host bus clock speed is 66.8152 MHz.
cpu: 0, clocks: 668152, slice: 222717
CPU0<T0:668144,T1:445424,D:3,S:222717,C:668152>
cpu: 1, clocks: 668152, slice: 222717
CPU1<T0:668144,T1:222704,D:6,S:222717,C:668152>
checking TSC synchronization across CPUs: passed.
Waiting on wait_init_idle (map = 0x2)
All processors have done init_idle
mtrr: your CPUs had inconsistent fixed MTRR settings
mtrr: probably your BIOS does not setup all CPUs
PCI: PCI BIOS revision 2.10 entry at 0xfb5c0, last bus=1
PCI: Using configuration type 1
PCI: Probing PCI hardware
Unknown bridge resource 0: assuming transparent
PCI: Using IRQ router PIIX [8086/7110] at 00:07.0
PCI->APIC IRQ transform: (B0,I7,P3) -> 19
PCI->APIC IRQ transform: (B0,I9,P0) -> 19
PCI->APIC IRQ transform: (B0,I13,P0) -> 17
PCI->APIC IRQ transform: (B0,I19,P0) -> 18
PCI->APIC IRQ transform: (B0,I19,P1) -> 18
PCI->APIC IRQ transform: (B1,I0,P0) -> 16
Limiting direct PCI/PCI transfers.
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
IA-32 Microcode Update Driver: v1.09 <[email protected]>
Starting kswapd
VFS: Diskquotas version dquot_6.5.0 initialized
Journalled Block Device driver loaded
parport0: PC-style at 0x378 (0x778) [PCSPP,TRISTATE,EPP]
parport0: irq 7 detected
parport0: Legacy device
i2c-core.o: i2c core module
i2c-dev.o: i2c /dev entries driver module
i2c-core.o: driver i2c-dev dummy driver registered.
i2c-proc.o version 2.6.1 (20010825)
pty: 256 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with MANY_PORTS SHARE_IRQ
SERIAL_PCI enabled
ttyS00 at 0x03f8 (irq = 4) is a 16550A
ttyS01 at 0x02f8 (irq = 3) is a 16550A
lp0: using parport0 (polling).
Real Time Clock Driver v1.10e
block: 128 slots per queue, batch=32
Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with
idebus=xx
PIIX4: IDE controller on PCI bus 00 dev 39
PIIX4: chipset revision 1
PIIX4: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:pio, hdd:pio
HPT366: onboard version of chipset, pin1=1 pin2=2
HPT366: IDE controller on PCI bus 00 dev 98
HPT366: chipset revision 1
HPT366: not 100% native mode: will probe irqs later
ide2: BM-DMA at 0xd400-0xd407, BIOS settings: hde:DMA, hdf:pio
HPT366: IDE controller on PCI bus 00 dev 99
HPT366: chipset revision 1
HPT366: not 100% native mode: will probe irqs later
ide3: BM-DMA at 0xe000-0xe007, BIOS settings: hdg:pio, hdh:pio
hda: Maxtor 91728D8, ATA DISK drive
hde: ST38421A, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide2 at 0xcc00-0xcc07,0xd002 on irq 18
hda: 33750864 sectors (17280 MB) w/512KiB Cache, CHS=2100/255/63, UDMA(33)
hde: 16498944 sectors (8447 MB) w/256KiB Cache, CHS=16368/16/63, UDMA(66)
Partition check:
hda: hda1
hde: [PTBL] [1027/255/63] hde1 hde2 < hde5 >
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
eepro100.c:v1.09j-t 9/29/99 Donald Becker
http://www.scyld.com/network/eepro100.html
eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin
<[email protected]> and others
eth0: OEM i82557/i82558 10/100 Ethernet, 00:D0:B7:E8:A2:02, IRQ 17.
Board assembly 749658-005, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0xdbd8681d).
PPP generic driver version 2.4.1
PPP Deflate Compression module registered
PPP BSD Compression module registered
Linux agpgart interface v0.99 (c) Jeff Hartmann
agpgart: Maximum main memory to use for agp memory: 150M
agpgart: Detected Intel 440BX chipset
agpgart: AGP aperture is 64M @ 0xd0000000
[drm] AGP 0.99 on Intel 440BX @ 0xd0000000 64MB
[drm] Initialized mga 3.0.2 20010321 on minor 0
es1371: version v0.30 time 01:10:14 Jan 31 2002
es1371: found chip, vendor id 0x1274 device id 0x1371 revision 0x06
es1371: found es1371 rev 6 at io 0xc400 irq 19
es1371: features: joystick 0x0
ac97_codec: AC97 Audio codec, id: 0x8384:0x7609 (SigmaTel STAC9721/23)
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 2048 buckets, 16Kbytes
TCP: Hash tables configured (established 16384 bind 16384)
ip_conntrack (1536 buckets, 12288 max)
ip_tables: (C) 2000-2002 Netfilter core team
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 240k freed
Adding Swap: 72256k swap-space (priority -1)
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.17, 10 Jan 2002 on ide0(3,1), internal journal
EXT3-fs: mounted filesystem with ordered data mode.




On Thu, 31 Jan 2002, Ben Greear wrote:

> What does the rest of the hardware-config look like? Is
> the NIC attached to a 10bt hub? Are you using PCI-Riser
> cards?
>
> Ben Greear <[email protected]> <Ben_Greear AT excite.com>
> President of Candela Technologies Inc http://www.candelatech.com
> ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear
>
>
>

2002-01-31 17:55:05

by Ben Greear

[permalink] [raw]
Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)



Robbert Kouprie wrote:

> The box is an Abit BP6 with Dual Celerons 433 and 192 Mb RAM. No
> PCI-Riser cards. It is connected at 100 Mbit full duplex to a 100
> Mbit switch. APIC is enabled. No kind of power management is enabled.


The only lockup problems I have run into are connecting some eepro nics to
a 10bt hub, and using (cheap arsed, it appears) PCI riser cards. I have
heard of some SMP related issues, but nothing concrete, and I don't
have any SMP systems personally. You could try the e100, but I have
no idea if it will be better or worse for your particular problem.


--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear


2002-01-31 19:18:00

by Robbert Kouprie

[permalink] [raw]
Subject: RE: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

I experienced the 10 Mbit half duplex problems too with this card, but
they seemed to have gone away after a bugfix from Alan Cox somewhere in
2.4. Somewhere later I upgraded to 100 Mbit full duplex and never
experienced problems again until 2.4.17.

I think im gonna try some older kernels and look through diffs if I have
time.

- Robbert

> -----Original Message-----
> From: Ben Greear [mailto:[email protected]]
> Sent: donderdag 31 januari 2002 18:54
> To: Robbert Kouprie
> Cc: [email protected]
> Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)
>
>
>
>
> Robbert Kouprie wrote:
>
> > The box is an Abit BP6 with Dual Celerons 433 and 192 Mb RAM. No
> > PCI-Riser cards. It is connected at 100 Mbit full duplex to a 100
> > Mbit switch. APIC is enabled. No kind of power management
> is enabled.
>
>
> The only lockup problems I have run into are connecting some
> eepro nics to
> a 10bt hub, and using (cheap arsed, it appears) PCI riser
> cards. I have
> heard of some SMP related issues, but nothing concrete, and I don't
> have any SMP systems personally. You could try the e100, but I have
> no idea if it will be better or worse for your particular problem.
>
>
> --
> Ben Greear <[email protected]> <Ben_Greear AT excite.com>
> President of Candela Technologies Inc http://www.candelatech.com
> ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear
>
>

2002-02-01 17:53:39

by Edward S. Marshall

[permalink] [raw]
Subject: PCI Problems [was Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)]

On Thu, 2002-01-31 at 11:54, Ben Greear wrote:
> The only lockup problems I have run into are connecting some eepro nics to
> a 10bt hub, and using (cheap arsed, it appears) PCI riser cards. I have
> heard of some SMP related issues, but nothing concrete, and I don't
> have any SMP systems personally. You could try the e100, but I have
> no idea if it will be better or worse for your particular problem.

I was running into the same problems here; SMP system w/PCI riser card
(HP NetServer LPr), connected to a 10/100 switch. I'd get
"wait_for_command_timeout" errors all the time under moderate network
load. Switching to the e100 driver didn't help in the slightest.
Eventually, I'd experience a complete system lockup.

Replacing the card with a 3c59x-based card put the machine back in
service (I've completely written eepro100s off as a viable cards now),
although I still saw occasional PCI-related issues. Specifically:

Jan 23 10:11:37 x kernel: Uhhuh. NMI received. Dazed and confused, but
trying to continue
Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
0000.
Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020
Jan 23 10:11:37 x kernel: You probably have a hardware problem with your
RAM chips
Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
0000.
Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020

The last two messages will repeat indefinitely, usually with a hit to
the dist for each pair of log entries (resulting in a very distinctive
drive grinding). Memory problems don't seem to be the issue; with a
fairly extensive run of memtest86, everything came back clean.

Taking a few minutes to try and rectify the situation, I started
shutting down services and manually unloading modules to see what was
causing the problem. Unloading usbcore did the trick:

Jan 26 18:41:24 x kernel: eth0: Host error, FIFO diagnostic register
0000.
Jan 26 18:41:24 x kernel: eth0: PCI bus error, bus status 80000020
Jan 26 18:41:24 x kernel: eth0: Too much work in interrupt, status e003.
Jan 26 18:41:24 x kernel: usb.c: USB disconnect on device 1
Jan 26 18:41:24 x kernel: USB bus 1 deregistered

I've rebooted the machine since then, but have always unloaded usb-uhci
and usbcore after booting. The issue hasn't cropped up again, although
it happened every couple of days previously.

The kernel in question is Red Hat's kernel-smp-2.4.9-21 build.

--
Edward S. Marshall <[email protected]>
http://esm.logic.net/
-------------------------------------------------------------------------------
[ Felix qui potuit rerum cognoscere causas.
]

2002-02-01 18:16:47

by Nick

[permalink] [raw]
Subject: Re: PCI Problems [was Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)]

Odd, I've got an HP LPr with an Ethernet controller: Intel Corporation
82557 [Ethernet Pro 100] (rev 8). on the riser. Works fine for me under
the debian SMP kernel Linux version 2.4.5-686-smp (herbert@gondolin) (gcc
version 2.95.4 20010319 (Debian prerelease)) #1 SMP Sun May 27 18:32:54
EST 2001. If you'd like me to test a workload or similar let me know, the
system is relativly low memory though.
Nick

On 1 Feb 2002, Edward S. Marshall wrote:

> On Thu, 2002-01-31 at 11:54, Ben Greear wrote:
> > The only lockup problems I have run into are connecting some eepro nics to
> > a 10bt hub, and using (cheap arsed, it appears) PCI riser cards. I have
> > heard of some SMP related issues, but nothing concrete, and I don't
> > have any SMP systems personally. You could try the e100, but I have
> > no idea if it will be better or worse for your particular problem.
>
> I was running into the same problems here; SMP system w/PCI riser card
> (HP NetServer LPr), connected to a 10/100 switch. I'd get
> "wait_for_command_timeout" errors all the time under moderate network
> load. Switching to the e100 driver didn't help in the slightest.
> Eventually, I'd experience a complete system lockup.
>
> Replacing the card with a 3c59x-based card put the machine back in
> service (I've completely written eepro100s off as a viable cards now),
> although I still saw occasional PCI-related issues. Specifically:
>
> Jan 23 10:11:37 x kernel: Uhhuh. NMI received. Dazed and confused, but
> trying to continue
> Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
> 0000.
> Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020
> Jan 23 10:11:37 x kernel: You probably have a hardware problem with your
> RAM chips
> Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
> 0000.
> Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020
>
> The last two messages will repeat indefinitely, usually with a hit to
> the dist for each pair of log entries (resulting in a very distinctive
> drive grinding). Memory problems don't seem to be the issue; with a
> fairly extensive run of memtest86, everything came back clean.
>
> Taking a few minutes to try and rectify the situation, I started
> shutting down services and manually unloading modules to see what was
> causing the problem. Unloading usbcore did the trick:
>
> Jan 26 18:41:24 x kernel: eth0: Host error, FIFO diagnostic register
> 0000.
> Jan 26 18:41:24 x kernel: eth0: PCI bus error, bus status 80000020
> Jan 26 18:41:24 x kernel: eth0: Too much work in interrupt, status e003.
> Jan 26 18:41:24 x kernel: usb.c: USB disconnect on device 1
> Jan 26 18:41:24 x kernel: USB bus 1 deregistered
>
> I've rebooted the machine since then, but have always unloaded usb-uhci
> and usbcore after booting. The issue hasn't cropped up again, although
> it happened every couple of days previously.
>
> The kernel in question is Red Hat's kernel-smp-2.4.9-21 build.
>
> --
> Edward S. Marshall <[email protected]>
> http://esm.logic.net/
> -------------------------------------------------------------------------------
> [ Felix qui potuit rerum cognoscere causas.
> ]
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-02-01 19:45:11

by Ken Brownfield

[permalink] [raw]
Subject: Re: PCI Problems [was Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)]

I've had LPr, LP1000r, LP2000r, and LH6000s in *heavy* production for
two years straight with nary a whimper from the eepro100, e100, or e1000
drivers.* This is SMP with 2-6 procs, 256MB-4GB RAM, all 2.2 and 2.4
kernels under RH6.2. On 100/1000, never 10Mb though.

Not to say that the HPs smell like roses, but I would highly suspect bad
hardware or a suspect BIOS/PCB revision, etc. in this case.

Just my US$0.02,
--
Ken.
[email protected]

* besides a quirky arp issue on boot that seemed to go away on its own
and wasn't card-specific. And the long-standing I/O APIC issues. ;)

On Fri, Feb 01, 2002 at 01:16:15PM -0500, [email protected] wrote:
| Odd, I've got an HP LPr with an Ethernet controller: Intel Corporation
| 82557 [Ethernet Pro 100] (rev 8). on the riser. Works fine for me under
| the debian SMP kernel Linux version 2.4.5-686-smp (herbert@gondolin) (gcc
| version 2.95.4 20010319 (Debian prerelease)) #1 SMP Sun May 27 18:32:54
| EST 2001. If you'd like me to test a workload or similar let me know, the
| system is relativly low memory though.
| Nick
|
| On 1 Feb 2002, Edward S. Marshall wrote:
|
| > On Thu, 2002-01-31 at 11:54, Ben Greear wrote:
| > > The only lockup problems I have run into are connecting some eepro nics to
| > > a 10bt hub, and using (cheap arsed, it appears) PCI riser cards. I have
| > > heard of some SMP related issues, but nothing concrete, and I don't
| > > have any SMP systems personally. You could try the e100, but I have
| > > no idea if it will be better or worse for your particular problem.
| >
| > I was running into the same problems here; SMP system w/PCI riser card
| > (HP NetServer LPr), connected to a 10/100 switch. I'd get
| > "wait_for_command_timeout" errors all the time under moderate network
| > load. Switching to the e100 driver didn't help in the slightest.
| > Eventually, I'd experience a complete system lockup.
| >
| > Replacing the card with a 3c59x-based card put the machine back in
| > service (I've completely written eepro100s off as a viable cards now),
| > although I still saw occasional PCI-related issues. Specifically:
| >
| > Jan 23 10:11:37 x kernel: Uhhuh. NMI received. Dazed and confused, but
| > trying to continue
| > Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
| > 0000.
| > Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020
| > Jan 23 10:11:37 x kernel: You probably have a hardware problem with your
| > RAM chips
| > Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
| > 0000.
| > Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020
| >
| > The last two messages will repeat indefinitely, usually with a hit to
| > the dist for each pair of log entries (resulting in a very distinctive
| > drive grinding). Memory problems don't seem to be the issue; with a
| > fairly extensive run of memtest86, everything came back clean.
| >
| > Taking a few minutes to try and rectify the situation, I started
| > shutting down services and manually unloading modules to see what was
| > causing the problem. Unloading usbcore did the trick:
| >
| > Jan 26 18:41:24 x kernel: eth0: Host error, FIFO diagnostic register
| > 0000.
| > Jan 26 18:41:24 x kernel: eth0: PCI bus error, bus status 80000020
| > Jan 26 18:41:24 x kernel: eth0: Too much work in interrupt, status e003.
| > Jan 26 18:41:24 x kernel: usb.c: USB disconnect on device 1
| > Jan 26 18:41:24 x kernel: USB bus 1 deregistered
| >
| > I've rebooted the machine since then, but have always unloaded usb-uhci
| > and usbcore after booting. The issue hasn't cropped up again, although
| > it happened every couple of days previously.
| >
| > The kernel in question is Red Hat's kernel-smp-2.4.9-21 build.
| >
| > --
| > Edward S. Marshall <[email protected]>
| > http://esm.logic.net/
| > -------------------------------------------------------------------------------
| > [ Felix qui potuit rerum cognoscere causas.
| > ]
| >
| > -
| > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| > the body of a message to [email protected]
| > More majordomo info at http://vger.kernel.org/majordomo-info.html
| > Please read the FAQ at http://www.tux.org/lkml/
| >
|
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/

2002-02-01 23:41:43

by Alan

[permalink] [raw]
Subject: Re: PCI Problems [was Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)]

> Jan 23 10:11:37 x kernel: Uhhuh. NMI received. Dazed and confused, but
> trying to continue
> Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
> 0000.
> Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020
> Jan 23 10:11:37 x kernel: You probably have a hardware problem with your
> RAM chips
> Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
> 0000.
> Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020

Your machine took an NMI and a PCI bus diagnostic. That generally points
hard to a bus problem.