Hi!
We've got a lot of machines with the eepro 100 from intel onboard, and when
we try to stress-test the network (running bonnie++ on a nfs-shared
directory on a machine), the network-card says "eth0: Card reports no
resources" to dmesg, and then the "line" appear dead for some time (one
minutte or more). What can be done to remove this error? NFS timesout with
this error (obviously)...
--
Thomas
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I actually have the same issue but I am not seeing any performance
loss. I do extensive NFS transfers as this box also stores a software raid
array, and aside from the kernel message, I am unaffected.
- From what I understand the problem is a hardware bug, and I believe I read
somewhere that by forcing the network card to use its own IRQ and not
having it share an IRQ will alleviate this problem.
Hope this helps ....
On a side note I run this nic on about 10 production web servers running
fbsd 3.5 receiving extensive traffic loads and have no problems with them
at all.
Jim
============================
They that give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
- --Benjamin Franklin,
Historical Review of Pennyslvania, 1759
On Mon, 29 Oct 2001, [iso-8859-1] Thomas Lang?s wrote:
Hi!
We've got a lot of machines with the eepro 100 from intel onboard, and when
we try to stress-test the network (running bonnie++ on a nfs-shared
directory on a machine), the network-card says "eth0: Card reports no
resources" to dmesg, and then the "line" appear dead for some time (one
minutte or more). What can be done to remove this error? NFS timesout with
this error (obviously)...
- --
Thomas
-
Thomas Lang?s wrote:
> Hi!
>
> We've got a lot of machines with the eepro 100 from intel onboard, and when
> we try to stress-test the network (running bonnie++ on a nfs-shared
> directory on a machine), the network-card says "eth0: Card reports no
> resources" to dmesg, and then the "line" appear dead for some time (one
> minutte or more). What can be done to remove this error? NFS timesout with
> this error (obviously)...
We found that using the intel e100 driver
instead of the eepro100 eliminates these
errors - YMMV of course -
cu
jjs
> directory on a machine), the network-card says "eth0: Card reports no
> resources" to dmesg, and then the "line" appear dead for some time (one
> minutte or more). What can be done to remove this error? NFS timesout with
> this error (obviously)...
Which kernel version, which eepro100 chip ?
Hi,
We have exactly the same problem with 2.4.9, 2.4.10 and 2.4.13, so
We had to switch to Intel's driver.
from 'cat /proc/pci'
Bus 1, device 1, function 0:
Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 8).
IRQ 9.
Master Capable. Latency=64. Min Gnt=8.Max Lat=56.
Non-prefetchable 32 bit memory at 0xff8fe000 [0xff8fefff].
I/O at 0xdf00 [0xdf3f].
Non-prefetchable 32 bit memory at 0xff600000 [0xff6fffff].
It is Intel i810 motherboard with NIC onboard.
but Intel's driver (e100-1.6.22) says on boot:
eth0: Intel(R) 82559 Fast Ethernet LAN on Motherboard
the chip is:
GD82559
L021LP51
We have this problem when nic is under high traffic.
Is there any other information that can help you to track the problem?
P.S. I can reproduce this problem any time.
On Mon, Oct 29, 2001 at 10:44:41AM +0000, Alan Cox wrote:
> > directory on a machine), the network-card says "eth0: Card reports no
> > resources" to dmesg, and then the "line" appear dead for some time (one
> > minutte or more). What can be done to remove this error? NFS timesout with
> > this error (obviously)...
>
> Which kernel version, which eepro100 chip ?
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Best regards.
--
Michael Rozhavsky Tel: +972-4-9936248
[email protected] Fax: +972-4-9890564
Optical Access
Senior Software Engineer http://www.opticalaccess.com
> We have exactly the same problem with 2.4.9, 2.4.10 and 2.4.13, so
> We had to switch to Intel's driver.
10Mbit half duplex ?
On Mon, Oct 29, 2001 at 11:49:14AM +0000, Alan Cox wrote:
> > We have exactly the same problem with 2.4.9, 2.4.10 and 2.4.13, so
> > We had to switch to Intel's driver.
>
> 10Mbit half duplex ?
10Mbit but Full duplex.
Best regards.
--
Michael Rozhavsky Tel: +972-4-9936248
[email protected] Fax: +972-4-9890564
Optical Access
Senior Software Engineer http://www.opticalaccess.com
Alan Cox:
> Which kernel version, which eepro100 chip ?
All kernels so far, starting with 2.4.0 (the first one we tested), and we've
now come to 2.4.13 and the error is still there.
Output from lspci -vvvxx:
02:04.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)
Subsystem: Dell Computer Corporation: Unknown device 009b
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 32 (2000ns min, 14000ns max), cache line size 08
Interrupt: pin A routed to IRQ 16
Region 0: Memory at fe900000 (32-bit, non-prefetchable) [size=4K]
Region 1: I/O ports at bcc0 [size=64]
Region 2: Memory at fe500000 (32-bit, non-prefetchable) [size=1M]
Expansion ROM at fe600000 [disabled] [size=1M]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=2 PME-
00: 86 80 29 12 17 01 90 02 08 00 00 02 08 20 00 00
10: 00 00 90 fe c1 bc 00 00 00 00 50 fe 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 9b 00
30: 00 00 60 fe dc 00 00 00 00 00 00 00 05 01 08 38
Output from dmesg:
eepro100.c:v1.09j-t 9/29/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin <[email protected]> and others
eth0: Intel Corporation 82557 [Ethernet Pro 100], 00:B0:D0:F0:8B:65, IRQ 16.
Receiver lock-up bug exists -- enabling work-around.
Board assembly 07195d-000, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x04f4518b).
Receiver lock-up workaround activated.
I'd gladly help you track down and fix this problem, and if you need
any more info (or testing of patches) just tell me :)
--
Thomas
Thomas Lang?s wrote:
>
> Hi!
>
> We've got a lot of machines with the eepro 100 from intel onboard, and when
> we try to stress-test the network (running bonnie++ on a nfs-shared
> directory on a machine), the network-card says "eth0: Card reports no
> resources" to dmesg, and then the "line" appear dead for some time (one
> minutte or more). What can be done to remove this error? NFS timesout with
> this error (obviously)...
>
> --
> Thomas
We have almost the same problem, except it totally locks up the
computer. Light network utilization is ok, but heavy traffic does the
effect.
No syslog reports, even keyboards leds won't light up (numlock, etc).
Rebooting helps for a while. We had to install another network card for
a workaround. I've tried kernels 2.4.10 and 2.4.12.
The network card is integrated at the motherboard.
dmesg:
----
eepro100.c:v1.09j-t 9/29/99 Donald Becker
http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin
<[email protected]> and others
PCI: Found IRQ 11 for device 01:08.0
eth1: Intel Corporation 82801BA(M) Ethernet, 00:03:47:A2:F8:81, IRQ 11.
Board assembly 000000-000, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x04f4518b).
-----
[obelix:/root]> eepro100-diag -ee -f -vv
eepro100-diag.c:v2.05 6/13/2001 Donald Becker ([email protected])
http://www.scyld.com/diag/index.html
Index #1: Found a Intel i82562 Pro/100 V adapter at 0xde80.
i82557 chip registers at 0xde80:
00000000 00000000 00000000 00080002 183f0000 00000000
No interrupt sources are pending.
The transmit unit state is 'Idle'.
The receive unit state is 'Idle'.
This status is unusual for an activated interface.
EEPROM contents, size 64x16:
00: 0300 a247 81f8 1a03 0000 0201 4701 0000
0x08: 0000 0000 49b0 3013 8086 007f ffff ffff
0x10: ffff ffff ffff ffff ffff ffff ffff ffff
0x18: ffff ffff ffff ffff ffff ffff ffff ffff
0x20: ffff ffff ffff ffff ffff ffff ffff ffff
0x28: ffff ffff ffff ffff ffff ffff ffff ffff
0x30: 0000 ffff ffff ffff ffff ffff ffff ffff
0x38: ffff ffff ffff 0000 ffff ffff ffff 35dd
The EEPROM checksum is correct.
Intel EtherExpress Pro 10/100 EEPROM contents:
Station address 00:03:47:A2:F8:81.
Board assembly 000000-000, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
MII PHY #1 transceiver registers:
3100 7809 02a8 0330 05e1 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
2004 0000 0000 0000 0000 0000 0000 0000
0000 0000 0ce0 0000 0010 0000 0000 0000.
[obelix:/root]>
----
Oh, eepro100-diag reported 'Sleep mode is enabled', which could do
something like this -> I disabled it, but no positive effect.
Any similar problems?
Thanks,
Jarmo
>We have almost the same problem, except it totally locks up the
>computer. Light network utilization is ok, but heavy traffic does the
>effect.
>No syslog reports, even keyboards leds won't light up (numlock, etc).
>Rebooting helps for a while. We had to install another network card for
>a workaround. I've tried kernels 2.4.10 and 2.4.12.
>
>The network card is integrated at the motherboard.
>
>dmesg:
>----
>*snip*
>-----
>*snip*
>----
>Oh, eepro100-diag reported 'Sleep mode is enabled', which could do
>something like this -> I disabled it, but no positive effect.
>
>
>Any similar problems?
Sounds like this might be the same problem that we are experiencing
here. The nic does get a high load of traffic immedeately when it has
booted up.
No messages of anything remotely wrong whatsoever, even after
setting the highest debug level in the eepro100 driver.
-=Dead2=-
*my previous message to this list about this issue*
> Tested now with another motherboard with the same results.
> MSI 6321 Pro 1.0
>
> Both these motherboards use VIA dual-cpu chipsets.
> Same results with 2.4.13-Pre6 on both motherboards.
>
> > I have an Asus CUV266-d motherboard, and want to use my Intel NIC's..
> >
> > 2.4.10 & 2.4.12 hangs while "Setting up routing"
> > No error messages appear.
> >
> > 2.4.x(4 maybe?) has both 'e100' drivers and the 'eepro100' drivers.
> > When loading the 'eepro100', it hangs just like with todays kernels.
> > When loading the 'e100', everything works just fine for a short while..
> > 20-40seconds I guess.. Then the computer hangs.
> >
> > When not loading any NIC drivers, everything works just fine.
> >
> > The NIC's i've tried are named "Intel(R) PRO/100+ Dual Port Server
> Adapter"
> > Have also tried a "Intel(R) PRO/100+ Adapter"
> >
> > Any ideas of what to test?
> > I have the latest bios and have tried just about all bios settings.
> > 'noapic' doesn't help.
I searched a bit and seems some users have had same kind of problems
with 10Mbit network with high amounts of collisions, just like ours is.
Jarmo
J Sloan:
> We found that using the intel e100 driver
> instead of the eepro100 eliminates these
> errors - YMMV of course -
I've now tried the Intel driver, no help, still get the NFS timeouts (the
intel driver doesn't output anything to dmesg, so it's no way of telling if
the same things occur as in the eepro100 stock-kernel driver).
This is how I do the test:
NFS share a filesystem
NFS mount it on another box (not running intel e100 nic)
Start bonnie++ on the box that has mounted the nfs share
After 10-20mins, the first NFS timeout comes (which means the card is out of
resources, and "halts" for a bit). When the card becomes out of resources,
it seems like it uses a few minutes before it comes online again, no wonder
why, tho.
Has anyone got any suggestions on how to start tracking down, and maybe
fixing this problem? Or, is this a hardware error? Or maybe a firmware
error? Should I start contacting Dell and tell them that's there's a
possible error in their PowerEdge 2550-series?
--
Thomas
Hi,
On Wed, Oct 31, 2001 at 09:01:25AM +0100, Thomas Lang?s wrote:
>
> I've now tried the Intel driver, no help, still get the NFS timeouts (the
> intel driver doesn't output anything to dmesg, so it's no way of telling if
> the same things occur as in the eepro100 stock-kernel driver).
>
> This is how I do the test:
>
> NFS share a filesystem
> NFS mount it on another box (not running intel e100 nic)
> Start bonnie++ on the box that has mounted the nfs share
>
> After 10-20mins, the first NFS timeout comes (which means the card is out of
> resources, and "halts" for a bit). When the card becomes out of resources,
> it seems like it uses a few minutes before it comes online again, no wonder
> why, tho.
>
> Has anyone got any suggestions on how to start tracking down, and maybe
> fixing this problem? Or, is this a hardware error? Or maybe a firmware
Well, with eepro100 the start may be the following:
1. When the card stalls, start ping from that host.
This way you ensure that you have something in transmit ring.
If it's transmitting that stalls, you'll get a message from netdev watchdog.
2. If ping works, then your problem appear to be pure NFS one, i.e. inability
of NFS to recover from network operation disruption.
3. If ping is able to transmit, but not receive (you may check it by
tcpdump), then we have a receiver problem.
We'll think what to do then.
4. In any case, running eepro100-diag from scyld.com at the moment of the
stall may give some useful information.
5. In any case, searching eepro100 mailing list archive on scyld.com is a
good idea, you may learn what other people observe/do.
Andrey
> error? Should I start contacting Dell and tell them that's there's a
> possible error in their PowerEdge 2550-series?
Andrey Savochkin wrote:
>
> Hi,
>
> On Wed, Oct 31, 2001 at 09:01:25AM +0100, Thomas Lang?s wrote:
> >
> > I've now tried the Intel driver, no help, still get the NFS timeouts (the
> > intel driver doesn't output anything to dmesg, so it's no way of telling if
> > the same things occur as in the eepro100 stock-kernel driver).
> >
> > This is how I do the test:
> >
> > NFS share a filesystem
> > NFS mount it on another box (not running intel e100 nic)
> > Start bonnie++ on the box that has mounted the nfs share
> >
> > After 10-20mins, the first NFS timeout comes (which means the card is out of
> > resources, and "halts" for a bit). When the card becomes out of resources,
> > it seems like it uses a few minutes before it comes online again, no wonder
> > why, tho.
> >
> > Has anyone got any suggestions on how to start tracking down, and maybe
> > fixing this problem? Or, is this a hardware error? Or maybe a firmware
>
> Well, with eepro100 the start may be the following:
> 1. When the card stalls, start ping from that host.
> This way you ensure that you have something in transmit ring.
> If it's transmitting that stalls, you'll get a message from netdev watchdog.
> 2. If ping works, then your problem appear to be pure NFS one, i.e. inability
> of NFS to recover from network operation disruption.
> 3. If ping is able to transmit, but not receive (you may check it by
> tcpdump), then we have a receiver problem.
> We'll think what to do then.
>
> 4. In any case, running eepro100-diag from scyld.com at the moment of the
> stall may give some useful information.
> 5. In any case, searching eepro100 mailing list archive on scyld.com is a
> good idea, you may learn what other people observe/do.
>
> Andrey
>
> > error? Should I start contacting Dell and tell them that's there's a
> > possible error in their PowerEdge 2550-series?
Guys. This is Network section of my config:
#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_MMAP is not set
# CONFIG_NETLINK is not set
# CONFIG_NETFILTER is not set
# CONFIG_FILTER is not set
CONFIG_UNIX=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
# CONFIG_IP_ADVANCED_ROUTER is not set
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_INET_ECN is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_IPV6 is not set
# CONFIG_KHTTPD is not set
# CONFIG_ATM is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_DECNET is not set
# CONFIG_BRIDGE is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_LLC is not set
# CONFIG_NET_DIVERT is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_NET_FASTROUTE is not set
# CONFIG_NET_HW_FLOWCONTROL is not set
#
# QoS and/or fair queueing
#
# CONFIG_NET_SCHED is not set
I work on this config (2.4.13) now and my machine has eepro100.o loaded.
Now I test it. This problem is appear when some options of IP section is
enabled. Now I can't say which of them. (I think SYN or MROUTE but it's
my assumption).
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Am Mittwoch, 31. Oktober 2001 09:01 schrieb Thomas Lang?s:
> J Sloan:
> > We found that using the intel e100 driver
> > instead of the eepro100 eliminates these
> > errors - YMMV of course -
>
> I've now tried the Intel driver, no help, still get the NFS timeouts (the
> intel driver doesn't output anything to dmesg, so it's no way of telling if
> the same things occur as in the eepro100 stock-kernel driver).
I had some trouble with an Intel STL 2 board and the onboard EEPRO100.
Samba worked OK but it always got stuck on NFS transfers.
There was a bug in the older BMC firmware, so the eepro100 detected
some NFS frames as "TCO" packets.
(http://support.intel.com/support/motherboards/server/ta_353-1.htm)
If you use the e100 driver, you can look at
/proc/net/PRO_LAN_ADAPTERS/eth0.info
If the "Tx_TCO_Packets" entry isn't zero after NFS times out,
this may be your problem.
With the eepro100 driver you will only see overruns with ifconfig.
If this is the case, you may want to check for a BMC (board management
controller) software update.
...Juergen
Andrey Savochkin:
> Well, with eepro100 the start may be the following:
> 1. When the card stalls, start ping from that host.
> This way you ensure that you have something in transmit ring.
> If it's transmitting that stalls, you'll get a message from netdev watchdog.
>From the server, or the client? I've already tried pinging from the server
when I get the error-message in dmesg, but it's unresponsive to anything.
And, I mean anything, network-wise. There seems to be a timeout somewhere,
because after some time, everything resumes back to normal again.
> 4. In any case, running eepro100-diag from scyld.com at the moment of the
> stall may give some useful information.
OK, I'll do the test again, and run the eepro100-diag. Any special options
you want me to specify?
> 5. In any case, searching eepro100 mailing list archive on scyld.com is a
> good idea, you may learn what other people observe/do.
OK, I'll search... :)
--
Thomas
Juergen Hasch:
> If you use the e100 driver, you can look at
> /proc/net/PRO_LAN_ADAPTERS/eth0.info
> If the "Tx_TCO_Packets" entry isn't zero after NFS times out,
> this may be your problem.
> With the eepro100 driver you will only see overruns with ifconfig.
Here's the full /proc/net/PRO_LAN_Adapters/eth0.info output (after NFS
timeouts):
gekko:~# cat /proc/net/PRO_LAN_Adapters/eth0.info
Description Intel(R) 8255x-based Ethernet Adapter
Driver_Name e100
Driver_Version 1.6.22
PCI_Vendor 0x8086
PCI_Device_ID 0x1229
PCI_Subsystem_Vendor 0x1028
PCI_Subsystem_ID 0x009b
PCI_Revision_ID 0x0008
PCI_Bus 2
PCI_Slot 4
IRQ 16
System_Device_Name eth0
Current_HWaddr 00:B0:D0:F0:8B:65
Permanent_HWaddr 00:B0:D0:F0:8B:65
Part_Number 07195d-000
Link up
Speed 100
Duplex full
State up
Rx_Packets 27747043
Tx_Packets 25999146
Rx_Bytes 1730389022
Tx_Bytes 21884644
Rx_Errors 0
Tx_Errors 0
Rx_Dropped 0
Tx_Dropped 0
Multicast 0
Collisions 0
Rx_Length_Errors 0
Rx_Over_Errors 0
Rx_CRC_Errors 0
Rx_Frame_Errors 0
Rx_FIFO_Errors 0
Rx_Missed_Errors 0
Tx_Aborted_Errors 0
Tx_Carrier_Errors 0
Tx_FIFO_Errors 0
Tx_Heartbeat_Errors 0
Tx_Window_Errors 0
Rx_TCP_Checksum_Good 0
Rx_TCP_Checksum_Bad 0
Tx_TCP_Checksum_Good 0
Tx_TCP_Checksum_Bad 0
Tx_Abort_Late_Coll 0
Tx_Deferred_Ok 0
Tx_Single_Coll_Ok 0
Tx_Multi_Coll_Ok 0
Rx_Long_Length_Errors 0
Rx_Align_Errors 0
Tx_Flow_Control_Pause 0
Rx_Flow_Control_Pause 0
Rx_Flow_Control_Unsup 0
Tx_TCO_Packets 0
Rx_TCO_Packets 1
scbp = 0xf89da000 bddp = 0xf77568c0
--
Thomas
> Here's the full /proc/net/PRO_LAN_Adapters/eth0.info output (after NFS
> timeouts):
>
> gekko:~# cat /proc/net/PRO_LAN_Adapters/eth0.info
> Description Intel(R) 8255x-based Ethernet Adapter
> Driver_Name e100
> Driver_Version 1.6.22
> PCI_Vendor 0x8086
> PCI_Device_ID 0x1229
> PCI_Subsystem_Vendor 0x1028
> PCI_Subsystem_ID 0x009b
> PCI_Revision_ID 0x0008
> PCI_Bus 2
> PCI_Slot 4
> IRQ 16
> System_Device_Name eth0
> Current_HWaddr 00:B0:D0:F0:8B:65
> Permanent_HWaddr 00:B0:D0:F0:8B:65
> Part_Number 07195d-000
>
> Link up
> Speed 100
> Duplex full
> State up
>
> Rx_Packets 27747043
> Tx_Packets 25999146
> Rx_Bytes 1730389022
> Tx_Bytes 21884644
> Rx_Errors 0
> Tx_Errors 0
> Rx_Dropped 0
> Tx_Dropped 0
> Multicast 0
> Collisions 0
> Rx_Length_Errors 0
> Rx_Over_Errors 0
> Rx_CRC_Errors 0
> Rx_Frame_Errors 0
> Rx_FIFO_Errors 0
> Rx_Missed_Errors 0
> Tx_Aborted_Errors 0
> Tx_Carrier_Errors 0
> Tx_FIFO_Errors 0
> Tx_Heartbeat_Errors 0
> Tx_Window_Errors 0
>
> Rx_TCP_Checksum_Good 0
> Rx_TCP_Checksum_Bad 0
> Tx_TCP_Checksum_Good 0
> Tx_TCP_Checksum_Bad 0
>
> Tx_Abort_Late_Coll 0
> Tx_Deferred_Ok 0
> Tx_Single_Coll_Ok 0
> Tx_Multi_Coll_Ok 0
> Rx_Long_Length_Errors 0
> Rx_Align_Errors 0
>
> Tx_Flow_Control_Pause 0
> Rx_Flow_Control_Pause 0
> Rx_Flow_Control_Unsup 0
>
> Tx_TCO_Packets 0
> Rx_TCO_Packets 1
> scbp = 0xf89da000 bddp = 0xf77568c0
Well this doesn't look exactly the same as on the system I had problems with.
But your Rx_TCO_Packets counter is 1, so this may be related
(I also got Rx overrun errors). It may be that your BMC receives the packet
and simply chooses to ignore it because it is no valid server management
packet.
Could you make another test and take a look at the eth0.info ?
I could reproduce the problem when copying a large file over NFS, but not
when transferring it via ftp. Try this a few times.
If you can reproduce you network card being stuck only when using NFS and
having Rx_TCO_Packets > 0 after it is stuck, this is it.
Then you either need tu upgrade your BMC firmware or add another network card,
which doesn't eat NFS packets.
...Juergen
Juergen Hasch:
> But your Rx_TCO_Packets counter is 1, so this may be related
> (I also got Rx overrun errors). It may be that your BMC receives the packet
> and simply chooses to ignore it because it is no valid server management
> packet.
> Could you make another test and take a look at the eth0.info ?
> I could reproduce the problem when copying a large file over NFS, but not
> when transferring it via ftp. Try this a few times.
> If you can reproduce you network card being stuck only when using NFS and
> having Rx_TCO_Packets > 0 after it is stuck, this is it.
> Then you either need tu upgrade your BMC firmware or add another network card,
> which doesn't eat NFS packets.
I'm testing now, however, running eepro100-diag gave me some interessting
output:
Sleep mode is enabled. This is not recommended. Under high load the card
may not respond to PCI requests, and thus cause a master abort.
How do I disable sleepmode? I've never even enabled it.
--
Thomas
On Thu, Nov 01, 2001 at 08:55:23AM +0100, Thomas Lang?s wrote:
> Andrey Savochkin:
> > Well, with eepro100 the start may be the following:
> > 1. When the card stalls, start ping from that host.
> > This way you ensure that you have something in transmit ring.
> > If it's transmitting that stalls, you'll get a message from netdev watchdog.
>
> From the server, or the client? I've already tried pinging from the server
>From the computer where the network card hangs and where you see messages in
dmesg. The network card hangs on only one side, right?
> when I get the error-message in dmesg, but it's unresponsive to anything.
> And, I mean anything, network-wise. There seems to be a timeout somewhere,
> because after some time, everything resumes back to normal again.
If the operations stall just for few seconds, it's perfectly ok.
If after a few second stop the card itself resumes to operate normally, but
NFS operations are blocked for much longer time, it's NFS problem.
If the card itself stops operation for a long time, it needs to be fixed.
Andrey
Am Donnerstag, 1. November 2001 10:06 schrieb Thomas Lang?s:
> Juergen Hasch:
>
> I'm testing now, however, running eepro100-diag gave me some interessting
> output:
>
> Sleep mode is enabled. This is not recommended. Under high load the card
> may not respond to PCI requests, and thus cause a master abort.
>
> How do I disable sleepmode? I've never even enabled it.
The sleep bit is sometimes enabled by default (it was for me).
You can clear it with eepro100-diag (I think it was the -Gww option).
The documentation for eepro100-diag is somehow sparse, but
clearing the sleep bit was discussed on the eepro100 mailing list at
scyld.com in great detail. You might want to browse the archives there.
...Juergen
Andrey Savochkin:
> >From the computer where the network card hangs and where you see messages in
> dmesg. The network card hangs on only one side, right?
Yepp, and sorry, I ment, I tried pinging from client-side.
> If the operations stall just for few seconds, it's perfectly ok.
> If after a few second stop the card itself resumes to operate normally, but
> NFS operations are blocked for much longer time, it's NFS problem.
> If the card itself stops operation for a long time, it needs to be fixed.
Ok, it seems like the stock-kernel-driver hangs much longer than the
intel-driver (intel driver did only hang for a few sec when I tried just
now).
--
Thomas
On Wed, Oct 31, 2001 at 07:10:49PM +0100, Juergen Hasch wrote:
>
> I had some trouble with an Intel STL 2 board and the onboard EEPRO100.
> Samba worked OK but it always got stuck on NFS transfers.
>
> There was a bug in the older BMC firmware, so the eepro100 detected
> some NFS frames as "TCO" packets.
> (http://support.intel.com/support/motherboards/server/ta_353-1.htm)
>
> If you use the e100 driver, you can look at
> /proc/net/PRO_LAN_ADAPTERS/eth0.info
> If the "Tx_TCO_Packets" entry isn't zero after NFS times out,
> this may be your problem.
> With the eepro100 driver you will only see overruns with ifconfig.
It should be Rx_TCO_Packets, not Tx.
The problem described in Intel's advisory is related to incorrect processing
of receiving packets.
Andrey
Andrey Savochkin:
> It should be Rx_TCO_Packets, not Tx.
> The problem described in Intel's advisory is related to incorrect processing
> of receiving packets.
But if it's this bug that's triggered with NFS-traffic, then the counter
should be increasing with every timeout, right? Not just one time. I get a
lot of timeout and the counter is still just 1.
I'm going out to buy me another NIC and try tests a bit systematically, and
report back with the results afterwards.
--
Thomas
Am Donnerstag, 1. November 2001 13:00 schrieb Thomas Lang?s:
> Andrey Savochkin:
> > It should be Rx_TCO_Packets, not Tx.
> > The problem described in Intel's advisory is related to incorrect
> > processing of receiving packets.
>
> But if it's this bug that's triggered with NFS-traffic, then the counter
> should be increasing with every timeout, right? Not just one time. I get a
> lot of timeout and the counter is still just 1.
>
> I'm going out to buy me another NIC and try tests a bit systematically, and
> report back with the results afterwards.
The Rx_TCO_Packets counter should increase at each timeout you get,
so this looks like another problem.
I have got two servers with two different EEPRO100 network cards.
One works better with the eepro100 driver, the other one seems to favour the
e100 driver :-)
Both cards are working flawlessly now, however I was close to buying new NICs
because of the problems like command timeouts, no resources messages and NFS
timeouts.
...Juergen
[email protected] (Juergen Hasch) writes:
>> I've now tried the Intel driver, no help, still get the NFS timeouts (the
>> intel driver doesn't output anything to dmesg, so it's no way of telling if
>> the same things occur as in the eepro100 stock-kernel driver).
>I had some trouble with an Intel STL 2 board and the onboard EEPRO100.
>Samba worked OK but it always got stuck on NFS transfers.
A datapoint that might be interesting:
I run four of these buggers with eepros as Internet-Interfaces for
heavy traffic (30-80 MBit sustained 24/7) under 2.2.19. Not a single
glitch on each of these boxes. The machines have two PIII/1GHz each
and a (custom built) SMP kernel based off RH 2.2.19-6.2.7
boot message:
eepro100.c:v1.09j-t 9/29/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
eepro100.c: $Revision: 1.20.2.10 $ 2000/05/31 Modified by Andrey V. Savochkin <[email protected]> and others
eepro100.c: VA Linux custom, Dragan Stancevic <[email protected]> 2000/11/15
eth0: Intel PCI EtherExpress Pro100 82557, 00:D0:B7:A8:67:EC, I/O at 0x2c00, IRQ 21.
Board assembly 000000-000, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x04f4518b).
So there may be a change between 2.2 and 2.4 that triggers the problems.
Regards
Henning
--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]
Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20