2001-11-29 15:25:56

by Sven Heinicke

[permalink] [raw]
Subject: 2.4.16 freezed up with eepro100 module


The 2.4.16 kernel finally makes my clients happy with memory
management. The systems that froz up is a Dell of some sort or other
with two 1Ghz Pentium IIIs and 4G of memory. But, now I seems to be
having ethernet problems. With and eepro100 card:

Bus 0, device 4, function 0:
Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 8).
IRQ 16.
Master Capable. Latency=32. Min Gnt=8.Max Lat=56.
Non-prefetchable 32 bit memory at 0xfeb02000 [0xfeb02fff].
I/O at 0xfcc0 [0xfcff].
Non-prefetchable 32 bit memory at 0xfe900000 [0xfe9fffff].

loaded as a module, being used heavily, the system froze with nothing
on the console when I saw it. Normal log messages until:

Nov 28 22:03:31 ps1 kernel: eth0: can't fill rx buffer (force 0)!
Nov 28 22:05:03 ps1 kernel: 0001.
Nov 28 22:05:03 ps1 kernel: eth0: can't fill rx buffer (force 1)!
Nov 28 22:05:04 ps1 kernel: eth0: can't fill rx buffer (force 0)!
Nov 28 22:05:05 ps1 kernel: eth0: can't fill rx buffer (force 0)!
Nov 28 22:05:06 ps1 kernel: eth0: can't fill rx buffer (force 1)!
Nov 28 22:05:06 ps1 kernel: eth0: can't fill rx buffer (force 0)!
Nov 28 22:05:07 ps1 kernel: eth0: can't fill rx buffer (force 1)!
Nov 28 22:05:08 ps1 kernel: eth0: can't fill rx buffer (force 1)!
Nov 28 22:05:09 ps1 kernel: eth0: can't fill rx buffer (force 0)!
Nov 28 22:05:17 ps1 last message repeated 10 times
Nov 28 22:05:18 ps1 kernel: KERNEL: assertion (flags&MSG_PEEK) failed at tcp.c(1463):tcp_recvmsg
Nov 28 22:07:48 ps1 kernel: eth0: card reports no resources.
Nov 28 22:08:19 ps1 last message repeated 19 times
Nov 28 22:09:20 ps1 last message repeated 56 times
...
Nov 29 03:57:34 ps1 last message repeated 5 times
Nov 29 03:58:36 ps1 last message repeated 4 times
Nov 29 03:59:41 ps1 last message repeated 5 times
Nov 29 04:00:44 ps1 last message repeated 4 times
Nov 29 04:01:47 ps1 last message repeated 6 times
Nov 29 09:54:13 ps1 syslogd 1.4-0: restart.

Then me hitting the reset key before 10am. I'm going to start digging
through the code (guess it will be more of a learning experience for
me rather then actually being able to help code). So any suggestions
will be helpful.

---
Sven Heinicke <[email protected]> Princeton, NJ


2001-11-29 15:51:50

by Nathan Poznick

[permalink] [raw]
Subject: Re: 2.4.16 freezed up with eepro100 module

Thus spake Sven Heinicke:
>
> The 2.4.16 kernel finally makes my clients happy with memory
> management. The systems that froz up is a Dell of some sort or other
> with two 1Ghz Pentium IIIs and 4G of memory. But, now I seems to be
> having ethernet problems. With and eepro100 card:

I've encountered the same problem, with the same hardware setup (I
believe it's a Dell 2400, or something like that), on 2.4.14+xfs. For
me it didn't lock up the entire machine however, it only seemed to
kill the network - I was able to reboot the machine cleanly once I got
to the console. (message from yesterday with the subject 'failed
assertion in tcp.c') I too, am open to suggestions :-)

--
Nathan Poznick <[email protected]>
PGP Key: http://drunkmonkey.org/pgpkey.txt

Curiosity has its own reason for existing.
-- Albert Einstein

2001-11-29 16:15:00

by Sven Heinicke

[permalink] [raw]
Subject: Re: 2.4.16 freezed up with eepro100 module

Nathan Poznick writes:
> Thus spake Sven Heinicke:
> >
> > The 2.4.16 kernel finally makes my clients happy with memory
> > management. The systems that froz up is a Dell of some sort or other
> > with two 1Ghz Pentium IIIs and 4G of memory. But, now I seems to be
> > having ethernet problems. With and eepro100 card:
>
> I've encountered the same problem, with the same hardware setup (I
> believe it's a Dell 2400, or something like that), on 2.4.14+xfs. For
> me it didn't lock up the entire machine however, it only seemed to
> kill the network - I was able to reboot the machine cleanly once I got
> to the console. (message from yesterday with the subject 'failed
> assertion in tcp.c') I too, am open to suggestions :-)
>

I suspect that I would of been able to reboot it if I was at work in
the middle of the night. I am unable to try older kernels as until
2.4.16 I had memory issues. The process that was doing so much eth0
is ran for like 3 days before the freeze.

Sven

2001-11-30 04:50:14

by J Sloan

[permalink] [raw]
Subject: Re: 2.4.16 freezed up with eepro100 module

Nathan Poznick wrote:

> Thus spake Sven Heinicke:
> >
> > The 2.4.16 kernel finally makes my clients happy with memory
> > management. The systems that froz up is a Dell of some sort or other
> > with two 1Ghz Pentium IIIs and 4G of memory. But, now I seems to be
> > having ethernet problems. With and eepro100 card:
>
> I've encountered the same problem, with the same hardware setup (I
> believe it's a Dell 2400, or something like that), on 2.4.14+xfs. For
> me it didn't lock up the entire machine however, it only seemed to
> kill the network - I was able to reboot the machine cleanly once I got
> to the console. (message from yesterday with the subject 'failed
> assertion in tcp.c') I too, am open to suggestions :-)

Similar experience here - the network connectivity
would go away, but the machine was still alive.

Using the e100 driver instead seemed to solve the
problem on the dell servers here.

But I didn't have to reboot - just stopped networking,
unloaded the eepro100 drivers, loaded the e100
drivers and started networking.

cu

jjs


2001-11-30 05:46:33

by Anuradha Ratnaweera

[permalink] [raw]
Subject: Re: 2.4.16 freezed up with eepro100 module

On Thu, Nov 29, 2001 at 08:49:48PM -0800, J Sloan wrote:
> Nathan Poznick wrote:
>
> > Thus spake Sven Heinicke:
> > >
> > > The 2.4.16 kernel finally makes my clients happy with memory
> > > management. The systems that froz up is a Dell of some sort or other
> > > with two 1Ghz Pentium IIIs and 4G of memory. But, now I seems to be
> > > having ethernet problems. With and eepro100 card:
> >
> > I've encountered the same problem, with the same hardware setup (I
> > believe it's a Dell 2400, or something like that), on 2.4.14+xfs. For
> >
> > [...]
>
> Using the e100 driver instead seemed to solve the
> problem on the dell servers here.

Has anybody got the same issue with non Dell machines?

I am running 2.4.16 on a Compaq proliant ML 370 without problems (machine has
been up for 2+ days with the new kernels, though). Trafic is not very high.

The driver is built into the kernel.

/proc/pci shows

Bus 0, device 2, function 0:
Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 8).
IRQ 5.
Master Capable. Latency=64. Min Gnt=8.Max Lat=56.
Non-prefetchable 32 bit memory at 0xc4fff000 [0xc4ffffff].
I/O at 0x2400 [0x243f].
Non-prefetchable 32 bit memory at 0xc4e00000 [0xc4efffff].
Bus 0, device 5, function 0:
Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (#2) (rev 8).
IRQ 10.
Master Capable. Latency=64. Min Gnt=8.Max Lat=56.
Non-prefetchable 32 bit memory at 0xc4dfd000 [0xc4dfdfff].
I/O at 0x2c00 [0x2c3f].
Non-prefetchable 32 bit memory at 0xc4c00000 [0xc4cfffff].

Regards,

Anuradha

--

Debian GNU/Linux (kernel 2.4.16)

First Law of Bicycling:
No matter which way you ride, it's uphill and against the wind.

2001-11-30 05:58:15

by David Rees

[permalink] [raw]
Subject: Re: 2.4.16 freezed up with eepro100 module

On Fri, Nov 30, 2001 at 11:45:06AM +0600, Anuradha Ratnaweera wrote:
>
> Has anybody got the same issue with non Dell machines?
>
> I am running 2.4.16 on a Compaq proliant ML 370 without problems (machine has
> been up for 2+ days with the new kernels, though). Trafic is not very high.

I don't have any non-Dell machines with the eepro100, but I did put one of
our Dells on 2.2.16 35 hours ago with the eepro100 driver. I don't know the
exact model, but it's an older dual 500MHz PIII machine. Traffic is light,
with only appoximately 100MB being transfered over the network so far.

Is there a workload that can reproduce the hang? If so, I might be able to
do a bit of testing...

I've also got a couple Dell 2400s, but those are still running 2.4.9.
Unfortunately those are production machines, so I don't want to mess with
them right now.

-Dave

2001-11-30 06:07:25

by Ramaraj Pandian

[permalink] [raw]
Subject: SBP2 Support for multiple LUNs - Changers ??

I would like to use firewire dvd jukebox in Linux with latest
kernel.
Current SBP2 supports only one lun. DVD Jukebox has three luns(two for
drives and one for device).
It finds only one dvd rom drive out of two drives and DVD Jukebox.

How do I make use of other luns through SBP2 module?
How can I make it work?

I am working on windows device driver. I am learning linux now.

Your help will be greatly appreciated.
Thanks
Ramaraj

2001-11-30 14:23:36

by Nathan Poznick

[permalink] [raw]
Subject: Re: 2.4.16 freezed up with eepro100 module


(forgot to cc lkm on my reply)

> Has anybody got the same issue with non Dell machines?

All I have to test with are Dell machines, so I haven't been able to
try.

> I am running 2.4.16 on a Compaq proliant ML 370 without problems (machine has
> been up for 2+ days with the new kernels, though). Trafic is not very high.

The trigger seems to be a combination of high network load, and high
system load. The times it's happened to me, it's been while running
an app that has a couple of hundred threads, uses about a gig and a
half or so of memory, and does pretty heavy disk and network I/O. I'm
still trying to find a job that can reproduce it reliably (or even
semi-reliably), and when I can, I'm going to try a switch over to the
e100 driver as some people have suggested, to see if that stops it
from happening.

--
Nathan <[email protected]>
PGP Key: http://drunkmonkey.org/pgpkey.txt

"Competitiveness: the 8th deadly sin."
--Phantom

2001-11-30 16:05:07

by Sven Heinicke

[permalink] [raw]
Subject: Re: 2.4.16 freezed up with eepro100 module


I have eepro100's on other systems and never had a problem. They
never have been made to work as hard as the DELLs though. I am
trying the same DELL with a 3C996-T 1000Bt card using the driver from
3COM (we plan on moving that system to a 1000Bt system but the switch
hasn't arrived yet) and it is running at 100Bt with the same
software. If you don't hear form me assume it surrived. Been up a
day so far, took the DELL like 3 days of heavy use to crash before.

Sven

> Has anybody got the same issue with non Dell machines?
>
> I am running 2.4.16 on a Compaq proliant ML 370 without problems (machine has
> been up for 2+ days with the new kernels, though). Trafic is not very high.
>
> The driver is built into the kernel.
>
> /proc/pci shows
>
> Bus 0, device 2, function 0:
> Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 8).
> IRQ 5.
> Master Capable. Latency=64. Min Gnt=8.Max Lat=56.
> Non-prefetchable 32 bit memory at 0xc4fff000 [0xc4ffffff].
> I/O at 0x2400 [0x243f].
> Non-prefetchable 32 bit memory at 0xc4e00000 [0xc4efffff].
> Bus 0, device 5, function 0:
> Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (#2) (rev 8).
> IRQ 10.
> Master Capable. Latency=64. Min Gnt=8.Max Lat=56.
> Non-prefetchable 32 bit memory at 0xc4dfd000 [0xc4dfdfff].
> I/O at 0x2c00 [0x2c3f].
> Non-prefetchable 32 bit memory at 0xc4c00000 [0xc4cfffff].
>
> Regards,
>
> Anuradha
>
> --
>
> Debian GNU/Linux (kernel 2.4.16)
>
> First Law of Bicycling:
> No matter which way you ride, it's uphill and against the wind.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-11-30 22:32:38

by Nathan Poznick

[permalink] [raw]
Subject: Re: 2.4.16 freezed up with eepro100 module

Thus spake Sven Heinicke:
>
> I have eepro100's on other systems and never had a problem. They
> never have been made to work as hard as the DELLs though. I am
> trying the same DELL with a 3C996-T 1000Bt card using the driver from
> 3COM (we plan on moving that system to a 1000Bt system but the switch
> hasn't arrived yet) and it is running at 100Bt with the same
> software. If you don't hear form me assume it surrived. Been up a
> day so far, took the DELL like 3 days of heavy use to crash before.

Ok, I finally had a chance to work on this, and here's what I know:

1) I found a workload under which I was able to reliably make the
network on the machine die (a few hundred of the "eth0: card reports
no resources." errors showed up which continued until I took down the
network and removed the module). Unfortunately, the workload was with
an in-house app, so all I can describe are the conditions associated
with it: 2 processes with a total of about 600 threads, 1.5gb of
memory, about 500 network connections, and a lot of disk and network
I/O.

2) I switched from using the eepro100 module to using intel's e100
module, and I was unable to reproduce the problem, even under a
heavier load than before. Haven't seen so much as a peep about eth0
problems in the logs since I switched over.

So for now, I'll be sticking with the e100 driver, since it appears to
have solved my problem (at least for now).

--
Nathan Poznick <[email protected]>
PGP Key: http://drunkmonkey.org/pgpkey.txt

"This is wild, I swear..."
-Tom Servo (as Hercules). #410

2001-12-01 00:17:49

by Mike Fedyk

[permalink] [raw]
Subject: Re: 2.4.16 freezed up with eepro100 module

Added Jeff & Andrey to cc list because they were the last two to modify the
driver according to the comments at the top of eepro100.c

On Fri, Nov 30, 2001 at 04:31:31PM -0600, Nathan Poznick wrote:
> Thus spake Sven Heinicke:
> >
> > I have eepro100's on other systems and never had a problem. They
> > never have been made to work as hard as the DELLs though. I am
> > trying the same DELL with a 3C996-T 1000Bt card using the driver from
> > 3COM (we plan on moving that system to a 1000Bt system but the switch
> > hasn't arrived yet) and it is running at 100Bt with the same
> > software. If you don't hear form me assume it surrived. Been up a
> > day so far, took the DELL like 3 days of heavy use to crash before.
>
> Ok, I finally had a chance to work on this, and here's what I know:
>
> 1) I found a workload under which I was able to reliably make the
> network on the machine die (a few hundred of the "eth0: card reports
> no resources." errors showed up which continued until I took down the
> network and removed the module). Unfortunately, the workload was with
> an in-house app, so all I can describe are the conditions associated
> with it: 2 processes with a total of about 600 threads, 1.5gb of
> memory, about 500 network connections, and a lot of disk and network
> I/O.
>
You can run the test against eepro100 with tcpdump redirected to a log file,
and post that on the web somewhere. That would probably be helpful.

Also, some sort of profiling.

Jeff, Andrey, can you comment?

2001-12-01 10:10:22

by Andrey Savochkin

[permalink] [raw]
Subject: Re: 2.4.16 freezed up with eepro100 module

Hi,

On Fri, Nov 30, 2001 at 04:17:17PM -0800, Mike Fedyk wrote:
>
> On Fri, Nov 30, 2001 at 04:31:31PM -0600, Nathan Poznick wrote:
> > Thus spake Sven Heinicke:
> > >
> > > I have eepro100's on other systems and never had a problem. They
> > > never have been made to work as hard as the DELLs though. I am
> > > trying the same DELL with a 3C996-T 1000Bt card using the driver from
> > > 3COM (we plan on moving that system to a 1000Bt system but the switch
> > > hasn't arrived yet) and it is running at 100Bt with the same
> > > software. If you don't hear form me assume it surrived. Been up a
> > > day so far, took the DELL like 3 days of heavy use to crash before.
> >
> > Ok, I finally had a chance to work on this, and here's what I know:
> >
> > 1) I found a workload under which I was able to reliably make the
> > network on the machine die (a few hundred of the "eth0: card reports
> > no resources." errors showed up which continued until I took down the
> > network and removed the module). Unfortunately, the workload was with
> > an in-house app, so all I can describe are the conditions associated
> > with it: 2 processes with a total of about 600 threads, 1.5gb of
> > memory, about 500 network connections, and a lot of disk and network
> > I/O.

Do you see "can't fill rx buffer" messages?
If so, then your load is too big, and memory management is incapable of
freeing memory in time.
Right now the kernel doesn't allow to increase atomic allocation
reservation (which is a serious misfeature), so you need to hack and
change the reservation in the kernel.

If the network doesn't come alive when you remove the load, it's a second
problem, a bug in the driver. I've seen such reports, but they aren't
frequent. On my computer, the driver resumes operations well.
Why the driver can't do it for some people needs deep investigations.

> >
> You can run the test against eepro100 with tcpdump redirected to a log file,
> and post that on the web somewhere. That would probably be helpful.

tcpdumps won't help.

Andrey

2001-12-04 01:45:52

by Nathan Poznick

[permalink] [raw]
Subject: Re: 2.4.16 freezed up with eepro100 module

Thus spake Andrey Savochkin:

> Do you see "can't fill rx buffer" messages?
> If so, then your load is too big, and memory management is incapable of
> freeing memory in time.
> Right now the kernel doesn't allow to increase atomic allocation
> reservation (which is a serious misfeature), so you need to hack and
> change the reservation in the kernel.

Yes, I saw a combination of the "can't fill rx buffer" messages and
"card reports no resources" messages, and after a while it went to
just a whole bunch (few hundred) of the "card reports no resources"
messages, which continued to scroll across the console at the rate of
one every second or so until I took down networking and removed the
eepro100 module.

> If the network doesn't come alive when you remove the load, it's a second
> problem, a bug in the driver. I've seen such reports, but they aren't
> frequent. On my computer, the driver resumes operations well.
> Why the driver can't do it for some people needs deep investigations.

After I removed the load, I gave it about 10 minutes or so to see if
it would pick back up, but it didn't.

--
Nathan Poznick <[email protected]>
PGP Key: http://drunkmonkey.org/pgpkey.txt

"I think everyone ought to come in and have a hot cup of cocoa and
come inside and be nice and snuggly."
-Crow (as Dr. Herly). #201