LinuxLists.cc - 2.4.4: Kernel crash, possibly tcp related

2001-04-29 14:29:46

Subject: 2.4.4: Kernel crash, possibly tcp related

Greetings,

A possibly tcp-related bug causing a kernel crash, possible to trigger
from an unprivileged user.

Kernel 2.4.4, no patches applied.

The problem appeared when performing some network-performance tests with a
program called tcpblast. tcpblast has an option to set its "block size".
The block size is the size of the buffer passed to the write function.
The problem appears when this value is set to 40481 or higher. For ex:
$ tcpblast -d0 -s 40481 another_host 9000
With this block size the following message spammed:
tcp/udpblast send:: No such file or directory
Trying the same command with a 2.2.18 kernel gave:
tcp/udpblast send:: Bad address
The first part is from tcpblast, the second is printed via perror.
Well, if the machine then has "some" other work running a kernel
crash occurs (note that this only applies to 2.4.4, 2.2.18 didn't
seem to have the problem):

KERNEL: assertion (!skb_queue_empty(&sk->write_queue)) failed at tcp_timer.c(327):
tcp_retransmit_timer
Unable to handle kernel NULL pointer dereference...
.
.
.
Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing

Then the machine is completely locked up, no vt-changing or ctrl->scroll_lock etc
works.

The most efficient way I found to produce "some load" to trigger the bug while running
tcpblast was to use a simple forkbomb:
int main() { while(1) fork(); }

If you need more information, just ask.

regards,
/Ralf Nyr?n

System information:

cat /proc/version
Linux version 2.4.4 (plumbum@client2) (gcc version 2.95.2 20000220 (Debian GNU/Linux))
#4 Sat Apr 28 15:47:17 CEST 2001

cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 3
model name : Pentium II (Klamath)
stepping : 4
cpu MHz : 232.349
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov mmx
bogomips : 463.66

cat /proc/modules
vfat 8688 0 (unused)
fat 30272 0 [vfat]

cat /proc/ioports
0000-001f : dma1
0020-003f : pic1
0040-005f : timer
0060-006f : keyboard
0070-007f : rtc
0080-008f : dma page reg
00a0-00bf : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : ide0
02f8-02ff : serial(auto)
0376-0376 : ide1
03c0-03df : vga+
03f6-03f6 : ide0
03f8-03ff : serial(auto)
0cf8-0cff : PCI conf1
4000-403f : Intel Corporation 82371AB PIIX4 ACPI
5000-501f : Intel Corporation 82371AB PIIX4 ACPI
6400-641f : Intel Corporation 82371AB PIIX4 USB
6800-687f : VIA Technologies, Inc. VT86C100A [Rhine 10/100]
6800-687f : via-rhine
e000-efff : PCI Bus #01
e000-e0ff : ATI Technologies Inc 3D Rage LT Pro AGP-133
f000-f00f : Intel Corporation 82371AB PIIX4 IDE
f000-f007 : ide0
f008-f00f : ide1

cat /proc/iomem
00000000-0009fbff : System RAM
0009fc00-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000f0000-000fffff : System ROM
00100000-03ffffff : System RAM
00100000-001d160b : Kernel code
001d160c-0021a957 : Kernel data
a8000000-afffffff : PCI Bus #01
d8000000-dfffffff : PCI Bus #01
d8000000-d8ffffff : ATI Technologies Inc 3D Rage LT Pro AGP-133
d9000000-d9000fff : ATI Technologies Inc 3D Rage LT Pro AGP-133
e0000000-e3ffffff : Intel Corporation 440LX/EX - 82443LX/EX Host bridge
e4000000-e4ffffff : 3Dfx Interactive, Inc. Voodoo 2
e5000000-e500007f : VIA Technologies, Inc. VT86C100A [Rhine 10/100]
e5000000-e500007f : via-rhine
ffff0000-ffffffff : reserved

2001-04-30 05:11:28

by David Miller

[permalink] [raw]

Subject: Re: 2.4.4: Kernel crash, possibly tcp related

Ralf Nyren writes:
> The problem appears when this value is set to 40481 or higher. For ex:
> $ tcpblast -d0 -s 40481 another_host 9000
...
> KERNEL: assertion (!skb_queue_empty(&sk->write_queue)) failed at tcp_timer.c(327):
> tcp_retransmit_timer
> Unable to handle kernel NULL pointer dereference...

I'm having a devil of a time finding the tcpblast sources on the
net, can you point me to where I can get them? The one reference
I saw to get the original sources was:

ftp://ftp.xlink.net/pub/network/tcpblast.shar.gz

But even that directory no longer exists.

The kernel error you see is a gross fatal error, the TCP retransmit
timer has fired yet there are no packets on the transmit queue :-)

My current theory is that tcpblast does something erratic when the
error occurs.

Later,
David S. Miller
[email protected]

2001-04-30 06:42:41

by J Sloan

[permalink] [raw]

Subject: Re: 2.4.4: Kernel crash, possibly tcp related

"David S. Miller" schrieb:

> I'm having a devil of a time finding the tcpblast sources on the
> net, can you point me to where I can get them? The one reference
> I saw to get the original sources was:
>
> ftp://ftp.xlink.net/pub/network/tcpblast.shar.gz
>
> But even that directory no longer exists.

Try ftp://wintermute.toyota.com/pub/utils/tcpblast.tar

cu

jjs

2001-04-30 06:58:44

by David Miller

[permalink] [raw]

Subject: Re: 2.4.4: Kernel crash, possibly tcp related

Andrew Morton writes:
> "David S. Miller" wrote:
> >
> > I'm having a devil of a time finding the tcpblast sources on the
> > net, can you point me to where I can get them?
>
> I seem to have a copy.
>
> http://www.zip.com.au/~akpm/tcpblast-19990504.tar.gz

Thanks to everyone who pointed me at this and the debian copy :-)

Anyways, I just tried to reproduce Ralf's problem on two of my
machines. One was an SMP sparc64 system, and the other was my
uniprocessor Athlon.

What kind of machine are you reproducing this on Ralf? I'm not
even getting the very strange errors from tcpblast on the command
line, it is functioning perfectly fine and sending a stream of
data to the other machine. Are you doing something weird like
making the remote machine the local machine in your tcpblast run?

Later,
David S. Miller
[email protected]

2001-04-30 14:41:51

by Ralf Nyren

[permalink] [raw]

Subject: Re: 2.4.4: Kernel crash, possibly tcp related

On Sun, 29 Apr 2001, David S. Miller wrote:

[snip]
>
> Anyways, I just tried to reproduce Ralf's problem on two of my
> machines. One was an SMP sparc64 system, and the other was my
> uniprocessor Athlon.
>
> What kind of machine are you reproducing this on Ralf? I'm not
> even getting the very strange errors from tcpblast on the command
> line, it is functioning perfectly fine and sending a stream of
> data to the other machine. Are you doing something weird like
> making the remote machine the local machine in your tcpblast run?
>
> Later,
> David S. Miller
> [email protected]
>

Sorry for not including a reference to the software. I used the
tcpblast program from Debian (unstable). It can be found in the
netdiag package:
http://ftp.debian.org/debian/dists/woody/main/source/net/netdiag_0.7.orig.tar.gz

Since this problem seemed a bit hard to reproduce I tested it on another
machine too. It needed some more load, but eventually crashed.
This machine is a PII 400MHz, 128MB, 440BX/ZX, PIIX. 3c905B network card.
For more information like .config, System.map, ver_linux etc see:
http://www.educ.umu.se/~plumbum/kernel/panic_2.4.4_20010430/

Regarding the strange error msg: tcp/udpblast send:: No such file or directory
both the precompiled binary and one compiled from the source produced
this message. Although I noticed that the min blocksize triggering the message
changed from 40481 to 39841. Probably some compiletime feature :)

Making remote machine the local machine... no, I send from my machine
to another. Both with 100Mbps network connections.

Reproduction procedure:
./tcpblast -d0 -s 200000 _another_host_ 9000
./forkbomb
wait...

The so called "forkbomb" shouldn't really be necessary, some heavy load
making use of scheduler, memory and swap seems to do the thing.

Hope this information could be helpful.

regards,
/Ralf

2001-04-30 16:49:01

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: 2.4.4: Kernel crash, possibly tcp related

On Sun, Apr 29, 2001 at 11:58:20PM -0700, David S. Miller wrote:
>
> Andrew Morton writes:
> > "David S. Miller" wrote:
> > >
> > > I'm having a devil of a time finding the tcpblast sources on the
> > > net, can you point me to where I can get them?
> >
> > I seem to have a copy.
> >
> > http://www.zip.com.au/~akpm/tcpblast-19990504.tar.gz
>
> Thanks to everyone who pointed me at this and the debian copy :-)
>
> Anyways, I just tried to reproduce Ralf's problem on two of my
> machines. One was an SMP sparc64 system, and the other was my
> uniprocessor Athlon.
>
> What kind of machine are you reproducing this on Ralf? I'm not

JFYI: I reproduced too on my UP athlon. I run:

tcpblast -d0 -s 40481 another_host 9000

two times and after the second it locked hard. I didn't had any fork
bomb at the same time but there was an high computing load in the
background.

the nic is:

Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet LANCE] (rev 36)

Andrea

2001-04-30 17:01:12

by Alexey Kuznetsov

[permalink] [raw]

Subject: Re: 2.4.4: Kernel crash, possibly tcp related

Hello!

> My current theory is that tcpblast does something erratic when the
> error occurs.

It has buffer size of 32K, so that it faults at enough large chunk sizes.

Erratic errno is because this applet prints errno on partial write.

Oops is apparently because I did something wrong in do_fault yet.
Seems, you were right telling that this place looks dubious. 8)

Alexey

2001-04-30 17:22:32

by Ingo Oeser

[permalink] [raw]

Subject: Re: 2.4.4: Kernel crash, possibly tcp related

On Mon, Apr 30, 2001 at 06:46:33PM +0200, Andrea Arcangeli wrote:
> On Sun, Apr 29, 2001 at 11:58:20PM -0700, David S. Miller wrote:
> > Andrew Morton writes:
> > > "David S. Miller" wrote:
> > Anyways, I just tried to reproduce Ralf's problem on two of my
> > machines. One was an SMP sparc64 system, and the other was my
> > uniprocessor Athlon.
> >
> > What kind of machine are you reproducing this on Ralf? I'm not
>
> JFYI: I reproduced too on my UP athlon. I run:
>
> tcpblast -d0 -s 40481 another_host 9000
>
> two times and after the second it locked hard. I didn't had any fork
> bomb at the same time but there was an high computing load in the
> background.

I tried sth. else with 2.4.3-ac13, which could be related:

Machine: 1GB RAM, Dual PIII, ServerWorks LE chipset (Asus CUR-DLS board).
NIC: [Ethernet Pro 100] (rev 08) (driven by eepro100)

0. Run several kernel compiles and the like to fill up caches.
1. copy an complete iso image into /tmp (which is tmpfs)
2. ftp that over 100Mbit network to an machine.

I got a lot of spikes and a message "mm: critical shortage of
bounce buffers", while doing 1.

And I get a LOT of that messages, while doing 2. But I have a lot
of memory in pagecache and only 100MB allocated for other
processes. And I still have swap free (I have 2GB of swap as
recommended).

So could we please check, double check and triple check the
allocations in the net layer?

Another machine of mine needs now 128MB with the new kernel and
will lock up hard otherwise on full saturated 100Mbit network
load[1] with TCP, but needed only 32MB before. sth. has to be
wrong here...

More info on request.

I have both machines at hand and they are both ready for testing
as long, as my file systems stay repairable by fsck.ext2 ;-)

Both machines are not running X, frame buffers and no fancy multi
media stuff.

Regards

Ingo Oeser

[1] Tested cards: RTL 8139, Intel Etherexpress Pro 100, 3com
3c509TX, so I guess it's NOT the NIC ;-)
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>