2000-11-25 23:10:06

by Clayton Weaver

[permalink] [raw]
Subject: reproducible 2.2.1x nethangs

kernel versions 2.2.17 and 2.2.18-pre23 (same behavior)
monolithic kernel
i21143 tulip card (may or may not be significant, stock kernel driver)
egcs-1.1.2, glibc-2.1.3, binutils-2.9.1.0.25

I can reliably hang either 2.2.17 or 2.2.18-pre23 (same way, same
circumstances) with httpd over eth0. It is not a particularly exotic
kernel config (ethernet, tulip, dummy, ppp, aha154x, scsi hd/cd/tape,
pio ide, i486, generic pci driving an sis496 pci 2.0 bus, no pci bridge
optimization, firewall enabled, no masq, no proxy, no adv routing). All of
this hardware and the network are stable on 2.0.38 (ie the tcp/ip over
ethernet hang never happens there). It happens without any ipchains
rules installed (the support is there, but it's not configured).

It doesn't seem to do it on ftp (although that may simply be not having
pushed it hard enough). It can handle 100s of mbs in a single ftp session
without falling over, but a rapid sequence of httpd requests will knock
it over every time.

Minor points of evidence:

* on one test, "strace -ff ..." showed the second argument to accept()
scribbled over (6-7 lines of "^@^@..." in the child) about three forks
before it deadlocked. I saw the same thing at the bottom of the
httpd server's log after an earlier hang.

* It doesn't simply stop, it suddenly gets really slow on the connect
where it is going to hang. The last html page downloaded on one test
ended up with a partial document, so it sometimes starts the data
transfer, it simply can't complete it (the kernel/network_stack is
already on it's way to the twilight zone when the download starts, it
simply manages to squeeze out a few packets before it gets there).

* When it happens, it takes the keyboard with it, and you can't ping it.

* It's not the hd filesystems. I can html browse the same files that it
hangs on over eth0 via lynx on the same host where the httpd server is
running for hours without reproducing the kernel hang. I can move gbs
of data around on those filesystems without errors and without
filesystem corruption.

Since the error can be reproduced so reliably, it should be possible to
debug it, if I know where/how to enable verbose logging.

Suggestions?

Regards,

Clayton Weaver
<mailto:[email protected]>
(Seattle)

"Everybody's ignorant, just in different subjects." Will Rogers




2000-11-27 10:08:51

by Clayton Weaver

[permalink] [raw]
Subject: Re: reproducible 2.2.1x nethangs

(from layer above device driver, imho)

The fact that the http hang does not happen when connecting to
the httpd server from an http client running on the same host
as the server implicates the ethernet interface, but I would be
shocked to find that the cause is a bug in specifically the tulip driver
driving a real tulip card.

You can hang it with http with a tiny fraction of the packets that
are transferred during an ftp session that doesn't bother it at all, so
the device driver streams packets just fine. The http hang has many
more individual connects and forks and connection shutdowns, however, so I
would guess that somewhere in the interface between tcp/ip stack and
device driver bottom half calls is where the bug hits.

I doubt that it matters at all which ethernet device driver it is
exactly, other than perhaps different latencies affecting the timing on
interrupt races (ie any card with the same average latency as an i21143
tulip card will probably see the same problem in the same kernel
versions).

So, what code is different between a socket connection from a listening
daemon to a pci ethernet device driver and a socket connection from the
same listening daemon to a client connected via localhost? There is
a race or other bug in the first that isn't in the second, and it is
a race/bug that is not in 2.0.38. I can't knock 2.0.38 over at all with
http over the same ethernet lan from the same client (but 2.0.38
doesn't work with ipchains and doesn't have the dentry cache, vm
improvements, etc, so this is worth fixing).

(Note: i486, no SMP)

--

Regards,

Clayton Weaver
<mailto:[email protected]>
(Seattle)

"Everybody's ignorant, just in different subjects." Will Rogers



2000-11-28 19:50:53

by Clayton Weaver

[permalink] [raw]
Subject: Re: reproducible 2.2.1x nethangs

I retract the comment about "accept() 2nd argument scribbled over
in the child". That was a misinterpretation of the strace log.
strace shows the struct sockaddr * scribble in the parent after a restart
of the accept() call. Also, the firewalling code is eliminated from
consideration. I compiled it and ppp out, and the only difference was that
the hang happened sooner, in about a dozen http connects instead of
around 30.

What happens, according to strace, is that accept() gets interrupted when
the SIGCHLD signal is delivered after the child that handled the previous
connect exits, and accept() sets ERESTARTSYS.

For the first N connects, this is not problematic. accept() is called
again, it returns normally, the fork() happens, everything is
copascetic. But eventually the restarted accept() call shows nul bytes in
the struct sockaddr *addr arg value in the strace output after
restarting accept, and a few connects later the kernel hangs.

If the socket itself is truly ozoned when strace says the accept() struct
sockaddr * argument has changed in between interrupt and restart, the
subsequent forked child shouldn't be able to send at all, but that is not
what happens. That html document gets there.

Yet the kernel always hangs within a few connects of seeing that, so I'd
be suspicious of the internal kernel data associated with that socket
when it comes time to deallocate it. The accept() code in the context of
interrupts that cause it to be stopped and restarted is at least worth a
look for possible races.

Regards,

Clayton Weaver
<mailto:[email protected]>
(Seattle)

"Everybody's ignorant, just in different subjects." Will Rogers