2000-11-04 02:33:38

by Thomas Pollinger

[permalink] [raw]
Subject: Re: [BUG REPORT] TCP/IP weirdness in 2.2.15

Hi Stephen

I've been experiencing similar problems with what would seem at first a
completely unrelated problem. To make things short:

I have the following configuration:
- CVS server (vers. 1.10.4) running on HPUX B.11.00
- different CVS clients (running different cvs client versions: 1.10.4/7/8):
- Win NT (PIII, 512MB, 3c905 10/100)
- Debian 2.2.17 (PII, 256MB, 3c905, main client tester)
- Mandrake 2.2.x (3c905)
- RedHat 6.2 (Laptop, different Ethernet card)
- SuSE 2.2.10 (PII, 256MB, 3c905)

Running a 'cvs get' on the Linux clients of a larger source tree eventually
hangs the client in the middle of the get process. The hang is *always*
reproduceable (however it does not always hang at the same place, sometimes
after 1', sometimes after 5' to 10'). Several runs on Win NT did not show
this problem.

Running a strace on the client and server reveals that the client gets
stuck in a read(3, resp. recv(3, call (a recompiled the cvs client to not
use file descriptors for an additional test) and blocks forever (I left it
once during one night). The server tries first to send to the client, but
gets an EWOULDBLOCK error back. The server tries to send to the client for
about 5' during which no packets are accepted by the client. tcpdump
revealed that after the hang, sometimes the client did not accept any
packets (ack 0), sometimes it replied . ack 7678261 back (this will go
forever if the client won't be killed). The server replies: P
7678261:76784009 (1448) ack 612 win 32768 ... indefinitely.

To keep the mail short, I can run a test and send all related logs if needed.

I first thought that I got not a recent network driver and compiled a more
recent one (v0.99H 12Jun00 was old, new: 16Aug00 <2.2.17-pre17>) - but the
behaviour was the same.

At last, I tried to do the same on another HP-UX box and there was no
blocking at all.
What is interesting to know is that between the HP-UX server and the HP
client is a Linux router with a 3c905 card, 2.2.14 kernel.

Regards,
Pollinger Thomas


--------------------------------------------------------------------------------------------------------------
Below the original mail from Stephen (as it has been a while ago this
message was posted):

Hi Folks,
I'm seeing some very strange behaviour here with TCP/IP connections
(or at least, it looks strange to me) - under heavy load, it appears
connections are spuriously dropped.
Bear with me while I give a little background...
...I'm actually trying to perform some SPECweb99 benchmarking; I have
a test bed of 5 clients and 1 server. For those of you who are not
familiar with SPECweb99, let's just say that it tries to simulate
'real world' traffic - a mixture of HTTP GET, POST, static and dynamic
GET, keepalive connections, non-keepalive connections... basically
hammer the server as hard as you can until it grinds to a halt. Each
client has one controlling (parent) process which forks off N children
to generate the HTTP load. One client is a designated 'prime client'
which synchronises the rest of the clients. If you're interested in
more SPEC info, there's a very informative whitepaper on their
website, have a look at http://www.spec.org/osg/web99
The "dropped connection" manifests itself by the SPEC run 'hanging' -
at the end of the test there is one or more child processes on one or
more clients stuck in a read(2), trying to read from a
socket. (usually, one child process on one client).
If I do a netstat on the client I see an ESTABLISHED connection:
[stephenl@spec1 stephenl]$ netstat -t
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
[...]
tcp 0 0 spec1.specweb.zeus:6426 spec6.specweb.zeus.:www ESTABLISHED
[...]
[stephenl@spec1 stephenl]$
(spec1 is the client, spec6 is the server. As it happens, spec1 is
also my prime client - this is just a coincidence in this particular
lockup)
If I hop over to the server and do a netstat there, I
see... nothing. No ESTABLISHED connection, no TIME_WAIT, no
CLOSE_WAIT. Nothing.
Question: is this valid? (I think it may be, if the server
crashed. Which it didn't - all machines have enjoyed a nice long
uptime :-)
I've done more digging. I captured tcpdumps of the entire SPEC run, on
all machines. I added lots of extra debugging logs to the client code,
I ran 'netstat -tn' every 10 seconds on all machines for the duration
of the test. Oh yes, and the webserver logs.
Here is what I've found...
Debugging logfile
-----------------
The client creates a socket, binds a port to it and connects to the
webserver. It write()s a HTTP GET request, and read()s a file back. So
far so good. Without closing the socket, another request is sent and
another file is received (on keepalives, there are a random number of
HTTP requests performed - on average, 10). Now here is where it gets
interesting... still without closing the socket, write() another HTTP
request and go to read() the file... the read() sits there and this
child is effectively dead. I've observed it to hang on anywhere from
the 2nd request to the 7th or 8th request.
tcpdump logs
------------
This matches perfectly what I've just detailed above... I see all the
previous requests on this port, I see the server send the previous
file, I see the client ACK this, I see the client send the final,
fatal GET... I see the server ACK this. I see this on both the client
AND server tcpdump logfiles - they both perfectly match. After the
final, fatal ACK I see... nothing. The client never sends any more
packets, the server never sends any more packets. Specifically I see
no FIN from either end.
webserver logs
--------------
I see all requests from the offending client up to and including the
previous request. The offending GET never makes it to the server logs.
netstat logs
------------
Running a 'netstat -tn' every 10 seconds on the client, I see the
connection spring into existence somewhere through the test (sometimes
near the beginning of the test, sometimes near the end. On average
about halfway through, on a 10-minute run). Once the connection has
been made, it is reported as 'ESTABLISHED' until I terminate the stuck
process... Running a 'netstat -tn' every 10 seconds on the server, I
see the connection spring into existence at the same time; I see the
same connection 10 seconds later; 20 seconds later it has vanished. No
CLOSE, no CLOSE_WAIT. Not, even, a TIME_WAIT. According to netstat,
the connection on the server simply vanished. I'm *sure* this is just
plain wrong - but it's what I see.
misc logs
---------
No error messages in the webserver log; nothing in 'dmesg', nothing in
/var/spool/*
---

I've searched the archives best as I can but couldn't find anything
relevant (actually I've been loosely following linux-kernel for the
past 2 years, and I would expect to remember if I'd seen this before,
but I'm sure better minds can advise!)

Please help me folks, I've spent hours tracking it down this far, and
it's starting to wear me down :-/

I'd really appreciate info on how to go about debugging from
here... or a pointer as to what I'm doing wrong (am I doing something
wrong?)

---

I'm deliberately not going to include .config files etc here - this
mail is getting long enough as it is. Of course all information is
available - just ask. But I will give a concise system description:

All clients are identical, Celeron 500's with IDE subsystem and twin
EEPRO10/100 NICs (actually I'm only using 1 NIC for my tests at
present).

The server is basically the same as the client, except it has a PIII
700.

All systems were originally installed with RedHat 6.0, the only thing
that has changed is now ALL machines have linux kernel 2.2.15 (and I
hoped that would fix my problems :-)

Webserver software is Zeus 3.3.6.

The server has had /proc/sys/fs/{file,inode}-max raised to cope with
the big fileset (and the PAM limits raised)

That's all I can think of for now, so I'll stop typing here. Any
feedback would be deeply appreciated,

Best regards,
Stephen


2000-11-06 09:39:15

by Stephen Landamore

[permalink] [raw]
Subject: Re: [BUG REPORT] TCP/IP weirdness in 2.2.15

Hi Thomas,

On Fri, 3 Nov 2000, Thomas Pollinger wrote:
> Running a 'cvs get' on the Linux clients of a larger source tree
> eventually hangs the client in the middle of the get process. The
> hang is *always* reproduceable (however it does not always hang at
> the same place, sometimes after 1', sometimes after 5' to
> 10'). Several runs on Win NT did not show this problem.
[...]
> At last, I tried to do the same on another HP-UX box and there was
> no blocking at all.
>
> What is interesting to know is that between the HP-UX server and the
> HP client is a Linux router with a 3c905 card, 2.2.14 kernel.

Let me see if I understand this correct:

Client Router Server
------------------------------------
Linux Linux HPUX Bad
WinNT Linux HPUX Good
HPUX Linux HPUX Good

With all Linux boxes version 2.2.something

Is this correct?

This is slightly different from my situation; I had a Linux client and
Linux server directly connected to the same switch (3Com Superstack
II, FWIW)

I found the client always hung in my test; this was only visible at
the end of a SPEC run. Analysing tcpdump etc showed that in fact it
died somewhere early in the test, almost always before 10 minutes
worth of run.

In the end I 'fixed' the problem by using a HPUX server :-/ ... I
observe that there _are_ some SPECweb99 submissions with Linux (the
Tux result), I'm curious to know if the people who actually did the
test run experienced any problems...

Perhaps if I get time I'll repeat the experiment with (a) the most
recent Alan pre-2.2 kernel and (b) the most recent 2.4-test kernel...

I'll re-iterate my original request, which was not "it's broke - can
you fix it" but was "okay, how do I go about tracking this one down?"

cheers,
stephen

--
Stephen Landamore, <[email protected]> Zeus Technology
Tel: +44 1223 525000 Universally Serving the Net
Fax: +44 1223 525100 http://www.zeus.com
Zeus Technology, Zeus House, Cowley Road, Cambridge, CB4 0ZT, ENGLAND