2001-12-24 17:01:58

by Jan-Benedict Glaw

[permalink] [raw]
Subject: Data sitting and remaining in Send-Q

Hi!

I've got some problem with a freshly installed Debian sid system.
It's running with 2.4.16, 2.4.17-rc2 and 2.4.17 (the problem
appears on all these kernels) and something seems to break ssh.

When ssh'ing to this box (only this box, regardless which client)
the connection breaks if I request more than some dozends of bytes
at a time (so it will break at 'ls -l' with more than 10 files,
'cat /etc/passwd' will break, calling 'vi' will also break, because
it re-displays all the screen.

When strace'ing ssh client and server, I can see that both of them
are in a select() loop. On the broken server, netstat shows some
(kilo)bytes of data remaining in the Send-Q. However, this data
is actually *never* send over the wire letting the connection die.

Can anybody give me some hint on how to solve this?

Marry Chrismas, JBG

--
Jan-Benedict Glaw . [email protected] . +49-172-7608481
-- New APT-Proxy written in shell script --
http://lug-owl.de/~jbglaw/software/ap2/


2001-12-24 18:11:02

by Jose Luis Domingo Lopez

[permalink] [raw]
Subject: Re: Data sitting and remaining in Send-Q

On Monday, 24 December 2001, at 18:01:42 +0100,
Jan-Benedict Glaw wrote:

> I've got some problem with a freshly installed Debian sid system.
> It's running with 2.4.16, 2.4.17-rc2 and 2.4.17 (the problem
> appears on all these kernels) and something seems to break ssh.
>
I don't know if this has something to do with your problem, but
bugs.debian.org has a _long_ list of reported bugs for ssh, many of them
with respect to ssh's X-forwarding.

My own experience with Debian's ssh is that, sooner or later,
X-forwarding fails, with Send-Q (or Recv-Q) in the server side
completely full. The server side was Debian Sid, and client side was
Debian Woody, and it happened with both a simple xclock and gkrellm (ssh
remoteserver xclock, ssh remoteserver gkrellm).

However, interactive shells didn't seem to show this problem.

--
Jos? Luis Domingo L?pez
Linux Registered User #189436 Debian Linux Woody (P166 64 MB RAM)

jdomingo AT internautas DOT org => Spam at your own risk

2001-12-24 19:00:45

by Jan-Benedict Glaw

[permalink] [raw]
Subject: Re: Data sitting and remaining in Send-Q

On Mon, 2001-12-24 19:10:32 +0100, Jos? Luis Domingo L?pez <[email protected]>
wrote in message <20011224181031.GA7934@localhost>:
> On Monday, 24 December 2001, at 18:01:42 +0100,
> Jan-Benedict Glaw wrote:
>
> > I've got some problem with a freshly installed Debian sid system.
> > It's running with 2.4.16, 2.4.17-rc2 and 2.4.17 (the problem
> > appears on all these kernels) and something seems to break ssh.
> >
> I don't know if this has something to do with your problem, but
> bugs.debian.org has a _long_ list of reported bugs for ssh, many of them
> with respect to ssh's X-forwarding.

Yes, I know, and it's not only connected to X forwording, but also
(this is the majority of filed bugs) with ssh's exit behaviour
when any processes where started in background. However -- I've
got this problem with the running, interactive session. If I make
the server to send more than maybe 200 byte or so, the session
will hang, with both sides sitting in select, and data on the
server's Send-Q...

> My own experience with Debian's ssh is that, sooner or later,
> X-forwarding fails, with Send-Q (or Recv-Q) in the server side
> completely full. The server side was Debian Sid, and client side was
> Debian Woody, and it happened with both a simple xclock and gkrellm (ssh
> remoteserver xclock, ssh remoteserver gkrellm).

Well, my understanding is that, if there's data in any of the queues,
these bytes should be delivered. In this case, data is *not* sent
over the wire. Is this a kernel bug? ...or is data only transmitted
if we're in position to also set the PUSH bit?

> However, interactive shells didn't seem to show this problem.

Mine does:-( And this is quite annoying, because I'm to present
some software on the box in question in some days. But, with
no ssh on a (so far) headless box, I'll face some trouble...

MfG, JBG

--
Jan-Benedict Glaw . [email protected] . +49-172-7608481
-- New APT-Proxy written in shell script --
http://lug-owl.de/~jbglaw/software/ap2/

2001-12-24 19:38:51

by Jan-Benedict Glaw

[permalink] [raw]
Subject: Re: Data sitting and remaining in Send-Q

On Mon, 2001-12-24 19:10:32 +0100, Jos? Luis Domingo L?pez <[email protected]>
wrote in message <20011224181031.GA7934@localhost>:
> On Monday, 24 December 2001, at 18:01:42 +0100,
> Jan-Benedict Glaw wrote:
> > I've got some problem with a freshly installed Debian sid system.
> > It's running with 2.4.16, 2.4.17-rc2 and 2.4.17 (the problem
> > appears on all these kernels) and something seems to break ssh.
>
> My own experience with Debian's ssh is that, sooner or later,
> X-forwarding fails, with Send-Q (or Recv-Q) in the server side
> completely full. The server side was Debian Sid, and client side was
> Debian Woody, and it happened with both a simple xclock and gkrellm (ssh
> remoteserver xclock, ssh remoteserver gkrellm).

Seems to bo a more general problem. I just installed ftpd and telnetd.
*Both* of them show exactly the same behaviour: 'ls -l' via telnet
blocks also. I could get a 635 byte file via ftp, but fetching a
69294 bytes long file stalled. (This time, strace shows that ftpd is
sitting in write(5, ...data..., 56262), and there are
13032 bytes in Send-Q for ftpd...)

So what is this? Seems that there's a general TCP I/O problem with
the software current software versions in Debian unstable. libc
problem? Could a lousy network card cause this? Are there any
debugging hints for me?

MfG, JBG

--
Jan-Benedict Glaw . [email protected] . +49-172-7608481
-- New APT-Proxy written in shell script --
http://lug-owl.de/~jbglaw/software/ap2/

2001-12-24 20:09:45

by Mr. James W. Laferriere

[permalink] [raw]
Subject: Re: Data sitting and remaining in Send-Q


Hello Jan , Is this possibly related to a ECN enabled host &
somewhere in between a Non-ECN enabled (or a cisco router) ?
Just a thought , JimL

On Mon, 24 Dec 2001, Jan-Benedict Glaw wrote:

> On Mon, 2001-12-24 19:10:32 +0100, Jos? Luis Domingo L?pez <[email protected]>
> wrote in message <20011224181031.GA7934@localhost>:
> > On Monday, 24 December 2001, at 18:01:42 +0100,
> > Jan-Benedict Glaw wrote:
> > > I've got some problem with a freshly installed Debian sid system.
> > > It's running with 2.4.16, 2.4.17-rc2 and 2.4.17 (the problem
> > > appears on all these kernels) and something seems to break ssh.
> >
> > My own experience with Debian's ssh is that, sooner or later,
> > X-forwarding fails, with Send-Q (or Recv-Q) in the server side
> > completely full. The server side was Debian Sid, and client side was
> > Debian Woody, and it happened with both a simple xclock and gkrellm (ssh
> > remoteserver xclock, ssh remoteserver gkrellm).
>
> Seems to bo a more general problem. I just installed ftpd and telnetd.
> *Both* of them show exactly the same behaviour: 'ls -l' via telnet
> blocks also. I could get a 635 byte file via ftp, but fetching a
> 69294 bytes long file stalled. (This time, strace shows that ftpd is
> sitting in write(5, ...data..., 56262), and there are
> 13032 bytes in Send-Q for ftpd...)
>
> So what is this? Seems that there's a general TCP I/O problem with
> the software current software versions in Debian unstable. libc
> problem? Could a lousy network card cause this? Are there any
> debugging hints for me?
>
> MfG, JBG
>
> --
> Jan-Benedict Glaw . [email protected] . +49-172-7608481
> -- New APT-Proxy written in shell script --
> http://lug-owl.de/~jbglaw/software/ap2/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

+------------------------------------------------------------------+
| James W. Laferriere | System Techniques | Give me VMS |
| Network Engineer | P.O. Box 854 | Give me Linux |
| [email protected] | Coudersport PA 16915 | only on AXP |
+------------------------------------------------------------------+

2001-12-24 20:17:46

by Jan-Benedict Glaw

[permalink] [raw]
Subject: Re: Data sitting and remaining in Send-Q

On Mon, 2001-12-24 15:09:07 -0500, Mr. James W. Laferriere <[email protected]>
wrote in message <Pine.LNX.4.43.0112241507550.31883-100000@filesrv1.baby-dragons.com>:
>
> Hello Jan , Is this possibly related to a ECN enabled host &
> somewhere in between a Non-ECN enabled (or a cisco router) ?

That would give a different result: "functional TCP connections" or
"non-functional TCP connections". Mine are between that. If data gets
sent in small chunks, everything is fine, but if it's a larger
transfer (more than one ethernet frame may transport???), write()
stalls (or non-blocking write returns), but data is kept in
Send-Q rather than being sent down to the client.

Well, my setup is a LAN, everything here is fully functional wrt.
ECN. I've never switched ECN off, and 2.4.x is running since ages
on the boxes around. So it's definitely *not* ECN in this case:-(

MfG, JBG

--
Jan-Benedict Glaw . [email protected] . +49-172-7608481
-- New APT-Proxy written in shell script --
http://lug-owl.de/~jbglaw/software/ap2/

Subject: Re: Data sitting and remaining in Send-Q

> That would give a different result: "functional TCP connections" or
> "non-functional TCP connections". Mine are between that. If data gets
> sent in small chunks, everything is fine, but if it's a larger
> transfer (more than one ethernet frame may transport???), write()
> stalls (or non-blocking write returns), but data is kept in
> Send-Q rather than being sent down to the client.

Just to check the completely obvious:

Difficult / impossible to tell without a tcpdump, but last time I
saw something like this, one end was silently dropping packets
exactly equal to the MTU size (or up to 3 bytes smaller), but
transmitting all other packets (in this instance it was a bizarre
802.11 problem).

What happens is that small files get through, as do files sufficiently
small the TCP window hasn't grown properly, as do interactive sessions
(frequently) but large ftp's appear to die; in fact if you leave them
long enough they recover after a long stall.

This is far easier to diagnose if both devices are on the same segment
(remember it can be an L1/L2 device in the way that does the drop
though).

If you have an L3 device (router etc.) in the middle, you can get
a similar effect if the device does not fragment data correctly
(for instance the Cisco into ip tunnels bug - now fixed I think),
or, if you are using PMTU discovery (probably), if some evil device,
or the end nodes, are filtering out ICMP (or doing something else
which breaks PMTU discovery, such as some types of address filtering
if there is NAT in the way).

If you run tcpdump on both boxes, then for each packet transmitted
you should either see a received packet the other end, or an ICMP
reply (immediately); if you see a long pause, and get a reassembly
failed message come back, or a retransmit, you will know it's this.

I'd recommend testing tcpspray or something simple before looking
at the gory internals of ssh buffer handling (openssh seems cleaner).
I'd also recommend, if you are in an environment that can stand
it, putting the two machines on a common L2 network, close together,
and removing all filters (iptables etc.) and checking that works.

--
Alex Bligh

2001-12-24 21:35:16

by Thorsten Kranzkowski

[permalink] [raw]
Subject: Re: Data sitting and remaining in Send-Q

On Mon, Dec 24, 2001 at 08:44:37PM -0000, Alex Bligh - linux-kernel wrote:
> > That would give a different result: "functional TCP connections" or
> > "non-functional TCP connections". Mine are between that. If data gets
> > sent in small chunks, everything is fine, but if it's a larger
> > transfer (more than one ethernet frame may transport???), write()
> > stalls (or non-blocking write returns), but data is kept in
> > Send-Q rather than being sent down to the client.
>
> Just to check the completely obvious:
>
[...]
>
> If you have an L3 device (router etc.) in the middle, you can get
> a similar effect if the device does not fragment data correctly
> (for instance the Cisco into ip tunnels bug - now fixed I think),
> or, if you are using PMTU discovery (probably), if some evil device,

Jan,
do you have some DSL Modem in between?

Thorsten

--
| Thorsten Kranzkowski Internet: [email protected] |
| Mobile: ++49 170 1876134 Snail: Niemannsweg 30, 49201 Dissen, Germany |
| Ampr: dl8bcu@db0lj.#rpl.deu.eu, [email protected] [44.130.8.19] |

2001-12-24 21:57:14

by Jan-Benedict Glaw

[permalink] [raw]
Subject: Re: Data sitting and remaining in Send-Q

On Mon, 2001-12-24 20:44:37 -0000, Alex Bligh - linux-kernel <[email protected]>
wrote in message <1062462662.1009226676@[195.224.237.69]>:
> >That would give a different result: "functional TCP connections" or
> >"non-functional TCP connections". Mine are between that. If data gets
> >sent in small chunks, everything is fine, but if it's a larger
> >transfer (more than one ethernet frame may transport???), write()
> >stalls (or non-blocking write returns), but data is kept in
> >Send-Q rather than being sent down to the client.

Well, some testing done. I've written a small microserver bound to
port 1111/tcp via inetd:
--------------------------------------
#!/bin/sh

LEN="`cat /root/size`"

dd bs=$LEN if=/dev/zero count=1 2>/dev/null
sleep 1
exit 0
------------------------------------

I can control it's output by a file. It seems that I can always
transmit up to ~920 bytes at a time, but never more than 940.
All values in between these borders are more-or-less functional,
depending their size (smaller packets == high chance to reach client,
larger packets == small chance to reach destination).

> Just to check the completely obvious:
>
> Difficult / impossible to tell without a tcpdump, but last time I
> saw something like this, one end was silently dropping packets
> exactly equal to the MTU size (or up to 3 bytes smaller), but
> transmitting all other packets (in this instance it was a bizarre
> 802.11 problem).

It's quite a problem to do tcpdumping on a host from which you
never can get more than ~920 bytes at a time, neither by ftp, nor
by ssh or telnet or whatever:-)

Well, I've tcpdumped now, and it seemy my old WaveSwitch is
to blame. The "bad" server actually transmits everything
(and also tries retransmits etc.), but that never leaves the
switch again... I've changed the switch port as well as the
cable. It seems the switch and that network card don't
like each other...

I've now replaced the network card, everything is fine now.

I've never seen a NIC failing partially, I've learned a lot
this evening...

Thank you very much (to all who send me notes) and have a nice
X-Mas...

MfG, JBG

--
Jan-Benedict Glaw . [email protected] . +49-172-7608481
-- New APT-Proxy written in shell script --
http://lug-owl.de/~jbglaw/software/ap2/

2001-12-24 21:59:14

by Jan-Benedict Glaw

[permalink] [raw]
Subject: Re: Data sitting and remaining in Send-Q

On Mon, 2001-12-24 21:34:52 +0000, Thorsten Kranzkowski <[email protected]>
wrote in message <[email protected]>:
> On Mon, Dec 24, 2001 at 08:44:37PM -0000, Alex Bligh - linux-kernel wrote:
> > If you have an L3 device (router etc.) in the middle, you can get
> > a similar effect if the device does not fragment data correctly
> > (for instance the Cisco into ip tunnels bug - now fixed I think),
> > or, if you are using PMTU discovery (probably), if some evil device,
>
> Jan,
> do you have some DSL Modem in between?

Hi Thorsten!

No, it's not the famous MTU-too-large-and-a-lot-of-fragmentation-needed
problem. It was a broken NIC, unwilling to send frames > ~960 bytes...

MfG, JBG

--
Jan-Benedict Glaw . [email protected] . +49-172-7608481
-- New APT-Proxy written in shell script --
http://lug-owl.de/~jbglaw/software/ap2/

Subject: Re: Data sitting and remaining in Send-Q

> Well, I've tcpdumped now, and it seemy my old WaveSwitch is
> to blame. The "bad" server actually transmits everything
> (and also tries retransmits etc.), but that never leaves the
> switch again... I've changed the switch port as well as the
> cable. It seems the switch and that network card don't
> like each other...
>
> I've now replaced the network card, everything is fine now.
>
> I've never seen a NIC failing partially, I've learned a lot
> this evening...

Well I dunno if a WaveSwitch is 802.11 (sounds like it might
be), so if it is, I had an identical problem - look under wireless
ethernet at
http://www.alex.org.uk/T23
Various firmware upgrades fixed it, and crucially a settings
changed, fixed it. Your symptoms sound identical to mine
(and if so it's the basestation you have to fix). Short answer
is change to rfc1042 encapsulation from 802.1h, which (seemingly
illegally) works at 1500 byte MTUs only between some hardware
pairs.

--
Alex Bligh