2003-01-24 18:33:54

by David C Niemi

[permalink] [raw]
Subject: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX


I have been experiencing some baffling SSH client hangs under 2.5.59 (and
55) in which the session totally hangs up after I have typed (typically)
10-100 characters. Right before it hangs permanently, a character is
echo'd back to the screen several seconds late. Interestingly, data due
back for my client which is initiated by the server side does make it, I
just can't type anything further.

To reproduce this: ssh in to a somewhat distant host. At a command
prompt, hold down a letter key for a couple of minutes, or just type text
in. If you cut'n'paste text, it rarely hangs (my guess is that this
requires a lot fewer round trips than interactive typing). It should hang
before you get a screenful (sometimes the sessions hang even before they
are set up).

The system involved is a new Dell desktop with a P4/2.6 CPU and an
integrated Intel E1000 NIC, being used at 100Mb full duplex
(autonegotiated). Sessions go through a Cisco PIX on their way to
anywhere useful. The problem doesn't seem to occur if the SSH client and
server are on the same subnet; I'm not sure whether the PIX is an
essential cause of this or if any old router would do the same thing.
I've also reproduce it while being attached to different 100TX switches,
so I think the problem is higher-level.

As for networking options, I see the problem both using the (rather
extensive) default options, and a stripped down set of options with no QOS
or netfilter or anything else fancy.

Neither "ifconfig" nor dmesg show *any* errors whatsoever.

Anyone else seeing SSH client hangs to nonlocal hosts under 2.5.59?

David C Niemi


2003-01-24 19:41:24

by David Miller

[permalink] [raw]
Subject: Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

On Fri, 2003-01-24 at 10:43, David C Niemi wrote:
> The system involved is a new Dell desktop with a P4/2.6 CPU and an
> integrated Intel E1000 NIC, being used at 100Mb full duplex
> (autonegotiated).

What happens if you comment out the enabling of
NETIF_F_TSO in drivers/net/e1000/e1000_main.c around
line 428? Does the problem persist?

2003-01-24 20:30:55

by lost

[permalink] [raw]
Subject: Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

On Fri, 24 Jan 2003, David C Niemi wrote:

> I have been experiencing some baffling SSH client hangs under 2.5.59 (and
> 55) in which the session totally hangs up after I have typed (typically)
> 10-100 characters. Right before it hangs permanently, a character is
> echo'd back to the screen several seconds late. Interestingly, data due
> back for my client which is initiated by the server side does make it, I
> just can't type anything further.

<snip>

> Neither "ifconfig" nor dmesg show *any* errors whatsoever.
>
> Anyone else seeing SSH client hangs to nonlocal hosts under 2.5.59?

I'm seeing the same problem with a D-Link NIC (8139too driver). Exact same
symptoms - a delayed echo followed by no further echos. Checking netstat
shows an output queue for the socket but it never transmits anything.
Messages echoed by the remote server also make it through the connection.

The same problem does not occur using "telnet" to connect to the remote
host.


William Astle
finger [email protected] for further information

Geek Code V3.12: GCS/M/S d- s+:+ !a C++ UL++++$ P++ L+++ !E W++ !N w--- !O
!M PS PE V-- Y+ PGP t+@ 5++ X !R tv+@ b+++@ !DI D? G e++ h+ y?

2003-01-24 21:06:12

by Christopher Faylor

[permalink] [raw]
Subject: Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

On Fri, Jan 24, 2003 at 01:46:10PM -0700, [email protected] wrote:
>On Fri, 24 Jan 2003, David C Niemi wrote:
>
>> I have been experiencing some baffling SSH client hangs under 2.5.59 (and
>> 55) in which the session totally hangs up after I have typed (typically)
>> 10-100 characters. Right before it hangs permanently, a character is
>> echo'd back to the screen several seconds late. Interestingly, data due
>> back for my client which is initiated by the server side does make it, I
>> just can't type anything further.
>
><snip>
>
>> Neither "ifconfig" nor dmesg show *any* errors whatsoever.
>>
>> Anyone else seeing SSH client hangs to nonlocal hosts under 2.5.59?
>
>I'm seeing the same problem with a D-Link NIC (8139too driver). Exact same
>symptoms - a delayed echo followed by no further echos. Checking netstat
>shows an output queue for the socket but it never transmits anything.
>Messages echoed by the remote server also make it through the connection.

I hate "me toos" but maybe this will provide some useful data.

I'm seeing the same thing with a 3c59x driver. I couldn't reproduce the
problem with a tulip driver when I connect my laptop directly to my
cable modem. The problem only occurs when going through the laptop
(which acts as a firewall, running netfilter) to a remote site, in my
case the site is sources.redhat.com.

cgf

2003-01-25 01:51:34

by Christopher Faylor

[permalink] [raw]
Subject: Re: SSH Hangs in 2.5.59 and 2.5.55 (TCP_NODELAY?)

On Fri, Jan 24, 2003 at 04:15:23PM -0500, Christopher Faylor wrote:
>On Fri, Jan 24, 2003 at 01:46:10PM -0700, [email protected] wrote:
>>On Fri, 24 Jan 2003, David C Niemi wrote:
>>
>>> I have been experiencing some baffling SSH client hangs under 2.5.59 (and
>>> 55) in which the session totally hangs up after I have typed (typically)
>>> 10-100 characters. Right before it hangs permanently, a character is
>>> echo'd back to the screen several seconds late. Interestingly, data due
>>> back for my client which is initiated by the server side does make it, I
>>> just can't type anything further.
>>
>><snip>
>>
>>> Neither "ifconfig" nor dmesg show *any* errors whatsoever.
>>>
>>> Anyone else seeing SSH client hangs to nonlocal hosts under 2.5.59?
>>
>>I'm seeing the same problem with a D-Link NIC (8139too driver). Exact same
>>symptoms - a delayed echo followed by no further echos. Checking netstat
>>shows an output queue for the socket but it never transmits anything.
>>Messages echoed by the remote server also make it through the connection.
>
>I hate "me toos" but maybe this will provide some useful data.
>
>I'm seeing the same thing with a 3c59x driver. I couldn't reproduce the
>problem with a tulip driver when I connect my laptop directly to my
>cable modem. The problem only occurs when going through the laptop
>(which acts as a firewall, running netfilter) to a remote site, in my
>case the site is sources.redhat.com.

Checking the strace log between telnet and ssh, I noticed that ssh does this:
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0

while telnet doesn't.

If I introduce that call into telnet, it seems to hang eventually too in the
same way as ssh.

cgf

2003-01-27 14:18:21

by David C Niemi

[permalink] [raw]
Subject: Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

On Fri, 24 Jan 2003, David S. Miller wrote:
> What happens if you comment out the enabling of
> NETIF_F_TSO in drivers/net/e1000/e1000_main.c around
> line 428? Does the problem persist?

Yes, the problem persists.

Interesting that it seems to happen on a variety of Ethernet cards, I
wonder if the problem's in the TCP area. Interestingly it seems like on
the *unafflicted* systems I can still see the "delayed character" symptom,
but eventually the outstanding characters do get echoed back to the
screen. Whereas on the afflicted 2.5.5x systems, as soon as there is a
delay (perhaps due to a retransmission) all outstanding characters (after
the delayed one) are lost or permanently hung up somewhere.

DCN

2003-01-27 18:09:23

by David Miller

[permalink] [raw]
Subject: Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX


David, this email address you use "[email protected]" always
bounces for me. Maybe you have it fixed now, but for the first
two replies I've sent you on this issue I've gotten a "user unknown"
bounce. This gets annoying after a while when you're trying to help
someone fix a problem. :(

2003-01-27 18:13:57

by David Miller

[permalink] [raw]
Subject: Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

From: David C Niemi <[email protected]>
Date: Mon, 27 Jan 2003 09:27:25 -0500 (EST)

On Fri, 24 Jan 2003, David S. Miller wrote:
> What happens if you comment out the enabling of
> NETIF_F_TSO in drivers/net/e1000/e1000_main.c around
> line 428? Does the problem persist?

Yes, the problem persists.

Interesting that it seems to happen on a variety of Ethernet cards,
I wonder if the problem's in the TCP area.

I think the clue in this thread is the TCP_NODELAY socket option.
The one post claimed that by turning this on in telnet, it made
telnet exhibit the same problems SSH shows.

2003-01-27 18:20:00

by Anders Gustafsson

[permalink] [raw]
Subject: Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

On Mon, Jan 27, 2003 at 10:11:28AM -0800, David S. Miller wrote:
>
> I think the clue in this thread is the TCP_NODELAY socket option.
> The one post claimed that by turning this on in telnet, it made
> telnet exhibit the same problems SSH shows.

This is a "me too", well actually not me, but some friends is seeing this.
If I remember correctly the data was actually sent to the server and only
the response was lost (seen be stracing the shell on the server). Someone
suggested that it might be the sequence-number beeing screwed up.


--
Anders Gustafsson - [email protected] - http://0x63.nu/

2003-01-27 21:21:32

by Bill Davidsen

[permalink] [raw]
Subject: Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

On Fri, 24 Jan 2003, David C Niemi wrote:

>
> I have been experiencing some baffling SSH client hangs under 2.5.59 (and
> 55) in which the session totally hangs up after I have typed (typically)
> 10-100 characters. Right before it hangs permanently, a character is
> echo'd back to the screen several seconds late. Interestingly, data due
> back for my client which is initiated by the server side does make it, I
> just can't type anything further.
>
> To reproduce this: ssh in to a somewhat distant host. At a command
> prompt, hold down a letter key for a couple of minutes, or just type text
> in. If you cut'n'paste text, it rarely hangs (my guess is that this
> requires a lot fewer round trips than interactive typing). It should hang
> before you get a screenful (sometimes the sessions hang even before they
> are set up).

Sorry to say I sometimes see this on 2.4 kernels as well, even on PPP
dialed connections. The symptoms are that the local ssh client just stops
sending packets. That's very easy to tell with an external modem:-) The
connection is still fine, if I have multiple connections to the host only
one hangs, and I believe it's a client issue in ssh.

What I see may or may not be related to your problem.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-01-27 22:39:07

by David Miller

[permalink] [raw]
Subject: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX


Hey guys, can you all see if this patch makes the problem go away in
2.5.x? It is merely a guess, but it is worth enough to experiment.

Alexey, this piece of code was buggy first time it was coded, and it
may still have some holes. :-)))

--- net/ipv4/tcp_output.c.~1~ Mon Jan 27 14:45:49 2003
+++ net/ipv4/tcp_output.c Mon Jan 27 14:46:33 2003
@@ -889,7 +889,7 @@
if (atomic_read(&sk->wmem_alloc) > min(sk->wmem_queued+(sk->wmem_queued>>2),sk->sndbuf))
return -EAGAIN;

- if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
+ if (0 && before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
BUG();

2003-01-28 02:09:04

by lost

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

On Mon, 27 Jan 2003, David S. Miller wrote:

> Hey guys, can you all see if this patch makes the problem go away in
> 2.5.x? It is merely a guess, but it is worth enough to experiment.
>
> Alexey, this piece of code was buggy first time it was coded, and it
> may still have some holes. :-)))

It seems to have cleared up the problem for me. I've been running an SSH
seesion for the past hour without any lock up problems with the patch
installed. Without it, the lock up happened quite reliably within a few
minutes.

William Astle
finger [email protected] for further information

Geek Code V3.12: GCS/M/S d- s+:+ !a C++ UL++++$ P++ L+++ !E W++ !N w--- !O
!M PS PE V-- Y+ PGP t+@ 5++ X !R tv+@ b+++@ !DI D? G e++ h+ y?

2003-01-28 02:48:55

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

Hello!

> Alexey, this piece of code was buggy first time it was coded, and it
> may still have some holes. :-)))

To my shame, I cannot say "no". It was written sort of too fast. :-)

Did the reporters see packets with wrong checksum on wire or wrong tcp
headers or something like that?

Alexey

2003-01-28 03:12:53

by Christopher Faylor

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

On Tue, Jan 28, 2003 at 05:57:55AM +0300, [email protected] wrote:
>>Alexey, this piece of code was buggy first time it was coded, and it
>>may still have some holes. :-)))
>
>To my shame, I cannot say "no". It was written sort of too fast. :-)
>
>Did the reporters see packets with wrong checksum on wire or wrong tcp
>headers or something like that?

My knowledge of TCP/IP is extremely minimal but the sequence number looked
weird when the stall occurred. It looked like the sequence numbers you get
with the -S option to tcpdump. All of the other packets had small sequence
numbers and what I assume was the bad packet had a large one.

I'm sorry if this is gibberish and makes no sense. I don't know how to tell
if the checksum was wrong or not.

cgf

2003-01-28 03:29:43

by Christopher Faylor

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

On Mon, Jan 27, 2003 at 02:36:25PM -0800, David S. Miller wrote:
>Hey guys, can you all see if this patch makes the problem go away in
>2.5.x? It is merely a guess, but it is worth enough to experiment.
>
>Alexey, this piece of code was buggy first time it was coded, and it
>may still have some holes. :-)))
>
>--- net/ipv4/tcp_output.c.~1~ Mon Jan 27 14:45:49 2003
>+++ net/ipv4/tcp_output.c Mon Jan 27 14:46:33 2003
>@@ -889,7 +889,7 @@
> if (atomic_read(&sk->wmem_alloc) > min(sk->wmem_queued+(sk->wmem_queued>>2),sk->sndbuf))
> return -EAGAIN;
>
>- if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
>+ if (0 && before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
> if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
> BUG();

Sorry, but this doesn't do it for me. I still get a hang.

cgf

2003-01-28 03:46:25

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

Hello!

> Sorry, but this doesn't do it for me. I still get a hang.

Can you make tcpdump of this session which looks like tcpdump with -S? :-)


Alexey

2003-01-28 07:00:41

by Eric Dumazet

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

From: <[email protected]>
> Can you make tcpdump of this session which looks like tcpdump with -S? :-)
>
>
> Alexey

Hello Alexey

I do have a lot of such hangs too.

client is a linux 2.4.20 machine, but hangs were given by other OS as well.
server is a linux-2.5.59 SMP machine, traffic control in operation,

07:28:28.176006 client.1022 > server.22: S 3603063450:3603063450(0) win 5840
<mss 1460,sackOK,timestamp 359716783 0,no
p,wscale 0> (DF)
07:28:28.176151 server.22 > client.1022: S 420199885:420199885(0) ack
3603063451 win 5840 <mss 1460> (DF)
07:28:28.580529 client.1022 > server.22: . ack 420199886 win 5840 (DF)
07:28:28.583078 server.22 > client.1022: P 420199886:420199909(23) ack
3603063451 win 5840 (DF)
07:28:29.006494 client.1022 > server.22: . ack 420199909 win 5840 (DF) [tos
0x10]
07:28:29.007644 client.1022 > server.22: P 3603063451:3603063472(21) ack
420199909 win 5840 (DF) [tos 0x10]
07:28:29.007743 server.22 > client.1022: . ack 3603063472 win 5840 (DF)
07:28:29.008246 server.22 > client.1022: P 420199909:420200185(276) ack
3603063472 win 5840 (DF)
07:28:29.575223 client.1022 > server.22: P 3603063472:3603063628(156) ack
420200185 win 6432 (DF) [tos 0x10]
07:28:29.576148 client.1022 > server.22: P 3603063628:3603063680(52) ack
420200185 win 6432 (DF) [tos 0x10]
07:28:29.596173 server.22 > client.1022: P 420200185:420200197(12) ack
3603063680 win 5840 (DF)
07:28:30.020642 client.1022 > server.22: P 3603063680:3603063700(20) ack
420200197 win 6432 (DF) [tos 0x10]
07:28:30.059904 server.22 > client.1022: . ack 3603063700 win 5840 (DF)
07:28:31.407346 client.1022 > server.22: P 3603063680:3603063700(20) ack
420200197 win 6432 (DF) [tos 0x10]
07:28:31.407464 server.22 > client.1022: . ack 3603063700 win 5840 (DF)
07:28:34.369344 server.22 > client.1022: P 420200197:420200209(12) ack
3603063700 win 5840 (DF)
07:28:34.784326 client.1022 > server.22: P 3603063700:3603063840(140) ack
420200209 win 6432 (DF) [tos 0x10]
07:28:34.784398 server.22 > client.1022: . ack 3603063840 win 6432 (DF)
07:28:34.786551 server.22 > client.1022: P 420200209:420200221(12) ack
3603063840 win 6432 (DF)
07:28:35.268516 client.1022 > server.22: . ack 420200221 win 6432 (DF) [tos
0x10]
07:28:38.597375 client.1022 > server.22: P 3603063840:3603063868(28) ack
420200221 win 6432 (DF) [tos 0x10]
07:28:38.605463 server.22 > client.1022: P 420200221:420200233(12) ack
3603063868 win 6432 (DF)
07:28:39.096028 client.1022 > server.22: . ack 420200233 win 6432 (DF) [tos
0x10]
07:28:39.099880 client.1022 > server.22: P 3603063868:3603064016(148) ack
420200233 win 6432 (DF) [tos 0x10]
07:28:39.101205 server.22 > client.1022: P 420200233:420200245(12) ack
3603064016 win 6432 (DF)
07:28:39.665364 client.1022 > server.22: P 3603064016:3603064028(12) ack
420200245 win 6432 (DF) [tos 0x10]
07:28:39.681917 server.22 > client.1022: P 420200245:420200377(132) ack
3603064028 win 6432 (DF) [tos 0x10]
07:28:39.957853 server.22 > client.1022: P 420200377:420200413(36) ack
3603064028 win 6432 (DF) [tos 0x10]
07:28:40.191540 client.1022 > server.22: . ack 420200377 win 7504 (DF) [tos
0x10]
07:28:40.432214 client.1022 > server.22: . ack 420200413 win 7504 (DF) [tos
0x10]
07:28:41.928298 client.1022 > server.22: P 3603064028:3603064048(20) ack
420200413 win 7504 (DF) [tos 0x10]
07:28:41.938316 server.22 > client.1022: P 420200413:420200457(44) ack
3603064048 win 6432 (DF) [tos 0x10]
07:28:42.123677 client.1022 > server.22: P 3603064048:3603064068(20) ack
420200413 win 7504 (DF) [tos 0x10]
07:28:42.134272 server.22 > client.1022: P 420200457:420200501(44) ack
3603064068 win 6432 (DF) [tos 0x10]
07:28:42.483256 client.1022 > server.22: P 3603064068:3603064108(40) ack
420200457 win 7504 (DF) [tos 0x10]
07:28:42.494187 server.22 > client.1022: P 420200501:420200569(68) ack
3603064108 win 6432 (DF) [tos 0x10]
07:28:42.679902 client.1022 > server.22: . ack 420200501 win 7504 (DF) [tos
0x10]
07:28:42.792933 client.1022 > server.22: P 3603064108:3603064128(20) ack
420200501 win 7504 (DF) [tos 0x10]
07:28:42.803135 server.22 > client.1022: P 420200569:420200613(44) ack
3603064128 win 6432 (DF) [tos 0x10]
07:28:42.825978 client.1022 > server.22: P 3603064128:3603064148(20) ack
420200501 win 7504 (DF) [tos 0x10]
07:28:42.836109 server.22 > client.1022: P 420200613:420200657(44) ack
3603064148 win 6432 (DF) [tos 0x10]
07:28:43.408817 client.1022 > server.22: P 3603064148:3603064188(40) ack
420200501 win 7504 (DF) [tos 0x10]
07:28:43.461886 server.22 > client.1022: . ack 3603064188 win 6432 (DF) [tos
0x10]
07:28:43.589866 server.22 > client.1022: P 420200501:420200569(68) ack
3603064188 win 6432 (DF) [tos 0x10]
07:28:44.087198 client.1022 > server.22: . ack 420200569 win 7504 (DF) [tos
0x10]
07:28:45.410465 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:28:49.050628 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:28:56.328994 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:29:10.886716 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:29:40.003185 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:30:38.235034 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:32:34.696797 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:34:34.678762 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:36:34.660724 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:38:34.642703 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:40:34.623666 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:42:34.605632 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:44:34.587601 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]
07:46:34.569566 server.22 > client.1022: P ack 3603064188 win 6432 (DF) [tos
0x10]

netstat -an on the server tells us that 156 bytes are waiting in the send
queue.

Thanks for your help

2003-01-28 12:26:49

by Sebastian Benoit

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

dada1([email protected])@2003.01.28 08:08:10 +0000:
> From: <[email protected]>
> > Can you make tcpdump of this session which looks like tcpdump with -S? :-)

This might help you:

I still have a similar problem (ssh hang with other traffic) that i reported
in november on netdev:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103641051419994&w=2

(this post includes a tcpdump and a discription how to reproduce it)

I did not follow up on that because i did not have the time and
i ran into hardware problems then...

(You can find the 'socket' program i used here:
http://www.jnickelsen.de/socket/socket-1.2.html)

/Sebastian
--
Sebastian Benoit <[email protected]>
My mail is GnuPG signed -- Unsigned ones are bogus -- http://www.gnupg.org/
GnuPG 0x5BA22F00 2001-07-31 2999 9839 6C9E E4BF B540 C44B 4EC4 E1BE 5BA2 2F00

"After writing for fifteen years it struck me I had no talent for writing.
But I couldn't give it up: by that time I was already famous." -- Mark Twain


Attachments:
(No filename) (991.00 B)
(No filename) (240.00 B)
Download all attachments

2003-01-28 14:00:16

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

Hello!

> http://marc.theaimsgroup.com/?l=3Dlinux-kernel&m=3D103641051419994&w=3D2

Thank you. Christopher also gave something similar.

Dave, look:

23:24:05.617819 trixie.bosbc.com.32793 > sources.redhat.com.22: P [bad tcp cksum 3770!] 5136:5136(0) ack 32369 win 45144 <nop,nop,timestamp 122553 80640703> (DF) [tos 0x10] (ttl 64, id 9958, len 52)
23:24:06.093754 trixie.bosbc.com.32793 > sources.redhat.com.22: P [bad tcp cksum 5b6e!] 5136:5136(0) ack 32369 win 45144 <nop,nop,timestamp 123029 80640703> (DF) [tos 0x10] (ttl 64, id 9959, len 52)
23:24:07.045603 trixie.bosbc.com.32793 > sources.redhat.com.22: P [bad tcp cksum a36a!] 5136:5136(0) ack 32369 win 45144 <nop,nop,timestamp 123981 80640703> (DF) [tos 0x10] (ttl 64, id 9960, len 52)


We apparently have segment of zero length in queue. :-)

Well, that chunk cannot be responsible for this directly, I am afraid
we somewhat arrived to attempt to retransmit already acked segment.
It is weird enough to be good hint. :-)

Alexey

2003-01-28 18:39:13

by David Miller

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

From: [email protected]
Date: Tue, 28 Jan 2003 17:09:09 +0300 (MSK)

We apparently have segment of zero length in queue. :-)

Well, that chunk cannot be responsible for this directly, I am afraid
we somewhat arrived to attempt to retransmit already acked segment.

Hmmm, it is one of few places where sequence numbers of already
sent packet are mangled. :-)

Good set of debug checks would be the following:

--- net/ipv4/tcp_output.c.~1~ Mon Jan 27 14:46:33 2003
+++ net/ipv4/tcp_output.c Tue Jan 28 10:47:08 2003
@@ -441,6 +441,9 @@
TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;

+ BUG_TRAP(TCP_SKB_CB(buff)->seq != TCP_SKB_CB(buff)->end_seq);
+ BUG_TRAP(TCP_SKB_CB(skb)->seq != TCP_SKB_CB(skb)->end_seq);
+
/* PSH and FIN should only be set in the second packet. */
flags = TCP_SKB_CB(skb)->flags;
TCP_SKB_CB(skb)->flags = flags & ~(TCPCB_FLAG_FIN|TCPCB_FLAG_PSH);
@@ -524,6 +527,7 @@
}

TCP_SKB_CB(skb)->seq += len;
+ BUG_TRAP(TCP_SKB_CB(skb)->seq != TCP_SKB_CB(skb)->end_seq);
skb->ip_summed = CHECKSUM_HW;
return 0;
}
@@ -796,6 +800,7 @@

/* Update sequence range on original skb. */
TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
+ BUG_TRAP(TCP_SKB_CB(skb)->seq != TCP_SKB_CB(skb)->end_seq);

/* Merge over control information. */
flags |= TCP_SKB_CB(next_skb)->flags; /* This moves PSH/FIN etc. over */

2003-01-28 19:07:28

by Sebastian Benoit

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

David S. Miller([email protected])@2003.01.28 10:35:34 +0000:
> From: [email protected]
> Date: Tue, 28 Jan 2003 17:09:09 +0300 (MSK)
>
> We apparently have segment of zero length in queue. :-)
>
> Well, that chunk cannot be responsible for this directly, I am afraid
> we somewhat arrived to attempt to retransmit already acked segment.
>
> Hmmm, it is one of few places where sequence numbers of already
> sent packet are mangled. :-)
>
> Good set of debug checks would be the following:

no output, i did 4 tests, everytime i was able to lock the ssh-connection
within a few seconds. kernel 2.5.59 + your debug-patch.

tcpdump of one:

20:07:30.788431 ronja.fluchtwagenfahrer.de.32774 > turing.fb12.de.ssh: . ack 4359 win 13888 <nop,nop,timestamp 591456 50952833> (DF) [tos 0x10]
20:07:31.054101 ronja.fluchtwagenfahrer.de.32774 > turing.fb12.de.ssh: P 2927:2975(48) ack 4359 win 13888 <nop,nop,timestamp 591722 50952833> (DF) [tos 0x10]
20:07:31.119062 turing.fb12.de.ssh > ronja.fluchtwagenfahrer.de.32774: P 4359:4407(48) ack 2975 win 10944 <nop,nop,timestamp 50952865 591722> (DF) [tos 0x10]
20:07:31.119102 ronja.fluchtwagenfahrer.de.32774 > turing.fb12.de.ssh: . ack 4407 win 13888 <nop,nop,timestamp 591787 50952865> (DF) [tos 0x10]
20:07:31.132819 turing.fb12.de.ssh > ronja.fluchtwagenfahrer.de.32774: P 4407:4487(80) ack 2975 win 10944 <nop,nop,timestamp 50952865 591722> (DF) [tos 0x10]
20:07:31.132842 ronja.fluchtwagenfahrer.de.32774 > turing.fb12.de.ssh: . ack 4487 win 13888 <nop,nop,timestamp 591801 50952865> (DF) [tos 0x10]
20:07:31.132930 turing.fb12.de.ssh > ronja.fluchtwagenfahrer.de.32774: P 4487:4551(64) ack 2975 win 10944 <nop,nop,timestamp 50952866 591722> (DF) [tos 0x10]
20:07:31.132951 ronja.fluchtwagenfahrer.de.32774 > turing.fb12.de.ssh: . ack 4551 win 13888 <nop,nop,timestamp 591801 50952866> (DF) [tos 0x10]
20:07:31.602060 ronja.fluchtwagenfahrer.de.32774 > turing.fb12.de.ssh: P 2975:3023(48) ack 4551 win 13888 <nop,nop,timestamp 592270 50952866> (DF) [tos 0x10]
20:07:31.687764 ronja.fluchtwagenfahrer.de.32774 > turing.fb12.de.ssh: P 3023:3071(48) ack 4551 win 13888 <nop,nop,timestamp 592356 50952866> (DF) [tos 0x10]
20:07:31.834730 ronja.fluchtwagenfahrer.de.32774 > turing.fb12.de.ssh: P 2975:3023(48) ack 4551 win 13888 <nop,nop,timestamp 592503 50952866> (DF) [tos 0x10]
20:07:31.888875 turing.fb12.de.ssh > ronja.fluchtwagenfahrer.de.32774: P 4551:4599(48) ack 3023 win 10944 <nop,nop,timestamp 50952942 592503> (DF) [tos 0x10]

---- here it hangs ---

20:07:31.888910 ronja.fluchtwagenfahrer.de.32774 > turing.fb12.de.ssh: . ack 4599 win 13888 <nop,nop,timestamp 592557 50952942> (DF) [tos 0x10]
20:07:32.300653 ronja.fluchtwagenfahrer.de.32774 > turing.fb12.de.ssh: P ack 4599 win 13888 <nop,nop,timestamp 592969 50952942> (DF) [tos 0x10]
20:07:33.232614 ronja.fluchtwagenfahrer.de.32774 > turing.fb12.de.ssh: P ack 4599 win 13888 <nop,nop,timestamp 593901 50952942> (DF) [tos 0x10]
20:07:35.096334 ronja.fluchtwagenfahrer.de.32774 > turing.fb12.de.ssh: P ack 4599 win 13888 <nop,nop,timestamp 595765 50952942> (DF) [tos 0x10]
20:07:37.269116 ronja.fluchtwagenfahrer.de.32773 > turing.fb12.de.ssh: P ack 1 win 34800 <nop,nop,timestamp 597938 50948566> (DF) [tos 0x10]


/B.
--
Sebastian Benoit <[email protected]>
My mail is GnuPG signed -- Unsigned ones are bogus -- http://www.gnupg.org/
GnuPG 0x5BA22F00 2001-07-31 2999 9839 6C9E E4BF B540 C44B 4EC4 E1BE 5BA2 2F00

Repetition causes insanity. Repetition causes insanity. Repetition causes
insanity. Repetition causes insanity. Repetition causes insanity. Repetition
causes insanity. Repetition causes insanity. Repetition causes insanity.
Repetition causes insanity. Repetition causes insanity. Repetition causes
insanity. Repetition causes insanity. Repetition causes insanity. Repetition
causes insanity. Repetition causes ins


Attachments:
(No filename) (3.79 kB)
(No filename) (240.00 B)
Download all attachments

2003-01-28 20:37:37

by David Miller

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

From: Sebastian Benoit <[email protected]>
Date: Tue, 28 Jan 2003 20:16:45 +0100

David S. Miller([email protected])@2003.01.28 10:35:34 +0000:
> Good set of debug checks would be the following:

no output, i did 4 tests, everytime i was able to lock the ssh-connection
within a few seconds. kernel 2.5.59 + your debug-patch.

Thanks for testing, how about this new patch at the end of this email?
Does it make the problem go away?

Alexey, most solid report is that 2.5.43-bk1 makes bug appear.
This is good because it sort of narrows things down. What is
contained there in networking is:

1) initial stackable dst logic, should not cause problems
2) addition of UDP sendfile and ip_append_*() logic
3) fix to tcp_check_req() "fix" :-) it only changes bahevior
on connect so should not be a problem

I heavily, therefore, suspect #2 which is why I am poking around
in the tcp.c changes to change checksumming and copying semantics.

--- net/ipv4/tcp.c.~1~ Tue Jan 28 12:40:09 2003
+++ net/ipv4/tcp.c Tue Jan 28 12:41:48 2003
@@ -1089,11 +1089,13 @@
if (!skb)
goto wait_for_memory;

+#if 0
/*
* Check whether we can use HW checksum.
*/
if (sk->route_caps & (NETIF_F_IP_CSUM|NETIF_F_NO_CSUM|NETIF_F_HW_CSUM))
skb->ip_summed = CHECKSUM_HW;
+#endif

skb_entail(sk, tp, skb);
copy = mss_now;

2003-01-28 21:51:39

by Christopher Faylor

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

On Tue, Jan 28, 2003 at 12:34:13PM -0800, David S. Miller wrote:
> From: Sebastian Benoit <[email protected]>
> Date: Tue, 28 Jan 2003 20:16:45 +0100
>
> David S. Miller([email protected])@2003.01.28 10:35:34 +0000:
> > Good set of debug checks would be the following:
>
> no output, i did 4 tests, everytime i was able to lock the ssh-connection
> within a few seconds. kernel 2.5.59 + your debug-patch.
>
>Thanks for testing, how about this new patch at the end of this email?
>Does it make the problem go away?

It does for me, yes. I tried very hard to make ssh hang but I couldn't do
so.

cgf

2003-01-28 22:02:43

by Sebastian Benoit

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

David S. Miller([email protected])@2003.01.28 12:34:13 +0000:
> Thanks for testing, how about this new patch at the end of this email?
> Does it make the problem go away?

this does it!

/B.


> --- net/ipv4/tcp.c.~1~ Tue Jan 28 12:40:09 2003
> +++ net/ipv4/tcp.c Tue Jan 28 12:41:48 2003
> @@ -1089,11 +1089,13 @@
> if (!skb)
> goto wait_for_memory;
>
> +#if 0
> /*
> * Check whether we can use HW checksum.
> */
> if (sk->route_caps & (NETIF_F_IP_CSUM|NETIF_F_NO_CSUM|NETIF_F_HW_CSUM))
> skb->ip_summed = CHECKSUM_HW;
> +#endif
>
> skb_entail(sk, tp, skb);
> copy = mss_now;
>

--
Sebastian Benoit <[email protected]>
My mail is GnuPG signed -- Unsigned ones are bogus -- http://www.gnupg.org/
GnuPG 0x5BA22F00 2001-07-31 2999 9839 6C9E E4BF B540 C44B 4EC4 E1BE 5BA2 2F00

The dyslexic agnostic with insomnia laid awake all night wondering if there
really was a dog.


Attachments:
(No filename) (928.00 B)
(No filename) (240.00 B)
Download all attachments

2003-01-28 23:24:28

by David Miller

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x, through Cisco PIX

From: Sebastian Benoit <[email protected]>
Date: Tue, 28 Jan 2003 23:12:01 +0100

David S. Miller([email protected])@2003.01.28 12:34:13 +0000:
> Thanks for testing, how about this new patch at the end of this email?
> Does it make the problem go away?

this does it!

Alexey, my current suspect is skb->csum state on retransmit.

BTW, how come tcp_trim_head() can just set skb->ip_summed
blindly to CHECKSUM_HW and not setup skb->csum? Even if you
can depend upon net/core/dev.c to do the checksum for
you, you still would need to setup skb->csum properly.

2003-01-28 23:48:08

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

Hello!

> Alexey, most solid report is that 2.5.43-bk1 makes bug appear.
> This is good because it sort of narrows things down.

Now I do not think so. It looks like some old beast just got manifested.

It happens when 2 short consecutive segments are lost.
Funny thing happen when retransmitting.
First, I do not see collapsing, which must be succesfull in this case.
So, the first segment is retransmitted alone, but the second is never
retransmitted, tcp even prefers to retransmit the third one. Something
is already bad, queue is broken in an interesting way, the impression is
that... that... that tcp did collapsing, but "forgot" to modify skb length.

Hey! Interesting thing has just happened, it is the first time when I found
the bug formulating a senstence while writing e-mail not while peering
to code. :-)

Shheit, look into tcp_retrans_try_collapse():

if (skb->ip_summed != CHECKSUM_HW) {
memcpy(skb_put(skb, next_skb_size), next_skb->data, nex$ skb->csum = csum_block_add(skb->csum, next_skb->csum, s$ }


WHERE IS skb_put and copy when skb->ip_summed==CHECKSUM_HW??!!

So, the fix is move of memcpy() line out of if clause.

Alexey

2003-01-28 23:53:59

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

Hello!

> BTW, how come tcp_trim_head() can just set skb->ip_summed
> blindly to CHECKSUM_HW and not setup skb->csum?

When skb->ip_summed is CHECKSUM_HW skb->csum is ignored and
initialized at the moment when segment is transmitted in
tcp_v*_send_check().

Alexey

2003-01-29 00:00:27

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

Hello!

The proposed fix is enclosed. Please, check.

Alexey

===== net/ipv4/tcp_output.c 1.19 vs edited =====
--- 1.19/net/ipv4/tcp_output.c Fri Oct 25 15:46:21 2002
+++ edited/net/ipv4/tcp_output.c Wed Jan 29 03:07:26 2003
@@ -786,13 +786,13 @@
/* Ok. We will be able to collapse the packet. */
__skb_unlink(next_skb, next_skb->list);

+ memcpy(skb_put(skb, next_skb_size), next_skb->data, next_skb_size);
+
if (next_skb->ip_summed == CHECKSUM_HW)
skb->ip_summed = CHECKSUM_HW;

- if (skb->ip_summed != CHECKSUM_HW) {
- memcpy(skb_put(skb, next_skb_size), next_skb->data, next_skb_size);
+ if (skb->ip_summed != CHECKSUM_HW)
skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
- }

/* Update sequence range on original skb. */
TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;

2003-01-29 00:37:23

by Sebastian Benoit

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

[email protected]([email protected])@2003.01.29 03:09:21 +0000:
> Hello!
>
> The proposed fix is enclosed. Please, check.

okay, this seems to be a solution.
i can't get the ssh session to lock up with this patch.

thanks,
B.
--
Sebastian Benoit <[email protected]>
My mail is GnuPG signed -- Unsigned ones are bogus -- http://www.gnupg.org/
GnuPG 0x5BA22F00 2001-07-31 2999 9839 6C9E E4BF B540 C44B 4EC4 E1BE 5BA2 2F00

I'm not as think as you stoned I am.


Attachments:
(No filename) (470.00 B)
(No filename) (240.00 B)
Download all attachments

2003-01-29 01:20:38

by David Miller

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

From: [email protected]
Date: Wed, 29 Jan 2003 02:56:41 +0300 (MSK)

Hey! Interesting thing has just happened, it is the first time when I found
the bug formulating a senstence while writing e-mail not while peering
to code. :-)

Congratulations :-)

Shheit, look into tcp_retrans_try_collapse():

if (skb->ip_summed != CHECKSUM_HW) {
memcpy(skb_put(skb, next_skb_size), next_skb->data, nex$ skb->csum = csum_block_add(skb->csum, next_skb->csum, s$ }


WHERE IS skb_put and copy when skb->ip_summed==CHECKSUM_HW??!!

So, the fix is move of memcpy() line out of if clause.

Indeed, this bug exists in 2.4 as well of course.

This bug is 2.4.3 vintage :-) It got added as part of initial
zerocopy merge in fact.

Here is 2.4.x version of fix, 2.5.x is identicaly sans some line
number differences. I will push this all to Linus/Marcelo.

BTW, Alexey, please please explain to me how that trick made
by tcp_trim_head() works. :-) I am talking about how it is
setting ip_summed to CHECKSUM_HARDWARE blindly and not even
bothering to set skb->csum correctly.

--- net/ipv4/tcp_output.c.~1~ Tue Jan 28 16:12:39 2003
+++ net/ipv4/tcp_output.c Tue Jan 28 16:14:18 2003
@@ -721,10 +721,9 @@
if (next_skb->ip_summed == CHECKSUM_HW)
skb->ip_summed = CHECKSUM_HW;

- if (skb->ip_summed != CHECKSUM_HW) {
- memcpy(skb_put(skb, next_skb_size), next_skb->data, next_skb_size);
+ memcpy(skb_put(skb, next_skb_size), next_skb->data, next_skb_size);
+ if (skb->ip_summed != CHECKSUM_HW)
skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
- }

/* Update sequence range on original skb. */
TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;

2003-01-29 03:05:53

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

Hello!

> BTW, Alexey, please please explain to me how that trick made
> by tcp_trim_head() works. :-) I am talking about how it is
> setting ip_summed to CHECKSUM_HARDWARE blindly and not even
> bothering to set skb->csum correctly.

skb->csum is not used inside TCP when skb->ip_summed==CHECKSUM_HW:

void tcp_v4_send_check(struct sock *sk, struct tcphdr *th, int len,
struct sk_buff *skb)
{
struct inet_opt *inet = inet_sk(sk);

if (skb->ip_summed == CHECKSUM_HW) {
th->check = ~tcp_v4_check(th, len, inet->saddr, inet->daddr, 0);
skb->csum = offsetof(struct tcphdr, check);

And when pushing segment down to IP, it is initialized to offset of th->check.

So, it is safe to make skb->ip_summed := CHECKSUM_HW any moment when
we are lazy to recalculate checksum. Frankly speaking, it is not very good,
I was confused _a_ _lot_ when seeing wrong checksums on those bogus
zero-length packets in tcpdumps made by Christopher. But saves some
source lines.

Alexey

2003-01-29 04:02:50

by Christopher Faylor

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

On Wed, Jan 29, 2003 at 01:46:42AM +0100, Sebastian Benoit wrote:
>[email protected]([email protected])@2003.01.29 03:09:21 +0000:
>>The proposed fix is enclosed. Please, check.
>
>okay, this seems to be a solution. i can't get the ssh session to lock
>up with this patch.

Ditto for me.

Thank you!
cgf

2003-01-29 06:55:30

by David Miller

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

From: [email protected]
Date: Wed, 29 Jan 2003 03:09:21 +0300 (MSK)

The proposed fix is enclosed. Please, check.

Installed locally and I will propagate everywhere as soon
as possible.

Thanks a lot Alexey.

2003-01-29 07:36:19

by David Miller

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

From: [email protected]
Date: Wed, 29 Jan 2003 06:14:55 +0300 (MSK)

skb->csum is not used inside TCP when skb->ip_summed==CHECKSUM_HW:
...
So, it is safe to make skb->ip_summed := CHECKSUM_HW any moment when
we are lazy to recalculate checksum.

I see, clever trick as I had suspected.

Thanks for the explanation.

2003-01-29 14:03:45

by David C Niemi

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,


On Tue, 28 Jan 2003, David S. Miller wrote:
> From: [email protected]
> Date: Wed, 29 Jan 2003 02:56:41 +0300 (MSK)
>
> Hey! Interesting thing has just happened, it is the first time when I
> found the bug formulating a senstence while writing e-mail not while
> peering to code. :-)
>
> Congratulations :-)

Just to confirm, this fix works for me as well.

...
> Indeed, this bug exists in 2.4 as well of course.
>
> This bug is 2.4.3 vintage :-) It got added as part of initial
> zerocopy merge in fact.

Odd, then, that it I was unable to reproduce the SSH hangs under 2.4.18
even once, despite heavily using it for several days under the same
circumstances. Is there any reason 2.4.x would be better able to recover?
2.5.59 with the fix seems to feel a bit less balky than 2.4.18 without the
fix, so it seemed to me that 2.4.18 had some way of recovering at the cost
of a several second pause in the session.

DCN


2003-01-29 14:15:48

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

Hello!

> Odd, then, that it I was unable to reproduce the SSH hangs under 2.4.18

The bug is there, but it cannot be triggered with ssh.
In 2.4 it can happen only on sockets which use sendfile().

Alexey

2003-02-02 15:34:17

by Bill Davidsen

[permalink] [raw]
Subject: Re: [TEST FIX] Re: SSH Hangs in 2.5.59 and 2.5.55 but not 2.4.x,

On Wed, 29 Jan 2003, David C Niemi wrote:

>
> On Tue, 28 Jan 2003, David S. Miller wrote:
> > From: [email protected]
> > Date: Wed, 29 Jan 2003 02:56:41 +0300 (MSK)
> >
> > Hey! Interesting thing has just happened, it is the first time when I
> > found the bug formulating a senstence while writing e-mail not while
> > peering to code. :-)
> >
> > Congratulations :-)
>
> Just to confirm, this fix works for me as well.
>
> ...
> > Indeed, this bug exists in 2.4 as well of course.
> >
> > This bug is 2.4.3 vintage :-) It got added as part of initial
> > zerocopy merge in fact.
>
> Odd, then, that it I was unable to reproduce the SSH hangs under 2.4.18
> even once, despite heavily using it for several days under the same
> circumstances. Is there any reason 2.4.x would be better able to recover?
> 2.5.59 with the fix seems to feel a bit less balky than 2.4.18 without the
> fix, so it seemed to me that 2.4.18 had some way of recovering at the cost
> of a several second pause in the session.

The problem which I have been seeing with some regularity is not the hang
you describe (I see that infrequently) but rather a hang after I exit an
ssh connection. I open several dozen windows at a time to a cluster when I
do admin, and when I close almost always at least one doesn't drop without
"~." to help. So far in a hour I haven't seen that.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.