2001-02-27 10:07:10

by Russell King

[permalink] [raw]
Subject: rsync over ssh on 2.4.2 to 2.2.18

Hi,

I'm seeing odd behaviour with rsync over ssh between two x86 machines -
one if the is an UP PIII (Katmai) running 2.4.2 (isdn-gw) and the other
is an UP Pentium 75-200 (pilt-gw) running 2.2.15pre13 with some custom
serial driver hacks (for running Amplicon cards with their ISA interrupt-
sharing scheme).

Please note: although I am using 2.2.15pre13, it is _not_ the cause of
this problem, and I don't particularly want to touch the 2.2.15pre13
machine at the present time, which has been happy for the past year or
so.

netstat on isdn-gw shows the following:

Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 72868 0 isdn-gw.piltdown.a:1023 pilt-gw.piltdown.at:ssh ESTABLISHED

netstat on pilt-gw shows the following:

Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 14372 pilt-gw.piltdown.at:ssh isdn-gw.piltdown.a:1023 ESTABLISHED

Looking on the network reveals:

09:49:23.215852 P pilt-gw.ssh > isdn-gw.1023: . 1119640864:1119640864(0) ack 1175090336 win 31856 <nop,nop,timestamp 114834687 8389267> (DF) [tos 0x8]
09:49:23.216095 P isdn-gw.1023 > pilt-gw.ssh: . 1:1(0) ack 1 win 0 <nop,nop,timestamp 8401267 114186976> (DF) [tos 0x8]

So isdn-gw is saying it can't accept more data, which ties up with what
netstat is indicating. Note that this situation has persisted for around
12 hours with neither end moving forward.

When I straced the ssh process on isdn-gw this morning, things start moving,
and in this instance everything completed normally, even after I stopped
stracing ssh. However, there are times when I have to leave strace running
to get the rsync to complete.

select(4, [3], [3], NULL, NULL) = 2 (in [3], out [3])
read(3, "\0\0\20\t\215R\241\355\201_o\227\4\371\340+\333_\333\34"..., 8192) = 8192
write(3, "\370\325|\355\232\256\247Kf\202\v\272\334@\231\317]\322"..., 135940) = 24616

On this instance, it appears that sshd has gotten stuck in select().
Maybe there is a race condition or missing wakeup in the TCP code?

Note that this problem also exists on 2.4.0 (I upgraded from 2.4.0 to
2.4.2 yesterday to see if the behaviour has changed).

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html


2001-02-27 10:19:10

by Rasmus Andersen

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18

On Tue, Feb 27, 2001 at 10:02:11AM +0000, Russell King wrote:
> Hi,
>
> I'm seeing odd behaviour with rsync over ssh between two x86 machines -
> one if the is an UP PIII (Katmai) running 2.4.2 (isdn-gw) and the other
> is an UP Pentium 75-200 (pilt-gw) running 2.2.15pre13 with some custom
> serial driver hacks (for running Amplicon cards with their ISA interrupt-
> sharing scheme).

[...]

I have seen behaviour similar to this (descibed in
http://marc.theaimsgroup.com/?l=linux-kernel&m=98262067309185&w=2)
locally on a 2.4.X machine where the rsync completes but fails to
terminate, apparently because the child does not receive the KILLUSR1
(wild speculation)? Anyways, the parent process waits in wait4 and
the child loops, waiting for the signal. This is not reproducable
in 2.2.X (for me).

I have found that in order to reproduce this problem I need to
make rsync do a certain amount of work. Ie., rsync linux/drivers
new-dir will make rsync hang, but rsync linux/drivers/[a-p]* does
not. Does the same thing apply for you? stracing the rsync does
make it all work if I specify that rsync should trace the child,
too.


Regards,
Rasmus

2001-02-27 10:51:45

by David Miller

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18


Russell King writes:
> Please note: although I am using 2.2.15pre13, it is _not_ the cause of
> this problem

How do you know this? There are so many deadly TCP bugs fixed
since 2.2.15pre13 I don't know how you can assert this.

Later,
David S. Miller
[email protected]

2001-02-27 10:59:05

by bert hubert

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18

On Tue, Feb 27, 2001 at 11:18:39AM +0100, Rasmus Andersen wrote:

> terminate, apparently because the child does not receive the KILLUSR1
> (wild speculation)? Anyways, the parent process waits in wait4 and
> the child loops, waiting for the signal. This is not reproducable
> in 2.2.X (for me).

rsync problems of this type abound across several UNIX platforms - I had
problems like this on FreeBSD, but never under Linux. I suspect that rsync
bends or at least stretches certain limits.

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Services
Trilab The Technology People
'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet

2001-02-27 13:22:02

by Anton Blanchard

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18


Hi,

> I'm seeing odd behaviour with rsync over ssh between two x86 machines -
> one if the is an UP PIII (Katmai) running 2.4.2 (isdn-gw) and the other
> is an UP Pentium 75-200 (pilt-gw) running 2.2.15pre13 with some custom
> serial driver hacks (for running Amplicon cards with their ISA interrupt-
> sharing scheme).

What version of ssh are you using? Older versions would use blocking IO
which would result in deadlocks (and angry emails wrongly blaming rsync :)

Anton

2001-02-27 16:49:34

by Russell King

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18

Anton Blanchard writes:
> > I'm seeing odd behaviour with rsync over ssh between two x86 machines -
> > one if the is an UP PIII (Katmai) running 2.4.2 (isdn-gw) and the other
> > is an UP Pentium 75-200 (pilt-gw) running 2.2.15pre13 with some custom
> > serial driver hacks (for running Amplicon cards with their ISA interrupt-
> > sharing scheme).
>
> What version of ssh are you using? Older versions would use blocking IO
> which would result in deadlocks (and angry emails wrongly blaming rsync :)

Note that I proved that it was sitting in select(), and therefore can't be
blocking.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2001-02-27 20:35:43

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18

Hello!

> netstat on isdn-gw shows the following:
>
> Proto Recv-Q Send-Q Local Address Foreign Address State
> tcp 72868 0 isdn-gw.piltdown.a:1023 pilt-gw.piltdown.at:ssh ESTABLISHED
plus
> select(4, [3], [3], NULL, NULL) = 2 (in [3], out [3])


> Maybe there is a race condition or missing wakeup in the TCP code?

Moreover, even not _one_ wakeup is missing. At least two, because
wakeups in read and write are separate and you have stuck in both directions.
8)8)

Well, if it was one I would start to dig ground inside tcp instantly.
But as soon as two of them are missing, I have to suspect wake_up itself.
At least, we had such bugs there until 2.4.0.

Alexey

2001-02-27 20:36:37

by Russell King

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18

David S. Miller writes:
> How do you know this? There are so many deadly TCP bugs fixed
> since 2.2.15pre13 I don't know how you can assert this.

Through the evidence I gave.

Firstly, if the receiving side is saying that it has a window of 0, then
it is not going to accept any more data.

Secondly, the receiving side has data waiting in the receive queue.

Thirdly, the receiving process is selecting on the socket, and dispite
there being data waiting, select is saying that there is no data waiting.

All the data is pointing at the 2.4 kernel as being the culprit.

I'm surprised at your response given the amount of hard evidence I gave that
supports my assertion.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2001-02-27 22:56:35

by Russell King

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18

[email protected] writes:
> Moreover, even not _one_ wakeup is missing. At least two, because
> wakeups in read and write are separate and you have stuck in both directions.
> 8)8)

I'll see if I can strace it from the start until it hangs tomorrow.

> Well, if it was one I would start to dig ground inside tcp instantly.
> But as soon as two of them are missing, I have to suspect wake_up itself.
> At least, we had such bugs there until 2.4.0.

I was running at one point a 2.4.0-test kernel, but I didn't see these
effects back then (but then I wasn't monitoring the system as closely as
I am at the moment).

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2001-02-27 23:11:35

by Eugene Crosser

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18

In article <[email protected]>,
Russell King <[email protected]> writes:

> I'm seeing odd behaviour with rsync over ssh between two x86 machines -

I've seen hanging rsync over ssh more than once, while sending much data
from an x86 running Linux (late 2.3.x) to Sparc/Solaris2.5.1 over ssh
1.2.26. I had a feeling that it was triggered by certain data patterns
because it often stopped at the same spot after restart (and therefore
I attributed it to a bug in rsync). I did not investigate further.

Eugene

2001-02-28 20:21:09

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18

Hello!

> I've seen hanging rsync over ssh more than once, while sending much data
> from an x86 running Linux (late 2.3.x) to Sparc/Solaris2.5.1

I remember this your report. However, recent news force to suspect
that the reason was in Solaris yet. Actually, if you send tcpdump of
failed session, this question can be answered.


Also REMEMBER!

rsync has __no__ chances to work, if one of sides is Linux-2.2 before 2.2.17
or Solaris of today (sunos-2.5.1 is enough old, probably, it still was right
unlike subsequent kernels). These stacks have conciding set of fatal bugs,
which prevent transmits with closed window.

To use rsync it is necessary to upgrade to >=2.2.17.

Alexey

2001-02-28 20:27:49

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18

Hello!

> I'll see if I can strace it from the start until it hangs tomorrow.

Please...

Also, try to make binary tcpdump.


> I was running at one point a 2.4.0-test kernel, but I didn't see these

Yes, it did not result in full stall. Lost wakeups were recovered
f.e. by any keyboard activity. 8)

Wow! 2.2.15? rsync surely does work with 2.2.15. The first 2.2 fixing
the same bugs, which solaris has now, was 2.2.17.

Alexey

2001-03-19 08:07:16

by Russell King

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18

On Wed, Feb 28, 2001 at 11:20:22PM +0300, [email protected] wrote:
> I remember this your report. However, recent news force to suspect
> that the reason was in Solaris yet. Actually, if you send tcpdump of
> failed session, this question can be answered.

Well, since I moved the rsync to 5pm, and then back to 9pm, I haven't
seen this problem - everything is again working as expected (touch wood)
with 2.2.15pre13 and 2.4.0.

This is odd, since it wasn't a one-off problem, but something that happened
each and every day of a particular week. Anyway, if it starts happening
again, I'll get a tcpdump of the session.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2001-03-19 18:58:10

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18

Hello!

> Well, since I moved the rsync to 5pm, and then back to 9pm, I haven't
> seen this problem - everything is again working as expected (touch wood)
> with 2.2.15pre13 and 2.4.0.
>
> This is odd, since it wasn't a one-off problem, but something that happened
> each and every day of a particular week. Anyway, if it starts happening
> again, I'll get a tcpdump of the session.

Well... I can reproduce this not depending of day of week now. :-(

If I understood Andrew's mail correctly, rsync freezes when
large amount of errors happen. Particularly, here ssh always freezes
trying to write to stdout (pipe linked to rsync) some chunk of data (>64K),
consisting of reports sort of "stat(usr/X11R6/bin/xmbind) : No such file"
(my /usr has huge amount of stray symlinks...). rsync does not read
this pipe, trying to write some more data to pipe, which is
pipe linked to rsync stdin. Dead lock.

Note, that doing strace you do not see this write()! write() is interrupted
by strace and you see succesful select(). You can see real place of lockup
via ps axl or starting strace before lockup happened.

Andrew said about one more common case, when large amount of errors happens:
wrong permission on some target directory.

I have impression that long ago I observed the same affect,
when disk space at target exhausted.

Why do I tall this? Well, probably, something changed at fs,
wich you rsync and rsync generates less errors.

Alexey

2001-03-23 12:42:25

by Rusty Russell

[permalink] [raw]
Subject: Re: rsync over ssh on 2.4.2 to 2.2.18

In message <[email protected]> you write:
> If I understood Andrew's mail correctly, rsync freezes when
> large amount of errors happen. Particularly, here ssh always freezes

Known hard-to-fix bug in rsync; too many errors in the pipe, and it
locks up.

Rusty.
--
Premature optmztion is rt of all evl. --DK