I have a fairly repeatable rsync over ssh stall that I'm seeing between
two Linux boxes, both running identical 2.4.1 kernels. The stall is
fairly easy to repeat in our environment -- it can happen up to several
times per minute, and usually happens at least once per minute. It
doesn't really seem to be data-sensitive. The stall will last until the
session times out *unless* I take one of two steps to "unstall" it. The
easiest way to do this is to run 'strace -p $PID' against the sending ssh
process. As soon as the strace is started, rsync starts working again,
but will stall again (even with strace still running) after a short period
of time.
We've seen this bug (or a *very* similar one) with 2.2.16 and 2.4.[01]. I
haven't tried a newer 2.2.x or 2.4.2 or -acX.
One system is a P2/400, the other is a P3/800. The two boxes are
communicating over a mostly idle Ethernet, through 3 switches. One end is
a EEPro 100, the other end is an Acenic, although that shouldn't matter.
During a stall, the sending end shows a lot of data stuck in the Recv-Q:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 72848 0 ref.lab.ocp.interna:840 ref-0.sys.pnap.net:ssh ESTABLISHED
The receiving end shows a similar problem, but on the sending queue:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 28960 ref-0.sys.pnap.net:ssh ref.lab.ocp.interna:840 ESTABLISHED
Like I said, I don't believe that this is a network issue, because I can
un-stall the rsync by either stracing the *sending* ssh process, or by
putting the sending rsync into the background with ^Z and then popping it
back into the foreground. I have tcpdumps that I can send, but they look
pretty straightforward to me -- the window fills, so data stops flowing.
Strace doesn't seem to be particularly informative:
<blocked, strace starts>
select(4, [0], [1], NULL, NULL) = 1 (out [1])
write(1, "xxxxxxxxxxxxx"..., 66156) = 66156
...
select(4, [0], [1 3], NULL, NULL) = 2 (out [1 3])
write(1, "\0\0\0\0\274\2\0\0\0\0\0\0\271\30\0\0\0\0\0\0\274\2\0\0"..., 69526
<blocked again>
Strace on the receiving end shows the obvious -- it's sitting in select
waiting for data to arrive.
According to 'ps l', the ssh process is waiting in 'sock_wait_for_wmem'.
We've tried changing versions of rsync and ssh without any success. FWIW,
this kernel was compiled with GCC 2.95.2, from Debian potato.
Scott
On Thu, Mar 01, 2001 at 04:41:01PM -0800, Scott Laird wrote:
> I have a fairly repeatable rsync over ssh stall that I'm seeing between
> two Linux boxes, both running identical 2.4.1 kernels. The stall is
> fairly easy to repeat in our environment -- it can happen up to several
> times per minute, and usually happens at least once per minute. It
> doesn't really seem to be data-sensitive. The stall will last until the
> session times out *unless* I take one of two steps to "unstall" it. The
> easiest way to do this is to run 'strace -p $PID' against the sending ssh
> process. As soon as the strace is started, rsync starts working again,
> but will stall again (even with strace still running) after a short period
> of time.
>...
> According to 'ps l', the ssh process is waiting in 'sock_wait_for_wmem'.
I've also reported this recently, and got told that it was because I was
running 2.2.15pre13 on one end. Thanks for confirming that 2.2.15pre13
is not the cause.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
Hello!
> I've also reported
The report by Scott Laird is sane unlike your one.
It can be explained by bug rather than only by poltergeist. 8)
> Thanks for confirming that 2.2.15pre13 is not the cause.
Russel, you are warned that kernels<2.2.17 and rsync is an incompatible
combination.
Alexey
[email protected] writes:
> Russel, you are warned that kernels<2.2.17 and rsync is an incompatible
> combination.
So, what you're saying is that because these kernels have known problems
with rsync, the fact that my symptoms on 2.4.0 are 100% _precisely_ the
same means its not the same bug?
In addition, the fact that the tcp _retries_ indicate that both sides
are behaving correctly _in this instance_ means that its not a 2.4 bug?
If you still insist that it is purely a 2.2.15pre13 bug dispite the
growing evidence against this, then I shall see if I can get everything
together to put 2.2.18 on this machine. I can't guarantee when I'll
be able to do this though.
Also, as I pointed out, since the machines are 40+ miles away for
most of the week, and are without a reasonable net connection, I
can only comment on what is _currently_ running, and I thought it at
least useful to indicate that both my and Scott symptoms are identical.
PS, could you please spell my name correctly?
PPS, rather than arguing about this, can people proceed to investigate
Scotts problem, and I'll "tag along" to see if my problem gets fixed.
Thanks.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
On Fri, Mar 02, 2001 at 04:31:07PM +0000, Russell King wrote:
> [email protected] writes:
> > Russel, you are warned that kernels<2.2.17 and rsync is an incompatible
> > combination.
>
> So, what you're saying is that because these kernels have known problems
> with rsync, the fact that my symptoms on 2.4.0 are 100% _precisely_ the
> same means its not the same bug?
Well, I can tell you that going from a 2.4.2pre2 sparc64 box via rsync
over ssh to a 2.4.2 or 2.4.1-pre8 i686 gives me the same problems.
However with slight differences. With the 2.4.1-pre8 kernel on the i686
I see "protocol error, different version of rsync?", and with the 2.4.2
kernel I get segv's in the remote rsync (I'm running the rsync -e ssh
from the sparc64).
Both systems are running IDE, ext2 only on both, no special config
options (pretty bare to be honest).
So no, this is not a 2.2.x interaction bug.
--
-----------=======-=-======-=========-----------=====------------=-=------
/ Ben Collins -- ...on that fantastic voyage... -- Debian GNU/Linux \
` [email protected] -- [email protected] -- [email protected] '
`---=========------=======-------------=-=-----=-===-======-------=--=---'
Hello!
> same means its not the same bug?
It is the same, I think.
> If you still insist that it is purely a 2.2.15pre13 bug
I never said this. I said that your strace is _wrong_, how can I be
sure that tcpdump is not wrong too? You could understand this. 8)
> together to put 2.2.18 on this machine. I can't guarantee when I'll
> be able to do this though.
You planned to make more accurate strace on Monday, if I remember correctly.
Now it is not necessary, Scott's one is enough to understand that
some problem exists and cannot be explained by buggy 2.2.15.
> PS, could you please spell my name correctly?
I bring apologies.
Alexey
On Fri, 2 Mar 2001 [email protected] wrote:
> > together to put 2.2.18 on this machine. I can't guarantee when I'll
> > be able to do this though.
>
> You planned to make more accurate strace on Monday, if I remember correctly.
> Now it is not necessary, Scott's one is enough to understand that
> some problem exists and cannot be explained by buggy 2.2.15.
One data point on my hang -- I increased
/proc/sys/net/core/wmem_{max,default} from 64k to 256k, and then increased
/proc/sys/net/ipv5/tcp_wmem from "4096 16384 131072" to "16384 65536
262144", and the hangs seem to have either stopped or (more likely)
drastically reduced in frequency. I was able to rsync a couple GB without
stalling.
I can perform more tests, if anyone has anything in particular that they'd
like to see.
Scott
On Fri, Mar 02, 2001 at 10:12:36AM +0000, Russell King wrote:
> On Thu, Mar 01, 2001 at 04:41:01PM -0800, Scott Laird wrote:
> > I have a fairly repeatable rsync over ssh stall that I'm seeing between
> > two Linux boxes, both running identical 2.4.1 kernels. The stall is
> > fairly easy to repeat in our environment -- it can happen up to several
> > times per minute, and usually happens at least once per minute. It
> > doesn't really seem to be data-sensitive. The stall will last until the
> > session times out *unless* I take one of two steps to "unstall" it. The
> > easiest way to do this is to run 'strace -p $PID' against the sending ssh
> > process. As soon as the strace is started, rsync starts working again,
> > but will stall again (even with strace still running) after a short period
> > of time.
> >...
> > According to 'ps l', the ssh process is waiting in 'sock_wait_for_wmem'.
>
> I've also reported this recently, and got told that it was because I was
> running 2.2.15pre13 on one end. Thanks for confirming that 2.2.15pre13
> is not the cause.
>
Be very careful here. He did nothing of the sort. He merely indicated that
there is at least one problem running rsync over ssh between 2.4.1 systems.
There is no guarantee that your problem and his are identical. As Alexey
pointed out, there are bad bugs in 2.2.15 which can cause a TCP connection to
get stuck. Given that you are running 2.2.15, you'd need a tcpdump to
determine whether you hit one of these or not.
I've been bitten too many times assuming something was one big problem only
to find out later it was actually several smaller ones.
Regards,
Tim
--
Tim Wright - [email protected] or [email protected] or [email protected]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
Hello!
> this kernel was compiled with GCC 2.95.2,
This is a hint.
Could you make the following things:
1. to disassemble tcp_poll() (the easiest way is to gdb vmlinux, to
say x/i tcp_poll and to hold enter pressed long enough, copying screen
to file) and to send the result to me.
2. to apply the enclosed patchlet.
3. if 3 does not change anything, recompile with egcs-1.1.2
Alexey
--- ../vger3-010223/linux/net/ipv4/tcp.c Fri Feb 23 21:28:34 2001
+++ linux/net/ipv4/tcp.c Sat Mar 3 18:37:22 2001
@@ -442,6 +443,8 @@
set_bit(SOCK_ASYNC_NOSPACE, &sk->socket->flags);
set_bit(SOCK_NOSPACE, &sk->socket->flags);
+ barrier();
+
/* Race breaker. If space is freed after
* wspace test but before the flags are set,
* IO signal will be lost.
Notice also that by default ssh opens stdin/stdout blocking, and can
relatively easily deadlock if the pipes it talks over really want to do
a write before a read or the other way round.
You can try compile the following file, put it in the same directory
as ssh, and then run rsync over this instead of plain ssh (I use it in
fact in all places where I connect to ssh over pipes).
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#ifndef HAVE_NO_UNISTD_H
# include <unistd.h>
#endif /* HAVE_NO_UNISTD_H */
#include <fcntl.h>
static char ssh[] = "ssh";
int unblock(FILE *fp) {
int fd, rc, flags;
fd = fileno(fp);
if (isatty(fd)) return 0;
flags = fcntl(fd, F_GETFL, 0);
if (flags < 0) {
fprintf(stderr, "Could not query fd %d: %s\n", fd, strerror(errno));
return 1;
}
rc = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
if (rc < 0) {
fprintf(stderr, "Could not unblock fd %d: %s\n", fd, strerror(errno));
return 1;
}
return 0;
}
int main(int argc, char **argv) {
int rc;
char *ptr, *work;
if (unblock(stdin)) return 1;
if (unblock(stdout)) return 1;
if (unblock(stderr)) return 1;
ptr = strrchr(argv[0], '/');
if (ptr == NULL) ptr = argv[0];
else ptr++;
work = malloc(ptr-argv[0]+sizeof(ssh));
if (!work) {
fprintf(stderr, "Out of memory. Buy more ?\n");
return 1;
}
memcpy(work, argv[0], ptr-argv[0]);
memcpy(work+(ptr-argv[0]), ssh, sizeof(ssh));
argv[0] = work;
rc = execvp(work, argv);
fprintf(stderr, "Could not exec %.300s: %s\n", work, strerror(errno));
return rc;
}