by Stephen D. Williams

Subject: Rsync SSH session hang, AGAIN - Help! Deadlock debugging needed.

This has been a recurring problem for a couple years which I and others
have experienced. I was free from it for a while, but after upgrading
OpenSSL/OpenSSH to avoid the recent exploit it is back and highly
repeatable. This has been persistant enough that I am going to start
with the assumption that it may be a kernel bug, or at least probably
debuggable definitively only by a proficient kernel developer. We have
got to squash this once and for all; SSH is used everywhere and it needs
to be reliable. Probably there is a race condition in ssh, as mentioned
below, but it must be subtle.

rsync/ssh transfers from local system to local system work perfectly.
Between the systems, there is nearly always large delays at certain
times and usually a complete hang. After a long period, this often
produces a timeout. These sytems are on 100baseT on the same switch.
One system appears to be having mild packet loss (400 out of 400,000 on
both send and receive as frame/carrier erros). BTW, running a cpio
through the SSH connections causes a nearly immediate hang, so it is
unlikely to be a problem with rsync.

Both systems work find receiving rsync/ssh from my laptop over a 400Kb
DSL connection with:
OpenSSH 3.1p1
openssl 0.9.6c
rsync 2.5.4
gcc 2.96
kernel 2.4.19

(systems are a combination of Suse and Redhat 7.3, upgraded variously by

My standard rsync/ssh script looks like:

brsyncndz (backup rsync no delete or compression):
if [ "$PORT" = "" ]; then PORT=22; fi
rsync -vv -HpogDtSxlra --partial --progress --stats -e "ssh -p $PORT" $*

On both sides:

On 'old' system:
gcc 2.95.2
kernel 2.4.3

On 'new' system:
gcc 2.96
kernel 2.4.20-pre8

References to past discussions: (Tried the TCP buffers tuning.)
Haven't tried this code yet:

