I didn't think any of the tools that I know, trust, and have used for
years could be failing me, but I hadn't seen quite this failure mode before.
Even though the 'old' system below was reachable via web, ssh command
line sessions, etc., any connection between it and another box on the
same network at the ISP failed after 100-200KB. While there were a few
ethernet errors, nothing seemed to increase much or at all with each ssh
hang. Nevertheless, it was either a cable or hub/switch at the ISP that
was bad. I now theorize that, having a noisy port, the hub/switch
involved would sense an unacceptably bad port and shut it down, but only
when burst transfers were between local boxes. Everything remote was
spaced enough to be below that threshold. After a few seconds of no
activity, the port was apparently unlocked. I knew that switches had
bad port sensing, but had never seen a system on the fence like this.
It didn't help that these machines were 500 miles away in a colo with
only daytime staff.
Apologies for the premature post.
Stephen D. Williams wrote:
> This has been a recurring problem for a couple years which I and
> others have experienced. I was free from it for a while, but after
> upgrading OpenSSL/OpenSSH to avoid the recent exploit it is back and
> highly repeatable. This has been persistant enough that I am going to
> start with the assumption that it may be a kernel bug, or at least
> probably debuggable definitively only by a proficient kernel
> developer. We have got to squash this once and for all; SSH is used
> everywhere and it needs to be reliable. Probably there is a race
> condition in ssh, as mentioned below, but it must be subtle.
> rsync/ssh transfers from local system to local system work perfectly.
> Between the systems, there is nearly always large delays at certain
> times and usually a complete hang. After a long period, this often
> produces a timeout. These sytems are on 100baseT on the same switch.
> One system appears to be having mild packet loss (400 out of 400,000
> on both send and receive as frame/carrier erros). BTW, running a cpio
> through the SSH connections causes a nearly immediate hang, so it is
> unlikely to be a problem with rsync.
> Both systems work find receiving rsync/ssh from my laptop over a 400Kb
> DSL connection with:
> OpenSSH 3.1p1
> openssl 0.9.6c
> rsync 2.5.4
> gcc 2.96
> kernel 2.4.19
> (systems are a combination of Suse and Redhat 7.3, upgraded variously
> by hand)
> My standard rsync/ssh script looks like:
> brsyncndz (backup rsync no delete or compression):
> if [ "$PORT" = "" ]; then PORT=22; fi
> rsync -vv -HpogDtSxlra --partial --progress --stats -e "ssh -p $PORT" $*
> On both sides:
> On 'old' system:
> gcc 2.95.2
> kernel 2.4.3
> On 'new' system:
> gcc 2.96
> kernel 2.4.20-pre8
> References to past discussions: (Tried the TCP buffers tuning.)
> Haven't tried this code yet: