2007-10-07 16:47:56

by Malte Schröder

[permalink] [raw]
Subject: Strange network related data corruption

Hello,
I am encountering some strange data corruption when transferring
data from one of my PCs that I use as a file-server.

on the server:
FILE=<large file>; | cut -d" " -f1 | nc -lp5000 -q0; while nc
-lp5000 -q0 < $FILE; do : ; done

on the client:
H=<server>; SUM=$(nc -q0 $H 5000);sleep 1s; while nc -q0 $H 5000 |
sha1sum | (grep -v $SUM || echo -n .); do sleep 1s ;done

(output looks somewhat like this:
..............6dd5fb1ce29d270acdfbb02d00921bf75d141773 -
...
)

I would expect the sha1sum to be the same in every pass (assuming the
source file does not change). But every few passes (with no apparent
pattern) there is a different sum returned. I first noticed this when
transferring large files (backups) with with SMB and NFS(v3 and v4) but
to rule that out I tried netcat in the way noted above.

When I have the server do the sha1sum of the file locally the problem
is not reproducible. When I do this with a small file that easily fits
into the cache the problem stays reproducible.

Another thing I did was to use dd to transfer data in 1GiB chunks from
/dev/zero and generate the sha1sum on the client. There I was not able
to reproduce the problem.


The server is a Athlon64 3400+ (good old Clawhammer) with 1GiB RAM. I
use 4 SATA drives in a software RAID5 configuration, attached to a
Promise TX4 300 SATA-II controller. The filesystem is ext3 without
special mount-options. The dist is Debian/Sid for AMD64 with
self-compiled kernel 2.6.23-rc9 (.config attached).

The clients I tried are a Core2Duo 6600 with 3GiB of RAM, also
Debian/Sid AMD64 (kernel 2.6.23-rc9) and a Centrino notebook with
Pentium M and 1GiB of RAM (Debian/Sid i386, kernel 2.6.23-rc7).

All PCs mentioned have gigabit ethernet and are connected via a gigabit
switch.

I tried these tests between the clients and could not reproduce the
problem there.

I had the server run memtest68+ with 20 passes without problems.

I tried several kernel versions on the server (from .18 to .23-rc9), all
showed the problem. I suspect a hardware problem, but I cannot isolate
the part responsible. I tried another ethernet adapter (the 3com905cin
lspci output) and I also tried the onboard sata controller(s) (2 ports
via and 2 ports promise tx2).

I don't know if this is a kernel problem or just my and my setup, but
maybe some one on this list has an idea wher I could look next.

Thanks and regards
Malte
--
---------------------------------------
Malte Schröder
[email protected]
ICQ# 68121508
---------------------------------------


Attachments:
(No filename) (2.46 kB)
config-2.6.23-rc9 (243.00 B)
Download all attachments

2007-10-08 13:01:49

by Denys Vlasenko

[permalink] [raw]
Subject: Re: Strange network related data corruption

On Sunday 07 October 2007 17:47, Malte Schr?der wrote:
> Hello,
> I am encountering some strange data corruption when transferring
> data from one of my PCs that I use as a file-server.
>
> on the server:
> FILE=<large file>; | cut -d" " -f1 | nc -lp5000 -q0; while nc
> -lp5000 -q0 < $FILE; do : ; done

$ cat z
FILE=z; | cut -d" " -f1 | nc -lp5000 -q0; while nc -lp5000 -q0 < $FILE; do : ; done

$ sh z
z: line 1: syntax error near unexpected token `|'
z: line 1: `FILE=z; | cut -d" " -f1 | nc -lp5000 -q0; while nc -lp5000 -q0 < $FILE; do : ; done'

> on the client:
> H=<server>; SUM=$(nc -q0 $H 5000);sleep 1s; while nc -q0 $H 5000 |
> sha1sum | (grep -v $SUM || echo -n .); do sleep 1s ;done
>
> (output looks somewhat like this:
> ..............6dd5fb1ce29d270acdfbb02d00921bf75d141773 -
> ...
> )
>
> I would expect the sha1sum to be the same in every pass (assuming the
> source file does not change). But every few passes (with no apparent
> pattern) there is a different sum returned. I first noticed this when
> transferring large files (backups) with with SMB and NFS(v3 and v4) but
> to rule that out I tried netcat in the way noted above.

Does it happen over loopback?

tcpdump / tcpflow may help seeing whether there is some corruption
(TCP checksumming should have catched that, but worth looking into).
Basically, you wait for "wrong" checksum to appear, then
you stop script and look into tcpdump/tcpflow logs.
--
vda

2007-10-09 11:26:17

by Malte Schröder

[permalink] [raw]
Subject: Re: Strange network related data corruption

On Mon, 8 Oct 2007 14:01:32 +0100
Denys Vlasenko <[email protected]> wrote:

> On Sunday 07 October 2007 17:47, Malte Schröder wrote:
> > Hello,
> > I am encountering some strange data corruption when transferring
> > data from one of my PCs that I use as a file-server.
> >
> > on the server:
> > FILE=<large file>; | cut -d" " -f1 | nc -lp5000 -q0; while nc
> > -lp5000 -q0 < $FILE; do : ; done
>
> $ cat z
> FILE=z; | cut -d" " -f1 | nc -lp5000 -q0; while nc -lp5000 -q0 < $FILE; do : ; done

Sorry, there is a copy'n'paste error in my post, correct line would be:
FILE=z ; sha1sum $FILE | cut -d" " -f1 | nc -lp5000 -q0; while nc -lp5000 -q0 < $FILE; do : ;done

>
> $ sh z
> z: line 1: syntax error near unexpected token `|'
> z: line 1: `FILE=z; | cut -d" " -f1 | nc -lp5000 -q0; while nc -lp5000 -q0 < $FILE; do : ; done'
>
> > on the client:
> > H=<server>; SUM=$(nc -q0 $H 5000);sleep 1s; while nc -q0 $H 5000 |
> > sha1sum | (grep -v $SUM || echo -n .); do sleep 1s ;done
> >
> > (output looks somewhat like this:
> > ..............6dd5fb1ce29d270acdfbb02d00921bf75d141773 -
> > ...
> > )
> >
> > I would expect the sha1sum to be the same in every pass (assuming the
> > source file does not change). But every few passes (with no apparent
> > pattern) there is a different sum returned. I first noticed this when
> > transferring large files (backups) with with SMB and NFS(v3 and v4) but
> > to rule that out I tried netcat in the way noted above.
>
> Does it happen over loopback?

I just tried a few times and yes, it also happens on loopback, but
much less frequently. Now I am really confused ...

>
> tcpdump / tcpflow may help seeing whether there is some corruption
> (TCP checksumming should have catched that, but worth looking into).
> Basically, you wait for "wrong" checksum to appear, then
> you stop script and look into tcpdump/tcpflow logs.
> --
> vda
>


Attachments:
signature.asc (189.00 B)

2007-10-09 11:57:38

by Denys Vlasenko

[permalink] [raw]
Subject: Re: Strange network related data corruption

On Tuesday 09 October 2007 12:25, Malte Schr?der wrote:
> > Does it happen over loopback?
>
> I just tried a few times and yes, it also happens on loopback, but
> much less frequently. Now I am really confused ...

Actually, that eliminates a lot of cases.

Run memtest86 overnight ("bad hardware" theory),
try older kernel versions ("kernel bug" theory).
--
vda

2007-10-13 07:55:50

by Malte Schröder

[permalink] [raw]
Subject: Re: Strange network related data corruption

On Tue, 9 Oct 2007 12:57:20 +0100
Denys Vlasenko <[email protected]> wrote:

> On Tuesday 09 October 2007 12:25, Malte Schröder wrote:
> > > Does it happen over loopback?
> >
> > I just tried a few times and yes, it also happens on loopback, but
> > much less frequently. Now I am really confused ...
>
> Actually, that eliminates a lot of cases.
>
> Run memtest86 overnight ("bad hardware" theory),

Well, that did not show problems. But I put apart the PC, removed dust
and so on.
Now it doesn't even boot anymore (i.e. no BIOS, not even pcspeaker when
I boot without RAM. So I declare the mainboard as dead.

> try older kernel versions ("kernel bug" theory).

Done. But it is the hardware.


Thanks for the advise :)

> --
> vda
>


--
---------------------------------------
Malte Schröder
[email protected]
ICQ# 68121508
---------------------------------------


Attachments:
signature.asc (189.00 B)