Message-ID: <3FFDEB94.4000606@bogonomicon.net>
Date: Thu, 08 Jan 2004 17:45:24 -0600
From: Bryan Andersen <bryan@bogonomicon.net>
Organization: Bogonomicon
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031107 Debian/1.5-3
MIME-Version: 1.0
To: Bryan Andersen <bryan@nerdvest.com>
CC: linux-kernel@vger.kernel.org
Subject: Re: 2.4.23+atalib2 sil3112A write errors
References: <3FFD4371.7020309@nerdvest.com>
In-Reply-To: <3FFD4371.7020309@nerdvest.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3049
Lines: 60

Sorry for the false alarm.  I'm now suspecting hardware errors.  I took 
a look at the actual errors introduced into the files.  When I octal 
dump both the source file and destination file and diff them I'm only 
seeing a short string of bytes changed.  This is telling me I'm getting 
multibit errors slipping though.  Time to start swapping cables and 
hardware.  Is there a way to get the driver to output messages when 
errors are seen rather than messages for all transfers?

This is an example of the differences seen.

72922,72923c72922,72923
< 4357500 064047 120335 134421 006545 137622 023477 043620 164561
< 4357520 106514 023757 026270 043466 051034 117400 174451 127107
---
 > 4357500 064047 120335 134421 006545 136566 035207 015425 133770
 > 4357520 012031 012055 171154 045114 015406 000160 017147 035214

- Bryan

Bryan Andersen wrote:
> I'm seeing silent write errors with two Seagate 160GB drives on a 
> sil3112A SATA controller on an ASUS A7N8X-Deluxe motherboard, but I'm 
> not seeing any read errors.  Each drive is on it's own cable.  Kernel is 
> 2.4.23 release with the 2.4.23-libata2 patch and some patches for using 
> MythTV applied.  (I also see the same problem under 2.4.24+libata only) 
>  I'm also not seeing any error messages in any log files or the kernel 
> dmesg output.  The error rate looks to be around 1 in a million blocks. 
>  My current guess is the block or blocks just didn't get written by the 
> drive as when the system reads in the blocks with the bad data I'm not 
> seeing any read error messages.
> 
> I am now running the tests again using jfs as the filesystem rather than 
> ext3 to rule out the file system as the cause.  As part of this test I'm 
> also zeroing the disks before the test and then checking for non zero 
> data and how large the non zero data blocks are.  Given enough time I 
> may run this test a couple of times to see if the positions of the bad 
> data move about or stay put.
> 
> How the first testes were run.
> 
> I used "cp -a * /data10" as root to copy the data (150GB worth) to the 
> disk under test.  Then I created md5sum lists for all files in both the 
> source and destination and sorted and compaired the lists.  Differences 
> were found between the lists, some files and directories were missing, 
> or corrupted.  On fscking the disks some inodes were found corrupted.  I 
> then copied only the corrupted files and after a few iterations I 
> finally got a copy that was the same as the source.  I then ran a 
> program to repeatadly md5sum checksum all the files and check against 
> the source.  It did not see any differences in 10 cycles.
> 
> I've attached my .config file, but don't have dmesg output to include. I 
> will grab that when I reboot after the current tests are finished.
> 
> - Bryan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/