2012-06-02 09:55:24

by Stefan Richter

[permalink] [raw]
Subject: Silent data corruption with kernel 3.4 and FireWire disks

About a week ago I noticed silent data corruptions of files on FireWire
disks: Mount disk, read lots of data and e.g. compute their md5sum,
unmount disk, mount disk again, read and md5sum the same files again ->
MD5s may differ.

Defects in files that were written in May hint that not only reading from
but also writing to FireWire disks resulted in corrupt data. This was
silent corruption without any error messages from the PCI, firewire, SCSI,
block, or filesystem subsystems.

Affected:
- kernel 3.4
- kernel 3.4-rc5
Not affected:
- kernel 3.3.1 (which I have been running now for the last 6 days)

I used these three kernels with the same patchlevel of FireWire drivers,
namely circa those which are about to be released in 3.5-rc1. FireWire
disks with different 1394-to-SATA or -IDE bridge chips are affected. I
noticed the problem at first on an Agere FW643e PCIe 1394 controller which
sits behind a PLX PEX 8505 PCIe switch.

MPEG2TS video reception through the same 1394 controller and PCIe switch
did never show a noticable sign of corruption.

I did not have time yet to systematically test
- whether all of my FireWire controllers are affected,
- whether SATA or USB disks are affected (SATA probably not, USB not
used yet),
- whether my secondary Linux PC is affected.

Kernel 3.4 and 3.4-rc5 exhibited another (seemingly harmless but
suspicious) issue on my primary PC: frequent transmit queue time-outs of
an RTL8111/8168B Ethernet interface,
http://www.spinics.net/lists/netdev/msg197032.html

Being busy at work lately and not having Linux available at work, I will
be slow to look further into it. With enough spare time, it should be
possible to identify the regression by bisection between kernel 3.3 and
3.4-rc but I have no estimate when I will be able to spend that time.
--
Stefan Richter
-=====-===-- -==- ---=-
http://arcgraph.de/sr/


2012-06-05 00:30:22

by Jonathan Woithe

[permalink] [raw]
Subject: Re: Silent data corruption with kernel 3.4 and FireWire disks

On Mon, Jun 04, 2012 at 07:28:50PM +0000, Stefan Richter wrote:
> About a week ago I noticed silent data corruptions of files on FireWire
> disks: Mount disk, read lots of data and e.g. compute their md5sum,
> unmount disk, mount disk again, read and md5sum the same files again ->
> MD5s may differ.
>
> Defects in files that were written in May hint that not only reading from
> but also writing to FireWire disks resulted in corrupt data. This was
> silent corruption without any error messages from the PCI, firewire, SCSI,
> block, or filesystem subsystems.
>
> Affected:
> - kernel 3.4
> - kernel 3.4-rc5
> Not affected:
> - kernel 3.3.1 (which I have been running now for the last 6 days)

Hmm, funny you should mention this. Over the past few months I have also
been experiencing silent corruption of a firewire disc, although I suspect
it may be for a different reason. The corruptions started occurring soon
after I upgraded a machine to kernel 2.6.39 in May 2011. The filesystem was
xfs, and when corruption occurred it generally took out the entire
filesystem (on repair, everything would be bundled unsorted into
lost+found).

The disc is written to once a day using rsync.

I removed the drive from its enclosure and ran various SMART tests on it
directly (the enclosure prevents SMART from operating). The drive showed no
pre-fail signs, passed all self-tests and didn't show any problems under
badblocks tests (read-write or destructive write).

On 18 May this year I upgraded the kernel to 3.3.6 and thus far I have not
had a repeat of the corruption. Under 2.6.39 I was usually seeing a
corruption event well within 2 weeks of recreating the filesystem, although
sometimes it took longer. Although it's early days it seems that 3.3.6 is
so far behaving better than 2.6.39.

Combined with Stefan's observations, this would indicate that there were
issues with 2.6.39, they weren't present in 3.3.x and then reappeared in
3.4. It's the disappearance and reappearance which has me thinking that
perhaps we are seeing two different problems, one of which has been fixed.

> FireWire disks with different 1394-to-SATA or -IDE bridge chips are
> affected. I noticed the problem at first on an Agere FW643e PCIe 1394
> controller which sits behind a PLX PEX 8505 PCIe switch.

In my case the enclosure was one based on the Oxford Semiconductor chipset
(911?). The drive is a PATA Western Digital 500 GB drive (00AAKB-00H8A0 - I
think from memory it's a Green drive). The firewire card is reported to be

VIA Technologies, Inc. IEEE 1394 Host Controller (rev 46)
Subsystem: VIA Technologies, Inc. IEEE 1394 Host Controller

(vendor/device ID: 1106:3044, subsystem: 1106:3044).

> - whether SATA or USB disks are affected (SATA probably not, USB not
> used yet),

The system concerned uses SATA discs for the system drives, driven by:

RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller
(rev 80)
Subsystem: ASUSTeK Computer Inc. A7V600/K8V Deluxe/K8V-X/A8V Deluxe
motherboard

I have seen no corruption on these. Once a week I am also writing to
alternating external USB2 drives (again, using rsync) and none of those have
seen this corruption either. The USB host is reported to be

USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86) (prog-if 20
[EHCI])
Subsystem: ASUSTeK Computer Inc. A7V600/K8V-X/A8V Deluxe motherboard

As I said, since there seems to be a working kernel between the version I
saw which exhibited the problem and the one where Stefan experienced an
issue, it's possible that these are two different issues (one fixed, one
still lurking). I throw the above out there in case it helps.

Regards
jonathan

PS: I'm not subscribed to lkml, but am to ieee1394-devel.