LinuxLists.cc - amd64 sata_nv (massive) memory corruption

2008-08-01 17:30:44

Subject: amd64 sata_nv (massive) memory corruption

Hi,

I'm seeing strong, easily reproducible (and silent) corruption on a
sata-attached
disk drive on an amd64 board. It might be the disk itself, but I
doubt it; googling
suggests that its somehow iommu-related but I cannot confirm this.

quickie summary:
-- disk is a brand new WDC WD5000AAKS-00YGA0 500GB disk (well, it
was brand new a few months ago -- unusued, at any rate)
-- passes smartmon with flying colors, including many repeated short and long
self-tests. Been passing for months. No hint of bad sectors or other errors
in smartctl -a display
-- no ide, sata errors in syslog -- no block device errors, no fs errors, etc.
-- No oopses anywhere to be found
-- system works flawlessly with an old PATA disk. (although I'm running it
with dma turned off with hdparm, out of paranoia)
-- system is amd64 dual core, ASUS M2N-E mobo, 4GB RAM
Northbridge is nVidia Corporation MCP55 Memory Controller (rev a3)
-- I tried moving the sata cable around to other ports, no effect; also tried
reseating it on hard drive, no effect.

corruption is *easily* observed copying files with cp or dd. Also, typically
filesystem metadata is corrupted too. Creating even a small ext2 filesystem,
say 1GB, then copying 300MB of files onto it, unmounting it, and running fsk
will return many dozens of errors. Rerunning e2fsck over and over (as
e2fsck -f -y /dev/sda6) will report new errors about 1 out of every 3 times
(on small fs'es -- on big one's it will find new errors every time)

This behaviour has been observed with two different kernels:
with 2.6.23.9, compiled for 32-bit, and also 2.6.26 complied
for 64-bit.

Googling this uncovers some Dec 2006 LKML emails suggesting an
iommu problem, which I explored:
-- My default boot complains
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM
-- I cannot find any option in BIOS that even vaguely hints at IOMMU-like
function; at best, I can assign interrupts to PCI slots, but
that's it. There's
a bunch of IO options for olde-fashioned superio-like stuff: serial,parallel
ports, USB stuff, etc. but that's all.
-- booting with iommu=soft does get rid of the aperature memory hole
messsage, but does not solve the corruption problem.
-- booting with iommu=force seems to have no effect.

I'm running the powernow-k8 cpu frequency regulator. On a hunch,
I wondered if this might be the source of the problem; however,
using the "performance" regulator to keep the clock speed nailed
at maximum had no effect on the corruption bug.

Also of note:
-- problem was observed earlier, when system had 3GB RAM in it.
-- The integrated nvidia ethernet seems to work great, no errors, etc.
-- A different PCI ethernet card works great too.
-- I'm running graphics on an anceint matrox card in a PCI slot, and
there's no hint of trouble there.
-- I'm using this system as my day-to-day desktop, and there seem to
be no other problems. This suggests that if its some pci iommu
wackiness, it certainly not affecting anything that isn't sata.

I really doubt the problem is the hard-drive; but I'll have to buy another
one to rule this out. Its possible that there's some problem with the
sata_nv driver, but there have been historical reports of corruption
on amd64 with other sata controllers. I can buy another sata controller
if needed, to experiment.

Other than that, any ideas for any further experiments? What can
I do to narrow the problem?

-- Linas Vepstas

2008-08-01 21:21:50

by John Stoffel

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

Linas> I'm seeing strong, easily reproducible (and silent) corruption
Linas> on a sata-attached disk drive on an amd64 board. It might be
Linas> the disk itself, but I doubt it; googling suggests that its
Linas> somehow iommu-related but I cannot confirm this.

Interesting. I've got the same motherboard and chipset and memory and
I'm NOT seeing errors. I just did a quick setup of a 10gb partition
on a Seagate 250gb disk at the end, copied over the latest kernel tree
along with the ubuntu-7.10 ISO image. No errors on an ext2
filesystem.

Linas> quickie summary:
Linas> -- disk is a brand new WDC WD5000AAKS-00YGA0 500GB disk (well, it
Linas> was brand new a few months ago -- unusued, at any rate)
Linas> -- passes smartmon with flying colors, including many repeated short and long
Linas> self-tests. Been passing for months. No hint of bad sectors or other errors
Linas> in smartctl -a display
Linas> -- no ide, sata errors in syslog -- no block device errors, no
Linas> fs errors, etc.
Linas> -- No oopses anywhere to be found
Linas> -- system works flawlessly with an old PATA disk. (although I'm
Linas> running it with dma turned off with hdparm, out of paranoia)
Linas> -- system is amd64 dual core, ASUS M2N-E mobo, 4GB RAM
Linas> Northbridge is nVidia Corporation MCP55 Memory Controller
Linas> (rev a3)

Are you running the latest BIOS? As I recall, my motherboard is an
M2N-SLI Deluxe, which is slightly different from yours.

Linas> -- I tried moving the sata cable around to other ports, no
Linas> effect; also tried reseating it on hard drive, no effect.

Linas> corruption is *easily* observed copying files with cp or
Linas> dd. Also, typically filesystem metadata is corrupted
Linas> too. Creating even a small ext2 filesystem, say 1GB, then
Linas> copying 300MB of files onto it, unmounting it, and running fsk
Linas> will return many dozens of errors. Rerunning e2fsck over and
Linas> over (as e2fsck -f -y /dev/sda6) will report new errors about 1
Linas> out of every 3 times (on small fs'es -- on big one's it will
Linas> find new errors every time)

Linas> This behaviour has been observed with two different kernels:
Linas> with 2.6.23.9, compiled for 32-bit, and also 2.6.26 complied
Linas> for 64-bit.

I've been running a variety of RC kernels since Mid-Febuary 2008 on my
box and I have not been seeing problems.

Linas> Googling this uncovers some Dec 2006 LKML emails suggesting an
Linas> iommu problem, which I explored:
Linas> -- My default boot complains
Linas> Your BIOS doesn't leave a aperture memory hole
Linas> Please enable the IOMMU option in the BIOS setup
Linas> This costs you 64 MB of RAM
Linas> -- I cannot find any option in BIOS that even vaguely hints at
Linas> IOMMU-like function; at best, I can assign interrupts to
Linas> PCI slots, but that's it. There's a bunch of IO options
Linas> for olde-fashioned superio-like stuff: serial,parallel
Linas> ports, USB stuff, etc. but that's all.
Linas> -- booting with iommu=soft does get rid of the aperature memory hole
Linas> messsage, but does not solve the corruption problem.
Linas> -- booting with iommu=force seems to have no effect.

Linas> I'm running the powernow-k8 cpu frequency regulator. On a hunch,
Linas> I wondered if this might be the source of the problem; however,
Linas> using the "performance" regulator to keep the clock speed nailed
Linas> at maximum had no effect on the corruption bug.

I'm running the same freq regulator, but I let mine float up and down
from 1ghz to 2.6ghz (my max, not overclocked at all).

Linas> Also of note:
Linas> -- problem was observed earlier, when system had 3GB RAM in it.

What did you do to upgrade to 4gb of ram? Just pull the second pair
of 512mb DIMMs and put in fresh 1gb DIMMs? I've got a pair of 2gb
DIMMs in my box. I suspect you are seeing memory problems of some
sort.

Linas> -- The integrated nvidia ethernet seems to work great, no errors, etc.

Same here.

Linas> -- A different PCI ethernet card works great too.

Never bothered to try.

Linas> -- I'm running graphics on an anceint matrox card in a PCI
Linas> slot, and there's no hint of trouble there.

I could do this too as a test, but I'm running a PCIe Radeon X1600
without problems either.

Linas> -- I'm using this system as my day-to-day desktop, and there seem to
Linas> be no other problems. This suggests that if its some pci iommu
Linas> wackiness, it certainly not affecting anything that isn't sata.

Linas> I really doubt the problem is the hard-drive; but I'll have to
Linas> buy another one to rule this out. Its possible that there's
Linas> some problem with the sata_nv driver, but there have been
Linas> historical reports of corruption on amd64 with other sata
Linas> controllers. I can buy another sata controller if needed, to
Linas> experiment.

Linas> Other than that, any ideas for any further experiments? What can
Linas> I do to narrow the problem?

Pull all your old memory, just put in the bare minimum and see if the
problem repeats.

Also, what kind of power supply do you have installed? Not that I
think you're overloading it with what you list.

Next, I'd upgraded the BIOS to the latest release, and then reset the
BIOS to the factory default or safe settings to see if that helps.

Good luck! Let me know if you need me to run tests or get BIOS
information.

John

2008-08-01 22:19:19

by Alistair John Strachan

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

On Friday 01 August 2008 18:30:34 Linas Vepstas wrote:
> Hi,
>
> I'm seeing strong, easily reproducible (and silent) corruption on a
> sata-attached
> disk drive on an amd64 board. It might be the disk itself, but I
> doubt it; googling
> suggests that its somehow iommu-related but I cannot confirm this.

Nowhere do you explicitly say you have memtest86'ed the RAM. Checking 4GB of
RAM will take some time (probably several hours) but it will mostly eliminate
bad memory as the cause of the corruption.

IME these kinds of bugs are almost always bad RAM. Since the part of the RAM
that is bad may never be used by kernel code, you may experience no crashes.
This is especially true of machines with a lot of RAM. However since your
filesystem cache can easily consume all 4GB over time, you could see this kind
of corruption when copying files.

--
Cheers,
Alistair.

2008-08-02 02:52:11

by Linas Vepstas

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

2008/8/1 Alistair John Strachan <[email protected]>:
> On Friday 01 August 2008 18:30:34 Linas Vepstas wrote:
>> Hi,
>>
>> I'm seeing strong, easily reproducible (and silent) corruption on a
>> sata-attached
>> disk drive on an amd64 board. It might be the disk itself, but I
>> doubt it; googling
>> suggests that its somehow iommu-related but I cannot confirm this.
>
> Nowhere do you explicitly say you have memtest86'ed the RAM.

It passes memtest86+ just fine. The system has been in heavy
use doing big science calculations on big datasets (multi-gigabyte)
for months; these do not get corrupted when copied/moved around
on the old parallel IDE disk, nor moving/copying on an NFS mount
to a file server. Only the SATA disk is misbehaving.

--linas

2008-08-02 03:06:43

by Linas Vepstas

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

2008/8/1 John Stoffel <[email protected]>:

> Linas> -- system is amd64 dual core, ASUS M2N-E mobo, 4GB RAM
> Linas> Northbridge is nVidia Corporation MCP55 Memory Controller
> Linas> (rev a3)
>
> Are you running the latest BIOS? As I recall, my motherboard is an
> M2N-SLI Deluxe, which is slightly different from yours.

Its recent. I bought the thing only some number of months ago.
The basic mobo design is 2-3 years old, though, its not bleeding edge,
it was meant to be a conservative, stable, functional choice.
I'd hope that they'd have things like this debugged by now. Sigh.

> Linas> Also of note:
> Linas> -- problem was observed earlier, when system had 3GB RAM in it.
>
> What did you do to upgrade to 4gb of ram? Just pull the second pair
> of 512mb DIMMs and put in fresh 1gb DIMMs? I've got a pair of 2gb
> DIMMs in my box. I suspect you are seeing memory problems of some
> sort.

No See other email. memtest86 passes fine, the system has been in
heavy use as a compute server on large datasets. No problems at all,
spotless record. I've beeb manipulating multi-gigabyte files just fine
on the IDE disk, without any corruption at all. I can move them
around on NFS, too. They only get corrupted in the SATA disk,
where its immediate and widespread, and takes less than a minute
to occur.

> Next, I'd upgraded the BIOS to the latest release, and then reset the
> BIOS to the factory default or safe settings to see if that helps.

I'll give tht a whirl. Bios settings are still at factory defaults, I had no
reason to mess with them.

--linas

2008-08-02 20:09:18

by John Stoffel

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

>>>>> "Linas" == Linas Vepstas <[email protected]> writes:

Linas> 2008/8/1 Alistair John Strachan <[email protected]>:
>> On Friday 01 August 2008 18:30:34 Linas Vepstas wrote:
>>> Hi,
>>>
>>> I'm seeing strong, easily reproducible (and silent) corruption on a
>>> sata-attached
>>> disk drive on an amd64 board. It might be the disk itself, but I
>>> doubt it; googling
>>> suggests that its somehow iommu-related but I cannot confirm this.
>>
>> Nowhere do you explicitly say you have memtest86'ed the RAM.

Linas> It passes memtest86+ just fine. The system has been in heavy
Linas> use doing big science calculations on big datasets (multi-gigabyte)
Linas> for months; these do not get corrupted when copied/moved around
Linas> on the old parallel IDE disk, nor moving/copying on an NFS mount
Linas> to a file server. Only the SATA disk is misbehaving.

Can you post the output of dmesg after a boot, so we can see which
driver is being used? I assume the new Libata stuff, but maybe you
can also turn on debugging in there as well. Stuff like SCSI_DEBUG
(in the SCSI menus) might show us more details here.

Also, have you tried a new SATA cable by any chance? That's obviously
the cheaper path than getting a new disk...

Good luck,
John

2008-08-02 21:55:52

by Roger Heflin

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

Linas Vepstas wrote:
> 2008/8/1 Alistair John Strachan <[email protected]>:
>> On Friday 01 August 2008 18:30:34 Linas Vepstas wrote:
>>> Hi,
>>>
>>> I'm seeing strong, easily reproducible (and silent) corruption on a
>>> sata-attached
>>> disk drive on an amd64 board. It might be the disk itself, but I
>>> doubt it; googling
>>> suggests that its somehow iommu-related but I cannot confirm this.
>> Nowhere do you explicitly say you have memtest86'ed the RAM.
>
> It passes memtest86+ just fine. The system has been in heavy
> use doing big science calculations on big datasets (multi-gigabyte)
> for months; these do not get corrupted when copied/moved around
> on the old parallel IDE disk, nor moving/copying on an NFS mount
> to a file server. Only the SATA disk is misbehaving.

That MB uses DDR2 so I don't know if this is useful or not, I saw the issue on
MB's using DDR.

I have seen issues when using all 4 dimm slots on a number of MB's that only
appear to show up on DMA when using fast dual core cpus, if the CPU is slower
things work just fine, and if you don't do heavy use of network or disk things
are just fine. And these machines would pass memtest without any issues.

You might try slowing down the cpu to the slowest and see if you can still
duplicate it, if you cannot, bring the speed up a step and retest, if it only
happens at the highest speed, it might be something similar. In the end the
solution was to have the MB maker add an option in the bios to slow down the
ram, in the DDR case we had 4 double sided dimms (8 loads on the CPU) and AMD
documents said DDR memory with 6 or more loads needed to be running at 333 and
not 400, and as I said I don't know if it also applies to DDR2 in a similar way.
Note that if we used a slower dual core cpu it did not push things hard
enough to show the error either, I believe we had the issues with 280/285's but
not with 275's and lower (these were dual socket boards, with 4 dimms on each
cpu, 8 loads per cpu).

Roger

2008-08-02 22:02:20

by Linas Vepstas

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

2008/8/2 John Stoffel <[email protected]>:
>>>>>> "Linas" == Linas Vepstas <[email protected]> writes:
>
> Linas> 2008/8/1 Alistair John Strachan <[email protected]>:
>>> On Friday 01 August 2008 18:30:34 Linas Vepstas wrote:
>>>> Hi,
>>>>
>>>> I'm seeing strong, easily reproducible (and silent) corruption on a
>>>> sata-attached
>>>> disk drive on an amd64 board. It might be the disk itself, but I
>>>> doubt it; googling
>>>> suggests that its somehow iommu-related but I cannot confirm this.
>
> Can you post the output of dmesg after a boot, so we can see which
> driver is being used? I assume the new Libata stuff, but maybe you
> can also turn on debugging in there as well. Stuff like SCSI_DEBUG
> (in the SCSI menus) might show us more details here.
>
> Also, have you tried a new SATA cable by any chance? That's obviously
> the cheaper path than getting a new disk...

I took the problematic hard drive (and its cable) to another computer
with sata ports on it, and ran my file-copy/compare/fsck tests there,
and saw no problems; so the drive itself and its cable get a clean bill
of health.

Then, rather stupidly, I flashed the latest BIOS for the motherboard
and now have a dead motherboard (it hangs on its way through BIOS,
well before the bootloader.) So I'm off to buy a new mobo today.

I'll send the dmesg from the older boots later today, if all goes well.
I'm pretty sure I had the new libata on, and the old off -- but its
possible that the .config somehow managed to pull in parts of the
old libata code anyway. I say this because, besides the SATA, the
blown motherboard had an IDE connector in use, and I also had
another PCI IDE card plugged in and in use. I'm imagining that
perhaps the PCI IDE .config might have pulled in old code, maybe
via header file, and thus mangled some lock that the sata side
was using. Just a wild guess. -- Most people on this mobo hadn't
seen problems, and unlike most people, I had the PCI IDE card
in it.

--linas

2008-08-03 02:41:42

by John Stoffel

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

>>>>> "Linas" == Linas Vepstas <[email protected]> writes:

Linas> 2008/8/2 John Stoffel <[email protected]>:
>>>>>>> "Linas" == Linas Vepstas <[email protected]> writes:
>>
Linas> 2008/8/1 Alistair John Strachan <[email protected]>:
>>>> On Friday 01 August 2008 18:30:34 Linas Vepstas wrote:
>>>>> Hi,
>>>>>
>>>>> I'm seeing strong, easily reproducible (and silent) corruption on a
>>>>> sata-attached
>>>>> disk drive on an amd64 board. It might be the disk itself, but I
>>>>> doubt it; googling
>>>>> suggests that its somehow iommu-related but I cannot confirm this.
>>
>> Can you post the output of dmesg after a boot, so we can see which
>> driver is being used? I assume the new Libata stuff, but maybe you
>> can also turn on debugging in there as well. Stuff like SCSI_DEBUG
>> (in the SCSI menus) might show us more details here.
>>
>> Also, have you tried a new SATA cable by any chance? That's obviously
>> the cheaper path than getting a new disk...

Linas> I took the problematic hard drive (and its cable) to another
Linas> computer with sata ports on it, and ran my
Linas> file-copy/compare/fsck tests there, and saw no problems; so the
Linas> drive itself and its cable get a clean bill of health.

Well that's a good sign.

Linas> Then, rather stupidly, I flashed the latest BIOS for the
Linas> motherboard and now have a dead motherboard (it hangs on its
Linas> way through BIOS, well before the bootloader.) So I'm off to
Linas> buy a new mobo today.

Awww fuckies. Sorry to suggest this path to you. You might be able
to get it back by clearing the CMOS as well. And hey, it could have
been a bad Mobo in the end too.

Linas> I'll send the dmesg from the older boots later today, if all
Linas> goes well. I'm pretty sure I had the new libata on, and the
Linas> old off -- but its possible that the .config somehow managed to
Linas> pull in parts of the old libata code anyway. I say this
Linas> because, besides the SATA, the blown motherboard had an IDE
Linas> connector in use, and I also had another PCI IDE card plugged
Linas> in and in use. I'm imagining that perhaps the PCI IDE .config
Linas> might have pulled in old code, maybe via header file, and thus
Linas> mangled some lock that the sata side was using. Just a wild
Linas> guess. -- Most people on this mobo hadn't seen problems, and
Linas> unlike most people, I had the PCI IDE card in it.

Hmmm... I've sorta run into this, but on my old system where I have
the following: Adaptec SCSI built in (boot drive), LSI scsi PCI card
(tape library and drives), PATA on board (for DVD), SIL SATA PCI card
(data disks), HighPoint PCI card, two scratch disks. Total pain in
the butt figuring out the right mix of libATA SATA/PATA drivers vs the
old plain PATA drivers. Once I got it working with pretty much all
/dev/sd* devices, I just leave it alone. :] Oh yeah, an 8 port
serial card and a Gigabit ethernet card as well. It's full to the
gills.

My new system is mostly my desktop, not my server, so I haven't pushed
it as hard bus wise.

Good luck, sorry I can't help directly. Do you want to see my dmesg
output as a comparision?

John

2008-08-03 22:23:39

by Linas Vepstas

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

2008/8/2 John Stoffel <[email protected]>:

>> Linas

>>> Can you post the output of dmesg after a boot,

I found the problem, and its not in dmesg

> Linas> Then, rather stupidly, I flashed the latest BIOS for the
> Linas> motherboard and now have a dead motherboard (it hangs on its
> Linas> way through BIOS, well before the bootloader.) So I'm off to
> Linas> buy a new mobo today.
>
> Awww fuckies. Sorry to suggest this path to you. You might be able
> to get it back by clearing the CMOS as well.

That fixed it, but only *after* the machine cooled off overnight!
While it was warm, it was so unstable I couldn't even pilot around in BIOS
without it hanging. After cooling off, it was still unstable, but held
it together
long enough for me to ask for "factory defaults" -- and that fixed it.
(Grrr. What are these BIOS people thinking?)

I then did some more debugging, and isolated the original data corruption
problem to a bad pair of RAM sticks. But this was subtle, so let me recap:

-- The bad ram passes memtest86+
-- It's been in heavy use for some 3-4 months, for memory intensive compute,
and memory-intensive SQL, with not the slightest hint of any stability or
corruption problems, Uptime might have been around 3-4 months.
-- Corruption was prompt and widespread on the sata interface.

I remove the bad RAM, that the sata interface appears to be stable.
I've been doing file copies and diffs for hours without a hint of trouble.

This would seem to bring this chapter to a close.

======================================================

What I don't like is that the corruption was utterly silent -- and disastereous:
Originally, I had the sata disk paired to a pata disk in a RAID array, and the
raid array was getting corrupted -- corrupted system files would get worse,
as I tried reinstalling them. It took a while to realize that it was the sata
disk, and it took a bit longer to realize it wasn't the disk itself, but the
bad-ram-on-sata-channel.

So I'm wondering: can we devise a test to validate system-bus interactions
like this? Clearly, the memtest86 test validates the RAM and the northbridge
bus between CPU and system RAM, so that seems OK.

I assume the sata controller is attached via pci or pci-e -- although the pci
controller and the sata controller are on the same chip, (nVidia nForce 570
chipset) so it may be an 'emulated' pci bus of some sort. The problem would
seem to be some sort of bus timing issue between this particular RAM,
and the pci bus in the chipset -- bad "eyes" on some signal line, or ground
bounce or whatever, or maybe a rare chipset bug.

So the question is: is there some sort of sata (or pci) "loopback mode",
where we could pump data through all of the busses and controllers, up
near to the point where it would normally go out to the serdes to the disk,
but instead have it loop back, so that we could test the buses between
endpoints? I've never heard of a pci/pci-e loopback, but that doesn't mean it
doesn't exist. I have no clue about SATA. Is there possibly some ide or
scsi command that can be used to loop-back? Some sort of "send bytes
to disk, but don't actually write them to platter" command? Maybe just
a write to some scratch ram on the disk drive itself? Even just a few bytes
would be enough to implement a loopback test. Maybe some sort of
"queue this block, but don't write it yet", followed by a "give me dump of
the command queue" -- such a loopback test would have found my problem
pretty quickly I suspect.

Ideas solicited.

--linas

p.s. the corruption appears to be single bits -- the rest of the word, and
surrounding words, seem fine.

2008-08-03 22:33:53

by Alan

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

> I then did some more debugging, and isolated the original data corruption
> problem to a bad pair of RAM sticks. But this was subtle, so let me recap:
>
> -- The bad ram passes memtest86+

You are assuming bad RAM then not bad bus loadings, corrosion on the
pins.. ?

> So I'm wondering: can we devise a test to validate system-bus interactions
> like this? Clearly, the memtest86 test validates the RAM and the northbridge
> bus between CPU and system RAM, so that seems OK.

If you have a good enough pile of hardware and the right monitoring stuff
loaded then you should get EDAC event logs from the PCI/PCI-X for PCI
card logged traps, MCE errors for the higher level busses, L1/L2 cache or
CPU parity errors and ECC traps for memory problems either via EDAC or
MCE.

If you are using a generic end user motherboard then you don't.

> doesn't exist. I have no clue about SATA. Is there possibly some ide or
> scsi command that can be used to loop-back? Some sort of "send bytes
> to disk, but don't actually write them to platter" command? Maybe just
> a write to some scratch ram on the disk drive itself? Even just a few bytes

Yes for SCSI. In theory yes for ATA but I've never tested to see what the
level of actual support is, and I'm not sure you can test it in DMA mode.

> would be enough to implement a loopback test. Maybe some sort of

For the simpler cases perhaps. The more interesting approaches I think
are the fs level ones where you accept the fact that hardware sucks and
do end to end checksumming from the fs or even the app in some
situations. We don't yet have that functionality mainstream although it
might make an interesting device mapper module ...

Alan

2008-08-04 03:22:18

by Robert Hancock

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

Linas Vepstas wrote:
> What I don't like is that the corruption was utterly silent -- and disastereous:
> Originally, I had the sata disk paired to a pata disk in a RAID array, and the
> raid array was getting corrupted -- corrupted system files would get worse,
> as I tried reinstalling them. It took a while to realize that it was the sata
> disk, and it took a bit longer to realize it wasn't the disk itself, but the
> bad-ram-on-sata-channel.
>
> So I'm wondering: can we devise a test to validate system-bus interactions
> like this? Clearly, the memtest86 test validates the RAM and the northbridge
> bus between CPU and system RAM, so that seems OK.

I wouldn't be sure about that..

>
> I assume the sata controller is attached via pci or pci-e -- although the pci
> controller and the sata controller are on the same chip, (nVidia nForce 570
> chipset) so it may be an 'emulated' pci bus of some sort. The problem would
> seem to be some sort of bus timing issue between this particular RAM,
> and the pci bus in the chipset -- bad "eyes" on some signal line, or ground
> bounce or whatever, or maybe a rare chipset bug.

The SATA controller is part of the chipset, and I think it talks
HyperTransport directly, it only looks like PCI or PCI Express. These
systems have an on-die memory controller in the CPU, so the SATA
controller has to talk HyperTransport to the CPU which then is what
physically accesses the DIMMs.

In theory the DIMMs have no idea whether the accesses are from the CPU
itself or from the chipset. However, it's possible that the particular
timing or burst sizes of the transfers done by the SATA controller
triggered a problem with marginal timing on the DIMMs and caused the
data corruption.

>
> So the question is: is there some sort of sata (or pci) "loopback mode",
> where we could pump data through all of the busses and controllers, up
> near to the point where it would normally go out to the serdes to the disk,
> but instead have it loop back, so that we could test the buses between
> endpoints? I've never heard of a pci/pci-e loopback, but that doesn't mean it
> doesn't exist. I have no clue about SATA. Is there possibly some ide or
> scsi command that can be used to loop-back? Some sort of "send bytes
> to disk, but don't actually write them to platter" command? Maybe just
> a write to some scratch ram on the disk drive itself? Even just a few bytes
> would be enough to implement a loopback test. Maybe some sort of
> "queue this block, but don't write it yet", followed by a "give me dump of
> the command queue" -- such a loopback test would have found my problem
> pretty quickly I suspect.
>
> Ideas solicited.

I don't imagine that would be very useful in this case. The SATA link,
PCI Express bus, HyperTransport bus all have parity or CRC error
checking, so presumably they couldn't be likely to cause undetected
errors. The transitions between them could cause problems, and most
desktop machines don't have ECC memory which could catch memory timing
problems or bad RAM (which is rather unfortunate), so those are the most
likely places for a problem to show up.

>
> --linas
>
> p.s. the corruption appears to be single bits -- the rest of the word, and
> surrounding words, seem fine.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2008-08-05 05:45:35

by Linas Vepstas

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

2008/8/3 Robert Hancock <[email protected]>:
> Linas Vepstas wrote:
>>
>> What I don't like is that the corruption was utterly silent --
[...]
>> So the question is: is there some sort of sata (or pci) "loopback mode",
>> where we could pump data through all of the busses and controllers, up
>> near to the point where it would normally go out to the serdes to the
>> disk,
>> but instead have it loop back, so that we could test the buses between
>> endpoints?
>
> I don't imagine that would be very useful in this case. The SATA link, PCI
> Express bus, HyperTransport bus all have parity or CRC error checking, so
> presumably they couldn't be likely to cause undetected errors. The
> transitions between them could cause problems,

Well, but I suffered badly from an undetected error, in the sense
that the operating system had no knowledge of it, and it corrupted
data on disk as a result. As Alan Cox suggests, perhaps I didn't
have EDAC turned on, or something ... I'm investigating now.
But this is moot -- if there is software that already exists that
could have reported the error to the kernel, then this software
should have been installed/enabled/operating by default.

> and most desktop machines
> don't have ECC memory which could catch memory timing problems or bad RAM

I'm unclear on ECC memory: if a motherboard "supports ECC",
does it mean it actually uses ECC bits in the bus between the
memory controller and the RAM? Or does it simply mean that
it won't hang if I plug in ECC RAM (but otherwise ignore the bits)?

Personally I'm ready to pop $$$ for ECC it if will actually do
something for me, this has been painful.

--linas

2008-08-05 06:36:22

by Robert Hancock

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

Linas Vepstas wrote:
> 2008/8/3 Robert Hancock <[email protected]>:
>> Linas Vepstas wrote:
>>> What I don't like is that the corruption was utterly silent --
> [...]
>>> So the question is: is there some sort of sata (or pci) "loopback mode",
>>> where we could pump data through all of the busses and controllers, up
>>> near to the point where it would normally go out to the serdes to the
>>> disk,
>>> but instead have it loop back, so that we could test the buses between
>>> endpoints?
>> I don't imagine that would be very useful in this case. The SATA link, PCI
>> Express bus, HyperTransport bus all have parity or CRC error checking, so
>> presumably they couldn't be likely to cause undetected errors. The
>> transitions between them could cause problems,
>
> Well, but I suffered badly from an undetected error, in the sense
> that the operating system had no knowledge of it, and it corrupted
> data on disk as a result. As Alan Cox suggests, perhaps I didn't
> have EDAC turned on, or something ... I'm investigating now.
> But this is moot -- if there is software that already exists that
> could have reported the error to the kernel, then this software
> should have been installed/enabled/operating by default.

EDAC is mainly useful for detecting non-fatal problems (i.e. things like
corrupted transfers that were detected and retried, or ECC errors that
could be corrected) which might indicate a problem but might go
unnoticed otherwise. Usually, fatal problems that get detected by
hardware wouldn't be unnoticeable - they would typically raise NMIs and
cause funny kernel messages, cause machine check exceptions or just lock
up or reset the machine.

Of course, if you don't have ECC memory, and you have bad RAM or memory
timing problems, nothing can detect this at all, and EDAC wouldn't help you.

>
>> and most desktop machines
>> don't have ECC memory which could catch memory timing problems or bad RAM
>
> I'm unclear on ECC memory: if a motherboard "supports ECC",
> does it mean it actually uses ECC bits in the bus between the
> memory controller and the RAM? Or does it simply mean that
> it won't hang if I plug in ECC RAM (but otherwise ignore the bits)?

I think just about all chipsets (more specifically, memory controllers,
this includes AMD CPUs) support ECC and use the bits, at least if all
the installed memory supports it. I've never heard of a board that
couldn't handle ECC at all.

>
> Personally I'm ready to pop $$$ for ECC it if will actually do
> something for me, this has been painful.
>
> --linas
>

2008-08-05 12:47:23

by Alan

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

> have EDAC turned on, or something ... I'm investigating now.
> But this is moot -- if there is software that already exists that
> could have reported the error to the kernel, then this software
> should have been installed/enabled/operating by default.

That gets you into arguments with the people who care about performance
but its really a distribution level debate and I suspect the answer is
itself distro specific depending on usage/

> Personally I'm ready to pop $$$ for ECC it if will actually do
> something for me, this has been painful.

On a decent system ECC will do something. A modern server PC actually has
pretty good coverage on CPU L1, L2 and optionally RAM. I/O controllers
and disk internal caches seem to be a bit more variable which is one
reason big HPC cluster projects often checksum end to end - when you
produce terabytes of data all the one in a hundred billion error stats
start to look less than reassuring.

Alan

2008-08-05 17:02:29

by Linas Vepstas

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

2008/8/3 Alan Cox <[email protected]>:

>> -- The bad ram passes memtest86+
>
> You are assuming bad RAM then not bad bus loadings, corrosion on the
> pins.. ?

Yes, probably bad timing due to bus loading or bad impedance
due to bad connector, or whatever.

> If you have a good enough pile of hardware and the right monitoring stuff
> loaded then you should get EDAC event logs

I've got the AMD 570 chipset, which is older than the
amd76x that edac supports. The latest MB's seem to have
the AMD 790 chipset, which is also not currently supported.

Can anyone get me the portion of the AMD 570 (nVidia
nForce 570) chip specs that describe the RAM ECC
error event counters? (I assume that this chip has some
sort of error reporting or counting registers) I can sign
NDA if needed.

> The more interesting approaches I think
> are the fs level ones where you accept the fact that hardware sucks and
> do end to end checksumming from the fs or even the app in some
> situations. We don't yet have that functionality mainstream although it
> might make an interesting device mapper module ...

I'm game. Care to guide me through? So: on every write, this
new device mapper module computes a checksum and stores
it somewhere. On every read, it computes a checksum and
compares to the stored value. Easy enough I guess.

Several hard parts:
-- where to store the checksums?
-- what to do (besides print to dmesg) if there's a mismatch?
-- on an md raid-1, if there's a checksum error on one of the
disks, then one could check the other disk to see if its good.
This suggests a new API:

++ "is this block device an md device?"
++ "if yes to above, then give me alternate block"
++ "invalidate copy n of block x"
(this last, because presumably one wants to tell md that
one of its copies is bad.)

(Actually, above API would be interesting for fsck too ..
if fsck is failing with one copy from a raid set, it would
be interesting to see if an alternate copy passes fsck.)

-- but perhaps the storage containing the checksums themselves
was corrupted. Not sure what to do then. If the checksums
are corrupted, I don't want to accidentally flag large portions
of a block device being bad, when its actually good.

An alternative would be file-level checksums built into the
file system. I'm not thrilled by this, because it fails to focus
on errors caused by bad hardware. Its also too close to
trip-wire like function, and I don't want to get into conversations
about security & etc.

I'm paranoid enough to be willing to implement something like
this .. is the above design on the right track?

--linas

2008-08-05 17:38:44

by Alan

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

> I've got the AMD 570 chipset, which is older than the
> amd76x that edac supports. The latest MB's seem to have
> the AMD 790 chipset, which is also not currently supported.

AMD76x is very early 32bit so probably not..

The later AMD don't appear in the chipset specific code as the
hypedtransport era processors have on processor memory controllers and
use MCE reporting for that providing you have suitable memory etc.
Instead mcelog will decode them for you. The generic edac support for PCI
error scanning still applies.

> Can anyone get me the portion of the AMD 570 (nVidia
> nForce 570) chip specs that describe the RAM ECC
> error event counters? (I assume that this chip has some
> sort of error reporting or counting registers) I can sign
> NDA if needed.

C|N>K I've never even been able to extract IDE controller docs from
nVidia..

> I'm game. Care to guide me through? So: on every write, this
> new device mapper module computes a checksum and stores
> it somewhere. On every read, it computes a checksum and
> compares to the stored value. Easy enough I guess.
>
> Several hard parts:
> -- where to store the checksums?

That is the million dollar question - plus you can argue it is the fs
that should do it. There is stuff crawling through the standards world to
provide a small per block additional info area on disk sectors.

> -- what to do (besides print to dmesg) if there's a mismatch?

Configurable - panic/offline/warn ?

> This suggests a new API:
>
> ++ "is this block device an md device?"
> ++ "if yes to above, then give me alternate block"
> ++ "invalidate copy n of block x"
> (this last, because presumably one wants to tell md that
> one of its copies is bad.)

It's the same as dm RAID hitting a physical read error. In the former
case you got the data back but it is wrong (so useless) in the latter you
got nothing back.

> I'm paranoid enough to be willing to implement something like
> this .. is the above design on the right track?

Yes. If you can figure out where to keep the checksums without ruining
performance (and of course if there isn't one lurking in device mapper
world not yet submitted).

2008-08-06 21:39:46

by Linas Vepstas

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

2008/8/5 Alan Cox <[email protected]>:

>> I'm game. Care to guide me through? So: on every write, this
>> new device mapper module computes a checksum and stores
>> it somewhere. On every read, it computes a checksum and
>> compares to the stored value. Easy enough I guess.
>>
>> Several hard parts:
>> -- where to store the checksums?
>
> That is the million dollar question - plus you can argue it is the fs
> that should do it. There is stuff crawling through the standards world to
> provide a small per block additional info area on disk sectors.

My objection to fs-layer checksums (e.g. in some user-space
file system) is that it doesn't leverage the extra info that RAID
has. If a block is bad, RAID can probably fetch another one
that is good. You can't do this at the file-system level.

I assume I can layer device-mappers anywhere, right?
Layering one *underneath* md-raid would allow it to
reject/discard bad blocks, and then let the raid layer
try to find a good block somewhere else.

I assume that a device mapper can alter the number
of blocks-in to the number of blocks-out; that it doesn't
have to be 1-1. Then for every 10 sectors of data, it
would use 11 sectors of storage, one holding the
checksum. I'm very naive about how the block layer
works, so I don't know what snags there might be.

The downside of this is that the disk wouldn't be
naively readable unless the specific mapper module
was in place -- so one would need a superblock of
some sort indicating the type of checksumming used,
etc. Is there any "standardized" way of managing
superblocks for use by the device mapper? I guess
the encrypting dm has to store meta-information
somewhere, too, specifying what kind of encryption
was used. I'll look at that.

> Yes. If you can figure out where to keep the checksums without ruining
> performance

Heh. Unlikely. The act of checksumming will impact
performance. It should end up similar to the impact
from encryption (maybe not quite as bad), or comparable
to raid-5 (which computes various kinds of parity).

> (and of course if there isn't one lurking in device mapper
> world not yet submitted).

I'm googling, but I don't see anything. However, I now see,
for the first time, pending workd for 2.6.27 for a field in bio
called "blk_integrity". I cannot figure out if this work requires
special-whiz-bang disk drives to be purchased.

Also, it seems to be limited to 8 bytes of checksums per 512
byte block? This is reasonable for checksumming, I guess,
but one could get even fancier and run ECC-type sums, if
one could store, say, an addtional 50 bytes for every 512
bytes. I'm cc'ing Martin Petersen, the developer, for
comments.

--linas

2008-08-07 03:04:54

by Martin K. Petersen

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

>>>>> "Linas" == Linas Vepstas <[email protected]> writes:

[I got added to the CC: late in the game so I don't have the
background this discussion]

Linas> My objection to fs-layer checksums (e.g. in some user-space
Linas> file system) is that it doesn't leverage the extra info that
Linas> RAID has. If a block is bad, RAID can probably fetch another
Linas> one that is good. You can't do this at the file-system level.

ZFS and btrfs both support redundancy within the filesystem. They can
fetch the good copy and fix the bad one. And they have much more
context available for recovery than a RAID would.

Linas> I assume that a device mapper can alter the number of blocks-in
Linas> to the number of blocks-out; that it doesn't have to be
Linas> 1-1. Then for every 10 sectors of data, it would use 11 sectors
Linas> of storage, one holding the checksum. I'm very naive about how
Linas> the block layer works, so I don't know what snags there might
Linas> be.

I did a proof of concept of this a couple of years ago ago. And
performance was pretty poor. I also have a commercial device that
implements DIF on a SATA drive by doing the same thing. It also
suffers. It works reasonably well for what it was designed for,
namely RAID arrays where there is much more control over I/O staging
than we can provide in a general purpose operating system.

The elegant part about filesystem checksums is that they are stored in
the metadata blocks which are read anyway. So there are no additional
seeks, nor read-modify-write on a 10 sector + 1 blob of data.

Linas> I'm googling, but I don't see anything. However, I now see,
Linas> for the first time, pending workd for 2.6.27 for a field in bio
Linas> called "blk_integrity". I cannot figure out if this work
Linas> requires special-whiz-bang disk drives to be purchased.

There are two parts to this:

1. SCSI Data Integrity Field or DIF adds 8 bytes of stuff (referred to
as protection information) to each sector. The contents of each
8-byte tuple is well-defined.

2. Data Integrity Extensions is a set of knobs that allow us to DMA
the DIF protection information to and from host memory. That enables
us to provide end-to-end data integrity protection.

We can generate the protection information either up in the
application, attach it in a library or inside the kernel. HBAs, RAID
heads, disk drives and potentially SAN switches can verify the
integrity of the I/O before it gets passed on in the stack.

So, yes. You need special hardware. Controller and disk need to
support DIX and DIF respectively. This has been in the works for a
while and hardware is starting to materialize. Expect this to become
standard fare in the SCSI/SAS/FC market segment.

The T13 committee is currently working on a proposal called External
Path Protection which is essentially DIF for ATA. Will probably
happen in nearline drives first.

Linas> Also, it seems to be limited to 8 bytes of checksums per 512
Linas> byte block? This is reasonable for checksumming, I guess, but
Linas> one could get even fancier and run ECC-type sums, if one could
Linas> store, say, an addtional 50 bytes for every 512 bytes. I'm
Linas> cc'ing Martin Petersen, the developer, for comments.

The 8-byte DIF tuple is split into 3 sections:

- a 16-bit CRC of the 512 bytes of data

- a 16-bit application tag

- a 32-bit reference tag that in most cases needs to match the lower
32 bits of the sector LBA

The neat thing about DIF is that all nodes in the I/O path can verify
the contents. I.e. the drive can check that the CRC and LBA match the
data before it physically writes to disk. This allows us to catch
corruptions up front instead of when data is eventually read back.

So I mainly consider DIX/DIF a means to protect data integrity while
the I/O is in flight.

However, there is one feature that is of benefit in a more persistent
manner, namely the application tag. This gives us two bytes of extra
storage per sector. Given the small size it has very limited use at
the sector level. However, I have implemented it so that filesystems
can attach whatever they please, and the SCSI layer will interleave
the (meta-?)metadata attached to a logical block between the physical
sectors (This obviously implies FS block size > sector size and that's
about to change with 4KB sectors. There's work in progress to allow 8
bytes of DIF per 512 bytes of data regardless of physical sector size,
though).

The application tag space can be used to attach checksums to
filesystem logical blocks without changing the on-disk format. Or
DM/MD can use the extra space for their own housekeeping (and signal
to the filesystems that the app tag is not available).

DIF/DIX are somewhat convoluted and hard to cover in an email. I
suggest you read my recent OLS paper and my "Proactively Preventing
Data Corruption" article. Both can be found at the URL below.

http://oss.oracle.com/projects/data-integrity/documentation/

--
Martin K. Petersen Oracle Linux Engineering

2008-08-07 04:32:20

by Linas Vepstas

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

2008/8/6 Martin K. Petersen <[email protected]>:
>>>>>> "Linas" == Linas Vepstas <[email protected]> writes:
>
> [I got added to the CC: late in the game so I don't have the
> background this discussion]

You haven't missed anything, other than I've had my
umpteenth instance of data corruption in some years,
and am up to my eyeballs in consumer-grade hardware
from which I would like to get enterprise-grade reliability.
Of course, being a cheapskate is what gets me into this
mess.

> ZFS and btrfs both support redundancy within the filesystem. They can
> fetch the good copy and fix the bad one. And they have much more
> context available for recovery than a RAID would.

My problem is that the corruption I see is "silent": so
redundancy is useless, as I cannot distinguish good blocks
from bad. I'm running RAID, one of the two disks returns
bad data. Without checksums, I can't tell which version of
a block is the good one.

> Linas> I assume that a device mapper can alter the number of blocks-in
> Linas> to the number of blocks-out; that it doesn't have to be
> Linas> 1-1. Then for every 10 sectors of data, it would use 11 sectors
> Linas> of storage, one holding the checksum. I'm very naive about how
> Linas> the block layer works, so I don't know what snags there might
> Linas> be.
>
> I did a proof of concept of this a couple of years ago ago. And
> performance was pretty poor.

Yes, I'm not surprised. For a home-use system, though,
I think I'm ready to sacrifice performance in exchange for
reliability. Much of what I do does not hit the disk hard.

There is also in interesting possibility that offers a middle
ground between raw performance and safety: instead of
verifying checksums on *every* read access, it could be
enough to verify only every so often -- say, only one out
of every 10 reads, or maybe triggered by a cron job in
the middle of the night: turn on verification, touch a bunch
of files for an hour or two, turn off verification before 6AM.
This would be enough to trigger timely ill-health warnings,
without impacting daytime use. (Much as I dislike the
corruption I suffered, I dislike even more that I had no
warning of it)

> The elegant part about filesystem checksums is that they are stored in
> the metadata blocks which are read anyway.

Yes.

> So there are no additional
> seeks, nor read-modify-write on a 10 sector + 1 blob of data.

I guess that, instead of writing 10+1 sectors, with the seek
penalty, it might be faster to copy data in the kernel, so as
to be able to store the checksum in the same sector as the
data.

> So, yes. You need special hardware. Controller and disk need to
> support DIX and DIF respectively. This has been in the works for a
> while and hardware is starting to materialize. Expect this to become
> standard fare in the SCSI/SAS/FC market segment.

Yes, well, my HBA is soldered onto my MB, and I'm buying
$80 hard drives one at a time at Frye's electronics, so it could
be 5-10 years before DIX/DIF trickles down to consumer-grade
electronics. And I don't want to wait 5-10 years ...

Thus, a "tactical" solution seems to be pure-software
check-summing in a kernel device-mapper module,
performance be damned.

--linas

2008-08-07 07:46:43

by Pavel Machek

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

Hi!

> >> I'm game. Care to guide me through? So: on every write, this
> >> new device mapper module computes a checksum and stores
> >> it somewhere. On every read, it computes a checksum and
> >> compares to the stored value. Easy enough I guess.
> >>
> >> Several hard parts:
> >> -- where to store the checksums?
> >
> > That is the million dollar question - plus you can argue it is the fs
> > that should do it. There is stuff crawling through the standards world to
> > provide a small per block additional info area on disk sectors.
>
> My objection to fs-layer checksums (e.g. in some user-space
> file system) is that it doesn't leverage the extra info that RAID
> has. If a block is bad, RAID can probably fetch another one
> that is good. You can't do this at the file-system level.
>
> I assume I can layer device-mappers anywhere, right?
> Layering one *underneath* md-raid would allow it to
> reject/discard bad blocks, and then let the raid layer
> try to find a good block somewhere else.
>
> I assume that a device mapper can alter the number
> of blocks-in to the number of blocks-out; that it doesn't
> have to be 1-1. Then for every 10 sectors of data, it
> would use 11 sectors of storage, one holding the
> checksum. I'm very naive about how the block layer
> works, so I don't know what snags there might be.

I did something like that long time ago -- with loop, and separate
partition for checksums.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-08-07 16:47:16

by Martin K. Petersen

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

>>>>> "Linas" == Linas Vepstas <[email protected]> writes:

Linas> My problem is that the corruption I see is "silent": so
Linas> redundancy is useless, as I cannot distinguish good blocks from
Linas> bad. I'm running RAID, one of the two disks returns bad data.
Linas> Without checksums, I can't tell which version of a block is the
Linas> good one.

But btrfs can.

Linas> There is also in interesting possibility that offers a middle
Linas> ground between raw performance and safety: instead of verifying
Linas> checksums on *every* read access, it could be enough to verify
Linas> only every so often -- say, only one out of every 10 reads, or
Linas> maybe triggered by a cron job in the middle of the night: turn
Linas> on verification, touch a bunch of files for an hour or two,
Linas> turn off verification before 6AM.

All evidence suggests that scrubbing is a good way to keep your data
healthy.

A common corruption scenario a few years ago was bleed to adjacent
tracks due to a frequently written hot spot on disk. Scrubbing in
RAID arrays helped fix that. Modern drives actually maintain an
internal list of hot spots and will automatically schedule refreshes
of adjacent blocks to prevent bleed.

But there are obviously other corruption scenarios that scrubbing can
help alleviate -- including genuine bit rot on the platter.

Linas> Yes, well, my HBA is soldered onto my MB, and I'm buying $80
Linas> hard drives one at a time at Frye's electronics, so it could be
Linas> 5-10 years before DIX/DIF trickles down to consumer-grade
Linas> electronics. And I don't want to wait 5-10 years ...

I doubt it's going to take *that* long.

Corruption of in-flight data has been a problem for years. And it is
a problem that RAID and FS checksums can't fix.

Oracle has been providing customers with in-flight integrity
protection on high-end arrays for many years using a proprietary
technology called HARD. Array vendors license it from us and HARD is
mandatory in a lot of business and government deployments.

DIF/DIX is our attempt to make integrity protection available on mid-
to low-range equipment. We decided to embrace and extend an existing,
open standard and are working with standards bodies to nudge them in
the right direction in terms of new features. It has taken about two
years from conception to product in a highly conservative, slow-moving
industry.

As as I mentioned earlier, T13 is working on EPP which is essentially
DIF for SATA. The protection format is the same which means we can
prepare one type of integrity information regardless of whether the
target drive is SCSI or SATA.

Once External Path Protection is ratified I'm expecting drives to
appear fairly quickly. The turnaround time should be short as SATA
drive generations don't last nearly as long as SCSI.

Linas> Thus, a "tactical" solution seems to be pure-software
Linas> check-summing in a kernel device-mapper module, performance be
Linas> damned.

What I don't understand is why you are so focused on fixing this at
the RAID level. I think your time would be better spent contributing
to btrfs which gives you checksums and redundancy on consumer grade
hardware today. It's is only a few months away from GA. So why not
implement scrubbing in btrfs instead of spending time on a kludgy
device mapper module with crappy performance?

--
Martin K. Petersen Oracle Linux Engineering

2008-08-07 17:24:12

by Linas Vepstas

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

2008/8/7 Martin K. Petersen <[email protected]>:

> Linas> Thus, a "tactical" solution seems to be pure-software
> Linas> check-summing in a kernel device-mapper module, performance be
> Linas> damned.
>
> What I don't understand is why you are so focused on fixing this at
> the RAID level. I think your time would be better spent contributing
> to btrfs which gives you checksums and redundancy on consumer grade
> hardware today. It's is only a few months away from GA. So why not
> implement scrubbing in btrfs instead of spending time on a kludgy
> device mapper module with crappy performance?

Let me count the ways:
-- Time is an important consideration; I'd do this at home,
spare-time, sandwiched between my formal commitments.
-- The device mapper interfaces are modular enough, and
there are enough other device-mapper modules to serve
as example code, that I could probably knock off the basic
function in less than a week, and then spend a few months
polishing and enhancing at leisure.
-- Learning the in's and out's of btrfs would take more than
a week.
-- Timeliness, modularity -- suppose I am pressed for time,
and it takes me 6 months to develop a dm module. I am
pretty sure that the api's won't change out from under me,
an my patches will still apply. By contrast, btrfs will likely
undergo major changes by then, so slow-moving patches
would likely be rotten/superceeded by then.
-- I'm architecturally conservative. Looking at the btrfs page
does not make me comfortable: it seems to have a do-it-all
set of goals. When I was younger, I created some do-it-all
projects, and found that 1) the community was never as
excited as I was, and 2) the list of desired features was
overwhelmingly large, which means most didn't get done,
and most of the rest were little more than proof-of-concept.
My gut-sense, irrational vibe from the btrfs page is that its
over-reaching -- examples of over-reaching projects that
got in trouble were evms and reiserfs -- and so wait-n-see
is my prudent response.
-- By contrast, raid+lvm already does most of what I need;
I'd be happier seeing a project that leverages the existing
raid+lvm infrastructure than one that fundamentally
rethinks/redesigns everything. Nothing wrong with
fundamental architectural re-thinks; its just that they're
much riskier.

I dunno, I'm not even sure I have the time to do a dm module
I'm still tossing around the idea.

--linas

2008-08-07 18:53:48

by John Stoffel

[permalink] [raw]

Subject: Re: amd64 sata_nv (massive) memory corruption

>>>>> "Martin" == Martin K Petersen <[email protected]> writes:

>>>>> "Linas" == Linas Vepstas <[email protected]> writes:
Linas> My problem is that the corruption I see is "silent": so
Linas> redundancy is useless, as I cannot distinguish good blocks from
Linas> bad. I'm running RAID, one of the two disks returns bad data.
Linas> Without checksums, I can't tell which version of a block is the
Linas> good one.

Martin> But btrfs can.

Maybe. I'd not trust btrfs even now because the on-disk format is
going to change yet again from the currently released version. I'm
personally interested in it, but not quite enough to use it. :]

Linas> There is also in interesting possibility that offers a middle
Linas> ground between raw performance and safety: instead of verifying
Linas> checksums on *every* read access, it could be enough to verify
Linas> only every so often -- say, only one out of every 10 reads, or
Linas> maybe triggered by a cron job in the middle of the night: turn
Linas> on verification, touch a bunch of files for an hour or two,
Linas> turn off verification before 6AM.

If you're reading the file off disk, it doesn't cost anything to
verify it then, esp if the checksum is either in the metadata or next
to the blocks themselves.

It's corruption in files which aren't read which turns into a
problem.

Martin> All evidence suggests that scrubbing is a good way to keep
Martin> your data healthy.

Yup. And mirroring anything you think is important. Disk is cheap,
mirroring is good.

Heck, I'd pay good money for a SATA disk which mirrored inside itself
or which joined two seperate spindle/head assemblies into one and did
all the error correction at a low level.