2009-08-24 11:19:01

by Florian Weimer

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

* Pavel Machek:

> +Linux block-backed filesystems can only work correctly when several
> +conditions are met in the block layer and below (disks, flash
> +cards). Some of them are obvious ("data on media should not change
> +randomly"), some are less so.

You should make clear that the file lists per-file-system rules and
that some file sytems can recover from some of the error conditions.

> +* don't damage the old data on a failed write (ATOMIC-WRITES)
> +
> + (Thrash may get written into sectors during powerfail. And
> + ext3 handles this surprisingly well at least in the
> + catastrophic case of garbage getting written into the inode
> + table, since the journal replay often will "repair" the
> + garbage that was written into the filesystem metadata blocks.

Isn't this by design? In other words, if the metadata doesn't survive
non-atomic writes, wouldn't it be an ext3 bug?

--
Florian Weimer <[email protected]>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstra?e 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99


2009-08-24 13:01:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Mon, Aug 24, 2009 at 11:19:01AM +0000, Florian Weimer wrote:
> * Pavel Machek:
>
> > +Linux block-backed filesystems can only work correctly when several
> > +conditions are met in the block layer and below (disks, flash
> > +cards). Some of them are obvious ("data on media should not change
> > +randomly"), some are less so.
>
> You should make clear that the file lists per-file-system rules and
> that some file sytems can recover from some of the error conditions.

The only one that falls into that category is the one about not being
able to handle failed writes, and the way most failures take place,
they generally fail the ATOMIC-WRITES criterion in any case. That is,
when a write fails, an attempt to read from that sector will generally
result in either (a) an error, or (b) data other than what was there
before the write was attempted.

> > +* don't damage the old data on a failed write (ATOMIC-WRITES)
> > +
> > + (Thrash may get written into sectors during powerfail. And
> > + ext3 handles this surprisingly well at least in the
> > + catastrophic case of garbage getting written into the inode
> > + table, since the journal replay often will "repair" the
> > + garbage that was written into the filesystem metadata blocks.
>
> Isn't this by design? In other words, if the metadata doesn't survive
> non-atomic writes, wouldn't it be an ext3 bug?

Part of the problem here is that "atomic-writes" is confusing; it
doesn't mean what many people think it means. The assumption which
many naive filesystem designers make is that writes succeed or they
don't. If they don't succeed, they don't change the previously
existing data in any way.

So in the case of journalling, the assumption which gets made is that
when the power fails, the disk either writes a particular disk block,
or it doesn't. The problem here is as with humans and animals, death
is not an event, it is a process. When the power fails, the system
just doesn't stop functioning; the power on the +5 and +12 volt rails
start dropping to zero, and different components fail at different
times. Specifically, DRAM, being the most voltage sensitve, tends to
fail before the DMA subsystem, the PCI bus, and the hard drive fails.
So as a result, garbage can get written out to disk as part of the
failure. That's just the way hardware works.

Now consider a file system which does logical journalling. It has
written to the journal, using a compact encoding, "the i_blocks field
is now 25, and i_size is 13000", and the journal transaction has
committed. So now, it's time to update the inode on disk; but at that
precise moment, the power failures, and garbage is written to the
inode table. Oops! The entire sector containing the inode is
trashed. But the only thing which recorded in the journal is the new
value of i_blocks and i_size. So a journal replay won't help file
systems that do logical block journalling.

Is that a file system "bug"? Well, it's better to call that a
mismatch between the assumptions made of physical devices, and of the
file system code. On Irix, SGI hardware had a powerfail interrupt,
and the power supply and extra-big capacitors, so that when a power
fail interrupt came in, the Irix would run around frantically shutting
down pending DMA transfers to prevent this failure mode from causing
problems. PC class hardware (according to Ted's law), is cr*p, and
doesn't have a powerfail interrupt, so it's not something that we
have.

Ext3, ext4, and ocfs2 does physical block journalling, so as long as
journal truncate hasn't taken place right before the failure, the
replay of the physical block journal tends to repair this most (but
not necessarily all) cases of "garbage is written right before power
failure". People who care about this should really use a UPS, and
wire up the USB and/or serial cable from the UPS to the system, so
that the OS can do a controlled shutdown if the UPS is close to
shutting down due to an extended power failure.


There is another kind of non-atomic write that nearly all file systems
are subject to, however, and to give an example of this, consider what
happens if you a laptop is subjected to a sudden shock while it is
writing a sector, and the hard drive doesn't an accelerometer which
tries to anticipates such shocks. (nb, these things aren't
fool-proof; even if a HDD has one of these sensors, they only work if
they can detect the transition to free-fall, and the hard drive has
time to retract the heads before the actual shock hits; if you have a
sudden shock, the g-shock sensors won't have time to react and save
the hard drive).

Depending on how severe the shock happens to be, the head could end up
impacting the platter, destroying the medium (which used to be
iron-oxide; hence the term "spinning rust platters") at that spot.
This will obviously cause a write failure, and the previous contents
of the sector will be lost. This is also considered a failure of the
ATOMIC-WRITE property, and no, ext3 doesn't handle this case
gracefully. Very few file systems do. (It is possible for an OS that
doesn't have fixed metadata to immediately write the inode table to a
different location on the disk, and then update the pointers to the
inode table point to the new location on disk; but very few
filesystems do this, and even those that do usually rely on the
superblock being available on a fixed location on disk. It's much
simpler to assume that hard drives usually behave sanely, and that
writes very rarely fail.)

It's for this reason that I've never been completely sure how useful
Pavel's proposed treatise about file systems expectations really are
--- because all storage subsystems *usually* provide these guarantees,
but it is the very rare storage system that *always* provides these
guarantees.

We could just as easily have several kilobytes of explanation in
Documentation/* explaining how we assume that DRAM always returns the
same value that was stored in it previously --- and yet most PC class
hardware still does not use ECC memory, and cosmic rays are a reality.
That means that most Linux systems run on systems that are vulnerable
to this kind of failure --- and the world hasn't ended.

As I recall, the main problem which Pavel had was when he was using
ext3 on a *really* trashy flash drive, on a *really* trashing laptop
where the flash card stuck out slightly, and any jostling of the
netbook would cause the flash card to become disconnected from the
laptop, and cause write errors, very easily and very frequently. In
those circumstnaces, it's highly unlikely that ***any*** file system
would have been able to survive such an unreliable storage system.


One of the problems I have with the break down which Pavel has used is
that it doesn't break things down according to probability; the chance
of a storage subsystem scribbling garbage on its last write during a
power failure is very different from the chance that the hard drive
fails due to a shock, or due to some spilling printer toner near the
disk drive which somehow manages to find its way inside the enclosure
containing the spinning platters, versus the other forms of random
failures that lead to write failures. All of these fall into the
category of a failure of the property he has named "ATOMIC-WRITE", but
in fact ways in which the filesystem might try to protect itself are
varied, and it isn't necessarily all or nothing. One can imagine a
file system which can handle write failures for data blocks, but not
for metadata blocks; given that data blocks outnumber metadata blocks
by hundreds to one, and that write failures are relatively rare
(unless you have said trashy laptop with a trash flash card), a file
system that can gracefully deal with data block failures would be a
useful advancement.

But these things are never absolute, mainly because people aren't
willing to pay for either the cost of superior hardware (consider the
cost of ECC memory, which isn't *that* much more expensive; and yet
most PC class systems don't use it) or in terms of software overhead
(historically many file system designers have eschewed the use of
physical block journalling because it really hurts on meta-data
intensive benchmarks), talking about absolute requirements for
ATOMIC-WRITE isn't all that useful --- because nearly all hardware
doesn't provide these guarantees, and nearly all filesystems require
them. So to call out ext2 and ext3 for requiring them, without making
clear that pretty much *all* file systems require them, ends up
causing people to switch over to some other file system that
ironically enough, might end up being *more* vulernable, but which
didn't earn Pavel's displeasure because he didn't try using, say, XFS
on his flashcard on his trashy laptop.

- Ted

2009-08-24 13:50:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Mon, Aug 24, 2009 at 11:19:01AM +0000, Florian Weimer wrote:
> > +* don't damage the old data on a failed write (ATOMIC-WRITES)
> > +
> > + (Thrash may get written into sectors during powerfail. And
> > + ext3 handles this surprisingly well at least in the
> > + catastrophic case of garbage getting written into the inode
> > + table, since the journal replay often will "repair" the
> > + garbage that was written into the filesystem metadata blocks.
>
> Isn't this by design? In other words, if the metadata doesn't survive
> non-atomic writes, wouldn't it be an ext3 bug?

So I got confused when I quoted your note, which I had assumed was
exactly what Pavel had written in his documentation. In fact, what he
had written was this:

+Don't damage the old data on a failed write (ATOMIC-WRITES)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+....

So he had explicitly stated that he only cared about the whole sector
being written (or not written) in the power fail case, and not any
other. I'd suggest changing ATOMIC-WRITES to
ATOMIC-WRITE-ON-POWERFAIL, since the one-line summary, "Don't damage
the old data on a failed write", is also singularly misleading.

- Ted

2009-08-24 14:55:53

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Hi Theodore,

thanks for the insightful writing.

On 08/24/2009 04:01 PM, Theodore Tso wrote:

...snip ...

> It's for this reason that I've never been completely sure how useful
> Pavel's proposed treatise about file systems expectations really are
> --- because all storage subsystems *usually* provide these guarantees,
> but it is the very rare storage system that *always* provides these
> guarantees.

There is a thing called eMMC (embedded MMC) in the embedded world. You
may consider it as a non-removable MMC. This thing is a block device from
the Linux POW, and you may mount ext3 on top of it. And people do this.

The device seems to have a decent FTL, and does not look bad.

However, there are subtle things which mortals never think about. In
case of eMMC - power cuts may make some sectors unreadable - eMMC returns
ECC errors on reads. Namely, the sectors which were being written at
the very moment when the power cut happened may become unreadable.
And this makes ext3 refuse mounting the file-system, this makes
chkfs.ext3 refuse the file-system. Although this should be fixable in
SW, but we did not find time to do this so far.

Anyway, my point is that documenting subtle things like this is a very
good thing to do, just because nowadays we are trying to use existing
software with flash-based storage devices, which may violate these
subtle assumptions, or introduce other ones.

Probably, Pavel did too good job in generalizing things, and it could be
better to make a doc about HDD vs SSD or HDD vs Flash-based-storage.
Not sure. But the idea to document subtle FS assumption is good, IMO.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-08-24 18:39:02

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Hi!

> > +Linux block-backed filesystems can only work correctly when several
> > +conditions are met in the block layer and below (disks, flash
> > +cards). Some of them are obvious ("data on media should not change
> > +randomly"), some are less so.
>
> You should make clear that the file lists per-file-system rules and
> that some file sytems can recover from some of the error conditions.

Ok, I added "Not all filesystems require all of these
to be satisfied for safe operation" sentence there.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-24 18:48:48

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Hi!

> > > +* don't damage the old data on a failed write (ATOMIC-WRITES)
> > > +
> > > + (Thrash may get written into sectors during powerfail. And
> > > + ext3 handles this surprisingly well at least in the
> > > + catastrophic case of garbage getting written into the inode
> > > + table, since the journal replay often will "repair" the
> > > + garbage that was written into the filesystem metadata blocks.
> >
> > Isn't this by design? In other words, if the metadata doesn't survive
> > non-atomic writes, wouldn't it be an ext3 bug?
>
> So I got confused when I quoted your note, which I had assumed was
> exactly what Pavel had written in his documentation. In fact, what he
> had written was this:
>
> +Don't damage the old data on a failed write (ATOMIC-WRITES)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +....
>
> So he had explicitly stated that he only cared about the whole sector
> being written (or not written) in the power fail case, and not any
> other. I'd suggest changing ATOMIC-WRITES to
> ATOMIC-WRITE-ON-POWERFAIL, since the one-line summary, "Don't damage
> the old data on a failed write", is also singularly misleading.

Ok, something like this?

Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Either whole sector is correctly written or nothing is written during
powerfail.


Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-24 19:52:03

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Hi!

> > Isn't this by design? In other words, if the metadata doesn't survive
> > non-atomic writes, wouldn't it be an ext3 bug?
>
> Part of the problem here is that "atomic-writes" is confusing; it
> doesn't mean what many people think it means. The assumption which
> many naive filesystem designers make is that writes succeed or they
> don't. If they don't succeed, they don't change the previously
> existing data in any way.
>
> So in the case of journalling, the assumption which gets made is that
> when the power fails, the disk either writes a particular disk block,
> or it doesn't. The problem here is as with humans and animals, death
> is not an event, it is a process. When the power fails, the system
> just doesn't stop functioning; the power on the +5 and +12 volt rails
> start dropping to zero, and different components fail at different
> times. Specifically, DRAM, being the most voltage sensitve, tends to
> fail before the DMA subsystem, the PCI bus, and the hard drive fails.
> So as a result, garbage can get written out to disk as part of the
> failure. That's just the way hardware works.

Yep, and at that point you lost data. You had "silent data corruption"
from fs point of view, and that's bad.

It will be probably very bad on XFS, probably okay on Ext3, and
certainly okay on Ext2: you do filesystem check, and you should be
able to repair any damage. So yes, physical journaling is good, but
fsck is better.

> Is that a file system "bug"? Well, it's better to call that a
> mismatch between the assumptions made of physical devices, and of the
> file system code. On Irix, SGI hardware had a powerfail interrupt,

If those filesystem assumptions were not documented, I'd call it
filesystem bug. So better document them ;-).

> There is another kind of non-atomic write that nearly all file systems
> are subject to, however, and to give an example of this, consider what
> happens if you a laptop is subjected to a sudden shock while it is
> writing a sector, and the hard drive doesn't an accelerometer which
...
> Depending on how severe the shock happens to be, the head could end up
> impacting the platter, destroying the medium (which used to be
> iron-oxide; hence the term "spinning rust platters") at that spot.
> This will obviously cause a write failure, and the previous contents
> of the sector will be lost. This is also considered a failure of the
> ATOMIC-WRITE property, and no, ext3 doesn't handle this case
> gracefully. Very few file systems do. (It is possible for an OS
> that

Actually, ext2 should be able to survive that, no? Error writing ->
remount ro -> fsck on next boot -> drive relocates the sectors.

> It's for this reason that I've never been completely sure how useful
> Pavel's proposed treatise about file systems expectations really are
> --- because all storage subsystems *usually* provide these guarantees,
> but it is the very rare storage system that *always* provides these
> guarantees.

Well... there's very big difference between harddrives and flash
memory. Harddrives usually work, and flash memory never does.

> We could just as easily have several kilobytes of explanation in
> Documentation/* explaining how we assume that DRAM always returns the
> same value that was stored in it previously --- and yet most PC class
> hardware still does not use ECC memory, and cosmic rays are a reality.
> That means that most Linux systems run on systems that are vulnerable
> to this kind of failure --- and the world hasn't ended.

There's a difference. In case of cosmic rays, hardware is clearly
buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
and I still use it. I will not complain if ext3 trashes that.

In case of degraded raid-5, even with perfect hardware, and with
ext3 on top of that, you'll get silent data corruption. Nice, eh?

Clearly, Linux is buggy there. It could be argued it is raid-5's
fault, or maybe it is ext3's fault, but... linux is still buggy.

> As I recall, the main problem which Pavel had was when he was using
> ext3 on a *really* trashy flash drive, on a *really* trashing laptop
> where the flash card stuck out slightly, and any jostling of the
> netbook would cause the flash card to become disconnected from the
> laptop, and cause write errors, very easily and very frequently. In
> those circumstnaces, it's highly unlikely that ***any*** file system
> would have been able to survive such an unreliable storage system.

Well well well. Before I pulled that flash card, I assumed that doing
so is safe, because flashcard is presented as block device and ext3
should cope with sudden disk disconnects.

And I was wrong wrong wrong. (Noone told me at the university. I guess
I should want my money back).

Plus note that it is not only my trashy laptop and one trashy MMC
card; every USB thumb drive I seen is affected. (OTOH USB disks should
be safe AFAICT).

Ext3 is unsuitable for flash cards and RAID arrays, plain and
simple. It is not documented anywhere :-(. [ext2 should work better --
at least you'll not get silent data corruption.]

> One of the problems I have with the break down which Pavel has used is
> that it doesn't break things down according to probability; the chance
> of a storage subsystem scribbling garbage on its last write during a

Can you suggest better patch? I'm not saying we should redesign ext3,
but... someone should have told me that ext3+USB thumb drive=problems.

> But these things are never absolute, mainly because people aren't
> willing to pay for either the cost of superior hardware (consider the
> cost of ECC memory, which isn't *that* much more expensive; and yet
> most PC class systems don't use it) or in terms of software overhead
> (historically many file system designers have eschewed the use of
> physical block journalling because it really hurts on meta-data
> intensive benchmarks), talking about absolute requirements for
> ATOMIC-WRITE isn't all that useful --- because nearly all hardware
> doesn't provide these guarantees, and nearly all filesystems require
> them. So to call out ext2 and ext3 for requiring them, without
> making

ext3+raid5 will fail even if you have perfect hardware.

> clear that pretty much *all* file systems require them, ends up
> causing people to switch over to some other file system that
> ironically enough, might end up being *more* vulernable, but which
> didn't earn Pavel's displeasure because he didn't try using, say, XFS
> on his flashcard on his trashy laptop.

I hold ext2/ext3 to higher standards than other filesystem in
tree. I'd not use XFS/VFAT etc.

I would not want people to migrate towards XFS/VFAT, and yes I believe
XFSs/VFATs/... requirements should be documented, too. (But I know too
little about those filesystems).

If you can suggest better wording, please help me. But... those
requirements are non-trivial, commonly not met and the result is data
loss. It has to be documented somehow. Make it as innocent-looking as
you can...

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-24 20:24:28

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Pavel Machek wrote:
> Hi!
>
>
>>> Isn't this by design? In other words, if the metadata doesn't survive
>>> non-atomic writes, wouldn't it be an ext3 bug?
>>>
>> Part of the problem here is that "atomic-writes" is confusing; it
>> doesn't mean what many people think it means. The assumption which
>> many naive filesystem designers make is that writes succeed or they
>> don't. If they don't succeed, they don't change the previously
>> existing data in any way.
>>
>> So in the case of journalling, the assumption which gets made is that
>> when the power fails, the disk either writes a particular disk block,
>> or it doesn't. The problem here is as with humans and animals, death
>> is not an event, it is a process. When the power fails, the system
>> just doesn't stop functioning; the power on the +5 and +12 volt rails
>> start dropping to zero, and different components fail at different
>> times. Specifically, DRAM, being the most voltage sensitve, tends to
>> fail before the DMA subsystem, the PCI bus, and the hard drive fails.
>> So as a result, garbage can get written out to disk as part of the
>> failure. That's just the way hardware works.
>>
>
> Yep, and at that point you lost data. You had "silent data corruption"
> from fs point of view, and that's bad.
>
> It will be probably very bad on XFS, probably okay on Ext3, and
> certainly okay on Ext2: you do filesystem check, and you should be
> able to repair any damage. So yes, physical journaling is good, but
> fsck is better.
>

I don't see why you think that. In general, fsck (for any fs) only
checks metadata. If you have silent data corruption that corrupts things
that are fixable by fsck, you most likely have silent corruption hitting
things users care about like their data blocks inside of files. Fsck
will not fix (or notice) any of that, that is where things like full
data checksums can help.

Also note (from first hand experience), unless you check and validate
your data, you can have data corruptions that will not get flagged as IO
errors so data signing or scrubbing is a critical part of data integrity.
>
>> Is that a file system "bug"? Well, it's better to call that a
>> mismatch between the assumptions made of physical devices, and of the
>> file system code. On Irix, SGI hardware had a powerfail interrupt,
>>
>
> If those filesystem assumptions were not documented, I'd call it
> filesystem bug. So better document them ;-).
>
>
I think that we need to help people understand the full spectrum of data
concerns, starting with reasonable best practices that will help most
people suffer *less* (not no) data loss. And make very sure that they
are not falsely assured that by following any specific script that they
can skip backups, remote backups, etc :-)

Nothing in our code in any part of the kernel deals well with every
disaster or odd event.

>> There is another kind of non-atomic write that nearly all file systems
>> are subject to, however, and to give an example of this, consider what
>> happens if you a laptop is subjected to a sudden shock while it is
>> writing a sector, and the hard drive doesn't an accelerometer which
>>
> ...
>
>> Depending on how severe the shock happens to be, the head could end up
>> impacting the platter, destroying the medium (which used to be
>> iron-oxide; hence the term "spinning rust platters") at that spot.
>> This will obviously cause a write failure, and the previous contents
>> of the sector will be lost. This is also considered a failure of the
>> ATOMIC-WRITE property, and no, ext3 doesn't handle this case
>> gracefully. Very few file systems do. (It is possible for an OS
>> that
>>
>
> Actually, ext2 should be able to survive that, no? Error writing ->
> remount ro -> fsck on next boot -> drive relocates the sectors.
>

I think that the example and the response are both off base. If your
head ever touches the platter, you won't be reading from a huge part of
your drive ever again (usually, you have 2 heads per platter, 3-4
platters, impact would kill one head and a corresponding percentage of
your data).

No file system will recover that data although you might be able to
scrape out some remaining useful bits and bytes.

More common causes of silent corruption would be bad DRAM in things like
the drive write cache, hot spots (that cause adjacent track data
errors), etc. Note in this last case, your most recently written data
is fine, just the data you wrote months/years ago is toast!
>
>> It's for this reason that I've never been completely sure how useful
>> Pavel's proposed treatise about file systems expectations really are
>> --- because all storage subsystems *usually* provide these guarantees,
>> but it is the very rare storage system that *always* provides these
>> guarantees.
>>
>
> Well... there's very big difference between harddrives and flash
> memory. Harddrives usually work, and flash memory never does.
>

It is hard for anyone to see the real data without looking in detail at
large numbers of parts. Back at EMC, we looked at failures for lots of
parts so we got a clear grasp on trends. I do agree that flash/SSD
parts are still very young so we will have interesting and unexpected
failure modes to learn to deal with....
>
>> We could just as easily have several kilobytes of explanation in
>> Documentation/* explaining how we assume that DRAM always returns the
>> same value that was stored in it previously --- and yet most PC class
>> hardware still does not use ECC memory, and cosmic rays are a reality.
>> That means that most Linux systems run on systems that are vulnerable
>> to this kind of failure --- and the world hasn't ended.
>>
>
> There's a difference. In case of cosmic rays, hardware is clearly
> buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
> and I still use it. I will not complain if ext3 trashes that.
>
> In case of degraded raid-5, even with perfect hardware, and with
> ext3 on top of that, you'll get silent data corruption. Nice, eh?
>
> Clearly, Linux is buggy there. It could be argued it is raid-5's
> fault, or maybe it is ext3's fault, but... linux is still buggy.
>

Nothing is perfect. It is still a trade off between storage utilization
(how much storage we give users for say 5 2TB drives), performance and
costs (throw away any disks over 2 years old?).
>
>> As I recall, the main problem which Pavel had was when he was using
>> ext3 on a *really* trashy flash drive, on a *really* trashing laptop
>> where the flash card stuck out slightly, and any jostling of the
>> netbook would cause the flash card to become disconnected from the
>> laptop, and cause write errors, very easily and very frequently. In
>> those circumstnaces, it's highly unlikely that ***any*** file system
>> would have been able to survive such an unreliable storage system.
>>
>
> Well well well. Before I pulled that flash card, I assumed that doing
> so is safe, because flashcard is presented as block device and ext3
> should cope with sudden disk disconnects.
>
> And I was wrong wrong wrong. (Noone told me at the university. I guess
> I should want my money back).
>
> Plus note that it is not only my trashy laptop and one trashy MMC
> card; every USB thumb drive I seen is affected. (OTOH USB disks should
> be safe AFAICT).
>
> Ext3 is unsuitable for flash cards and RAID arrays, plain and
> simple. It is not documented anywhere :-(. [ext2 should work better --
> at least you'll not get silent data corruption.]
>

ext3 is used on lots of raid arrays without any issue.
>
>> One of the problems I have with the break down which Pavel has used is
>> that it doesn't break things down according to probability; the chance
>> of a storage subsystem scribbling garbage on its last write during a
>>
>
> Can you suggest better patch? I'm not saying we should redesign ext3,
> but... someone should have told me that ext3+USB thumb drive=problems.
>
>
>> But these things are never absolute, mainly because people aren't
>> willing to pay for either the cost of superior hardware (consider the
>> cost of ECC memory, which isn't *that* much more expensive; and yet
>> most PC class systems don't use it) or in terms of software overhead
>> (historically many file system designers have eschewed the use of
>> physical block journalling because it really hurts on meta-data
>> intensive benchmarks), talking about absolute requirements for
>> ATOMIC-WRITE isn't all that useful --- because nearly all hardware
>> doesn't provide these guarantees, and nearly all filesystems require
>> them. So to call out ext2 and ext3 for requiring them, without
>> making
>>
>
> ext3+raid5 will fail even if you have perfect hardware.
>
>
>> clear that pretty much *all* file systems require them, ends up
>> causing people to switch over to some other file system that
>> ironically enough, might end up being *more* vulernable, but which
>> didn't earn Pavel's displeasure because he didn't try using, say, XFS
>> on his flashcard on his trashy laptop.
>>
>
> I hold ext2/ext3 to higher standards than other filesystem in
> tree. I'd not use XFS/VFAT etc.
>
> I would not want people to migrate towards XFS/VFAT, and yes I believe
> XFSs/VFATs/... requirements should be documented, too. (But I know too
> little about those filesystems).
>
> If you can suggest better wording, please help me. But... those
> requirements are non-trivial, commonly not met and the result is data
> loss. It has to be documented somehow. Make it as innocent-looking as
> you can...
>
> Pavel
>

I think that you really need to step back and look harder at real
failures - not just your personal experience - but a larger set of real
world failures. Many papers have been published recently about that (the
google paper, the Bianca paper from FAST, Netapp, etc).

Regards,

Ric



2009-08-24 20:52:09

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Hi!

>> Yep, and at that point you lost data. You had "silent data corruption"
>> from fs point of view, and that's bad.
>>
>> It will be probably very bad on XFS, probably okay on Ext3, and
>> certainly okay on Ext2: you do filesystem check, and you should be
>> able to repair any damage. So yes, physical journaling is good, but
>> fsck is better.
>
> I don't see why you think that. In general, fsck (for any fs) only
> checks metadata. If you have silent data corruption that corrupts things
> that are fixable by fsck, you most likely have silent corruption hitting
> things users care about like their data blocks inside of files. Fsck
> will not fix (or notice) any of that, that is where things like full
> data checksums can help.

Ok, but in case of data corruption, at least your filesystem does not
degrade further.

>> If those filesystem assumptions were not documented, I'd call it
>> filesystem bug. So better document them ;-).
>>
> I think that we need to help people understand the full spectrum of data
> concerns, starting with reasonable best practices that will help most
> people suffer *less* (not no) data loss. And make very sure that they
> are not falsely assured that by following any specific script that they
> can skip backups, remote backups, etc :-)
>
> Nothing in our code in any part of the kernel deals well with every
> disaster or odd event.

I can reproduce data loss with ext3 on flashcard in about 40
seconds. I'd not call that "odd event". It would be nice to handle
that, but that is hard. So ... can we at least get that documented
please?


>> Actually, ext2 should be able to survive that, no? Error writing ->
>> remount ro -> fsck on next boot -> drive relocates the sectors.
>>
>
> I think that the example and the response are both off base. If your
> head ever touches the platter, you won't be reading from a huge part of
> your drive ever again (usually, you have 2 heads per platter, 3-4
> platters, impact would kill one head and a corresponding percentage of
> your data).

Ok, that's obviously game over.

>>> It's for this reason that I've never been completely sure how useful
>>> Pavel's proposed treatise about file systems expectations really are
>>> --- because all storage subsystems *usually* provide these guarantees,
>>> but it is the very rare storage system that *always* provides these
>>> guarantees.
>>
>> Well... there's very big difference between harddrives and flash
>> memory. Harddrives usually work, and flash memory never does.
>
> It is hard for anyone to see the real data without looking in detail at
> large numbers of parts. Back at EMC, we looked at failures for lots of
> parts so we got a clear grasp on trends. I do agree that flash/SSD
> parts are still very young so we will have interesting and unexpected
> failure modes to learn to deal with....

_Maybe_ SSDs, being HDD replacements are better. I don't know.

_All_ flash cards (MMC, USB, SD) had the problems. You don't need to
get clear grasp on trends. Those cards just don't meet ext3
expectations, and if you pull them, you get data loss.

>>> We could just as easily have several kilobytes of explanation in
>>> Documentation/* explaining how we assume that DRAM always returns the
>>> same value that was stored in it previously --- and yet most PC class
>>> hardware still does not use ECC memory, and cosmic rays are a reality.
>>> That means that most Linux systems run on systems that are vulnerable
>>> to this kind of failure --- and the world hasn't ended.

>> There's a difference. In case of cosmic rays, hardware is clearly
>> buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
>> and I still use it. I will not complain if ext3 trashes that.
>>
>> In case of degraded raid-5, even with perfect hardware, and with
>> ext3 on top of that, you'll get silent data corruption. Nice, eh?
>>
>> Clearly, Linux is buggy there. It could be argued it is raid-5's
>> fault, or maybe it is ext3's fault, but... linux is still buggy.
>
> Nothing is perfect. It is still a trade off between storage utilization
> (how much storage we give users for say 5 2TB drives), performance and
> costs (throw away any disks over 2 years old?).

"Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
believe that should be at least documented. (And understand why ZFS is
interesting thing).

>> Ext3 is unsuitable for flash cards and RAID arrays, plain and
>> simple. It is not documented anywhere :-(. [ext2 should work better --
>> at least you'll not get silent data corruption.]
>
> ext3 is used on lots of raid arrays without any issue.

And I still use my zaurus with crappy DRAM.

I would not trust raid5 array with my data, for multiple
reasons. The fact that degraded raid5 breaks ext3 assumptions should
really be documented.

>> I hold ext2/ext3 to higher standards than other filesystem in
>> tree. I'd not use XFS/VFAT etc.
>>
>> I would not want people to migrate towards XFS/VFAT, and yes I believe
>> XFSs/VFATs/... requirements should be documented, too. (But I know too
>> little about those filesystems).
>>
>> If you can suggest better wording, please help me. But... those
>> requirements are non-trivial, commonly not met and the result is data
>> loss. It has to be documented somehow. Make it as innocent-looking as
>> you can...

>
> I think that you really need to step back and look harder at real
> failures - not just your personal experience - but a larger set of real
> world failures. Many papers have been published recently about that (the
> google paper, the Bianca paper from FAST, Netapp, etc).

The papers show failures in "once a year" range. I have "twice a
minute" failure scenario with flashdisks.

Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
but I bet it would be on "once a day" scale.

We should document those.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-24 21:09:09

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Pavel Machek wrote:
> Hi!
>
>
>>> Yep, and at that point you lost data. You had "silent data corruption"
>>> from fs point of view, and that's bad.
>>>
>>> It will be probably very bad on XFS, probably okay on Ext3, and
>>> certainly okay on Ext2: you do filesystem check, and you should be
>>> able to repair any damage. So yes, physical journaling is good, but
>>> fsck is better.
>>>
>> I don't see why you think that. In general, fsck (for any fs) only
>> checks metadata. If you have silent data corruption that corrupts things
>> that are fixable by fsck, you most likely have silent corruption hitting
>> things users care about like their data blocks inside of files. Fsck
>> will not fix (or notice) any of that, that is where things like full
>> data checksums can help.
>>
>
> Ok, but in case of data corruption, at least your filesystem does not
> degrade further.
>
>
Even worse, your data is potentially gone and you have not noticed
it... This is why array vendors and archival storage products do
periodic scans of all stored data (read all the bytes, compared to a
digital signature, etc).
>>> If those filesystem assumptions were not documented, I'd call it
>>> filesystem bug. So better document them ;-).
>>>
>>>
>> I think that we need to help people understand the full spectrum of data
>> concerns, starting with reasonable best practices that will help most
>> people suffer *less* (not no) data loss. And make very sure that they
>> are not falsely assured that by following any specific script that they
>> can skip backups, remote backups, etc :-)
>>
>> Nothing in our code in any part of the kernel deals well with every
>> disaster or odd event.
>>
>
> I can reproduce data loss with ext3 on flashcard in about 40
> seconds. I'd not call that "odd event". It would be nice to handle
> that, but that is hard. So ... can we at least get that documented
> please?
>

Part of documenting best practices is to put down very specific things
that do/don't work. What I worry about is producing too much detail to
be of use for real end users.

I have to admit that I have not paid enough attention to this specifics
of your ext3 + flash card issue - is it the ftl stuff doing out of order
IO's?
>
>
>>> Actually, ext2 should be able to survive that, no? Error writing ->
>>> remount ro -> fsck on next boot -> drive relocates the sectors.
>>>
>>>
>> I think that the example and the response are both off base. If your
>> head ever touches the platter, you won't be reading from a huge part of
>> your drive ever again (usually, you have 2 heads per platter, 3-4
>> platters, impact would kill one head and a corresponding percentage of
>> your data).
>>
>
> Ok, that's obviously game over.
>

This is when you start seeing lots of READ and WRITE errors :-)
>
>>>> It's for this reason that I've never been completely sure how useful
>>>> Pavel's proposed treatise about file systems expectations really are
>>>> --- because all storage subsystems *usually* provide these guarantees,
>>>> but it is the very rare storage system that *always* provides these
>>>> guarantees.
>>>>
>>> Well... there's very big difference between harddrives and flash
>>> memory. Harddrives usually work, and flash memory never does.
>>>
>> It is hard for anyone to see the real data without looking in detail at
>> large numbers of parts. Back at EMC, we looked at failures for lots of
>> parts so we got a clear grasp on trends. I do agree that flash/SSD
>> parts are still very young so we will have interesting and unexpected
>> failure modes to learn to deal with....
>>
>
> _Maybe_ SSDs, being HDD replacements are better. I don't know.
>
> _All_ flash cards (MMC, USB, SD) had the problems. You don't need to
> get clear grasp on trends. Those cards just don't meet ext3
> expectations, and if you pull them, you get data loss.
>
>
Pull them even after an unmount, or pull them hot?
>>>> We could just as easily have several kilobytes of explanation in
>>>> Documentation/* explaining how we assume that DRAM always returns the
>>>> same value that was stored in it previously --- and yet most PC class
>>>> hardware still does not use ECC memory, and cosmic rays are a reality.
>>>> That means that most Linux systems run on systems that are vulnerable
>>>> to this kind of failure --- and the world hasn't ended.
>>>>
>
>
>>> There's a difference. In case of cosmic rays, hardware is clearly
>>> buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
>>> and I still use it. I will not complain if ext3 trashes that.
>>>
>>> In case of degraded raid-5, even with perfect hardware, and with
>>> ext3 on top of that, you'll get silent data corruption. Nice, eh?
>>>
>>> Clearly, Linux is buggy there. It could be argued it is raid-5's
>>> fault, or maybe it is ext3's fault, but... linux is still buggy.
>>>
>> Nothing is perfect. It is still a trade off between storage utilization
>> (how much storage we give users for say 5 2TB drives), performance and
>> costs (throw away any disks over 2 years old?).
>>
>
> "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
> believe that should be at least documented. (And understand why ZFS is
> interesting thing).
>
>
Your statement is overly broad - ext3 on a commercial RAID array that
does RAID5 or RAID6, etc has no issues that I know of.

Do you know first hand that ZFS works on flash cards?
>>> Ext3 is unsuitable for flash cards and RAID arrays, plain and
>>> simple. It is not documented anywhere :-(. [ext2 should work better --
>>> at least you'll not get silent data corruption.]
>>>
>> ext3 is used on lots of raid arrays without any issue.
>>
>
> And I still use my zaurus with crappy DRAM.
>
> I would not trust raid5 array with my data, for multiple
> reasons. The fact that degraded raid5 breaks ext3 assumptions should
> really be documented.
>

Again, you say RAID5 without enough specifics. Are you pointing just at
MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5 vendor?
>
>>> I hold ext2/ext3 to higher standards than other filesystem in
>>> tree. I'd not use XFS/VFAT etc.
>>>
>>> I would not want people to migrate towards XFS/VFAT, and yes I believe
>>> XFSs/VFATs/... requirements should be documented, too. (But I know too
>>> little about those filesystems).
>>>
>>> If you can suggest better wording, please help me. But... those
>>> requirements are non-trivial, commonly not met and the result is data
>>> loss. It has to be documented somehow. Make it as innocent-looking as
>>> you can...
>>>
>
>
>> I think that you really need to step back and look harder at real
>> failures - not just your personal experience - but a larger set of real
>> world failures. Many papers have been published recently about that (the
>> google paper, the Bianca paper from FAST, Netapp, etc).
>>
>
> The papers show failures in "once a year" range. I have "twice a
> minute" failure scenario with flashdisks.
>
> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
> but I bet it would be on "once a day" scale.
>
> We should document those.
> Pavel
>

Documentation is fine with sufficient, hard data....

ric



2009-08-24 21:11:55

by Greg Freemyer

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

> The papers show failures in "once a year" range. I have "twice a
> minute" failure scenario with flashdisks.
>
> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
> but I bet it would be on "once a day" scale.
>

I agree it should be documented, but the ext3 atomicity issue is only
an issue on unexpected shutdown while the array is degraded. I surely
hope most people running raid5 are not seeing that level of unexpected
shutdown, let along in a degraded array,

If they are, the atomicity issue pretty strongly says they should not
be using raid5 in that environment. At least not for any filesystem I
know. Having writes to LBA n corrupt LBA n+128 as an example is
pretty hard to design around from a fs perspective.

Greg

2009-08-24 21:25:19

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Hi!

>> I can reproduce data loss with ext3 on flashcard in about 40
>> seconds. I'd not call that "odd event". It would be nice to handle
>> that, but that is hard. So ... can we at least get that documented
>> please?
>>
>
> Part of documenting best practices is to put down very specific things
> that do/don't work. What I worry about is producing too much detail to
> be of use for real end users.

Well, I was trying to write for kernel audience. Someone can turn that
into nice end-user manual.

> I have to admit that I have not paid enough attention to this specifics
> of your ext3 + flash card issue - is it the ftl stuff doing out of order
> IO's?

The problem is that flash cards destroy whole erase block on unplug,
and ext3 can't cope with that.

>> _All_ flash cards (MMC, USB, SD) had the problems. You don't need to
>> get clear grasp on trends. Those cards just don't meet ext3
>> expectations, and if you pull them, you get data loss.
>>
> Pull them even after an unmount, or pull them hot?

Pull them hot.

[Some people try -osync to avoid data loss on flash cards... that will
not do the trick. Flashcard will still kill the eraseblock.]

>>> Nothing is perfect. It is still a trade off between storage
>>> utilization (how much storage we give users for say 5 2TB drives),
>>> performance and costs (throw away any disks over 2 years old?).
>>>
>>
>> "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
>> believe that should be at least documented. (And understand why ZFS is
>> interesting thing).
>>
> Your statement is overly broad - ext3 on a commercial RAID array that
> does RAID5 or RAID6, etc has no issues that I know of.

If your commercial RAID array is battery backed, maybe. But I was
talking Linux MD here.

>> And I still use my zaurus with crappy DRAM.
>>
>> I would not trust raid5 array with my data, for multiple
>> reasons. The fact that degraded raid5 breaks ext3 assumptions should
>> really be documented.
>
> Again, you say RAID5 without enough specifics. Are you pointing just at
> MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5
> vendor?

Degraded MD RAID5 on anything, including SATA, and including
hypothetical "perfect disk".

>> The papers show failures in "once a year" range. I have "twice a
>> minute" failure scenario with flashdisks.
>>
>> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
>> but I bet it would be on "once a day" scale.
>>
>> We should document those.
>
> Documentation is fine with sufficient, hard data....

Degraded MD RAID5 does not work by design; whole stripe will be
damaged on powerfail or reset or kernel bug, and ext3 can not cope
with that kind of damage. [I don't see why statistics should be
neccessary for that; the same way we don't need statistics to see that
ext2 needs fsck after powerfail.]
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-24 22:05:45

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Pavel Machek wrote:
> Hi!
>
>
>>> I can reproduce data loss with ext3 on flashcard in about 40
>>> seconds. I'd not call that "odd event". It would be nice to handle
>>> that, but that is hard. So ... can we at least get that documented
>>> please?
>>>
>>>
>> Part of documenting best practices is to put down very specific things
>> that do/don't work. What I worry about is producing too much detail to
>> be of use for real end users.
>>
>
> Well, I was trying to write for kernel audience. Someone can turn that
> into nice end-user manual.
>

Kernel people who don't do storage or file systems will still need a
summary - making very specific proposals based on real data and analysis
is useful.
>
>> I have to admit that I have not paid enough attention to this specifics
>> of your ext3 + flash card issue - is it the ftl stuff doing out of order
>> IO's?
>>
>
> The problem is that flash cards destroy whole erase block on unplug,
> and ext3 can't cope with that.
>
>

Even if you unmount the file system? Why isn't this an issue with ext2?

Sounds like you want to suggest very specifically that journalled file
systems are not appropriate for low end flash cards (which seems quite
reasonable).
>>> _All_ flash cards (MMC, USB, SD) had the problems. You don't need to
>>> get clear grasp on trends. Those cards just don't meet ext3
>>> expectations, and if you pull them, you get data loss.
>>>
>>>
>> Pull them even after an unmount, or pull them hot?
>>
>
> Pull them hot.
>
> [Some people try -osync to avoid data loss on flash cards... that will
> not do the trick. Flashcard will still kill the eraseblock.]
>

Pulling hot any device will cause data loss for recent data loss, even
with ext2 you will have data in the page cache, right?
>
>>>> Nothing is perfect. It is still a trade off between storage
>>>> utilization (how much storage we give users for say 5 2TB drives),
>>>> performance and costs (throw away any disks over 2 years old?).
>>>>
>>>>
>>> "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
>>> believe that should be at least documented. (And understand why ZFS is
>>> interesting thing).
>>>
>>>
>> Your statement is overly broad - ext3 on a commercial RAID array that
>> does RAID5 or RAID6, etc has no issues that I know of.
>>
>
> If your commercial RAID array is battery backed, maybe. But I was
> talking Linux MD here.
>

Many people in the real world who use RAID5 (for better or worse) use
external raid cards or raid arrays, so you need to be very specific.
>
>>> And I still use my zaurus with crappy DRAM.
>>>
>>> I would not trust raid5 array with my data, for multiple
>>> reasons. The fact that degraded raid5 breaks ext3 assumptions should
>>> really be documented.
>>>
>> Again, you say RAID5 without enough specifics. Are you pointing just at
>> MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5
>> vendor?
>>
>
> Degraded MD RAID5 on anything, including SATA, and including
> hypothetical "perfect disk".
>

Degraded is one faulted drive while MD is doing a rebuild? And then you
hot unplug it or power cycle? I think that would certainly cause failure
for ext2 as well (again, you would lose any data in the page cache).
>
>>> The papers show failures in "once a year" range. I have "twice a
>>> minute" failure scenario with flashdisks.
>>>
>>> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
>>> but I bet it would be on "once a day" scale.
>>>
>>> We should document those.
>>>
>> Documentation is fine with sufficient, hard data....
>>
>
> Degraded MD RAID5 does not work by design; whole stripe will be
> damaged on powerfail or reset or kernel bug, and ext3 can not cope
> with that kind of damage. [I don't see why statistics should be
> neccessary for that; the same way we don't need statistics to see that
> ext2 needs fsck after powerfail.]
> Pavel
>
What you are describing is a double failure and RAID5 is not double
failure tolerant regardless of the file system type....

I don't want to be overly negative since getting good documentation is
certainly very useful. We just need to be document things correctly
based on real data.

Ric

2009-08-24 22:22:22

by Zan Lynx

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Ric Wheeler wrote:
> Pavel Machek wrote:
>> Degraded MD RAID5 does not work by design; whole stripe will be
>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>> with that kind of damage. [I don't see why statistics should be
>> neccessary for that; the same way we don't need statistics to see that
>> ext2 needs fsck after powerfail.]
>> Pavel
>>
> What you are describing is a double failure and RAID5 is not double
> failure tolerant regardless of the file system type....

Are you sure he isn't talking about how RAID must write all the data
chunks to make a complete stripe and if there is a power-loss, some of
the chunks may be written and some may not?

As I read Pavel's point he is saying that the incomplete write can be
detected by the incorrect parity chunk, but degraded RAID-5 has no
working parity chunk so the incomplete write would go undetected.

I know this is a RAID failure mode. However, I actually thought this was
a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally
read the complete stripe and perform verification unless that is
requested, because doing so would hurt performance and lose the entire
point of the RAID-5 rotating parity blocks.

--
Zan Lynx
[email protected]

"Knowledge is Power. Power Corrupts. Study Hard. Be Evil."

2009-08-24 22:30:24

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Monday 24 August 2009 09:55:53 Artem Bityutskiy wrote:
> Probably, Pavel did too good job in generalizing things, and it could be
> better to make a doc about HDD vs SSD or HDD vs Flash-based-storage.
> Not sure. But the idea to document subtle FS assumption is good, IMO.

The standard procedure for this seems to be to cc: Jonathan Corbet on the
discussion, make puppy eyes at him, and subscribe to Linux Weekly News.

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-08-24 22:39:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
> > I have to admit that I have not paid enough attention to this specifics
> > of your ext3 + flash card issue - is it the ftl stuff doing out of order
> > IO's?
>
> The problem is that flash cards destroy whole erase block on unplug,
> and ext3 can't cope with that.

Sure --- but name **any** filesystem that can deal with the fact that
128k or 256k worth of data might disappear when you pull out the flash
card while it is writing a single sector?

> > Your statement is overly broad - ext3 on a commercial RAID array that
> > does RAID5 or RAID6, etc has no issues that I know of.
>
> If your commercial RAID array is battery backed, maybe. But I was
> talking Linux MD here.

It's not just high end RAID arrays that have battery backups; I happen
to use a mid-range hardware RAID card that comes with a battery
backup. It's just a matter of choosing your hardware carefully.

If your concern is that with Linux MD, you could potentially lose an
entire stripe in RAID 5 mode, then you should say that explicitly; but
again, this isn't a filesystem specific cliam; it's true for all
filesystems. I don't know of any file system that can survive having
a RAID stripe-shaped-hole blown into the middle of it due to a power
failure.

I'll note, BTW, that AIX uses a journal to protect against these sorts
of problems with software raid; this also means that with AIX, you
also don't have to rebuild a RAID 1 device after an unclean shutdown,
like you have do with Linux MD. This was on the EVMS's team
development list to implement for Linux, but it got canned after LVM
won out, lo those many years ago. Ce la vie; but it's a problem which
is solvable at the RAID layer, and which is traditionally and
historically solved in competent RAID implementations.

- Ted

2009-08-24 22:42:01

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

>>> I have to admit that I have not paid enough attention to this
>>> specifics of your ext3 + flash card issue - is it the ftl stuff
>>> doing out of order IO's?
>>
>> The problem is that flash cards destroy whole erase block on unplug,
>> and ext3 can't cope with that.
>
> Even if you unmount the file system? Why isn't this an issue with
> ext2?

No, I'm talking hot unplug here. It is the issue with ext2, but ext2
will run fsck on next mount, making it less severe.


>>> Pull them even after an unmount, or pull them hot?
>>>
>>
>> Pull them hot.
>>
>> [Some people try -osync to avoid data loss on flash cards... that will
>> not do the trick. Flashcard will still kill the eraseblock.]
>
> Pulling hot any device will cause data loss for recent data loss, even
> with ext2 you will have data in the page cache, right?

Right. But in ext3 case you basically loose whole filesystem, because
fs is inconsistent and you did not run fsck.

>>> Again, you say RAID5 without enough specifics. Are you pointing just
>>> at MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial
>>> RAID5 vendor?
>>>
>>
>> Degraded MD RAID5 on anything, including SATA, and including
>> hypothetical "perfect disk".
>
> Degraded is one faulted drive while MD is doing a rebuild? And then you
> hot unplug it or power cycle? I think that would certainly cause failure
> for ext2 as well (again, you would lose any data in the page cache).

Losing data in page cache is expected. Losing fs consistency is not.

>> Degraded MD RAID5 does not work by design; whole stripe will be
>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>> with that kind of damage. [I don't see why statistics should be
>> neccessary for that; the same way we don't need statistics to see that
>> ext2 needs fsck after powerfail.]

> What you are describing is a double failure and RAID5 is not double
> failure tolerant regardless of the file system type....

You get single disk failure then powerfail (or reset or kernel
panic). I would not call that double failure. I agree that it will
mean problems for most filesystems.

Anyway, even if that can be called a double failure, this limitation
should be clearly documented somewhere.

...and that's exactly what I'm trying to fix.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-24 22:44:32

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Mon 2009-08-24 16:22:22, Zan Lynx wrote:
> Ric Wheeler wrote:
>> Pavel Machek wrote:
>>> Degraded MD RAID5 does not work by design; whole stripe will be
>>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>>> with that kind of damage. [I don't see why statistics should be
>>> neccessary for that; the same way we don't need statistics to see that
>>> ext2 needs fsck after powerfail.]
>>> Pavel
>>>
>> What you are describing is a double failure and RAID5 is not double
>> failure tolerant regardless of the file system type....
>
> Are you sure he isn't talking about how RAID must write all the data
> chunks to make a complete stripe and if there is a power-loss, some of
> the chunks may be written and some may not?
>
> As I read Pavel's point he is saying that the incomplete write can be
> detected by the incorrect parity chunk, but degraded RAID-5 has no
> working parity chunk so the incomplete write would go undetected.

Yep.

> I know this is a RAID failure mode. However, I actually thought this was
> a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally
> read the complete stripe and perform verification unless that is
> requested, because doing so would hurt performance and lose the entire
> point of the RAID-5 rotating parity blocks.

Not sure; is not RAID expected to verify the array after unclean
shutdown?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-24 23:00:45

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Mon 2009-08-24 18:39:15, Theodore Tso wrote:
> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
> > > I have to admit that I have not paid enough attention to this specifics
> > > of your ext3 + flash card issue - is it the ftl stuff doing out of order
> > > IO's?
> >
> > The problem is that flash cards destroy whole erase block on unplug,
> > and ext3 can't cope with that.
>
> Sure --- but name **any** filesystem that can deal with the fact that
> 128k or 256k worth of data might disappear when you pull out the flash
> card while it is writing a single sector?

First... I consider myself quite competent in the os level, yet I did
not realize what flash does and what that means for data
integrity. That means we need some documentation, or maybe we should
refuse to mount those devices r/w or something.

Then to answer your question... ext2. You expect to run fsck after
unclean shutdown, and you expect to have to solve some problems with
it. So the way ext2 deals with the flash media actually matches what
the user expects. (*)

OTOH in ext3 case you expect consistent filesystem after unplug; and
you don't get that.

> > > Your statement is overly broad - ext3 on a commercial RAID array that
> > > does RAID5 or RAID6, etc has no issues that I know of.
> >
> > If your commercial RAID array is battery backed, maybe. But I was
> > talking Linux MD here.
...
> If your concern is that with Linux MD, you could potentially lose an
> entire stripe in RAID 5 mode, then you should say that explicitly; but
> again, this isn't a filesystem specific cliam; it's true for all
> filesystems. I don't know of any file system that can survive having
> a RAID stripe-shaped-hole blown into the middle of it due to a power
> failure.

Again, ext2 handles that in a way user expects it.

At least I was teached "ext2 needs fsck after powerfail; ext3 can
handle powerfails just ok".

> I'll note, BTW, that AIX uses a journal to protect against these sorts
> of problems with software raid; this also means that with AIX, you
> also don't have to rebuild a RAID 1 device after an unclean shutdown,
> like you have do with Linux MD. This was on the EVMS's team
> development list to implement for Linux, but it got canned after LVM
> won out, lo those many years ago. Ce la vie; but it's a problem which
> is solvable at the RAID layer, and which is traditionally and
> historically solved in competent RAID implementations.

Yep, we should add journal to RAID; or at least write "Linux MD
*needs* an UPS" in big and bold letters. I'm trying to do the second
part.

(Attached is current version of the patch).

[If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are
generaly unsafe to use without UPS/reliable connection/no kernel
bugs... then I may try to push that. I was not sure... maybe some
filesystem _can_ handle this kind of issues?]

Pavel

(*) Ok, now... user expects to run fsck, but very advanced users may
not expect old data to be damaged. Certainly I was not advanced enough
user few months ago.

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..d1ef4d0
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,57 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so. Not all filesystems require all of these
+to be satisfied for safe operation.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly.
+
+ Fortunately writes failing are very uncommon on traditional
+ spinning disks, as they have spare sectors they use when write
+ fails.
+
+Don't cause collateral damage on a failed write (NO-COLLATERALS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On some storage systems, failed write (for example due to power
+failure) kills data in adjacent (or maybe unrelated) sectors.
+
+Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
+and are thus unsuitable for all filesystems I know.
+
+ An inherent problem with using flash as a normal block device
+ is that the flash erase size is bigger than most filesystem
+ sector sizes. So when you request a write, it may erase and
+ rewrite some 64k, 128k, or even a couple megabytes on the
+ really _big_ ones.
+
+ If you lose power in the middle of that, filesystem won't
+ notice that data in the "sectors" _around_ the one your were
+ trying to write to got trashed.
+
+ MD RAID-4/5/6 in degraded mode has similar problem, stripes
+ behave similary to eraseblocks.
+
+
+Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+ Because RAM tends to fail faster than rest of system during
+ powerfail, special hw killing DMA transfers may be necessary;
+ otherwise, disks may write garbage during powerfail.
+ This may be quite common on generic PC machines.
+
+ Note that atomic write is very hard to guarantee for MD RAID-4/5/6,
+ because it needs to write both changed data, and parity, to
+ different disks. (But it will only really show up in degraded mode).
+ UPS for RAID array should help.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 67639f9..ef9ff0f 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they
have to be 8 character filenames, even then we are fairly close to
running out of unique filenames.

+Requirements
+============
+
+Ext2 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+ (NO-COLLATERALS)
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+ as of 2.6.28. hdparm -W0 disables it on SATA disks.
+
Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie. It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout. In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem. This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash. If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem. If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded (which means any
-filesystem changes they represent are also lost).
+==========
Check Documentation/filesystems/ext3.txt if you want to read more about
ext3 and journaling.

diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 570f9bd..752f4b4 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -199,6 +202,43 @@ debugfs: ext2 and ext3 file system debugger.
ext2online: online (mounted) ext2 and ext3 filesystem resizer


+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
+
+ Ext3 handles trash getting written into sectors during powerfail
+ surprisingly well. It's not foolproof, but it is resilient.
+ Incomplete journal entries are ignored, and journal replay of
+ complete entries will often "repair" garbage written into the inode
+ table. The data=journal option extends this behavior to file and
+ directory data blocks as well.
+
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+ (NO-COLLATERALS)
+
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+ (Note that barriers are disabled by default, use "barrier=1"
+ mount option after making sure hw can support them).
+
+ hdparm -I reports disk features. If you have "Native
+ Command Queueing" is the feature you are looking for.
+
+
References
==========


--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-24 23:43:32

by David Lang

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Mon, 24 Aug 2009, Zan Lynx wrote:

> Ric Wheeler wrote:
>> Pavel Machek wrote:
>>> Degraded MD RAID5 does not work by design; whole stripe will be
>>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>>> with that kind of damage. [I don't see why statistics should be
>>> neccessary for that; the same way we don't need statistics to see that
>>> ext2 needs fsck after powerfail.]
>>> Pavel
>>>
>> What you are describing is a double failure and RAID5 is not double failure
>> tolerant regardless of the file system type....
>
> Are you sure he isn't talking about how RAID must write all the data chunks
> to make a complete stripe and if there is a power-loss, some of the chunks
> may be written and some may not?

q write to raid 5 doesn't need to write to all drives, but it does need to
write to two drives (the drive you are modifying and the parity drive)

if you are not degraded and only suceed on one write you will detect the
corruption later when you try to verify the data.

if you are degraded and only suceed on one write, then the entire stripe
gets corrupted.

but this is a double failure (one drive + unclean shutdown)

if you have battery-backed cache you will finish the writes when you
reboot.

if you don't have battery-backed cache (or are using software raid and
crashed in the middle of sending the writes to the drive) you loose, but
unless you disable write buffers and do sync writes (which nobody is going
to do because of the performance problems) you will loose data in an
unclean shutdown anyway.

David Lang

> As I read Pavel's point he is saying that the incomplete write can be
> detected by the incorrect parity chunk, but degraded RAID-5 has no working
> parity chunk so the incomplete write would go undetected.
>
> I know this is a RAID failure mode. However, I actually thought this was a
> problem even for a intact RAID-5. AFAIK, RAID-5 does not generally read the
> complete stripe and perform verification unless that is requested, because
> doing so would hurt performance and lose the entire point of the RAID-5
> rotating parity blocks.
>
>

2009-08-25 00:34:50

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Pavel Machek wrote:
> On Mon 2009-08-24 16:22:22, Zan Lynx wrote:
>
>> Ric Wheeler wrote:
>>
>>> Pavel Machek wrote:
>>>
>>>> Degraded MD RAID5 does not work by design; whole stripe will be
>>>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>>>> with that kind of damage. [I don't see why statistics should be
>>>> neccessary for that; the same way we don't need statistics to see that
>>>> ext2 needs fsck after powerfail.]
>>>> Pavel
>>>>
>>>>
>>> What you are describing is a double failure and RAID5 is not double
>>> failure tolerant regardless of the file system type....
>>>
>> Are you sure he isn't talking about how RAID must write all the data
>> chunks to make a complete stripe and if there is a power-loss, some of
>> the chunks may be written and some may not?
>>
>> As I read Pavel's point he is saying that the incomplete write can be
>> detected by the incorrect parity chunk, but degraded RAID-5 has no
>> working parity chunk so the incomplete write would go undetected.
>>
>
> Yep.
>
>
>> I know this is a RAID failure mode. However, I actually thought this was
>> a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally
>> read the complete stripe and perform verification unless that is
>> requested, because doing so would hurt performance and lose the entire
>> point of the RAID-5 rotating parity blocks.
>>
>
> Not sure; is not RAID expected to verify the array after unclean
> shutdown?
>
> Pavel
>
Not usually - that would take multiple hours of verification, roughly
equivalent to doing a RAID rebuild since you have to read each sector of
every drive (although you would do this at full speed if the array was
offline, not throttled like we do with rebuilds).

That is part of the thing that scrubbing can do.

Note that once you find a bad bit of data, it is really useful to be
able to map that back into a humanly understandable object/repair
action. For example, map the bad data range back to metadata which would
translate into a fsck run or a list of impacted files or directories....

Ric


2009-08-25 14:43:42

by Florian Weimer

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

* Theodore Tso:

> The only one that falls into that category is the one about not being
> able to handle failed writes, and the way most failures take place,

Hmm. What does "not being able to handle failed writes" actually
mean? AFAICS, there are two possible answers: "all bets are off", or
"we'll tell you about the problem, and all bets are off".

>> Isn't this by design? In other words, if the metadata doesn't survive
>> non-atomic writes, wouldn't it be an ext3 bug?
>
> Part of the problem here is that "atomic-writes" is confusing; it
> doesn't mean what many people think it means. The assumption which
> many naive filesystem designers make is that writes succeed or they
> don't. If they don't succeed, they don't change the previously
> existing data in any way.

Right. And a lot of database systems make the same assumption.
Oracle Berkeley DB cannot deal with partial page writes at all, and
PostgreSQL assumes that it's safe to flip a few bits in a sector
without proper WAL (it doesn't care if the changes actually hit the
disk, but the write shouldn't make the sector unreadable or put random
bytes there).

> Is that a file system "bug"? Well, it's better to call that a
> mismatch between the assumptions made of physical devices, and of the
> file system code. On Irix, SGI hardware had a powerfail interrupt,
> and the power supply and extra-big capacitors, so that when a power
> fail interrupt came in, the Irix would run around frantically shutting
> down pending DMA transfers to prevent this failure mode from causing
> problems. PC class hardware (according to Ted's law), is cr*p, and
> doesn't have a powerfail interrupt, so it's not something that we
> have.

The DMA transaction should fail due to ECC errors, though.

> Ext3, ext4, and ocfs2 does physical block journalling, so as long as
> journal truncate hasn't taken place right before the failure, the
> replay of the physical block journal tends to repair this most (but
> not necessarily all) cases of "garbage is written right before power
> failure". People who care about this should really use a UPS, and
> wire up the USB and/or serial cable from the UPS to the system, so
> that the OS can do a controlled shutdown if the UPS is close to
> shutting down due to an extended power failure.

I think the general idea is to protect valuable data with WAL. You
overwrite pages on disk only after you've made a backup copy into WAL.
After a power loss event, you replay the log and overwrite all garbage
that might be there. For the WAL, you rely on checksum and sequence
numbers. This still doesn't help against write failures where the
system continues running (because the fsync() during checkpointing
isn't guaranteed to report errors), but it should deal with the power
failure case. But this assumes that the file system protects its own
data structure in a similar way. Is this really too much to demand?

Partial failures are extremely difficult to deal with because of their
asynchronous nature. I've come to accept that, but it's still
disappointing.

--
Florian Weimer <[email protected]>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstra?e 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

2009-08-25 18:52:09

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Monday 24 August 2009 15:24:28 Ric Wheeler wrote:
> Pavel Machek wrote:

> > Actually, ext2 should be able to survive that, no? Error writing ->
> > remount ro -> fsck on next boot -> drive relocates the sectors.
>
> I think that the example and the response are both off base. If your
> head ever touches the platter, you won't be reading from a huge part of
> your drive ever again

It's not quite that simple anymore.

These days, most modern drives add an "overcoat", which is a vapor deposition
layer of carbon (I.E. diamond) on top of the magnetic media, and then add a
nanolayer of some kind of nonmagnetic lubricant on top of that. That protects
the magnetic layer from physical contact with the head; it takes a pretty
solid whack to chip through diamond and actually gouge your disk:

http://www.datarecoverylink.com/understanding_magnetic_media.html

You can also do fun things with various nitridies (carbon nitride, silicon
nitride, titanium nitride) which are pretty darn tough too, although I dunno
about their suitability to hard drives:

http://www.physical-vapor-deposition.com/

So while it _is_ possible to whack your drive and scratch the platter, merely
"touching" won't do it. (Laptops wouldn't be feasible if they couldn't cope
with a little jostling while running.) In the case of repeated small whacks,
your heads may actually go first. (I vaguely recall the little aerofoil wing
thingy holding up the disk touches first, and can get ground down by repeated
contact with the diamond layer (despite the lubricant, that just buys time) so
it gets shorter and shorter and can't reliably keep the head above the disk
rather than in contact with it. But I'm kind of stale myself here, not sure
that's still current.)

Here's a nice youtube video of a 2007 defcon talk from a hard drive recovery
professional, "What's that Clicking Noise", series starts here:
http://www.youtube.com/watch?v=vCapEFNZAJ0

And here's that guy's web page:
http://www.myharddrivedied.com/presentations/index.html

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-08-25 20:56:05

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Monday 24 August 2009 16:11:56 Greg Freemyer wrote:
> > The papers show failures in "once a year" range. I have "twice a
> > minute" failure scenario with flashdisks.
> >
> > Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
> > but I bet it would be on "once a day" scale.
>
> I agree it should be documented, but the ext3 atomicity issue is only
> an issue on unexpected shutdown while the array is degraded. I surely
> hope most people running raid5 are not seeing that level of unexpected
> shutdown, let along in a degraded array,
>
> If they are, the atomicity issue pretty strongly says they should not
> be using raid5 in that environment. At least not for any filesystem I
> know. Having writes to LBA n corrupt LBA n+128 as an example is
> pretty hard to design around from a fs perspective.

Right now, people think that a degraded raid 5 is equivalent to raid 0. As
this thread demonstrates, in the power failure case it's _worse_, due to write
granularity being larger than the filesystem sector size. (Just like flash.)

Knowing that, some people might choose to suspend writes to their raid until
it's finished recovery. Perhaps they'll set up a system where a degraded raid
5 gets remounted read only until recovery completes, and then writes go to a
new blank hot spare disk using all that volume snapshoting or unionfs stuff
people have been working on. (The big boys already have hot spare disks
standing by on a lot of these systems, ready to power up and go without human
intervention. Needing two for actual reliability isn't that big a deal.)

Or maybe the raid guys might want to tweak the recovery logic so it's not
entirely linear, but instead prioritizes dirty pages over clean ones. So if
somebody dirties a page halfway through a degraded raid 5, skip ahead to
recover that chunk first to the new disk first (yes leaving holes, it's not that
hard to track), and _then_ let the write go through.

But unless people know the issue exists, they won't even start thinking about
ways to address it.

> Greg

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-08-25 21:08:10

by David Lang

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Tue, 25 Aug 2009, Rob Landley wrote:

> On Monday 24 August 2009 16:11:56 Greg Freemyer wrote:
>>> The papers show failures in "once a year" range. I have "twice a
>>> minute" failure scenario with flashdisks.
>>>
>>> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
>>> but I bet it would be on "once a day" scale.
>>
>> I agree it should be documented, but the ext3 atomicity issue is only
>> an issue on unexpected shutdown while the array is degraded. I surely
>> hope most people running raid5 are not seeing that level of unexpected
>> shutdown, let along in a degraded array,
>>
>> If they are, the atomicity issue pretty strongly says they should not
>> be using raid5 in that environment. At least not for any filesystem I
>> know. Having writes to LBA n corrupt LBA n+128 as an example is
>> pretty hard to design around from a fs perspective.
>
> Right now, people think that a degraded raid 5 is equivalent to raid 0. As
> this thread demonstrates, in the power failure case it's _worse_, due to write
> granularity being larger than the filesystem sector size. (Just like flash.)
>
> Knowing that, some people might choose to suspend writes to their raid until
> it's finished recovery. Perhaps they'll set up a system where a degraded raid
> 5 gets remounted read only until recovery completes, and then writes go to a
> new blank hot spare disk using all that volume snapshoting or unionfs stuff
> people have been working on. (The big boys already have hot spare disks
> standing by on a lot of these systems, ready to power up and go without human
> intervention. Needing two for actual reliability isn't that big a deal.)
>
> Or maybe the raid guys might want to tweak the recovery logic so it's not
> entirely linear, but instead prioritizes dirty pages over clean ones. So if
> somebody dirties a page halfway through a degraded raid 5, skip ahead to
> recover that chunk first to the new disk first (yes leaving holes, it's not that
> hard to track), and _then_ let the write go through.
>
> But unless people know the issue exists, they won't even start thinking about
> ways to address it.

if you've got the drives available you should be running raid 6 not raid 5
so that you have to loose two drives before you loose your error checking.

in my opinion that's a far better use of a drive than a hot spare.

David Lang