It seems that you are really hung up on whether or not the filesystem
metadata is consistent after a power failure, when I'd argue that the
problem with using storage devices that don't have good powerfail
properties have much bigger problems (such as the potential for silent
data corruption, or even if fsck will fix a trashed inode table with
ext2, massive data loss). So instead of your suggested patch, it
might be better simply to have a file in Documentation/filesystems
that states something along the lines of:
"There are storage devices that high highly undesirable properties
when they are disconnected or suffer power failures while writes are
in progress; such devices include flash devices and software RAID 5/6
arrays without journals, as well as hardware RAID 5/6 devices without
battery backups. These devices have the property of potentially
corrupting blocks being written at the time of the power failure, and
worse yet, amplifying the region where blocks are corrupted such that
adjacent sectors are also damaged during the power failure.
Users who use such storage devices are well advised take
countermeasures, such as the use of Uninterruptible Power Supplies,
and making sure the flash device is not hot-unplugged while the device
is being used. Regular backups when using these devices is also a
Very Good Idea.
Otherwise, file systems placed on these devices can suffer silent data
and file system corruption. An forced use of fsck may detect metadata
corruption resulting in file system corruption, but will not suffice
to detect data corruption."
My big complaint is that you seem to think that ext3 some how let you
down, but I'd argue that the real issue is that the storage device let
you down. Any journaling filesystem will have the properties that you
seem to be complaining about, so the fact that your patch only
documents this as assumptions made by ext2 and ext3 is unfair; it also
applies to xfs, jfs, reiserfs, reiser4, etc. Further more, most users
are even more concerned about possibility of massive data loss and/or
silent data corruption. So if your complaint that we don't have
documentation warning users about the potential pitfalls of using
storage devices with undesirable power fail properties, let's document
that as a shortcoming in those storage devices.
- Ted
Hi!
> It seems that you are really hung up on whether or not the filesystem
> metadata is consistent after a power failure, when I'd argue that the
> problem with using storage devices that don't have good powerfail
> properties have much bigger problems (such as the potential for silent
> data corruption, or even if fsck will fix a trashed inode table with
> ext2, massive data loss). So instead of your suggested patch, it
> might be better simply to have a file in Documentation/filesystems
> that states something along the lines of:
>
> "There are storage devices that high highly undesirable properties
> when they are disconnected or suffer power failures while writes are
> in progress; such devices include flash devices and software RAID 5/6
> arrays without journals, as well as hardware RAID 5/6 devices without
> battery backups. These devices have the property of potentially
> corrupting blocks being written at the time of the power failure, and
> worse yet, amplifying the region where blocks are corrupted such that
> adjacent sectors are also damaged during the power failure.
In FTL case, damaged sectors are not neccessarily adjacent. Otherwise
this looks okay and fair to me.
> Users who use such storage devices are well advised take
> countermeasures, such as the use of Uninterruptible Power Supplies,
> and making sure the flash device is not hot-unplugged while the device
> is being used. Regular backups when using these devices is also a
> Very Good Idea.
>
> Otherwise, file systems placed on these devices can suffer silent data
> and file system corruption. An forced use of fsck may detect metadata
> corruption resulting in file system corruption, but will not suffice
> to detect data corruption."
Ok, would you be against adding:
"Running non-journalled filesystem on these may be desirable, as
journalling can not provide meaningful protection, anyway."
> My big complaint is that you seem to think that ext3 some how let you
> down, but I'd argue that the real issue is that the storage device let
> you down. Any journaling filesystem will have the properties that you
> seem to be complaining about, so the fact that your patch only
> documents this as assumptions made by ext2 and ext3 is unfair; it also
> applies to xfs, jfs, reiserfs, reiser4, etc. Further more, most
> users
Yes, it applies to all journalling filesystems; it is just that I was
clever/paranoid enough to avoid anything non-ext3.
ext3 docs still says:
# The journal supports the transactions start and stop, and in case of a
# crash, the journal can replay the transactions to quickly put the
# partition back into a consistent state.
> are even more concerned about possibility of massive data loss and/or
> silent data corruption. So if your complaint that we don't have
> documentation warning users about the potential pitfalls of using
> storage devices with undesirable power fail properties, let's document
> that as a shortcoming in those storage devices.
Ok, works for me.
---
From: Theodore Tso <[email protected]>
Document that many devices are too broken for filesystems to protect
data in case of powerfail.
Signed-of-by: Pavel Machek <[email protected]>
diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
new file mode 100644
index 0000000..e1a46dd
--- /dev/null
+++ b/Documentation/filesystems/dangers.txt
@@ -0,0 +1,19 @@
+There are storage devices that high highly undesirable properties
+when they are disconnected or suffer power failures while writes are
+in progress; such devices include flash devices and software RAID 5/6
+arrays without journals, as well as hardware RAID 5/6 devices without
+battery backups. These devices have the property of potentially
+corrupting blocks being written at the time of the power failure, and
+worse yet, amplifying the region where blocks are corrupted such that
+additional sectors are also damaged during the power failure.
+
+Users who use such storage devices are well advised take
+countermeasures, such as the use of Uninterruptible Power Supplies,
+and making sure the flash device is not hot-unplugged while the device
+is being used. Regular backups when using these devices is also a
+Very Good Idea.
+
+Otherwise, file systems placed on these devices can suffer silent data
+and file system corruption. An forced use of fsck may detect metadata
+corruption resulting in file system corruption, but will not suffice
+to detect data corruption.
\ No newline at end of file
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Document things ext2 expects from storage filesystems, and the fact
that it can not handle barriers. Also remove jounaling description, as
that's really ext3 material.
Signed-off-by: Pavel Machek <[email protected]>
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 67639f9..e300ca8 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,17 @@ enough 4-character names to make up unique directory entries, so they
have to be 8 character filenames, even then we are fairly close to
running out of unique filenames.
+Requirements
+============
+
+Ext2 expects disk/storage subsystem not to return write errors.
+
+It also needs write caching to be disabled for reliable fsync
+operation; ext2 does not know how to issue barriers as of
+2.6.31. hdparm -W0 disables it on SATA disks.
+
Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie. It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout. In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem. This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
On Wed, 26 Aug 2009, Pavel Machek wrote:
>> It seems that you are really hung up on whether or not the filesystem
>> metadata is consistent after a power failure, when I'd argue that the
>> problem with using storage devices that don't have good powerfail
>> properties have much bigger problems (such as the potential for silent
>> data corruption, or even if fsck will fix a trashed inode table with
>> ext2, massive data loss). So instead of your suggested patch, it
>> might be better simply to have a file in Documentation/filesystems
>> that states something along the lines of:
>>
>> "There are storage devices that high highly undesirable properties
>> when they are disconnected or suffer power failures while writes are
>> in progress; such devices include flash devices and software RAID 5/6
>> arrays without journals,
is it under all conditions, or only when you have already lost redundancy?
prior discussions make me think this was only if the redundancy is already
lost.
also, the talk about software RAID 5/6 arrays without journals will be
confusing (after all, if you are using ext3/XFS/etc you are using a
journal, aren't you?)
you then go on to talk about hardware raid 5/6 without battery backup. I'm
think that you are being too specific here. any array without battery
backup can lead to 'interesting' situations when you loose power.
in addition, even with a single drive you will loose some data on power
loss (unless you do sync mounts with disabled write caches), full data
journaling can help protect you from this, but the default journaling just
protects the metadata.
David Lang
On Tue 2009-08-25 15:33:08, [email protected] wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>>> It seems that you are really hung up on whether or not the filesystem
>>> metadata is consistent after a power failure, when I'd argue that the
>>> problem with using storage devices that don't have good powerfail
>>> properties have much bigger problems (such as the potential for silent
>>> data corruption, or even if fsck will fix a trashed inode table with
>>> ext2, massive data loss). So instead of your suggested patch, it
>>> might be better simply to have a file in Documentation/filesystems
>>> that states something along the lines of:
>>>
>>> "There are storage devices that high highly undesirable properties
>>> when they are disconnected or suffer power failures while writes are
>>> in progress; such devices include flash devices and software RAID 5/6
>>> arrays without journals,
>
> is it under all conditions, or only when you have already lost redundancy?
I'd prefer not to specify.
> prior discussions make me think this was only if the redundancy is
> already lost.
I'm not so sure now.
Lets say you are writing to the (healthy) RAID5 and have a powerfail.
So now data blocks do not correspond to the parity block. You don't
yet have the corruption, but you already have a problem.
If you get a disk failing at this point, you'll get corruption.
> also, the talk about software RAID 5/6 arrays without journals will be
> confusing (after all, if you are using ext3/XFS/etc you are using a
> journal, aren't you?)
Slightly confusing, yes. Should I just say "MD RAID 5" and avoid
talking about hardware RAID arrays, where that's really
manufacturer-specific?
> in addition, even with a single drive you will loose some data on power
> loss (unless you do sync mounts with disabled write caches), full data
> journaling can help protect you from this, but the default journaling
> just protects the metadata.
"Data loss" here means "damaging data that were already fsynced". That
will not happen on single disk (with barriers on etc), but will happen
on RAID5 and flash.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Wed, 26 Aug 2009, Pavel Machek wrote:
> On Tue 2009-08-25 15:33:08, [email protected] wrote:
>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>
>>>> It seems that you are really hung up on whether or not the filesystem
>>>> metadata is consistent after a power failure, when I'd argue that the
>>>> problem with using storage devices that don't have good powerfail
>>>> properties have much bigger problems (such as the potential for silent
>>>> data corruption, or even if fsck will fix a trashed inode table with
>>>> ext2, massive data loss). So instead of your suggested patch, it
>>>> might be better simply to have a file in Documentation/filesystems
>>>> that states something along the lines of:
>>>>
>>>> "There are storage devices that high highly undesirable properties
>>>> when they are disconnected or suffer power failures while writes are
>>>> in progress; such devices include flash devices and software RAID 5/6
>>>> arrays without journals,
>>
>> is it under all conditions, or only when you have already lost redundancy?
>
> I'd prefer not to specify.
you need to, otherwise you are claiming that all linux software raid
implementations will loose data on powerfail, which I don't think is the
case.
>> prior discussions make me think this was only if the redundancy is
>> already lost.
>
> I'm not so sure now.
>
> Lets say you are writing to the (healthy) RAID5 and have a powerfail.
>
> So now data blocks do not correspond to the parity block. You don't
> yet have the corruption, but you already have a problem.
>
> If you get a disk failing at this point, you'll get corruption.
it's the same combination of problems (non-redundant array and write lost
to powerfail/reboot), just in a different order.
reccomending a scrub of the raid after an unclean shutdown would make
sense, along with a warning that if you loose all redundancy before the
scrub is completed and there was a write failure in the unscrubbed portion
it could corrupt things.
>> also, the talk about software RAID 5/6 arrays without journals will be
>> confusing (after all, if you are using ext3/XFS/etc you are using a
>> journal, aren't you?)
>
> Slightly confusing, yes. Should I just say "MD RAID 5" and avoid
> talking about hardware RAID arrays, where that's really
> manufacturer-specific?
what about dm raid?
I don't think you should talk about hardware raid cards.
>> in addition, even with a single drive you will loose some data on power
>> loss (unless you do sync mounts with disabled write caches), full data
>> journaling can help protect you from this, but the default journaling
>> just protects the metadata.
>
> "Data loss" here means "damaging data that were already fsynced". That
> will not happen on single disk (with barriers on etc), but will happen
> on RAID5 and flash.
this definition of data loss wasn't clear prior to this. you need to
define this, and state that the reason that flash and raid arrays can
suffer from this is that both of them deal with blocks of storage larger
than the data block (eraseblock or raid stripe) and there are conditions
that can cause the loss of the entire eraseblock or raid stripe which can
affect data that was previously safe on disk (and if power had been lost
before the latest write, the prior data would still be safe)
note that this doesn't nessasarily affect all flash disks. if the disk
doesn't replace the old block in the FTL until the data has all been
sucessfuly copies to the new eraseblock you don't have this problem.
some (possibly all) cheap thumb drives don't do this, but I would expect
that the expensive SATA SSDs to do things in the right order.
do this right and you are properly documenting a failure mode that most
people don't understand, but go too far and you are crying wolf.
David Lang
Hi!
>>> is it under all conditions, or only when you have already lost redundancy?
>>
>> I'd prefer not to specify.
>
> you need to, otherwise you are claiming that all linux software raid
> implementations will loose data on powerfail, which I don't think is the
> case.
Well, I'm not saying it loses data on _every_ powerfail ;-).
>>> also, the talk about software RAID 5/6 arrays without journals will be
>>> confusing (after all, if you are using ext3/XFS/etc you are using a
>>> journal, aren't you?)
>>
>> Slightly confusing, yes. Should I just say "MD RAID 5" and avoid
>> talking about hardware RAID arrays, where that's really
>> manufacturer-specific?
>
> what about dm raid?
>
> I don't think you should talk about hardware raid cards.
Ok, fixed.
>>> in addition, even with a single drive you will loose some data on power
>>> loss (unless you do sync mounts with disabled write caches), full data
>>> journaling can help protect you from this, but the default journaling
>>> just protects the metadata.
>>
>> "Data loss" here means "damaging data that were already fsynced". That
>> will not happen on single disk (with barriers on etc), but will happen
>> on RAID5 and flash.
>
> this definition of data loss wasn't clear prior to this. you need to
I actually think it was. write() syscall does not guarantee anything,
fsync() does.
> define this, and state that the reason that flash and raid arrays can
> suffer from this is that both of them deal with blocks of storage larger
> than the data block (eraseblock or raid stripe) and there are conditions
> that can cause the loss of the entire eraseblock or raid stripe which can
> affect data that was previously safe on disk (and if power had been lost
> before the latest write, the prior data would still be safe)
I actually believe Ted's writeup is good.
> note that this doesn't nessasarily affect all flash disks. if the disk
> doesn't replace the old block in the FTL until the data has all been
> sucessfuly copies to the new eraseblock you don't have this problem.
>
> some (possibly all) cheap thumb drives don't do this, but I would expect
> that the expensive SATA SSDs to do things in the right order.
I'd expect SATA SSDs to have that solved, yes. Again, Ted does not say
it affects _all_ such devices, and it certianly did affect all that I seen.
> do this right and you are properly documenting a failure mode that most
> people don't understand, but go too far and you are crying wolf.
Ok, latest version is below, can you suggest improvements? (And yes,
details when exactly RAID-5 misbehaves should be noted somewhere. I
don't know enough about RAID arrays, can someone help?)
Pavel
---
There are storage devices that high highly undesirable properties
when they are disconnected or suffer power failures while writes are
in progress; such devices include flash devices and MD RAID 4/5/6
arrays. These devices have the property of potentially
corrupting blocks being written at the time of the power failure, and
worse yet, amplifying the region where blocks are corrupted such that
additional sectors are also damaged during the power failure.
Users who use such storage devices are well advised take
countermeasures, such as the use of Uninterruptible Power Supplies,
and making sure the flash device is not hot-unplugged while the device
is being used. Regular backups when using these devices is also a
Very Good Idea.
Otherwise, file systems placed on these devices can suffer silent data
and file system corruption. An forced use of fsck may detect metadata
corruption resulting in file system corruption, but will not suffice
to detect data corruption.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> ---
> There are storage devices that high highly undesirable properties
> when they are disconnected or suffer power failures while writes are
> in progress; such devices include flash devices and MD RAID 4/5/6
> arrays. These devices have the property of potentially
> corrupting blocks being written at the time of the power failure, and
> worse yet, amplifying the region where blocks are corrupted such that
> additional sectors are also damaged during the power failure.
I would strike the entire mention of MD devices since it is your assertion, not
a proven fact. You will cause more data loss from common events (single sector
errors, complete drive failure) by steering people away from more reliable
storage configurations because of a really rare edge case (power failure during
split write to two raid members while doing a RAID rebuild).
>
> Users who use such storage devices are well advised take
> countermeasures, such as the use of Uninterruptible Power Supplies,
> and making sure the flash device is not hot-unplugged while the device
> is being used. Regular backups when using these devices is also a
> Very Good Idea.
All users who care about data integrity - including those who do not use MD5 but
just regular single S-ATA disks - will get better reliability from a UPS.
>
> Otherwise, file systems placed on these devices can suffer silent data
> and file system corruption. An forced use of fsck may detect metadata
> corruption resulting in file system corruption, but will not suffice
> to detect data corruption.
>
This is very misleading. All storage "can" have silent data loss, you are making
a statement without specifics about frequency.
FSCK can repair the file system metadata, but will not detect any data loss or
corruption in the data blocks allocated to user files. To detect data loss
properly, you need to checksum (or digitally sign) all objects stored in a file
system and verify them on a regular basis.
Also helps to keep a separate list of those objects on another device so that
when the metadata does take a hit, you can enumerate your objects and verify
that you have not lost anything.
ric
ric
On Wed, 26 Aug 2009, Pavel Machek wrote:
> There are storage devices that high highly undesirable properties
> when they are disconnected or suffer power failures while writes are
> in progress; such devices include flash devices and MD RAID 4/5/6
> arrays.
change this to say 'degraded MD RAID 4/5/6 arrays'
also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
suspect that they do)
then you need to add a note that if the array becomes degraded before a
scrub cycle happens previously hidden damage (that would have been
repaired by the scrub) can surface.
> These devices have the property of potentially corrupting blocks being
> written at the time of the power failure,
this is true of all devices
> and worse yet, amplifying the region where blocks are corrupted such
> that additional sectors are also damaged during the power failure.
re-word this something like
In addition to the standard risk of corrupting the blocks being written at
the time of the power failure, additonal blocks (in the same flash
eraseblock or raid stripe) may also be corrupted.
> Users who use such storage devices are well advised take
> countermeasures, such as the use of Uninterruptible Power Supplies,
> and making sure the flash device is not hot-unplugged while the device
> is being used. Regular backups when using these devices is also a
> Very Good Idea.
>
> Otherwise, file systems placed on these devices can suffer silent data
> and file system corruption. An forced use of fsck may detect metadata
> corruption resulting in file system corruption, but will not suffice
> to detect data corruption.
David Lang
On Tue 2009-08-25 19:48:09, Ric Wheeler wrote:
>
>> ---
>> There are storage devices that high highly undesirable properties
>> when they are disconnected or suffer power failures while writes are
>> in progress; such devices include flash devices and MD RAID 4/5/6
>> arrays. These devices have the property of potentially
>> corrupting blocks being written at the time of the power failure, and
>> worse yet, amplifying the region where blocks are corrupted such that
>> additional sectors are also damaged during the power failure.
>
> I would strike the entire mention of MD devices since it is your
> assertion, not a proven fact. You will cause more data loss from common
That actually is a fact. That's how MD RAID 5 is designed. And btw
those are originaly Ted's words.
> events (single sector errors, complete drive failure) by steering people
> away from more reliable storage configurations because of a really rare
> edge case (power failure during split write to two raid members while
> doing a RAID rebuild).
I'm not sure what's rare about power failures. Unlike single sector
errors, my machine actually has a button that produces exactly that
event. Running degraded raid5 arrays for extended periods may be
slightly unusual configuration, but I suspect people should just do
that for testing. (And from the discussion, people seem to think that
degraded raid5 is equivalent to raid0).
>> Otherwise, file systems placed on these devices can suffer silent data
>> and file system corruption. An forced use of fsck may detect metadata
>> corruption resulting in file system corruption, but will not suffice
>> to detect data corruption.
>>
>
> This is very misleading. All storage "can" have silent data loss, you are
> making a statement without specifics about frequency.
substitute with "can (by design)"?
Now, if you can suggest useful version of that document meeting your
criteria?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Tue 2009-08-25 16:56:40, [email protected] wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>> There are storage devices that high highly undesirable properties
>> when they are disconnected or suffer power failures while writes are
>> in progress; such devices include flash devices and MD RAID 4/5/6
>> arrays.
>
> change this to say 'degraded MD RAID 4/5/6 arrays'
>
> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
> suspect that they do)
I changed it to say MD/DM.
> then you need to add a note that if the array becomes degraded before a
> scrub cycle happens previously hidden damage (that would have been
> repaired by the scrub) can surface.
I'd prefer not to talk about scrubing and such details here. Better
leave warning here and point to MD documentation.
>> THESE devices have the property of potentially corrupting blocks being
>> written at the time of the power failure,
>
> this is true of all devices
Actually I don't think so. I believe SATA disks do not corrupt even
the sector they are writing to -- they just have big enough
capacitors. And yes I believe ext3 depends on that.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On 08/25/2009 08:06 PM, Pavel Machek wrote:
> On Tue 2009-08-25 19:48:09, Ric Wheeler wrote:
>>
>>> ---
>>> There are storage devices that high highly undesirable properties
>>> when they are disconnected or suffer power failures while writes are
>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>> arrays. These devices have the property of potentially
>>> corrupting blocks being written at the time of the power failure, and
>>> worse yet, amplifying the region where blocks are corrupted such that
>>> additional sectors are also damaged during the power failure.
>>
>> I would strike the entire mention of MD devices since it is your
>> assertion, not a proven fact. You will cause more data loss from common
>
> That actually is a fact. That's how MD RAID 5 is designed. And btw
> those are originaly Ted's words.
>
Ted did not design MD RAID5.
>> events (single sector errors, complete drive failure) by steering people
>> away from more reliable storage configurations because of a really rare
>> edge case (power failure during split write to two raid members while
>> doing a RAID rebuild).
>
> I'm not sure what's rare about power failures. Unlike single sector
> errors, my machine actually has a button that produces exactly that
> event. Running degraded raid5 arrays for extended periods may be
> slightly unusual configuration, but I suspect people should just do
> that for testing. (And from the discussion, people seem to think that
> degraded raid5 is equivalent to raid0).
Power failures after a full drive failure with a split write during a rebuild?
>
>>> Otherwise, file systems placed on these devices can suffer silent data
>>> and file system corruption. An forced use of fsck may detect metadata
>>> corruption resulting in file system corruption, but will not suffice
>>> to detect data corruption.
>>>
>>
>> This is very misleading. All storage "can" have silent data loss, you are
>> making a statement without specifics about frequency.
>
> substitute with "can (by design)"?
By Pavel's unproven casual observation?
>
> Now, if you can suggest useful version of that document meeting your
> criteria?
>
> Pavel
>>>> ---
>>>> There are storage devices that high highly undesirable properties
>>>> when they are disconnected or suffer power failures while writes are
>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>> arrays. These devices have the property of potentially
>>>> corrupting blocks being written at the time of the power failure, and
>>>> worse yet, amplifying the region where blocks are corrupted such that
>>>> additional sectors are also damaged during the power failure.
>>>
>>> I would strike the entire mention of MD devices since it is your
>>> assertion, not a proven fact. You will cause more data loss from common
>>
>> That actually is a fact. That's how MD RAID 5 is designed. And btw
>> those are originaly Ted's words.
>
> Ted did not design MD RAID5.
So what? He clearly knows how it works.
Instead of arguing he's wrong, will you simply label everything as
unproven?
>>> events (single sector errors, complete drive failure) by steering people
>>> away from more reliable storage configurations because of a really rare
>>> edge case (power failure during split write to two raid members while
>>> doing a RAID rebuild).
>>
>> I'm not sure what's rare about power failures. Unlike single sector
>> errors, my machine actually has a button that produces exactly that
>> event. Running degraded raid5 arrays for extended periods may be
>> slightly unusual configuration, but I suspect people should just do
>> that for testing. (And from the discussion, people seem to think that
>> degraded raid5 is equivalent to raid0).
>
> Power failures after a full drive failure with a split write during a rebuild?
Look, I don't need full drive failure for this to happen. I can just
remove one disk from array. I don't need power failure, I can just
press the power button. I don't even need to rebuild anything, I can
just write to degraded array.
Given that all events are under my control, statistics make little
sense here.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Wed, 26 Aug 2009, Pavel Machek wrote:
> On Tue 2009-08-25 16:56:40, [email protected] wrote:
>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>
>>> There are storage devices that high highly undesirable properties
>>> when they are disconnected or suffer power failures while writes are
>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>> arrays.
>>
>> change this to say 'degraded MD RAID 4/5/6 arrays'
>>
>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
>> suspect that they do)
>
> I changed it to say MD/DM.
>
>> then you need to add a note that if the array becomes degraded before a
>> scrub cycle happens previously hidden damage (that would have been
>> repaired by the scrub) can surface.
>
> I'd prefer not to talk about scrubing and such details here. Better
> leave warning here and point to MD documentation.
I disagree with that, the way you are wording this makes it sound as if
raid isn't worth it. if you are going to say that raid is risky you need
to properly specify when it is risky
>>> THESE devices have the property of potentially corrupting blocks being
>>> written at the time of the power failure,
>>
>> this is true of all devices
>
> Actually I don't think so. I believe SATA disks do not corrupt even
> the sector they are writing to -- they just have big enough
> capacitors. And yes I believe ext3 depends on that.
you are incorrect on this.
ext3 (like every other filesystem) just accepts the risk (zfs makes some
attempt to detect such corruption)
David Lang
On 08/25/2009 08:12 PM, Pavel Machek wrote:
> On Tue 2009-08-25 16:56:40, [email protected] wrote:
>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>
>>> There are storage devices that high highly undesirable properties
>>> when they are disconnected or suffer power failures while writes are
>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>> arrays.
>>
>> change this to say 'degraded MD RAID 4/5/6 arrays'
>>
>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
>> suspect that they do)
>
> I changed it to say MD/DM.
>
>> then you need to add a note that if the array becomes degraded before a
>> scrub cycle happens previously hidden damage (that would have been
>> repaired by the scrub) can surface.
>
> I'd prefer not to talk about scrubing and such details here. Better
> leave warning here and point to MD documentation.
Than you should punt the MD discussion to the MD documentation entirely.
I would suggest:
"Users of any file system that have a single media (SSD, flash or normal disk)
can suffer from catastrophic and complete data loss if that single media fails.
To reduce your exposure to data loss after a single point of failure, consider
using either hardware or properly configured software RAID. See the
documentation on MD RAID for how to configure it.
To insure proper fsync() semantics, you will need to have a storage device that
supports write barriers or have a non-volatile write cache. If not, best
practices dictate disabling the write cache on the storage device."
>
>>> THESE devices have the property of potentially corrupting blocks being
>>> written at the time of the power failure,
>>
>> this is true of all devices
>
> Actually I don't think so. I believe SATA disks do not corrupt even
> the sector they are writing to -- they just have big enough
> capacitors. And yes I believe ext3 depends on that.
> Pavel
Pavel, no S-ATA drive has capacitors to hold up during a power failure (or even
enough power to destage their write cache). I know this from direct, personal
knowledge having built RAID boxes at EMC for years. In fact, almost all RAID
boxes require that the write cache be hardwired to off when used in their arrays.
Drives fail partially on a very common basis - look at your remapped sector
count with smartctl.
RAID (including MD RAID5) will protect you from this most common error as it
will protect you from complete drive failure which is also an extremely common
event.
Your scenario is really, really rare - doing a full rebuild after a complete
drive failure (takes a matter of hours, depends on the size of the disk) and
having a power failure during that rebuild.
Of course adding a UPS to any storage system (including MD RAID system) helps
make it more reliable, specifically in your scenario.
The more important point is that having any RAID (MD1, MD5 or MD6) will greatly
reduce your chance of data loss if configured correctly. With ext3, ext2 or zfs.
Ric
On Wed, 26 Aug 2009, Pavel Machek wrote:
>>>>> ---
>>>>> There are storage devices that high highly undesirable properties
>>>>> when they are disconnected or suffer power failures while writes are
>>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>>> arrays. These devices have the property of potentially
>>>>> corrupting blocks being written at the time of the power failure, and
>>>>> worse yet, amplifying the region where blocks are corrupted such that
>>>>> additional sectors are also damaged during the power failure.
>>>>
>>>> I would strike the entire mention of MD devices since it is your
>>>> assertion, not a proven fact. You will cause more data loss from common
>>>
>>> That actually is a fact. That's how MD RAID 5 is designed. And btw
>>> those are originaly Ted's words.
>>
>> Ted did not design MD RAID5.
>
> So what? He clearly knows how it works.
>
> Instead of arguing he's wrong, will you simply label everything as
> unproven?
>
>>>> events (single sector errors, complete drive failure) by steering people
>>>> away from more reliable storage configurations because of a really rare
>>>> edge case (power failure during split write to two raid members while
>>>> doing a RAID rebuild).
>>>
>>> I'm not sure what's rare about power failures. Unlike single sector
>>> errors, my machine actually has a button that produces exactly that
>>> event. Running degraded raid5 arrays for extended periods may be
>>> slightly unusual configuration, but I suspect people should just do
>>> that for testing. (And from the discussion, people seem to think that
>>> degraded raid5 is equivalent to raid0).
>>
>> Power failures after a full drive failure with a split write during a rebuild?
>
> Look, I don't need full drive failure for this to happen. I can just
> remove one disk from array. I don't need power failure, I can just
> press the power button. I don't even need to rebuild anything, I can
> just write to degraded array.
>
> Given that all events are under my control, statistics make little
> sense here.
if you are intentionally causing several low-probability things to happen
at once you increase the risk of corruption
note that you also need a write to take place, and be interrupted in just
the right way.
David Lang
On 08/25/2009 08:20 PM, Pavel Machek wrote:
>>>>> ---
>>>>> There are storage devices that high highly undesirable properties
>>>>> when they are disconnected or suffer power failures while writes are
>>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>>> arrays. These devices have the property of potentially
>>>>> corrupting blocks being written at the time of the power failure, and
>>>>> worse yet, amplifying the region where blocks are corrupted such that
>>>>> additional sectors are also damaged during the power failure.
>>>>
>>>> I would strike the entire mention of MD devices since it is your
>>>> assertion, not a proven fact. You will cause more data loss from common
>>>
>>> That actually is a fact. That's how MD RAID 5 is designed. And btw
>>> those are originaly Ted's words.
>>
>> Ted did not design MD RAID5.
>
> So what? He clearly knows how it works.
>
> Instead of arguing he's wrong, will you simply label everything as
> unproven?
>
>>>> events (single sector errors, complete drive failure) by steering people
>>>> away from more reliable storage configurations because of a really rare
>>>> edge case (power failure during split write to two raid members while
>>>> doing a RAID rebuild).
>>>
>>> I'm not sure what's rare about power failures. Unlike single sector
>>> errors, my machine actually has a button that produces exactly that
>>> event. Running degraded raid5 arrays for extended periods may be
>>> slightly unusual configuration, but I suspect people should just do
>>> that for testing. (And from the discussion, people seem to think that
>>> degraded raid5 is equivalent to raid0).
>>
>> Power failures after a full drive failure with a split write during a rebuild?
>
> Look, I don't need full drive failure for this to happen. I can just
> remove one disk from array. I don't need power failure, I can just
> press the power button. I don't even need to rebuild anything, I can
> just write to degraded array.
>
> Given that all events are under my control, statistics make little
> sense here.
> Pavel
>
You are deliberately causing a double failure - pressing the power button after
pulling a drive is exactly that scenario.
Pull your single (non-MD5) disk out while writing (hot unplug from the S-ATA
side, leaving power on) and run some tests to verify your assertions...
ric
>>>> I'm not sure what's rare about power failures. Unlike single sector
>>>> errors, my machine actually has a button that produces exactly that
>>>> event. Running degraded raid5 arrays for extended periods may be
>>>> slightly unusual configuration, but I suspect people should just do
>>>> that for testing. (And from the discussion, people seem to think that
>>>> degraded raid5 is equivalent to raid0).
>>>
>>> Power failures after a full drive failure with a split write during a rebuild?
>>
>> Look, I don't need full drive failure for this to happen. I can just
>> remove one disk from array. I don't need power failure, I can just
>> press the power button. I don't even need to rebuild anything, I can
>> just write to degraded array.
>>
>> Given that all events are under my control, statistics make little
>> sense here.
>
> You are deliberately causing a double failure - pressing the power button
> after pulling a drive is exactly that scenario.
Exactly. And now I'm trying to get that documented, so that people
don't do it and still expect their fs to be consistent.
> Pull your single (non-MD5) disk out while writing (hot unplug from the
> S-ATA side, leaving power on) and run some tests to verify your
> assertions...
I actually did that some time ago with pulling SATA disk (I actually
pulled both SATA *and* power -- that was the way hotplug envelope
worked; that's more harsh test than what you suggest, so that should
be ok). Write test was fsync heavy, with logging to separate drive,
checking that all the data where fsync succeeded are indeed
accessible. I uncovered few bugs in ext* that jack fixed, I uncovered
some libata weirdness that is not yet fixed AFAIK, but with all the
patches applied I could not break that single SATA disk.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Tue 2009-08-25 17:20:13, [email protected] wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>> On Tue 2009-08-25 16:56:40, [email protected] wrote:
>>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>>
>>>> There are storage devices that high highly undesirable properties
>>>> when they are disconnected or suffer power failures while writes are
>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>> arrays.
>>>
>>> change this to say 'degraded MD RAID 4/5/6 arrays'
>>>
>>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
>>> suspect that they do)
>>
>> I changed it to say MD/DM.
>>
>>> then you need to add a note that if the array becomes degraded before a
>>> scrub cycle happens previously hidden damage (that would have been
>>> repaired by the scrub) can surface.
>>
>> I'd prefer not to talk about scrubing and such details here. Better
>> leave warning here and point to MD documentation.
>
> I disagree with that, the way you are wording this makes it sound as if
> raid isn't worth it. if you are going to say that raid is risky you need
> to properly specify when it is risky
Ok, would this help? I don't really want to go to scrubbing details.
(*) Degraded array or single disk failure "near" the powerfail is
neccessary for this property of RAID arrays to bite.
>>>> THESE devices have the property of potentially corrupting blocks being
>>>> written at the time of the power failure,
>>>
>>> this is true of all devices
>>
>> Actually I don't think so. I believe SATA disks do not corrupt even
>> the sector they are writing to -- they just have big enough
>> capacitors. And yes I believe ext3 depends on that.
>
> you are incorrect on this.
>
> ext3 (like every other filesystem) just accepts the risk (zfs makes some
> attempt to detect such corruption)
I'd like Ted to comment on this. He wrote the original document, and
I'd prefer not to introduce mistakes.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>>>> THESE devices have the property of potentially corrupting blocks being
>>>> written at the time of the power failure,
>>>
>>> this is true of all devices
>>
>> Actually I don't think so. I believe SATA disks do not corrupt even
>> the sector they are writing to -- they just have big enough
>> capacitors. And yes I believe ext3 depends on that.
>
> Pavel, no S-ATA drive has capacitors to hold up during a power failure
> (or even enough power to destage their write cache). I know this from
> direct, personal knowledge having built RAID boxes at EMC for years. In
> fact, almost all RAID boxes require that the write cache be hardwired to
> off when used in their arrays.
I never claimed they have enough power to flush entire cache -- read
the paragraph again. I do believe the disks have enough capacitors to
finish writing single sector, and I do believe ext3 depends on that.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On 08/25/2009 08:38 PM, Pavel Machek wrote:
>>>>> I'm not sure what's rare about power failures. Unlike single sector
>>>>> errors, my machine actually has a button that produces exactly that
>>>>> event. Running degraded raid5 arrays for extended periods may be
>>>>> slightly unusual configuration, but I suspect people should just do
>>>>> that for testing. (And from the discussion, people seem to think that
>>>>> degraded raid5 is equivalent to raid0).
>>>>
>>>> Power failures after a full drive failure with a split write during a rebuild?
>>>
>>> Look, I don't need full drive failure for this to happen. I can just
>>> remove one disk from array. I don't need power failure, I can just
>>> press the power button. I don't even need to rebuild anything, I can
>>> just write to degraded array.
>>>
>>> Given that all events are under my control, statistics make little
>>> sense here.
>>
>> You are deliberately causing a double failure - pressing the power button
>> after pulling a drive is exactly that scenario.
>
> Exactly. And now I'm trying to get that documented, so that people
> don't do it and still expect their fs to be consistent.
The problem I have is that the way you word it steers people away from RAID5 and
better data integrity. Your intentions are good, but your text is going to do
considerable harm.
Most people don't intentionally drop power (or have a power failure) during RAID
rebuilds....
>
>> Pull your single (non-MD5) disk out while writing (hot unplug from the
>> S-ATA side, leaving power on) and run some tests to verify your
>> assertions...
>
> I actually did that some time ago with pulling SATA disk (I actually
> pulled both SATA *and* power -- that was the way hotplug envelope
> worked; that's more harsh test than what you suggest, so that should
> be ok). Write test was fsync heavy, with logging to separate drive,
> checking that all the data where fsync succeeded are indeed
> accessible. I uncovered few bugs in ext* that jack fixed, I uncovered
> some libata weirdness that is not yet fixed AFAIK, but with all the
> patches applied I could not break that single SATA disk.
> Pavel
Fsync heavy workloads with working barriers will tend to keep the write cache
pretty empty (two barrier flushes per fsync) so this is not too surprising.
Drive behaviour depends on a lot of things though - how the firmware prioritizes
writes over reads, etc.
ric
On 08/25/2009 08:44 PM, Pavel Machek wrote:
>
>>>>> THESE devices have the property of potentially corrupting blocks being
>>>>> written at the time of the power failure,
>>>>
>>>> this is true of all devices
>>>
>>> Actually I don't think so. I believe SATA disks do not corrupt even
>>> the sector they are writing to -- they just have big enough
>>> capacitors. And yes I believe ext3 depends on that.
>>
>> Pavel, no S-ATA drive has capacitors to hold up during a power failure
>> (or even enough power to destage their write cache). I know this from
>> direct, personal knowledge having built RAID boxes at EMC for years. In
>> fact, almost all RAID boxes require that the write cache be hardwired to
>> off when used in their arrays.
>
> I never claimed they have enough power to flush entire cache -- read
> the paragraph again. I do believe the disks have enough capacitors to
> finish writing single sector, and I do believe ext3 depends on that.
>
> Pavel
Some scary terms that drive people mention (and measure):
"high fly writes"
"over powered seeks"
"adjacent tack erasure"
If you do get a partial track written, the data integrity bits that the data is
embedded in will flag it as invalid and give you and IO error on the next read.
Note that the damage is not persistent, it will get repaired (in place) on the
next write to that sector.
Also it is worth noting that ext2/3/4 write file system "blocks" not single
sectors. Each ext3 IO is 8 distinct disk sector writes and those can span tracks
on a drive which require a seek which all consume power.
On power loss, a disk will immediately park the heads...
ric
On Wed, 26 Aug 2009, Pavel Machek wrote:
> On Tue 2009-08-25 17:20:13, [email protected] wrote:
>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>
>>> On Tue 2009-08-25 16:56:40, [email protected] wrote:
>>>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>>>
>>>>> There are storage devices that high highly undesirable properties
>>>>> when they are disconnected or suffer power failures while writes are
>>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>>> arrays.
>>>>
>>>> change this to say 'degraded MD RAID 4/5/6 arrays'
>>>>
>>>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
>>>> suspect that they do)
>>>
>>> I changed it to say MD/DM.
>>>
>>>> then you need to add a note that if the array becomes degraded before a
>>>> scrub cycle happens previously hidden damage (that would have been
>>>> repaired by the scrub) can surface.
>>>
>>> I'd prefer not to talk about scrubing and such details here. Better
>>> leave warning here and point to MD documentation.
>>
>> I disagree with that, the way you are wording this makes it sound as if
>> raid isn't worth it. if you are going to say that raid is risky you need
>> to properly specify when it is risky
>
> Ok, would this help? I don't really want to go to scrubbing details.
>
> (*) Degraded array or single disk failure "near" the powerfail is
> neccessary for this property of RAID arrays to bite.
that sounds reasonable
David Lang
On Wed, 26 Aug 2009, Pavel Machek wrote:
>>>>> THESE devices have the property of potentially corrupting blocks being
>>>>> written at the time of the power failure,
>>>>
>>>> this is true of all devices
>>>
>>> Actually I don't think so. I believe SATA disks do not corrupt even
>>> the sector they are writing to -- they just have big enough
>>> capacitors. And yes I believe ext3 depends on that.
>>
>> Pavel, no S-ATA drive has capacitors to hold up during a power failure
>> (or even enough power to destage their write cache). I know this from
>> direct, personal knowledge having built RAID boxes at EMC for years. In
>> fact, almost all RAID boxes require that the write cache be hardwired to
>> off when used in their arrays.
>
> I never claimed they have enough power to flush entire cache -- read
> the paragraph again. I do believe the disks have enough capacitors to
> finish writing single sector, and I do believe ext3 depends on that.
keep in mind that in a powerfail situation the data being sent to the
drive may be corrupt (the ram gets flaky while a DMA to the drive copies
the bad data to the drive, which writes it before the power loss gets bad
enough for the drive to decide there is a problem and shutdown)
you just plain cannot count on writes that are in flight when a powerfail
happens to do predictable things, let alone what you consider sane or
proper.
David Lang
Pavel Machek wrote:
> Lets say you are writing to the (healthy) RAID5 and have a powerfail.
>
> So now data blocks do not correspond to the parity block. You don't
> yet have the corruption, but you already have a problem.
>
> If you get a disk failing at this point, you'll get corruption.
Not necessarily. Say you wrote out the entire stripe
in a 5 disk RAID 5 array, but only 3 data blocks and
the parity block got written out before power failure.
If the disk with the 4th (unwritten) data block were
to fail and get taken out of the RAID 5 array, the
degradation of the array could actually undo your data
corruption.
With RAID 5 and incomplete writes, you just don't know.
This kind of thing could go wrong at any level in the
system, with any kind of RAID 5 setup.
Of course, on a single disk system without RAID you can
still get incomplete writes, for the exact same reasons.
RAID 5 does not make things worse. It will protect your
data against certain failure modes, but not against others.
With or without RAID, you still need to make backups.
--
All rights reversed.
Pavel Machek wrote:
> Look, I don't need full drive failure for this to happen. I can just
> remove one disk from array. I don't need power failure, I can just
> press the power button. I don't even need to rebuild anything, I can
> just write to degraded array.
>
> Given that all events are under my control, statistics make little
> sense here.
I recommend a sledgehammer.
If you want to lose your data, you might as well have some fun.
No need to bore yourself to tears by simulating events that are
unlikely to happen simultaneously to careful system administrators.
--
All rights reversed.
On Tue 2009-08-25 20:45:26, Ric Wheeler wrote:
> On 08/25/2009 08:38 PM, Pavel Machek wrote:
>>>>>> I'm not sure what's rare about power failures. Unlike single sector
>>>>>> errors, my machine actually has a button that produces exactly that
>>>>>> event. Running degraded raid5 arrays for extended periods may be
>>>>>> slightly unusual configuration, but I suspect people should just do
>>>>>> that for testing. (And from the discussion, people seem to think that
>>>>>> degraded raid5 is equivalent to raid0).
>>>>>
>>>>> Power failures after a full drive failure with a split write during a rebuild?
>>>>
>>>> Look, I don't need full drive failure for this to happen. I can just
>>>> remove one disk from array. I don't need power failure, I can just
>>>> press the power button. I don't even need to rebuild anything, I can
>>>> just write to degraded array.
>>>>
>>>> Given that all events are under my control, statistics make little
>>>> sense here.
>>>
>>> You are deliberately causing a double failure - pressing the power button
>>> after pulling a drive is exactly that scenario.
>>
>> Exactly. And now I'm trying to get that documented, so that people
>> don't do it and still expect their fs to be consistent.
>
> The problem I have is that the way you word it steers people away from
> RAID5 and better data integrity. Your intentions are good, but your text
> is going to do considerable harm.
>
> Most people don't intentionally drop power (or have a power failure)
> during RAID rebuilds....
Example I seen went like this:
Drive in raid 5 failed; hot spare was available (no idea about
UPS). System apparently locked up trying to talk to the failed drive,
or maybe admin just was not patient enough, so he just powercycled the
array. He lost the array.
So while most people will not agressively powercycle the RAID array,
drive failure still provokes little tested error paths, and getting
unclean shutdown is quite easy in such case.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Wed 2009-08-26 00:24:30, Rik van Riel wrote:
> Pavel Machek wrote:
>
>> Look, I don't need full drive failure for this to happen. I can just
>> remove one disk from array. I don't need power failure, I can just
>> press the power button. I don't even need to rebuild anything, I can
>> just write to degraded array.
>>
>> Given that all events are under my control, statistics make little
>> sense here.
>
> I recommend a sledgehammer.
>
> If you want to lose your data, you might as well have some fun.
>
> No need to bore yourself to tears by simulating events that are
> unlikely to happen simultaneously to careful system administrators.
Sledgehammer is hardware problem, and I'm demonstrating
software/documentation problem we have here.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Tue 2009-08-25 18:19:40, [email protected] wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>>>>>> THESE devices have the property of potentially corrupting blocks being
>>>>>> written at the time of the power failure,
>>>>>
>>>>> this is true of all devices
>>>>
>>>> Actually I don't think so. I believe SATA disks do not corrupt even
>>>> the sector they are writing to -- they just have big enough
>>>> capacitors. And yes I believe ext3 depends on that.
>>>
>>> Pavel, no S-ATA drive has capacitors to hold up during a power failure
>>> (or even enough power to destage their write cache). I know this from
>>> direct, personal knowledge having built RAID boxes at EMC for years. In
>>> fact, almost all RAID boxes require that the write cache be hardwired to
>>> off when used in their arrays.
>>
>> I never claimed they have enough power to flush entire cache -- read
>> the paragraph again. I do believe the disks have enough capacitors to
>> finish writing single sector, and I do believe ext3 depends on that.
>
> keep in mind that in a powerfail situation the data being sent to the
> drive may be corrupt (the ram gets flaky while a DMA to the drive copies
> the bad data to the drive, which writes it before the power loss gets bad
> enough for the drive to decide there is a problem and shutdown)
>
> you just plain cannot count on writes that are in flight when a powerfail
> happens to do predictable things, let alone what you consider sane or
> proper.
>From what I see, this kind of failure is rather harder to reproduce
than the software problems. And at least SGI machines were designed to
avoid this...
Anyway, I'd like to hear from ext3 people... what happens on read
errors in journal? That's what you'd expect to see in situation above.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On 08/26/2009 07:21 AM, Pavel Machek wrote:
> On Tue 2009-08-25 20:45:26, Ric Wheeler wrote:
>
>> On 08/25/2009 08:38 PM, Pavel Machek wrote:
>>
>>>>>>> I'm not sure what's rare about power failures. Unlike single sector
>>>>>>> errors, my machine actually has a button that produces exactly that
>>>>>>> event. Running degraded raid5 arrays for extended periods may be
>>>>>>> slightly unusual configuration, but I suspect people should just do
>>>>>>> that for testing. (And from the discussion, people seem to think that
>>>>>>> degraded raid5 is equivalent to raid0).
>>>>>>>
>>>>>> Power failures after a full drive failure with a split write during a rebuild?
>>>>>>
>>>>> Look, I don't need full drive failure for this to happen. I can just
>>>>> remove one disk from array. I don't need power failure, I can just
>>>>> press the power button. I don't even need to rebuild anything, I can
>>>>> just write to degraded array.
>>>>>
>>>>> Given that all events are under my control, statistics make little
>>>>> sense here.
>>>>>
>>>> You are deliberately causing a double failure - pressing the power button
>>>> after pulling a drive is exactly that scenario.
>>>>
>>> Exactly. And now I'm trying to get that documented, so that people
>>> don't do it and still expect their fs to be consistent.
>>>
>> The problem I have is that the way you word it steers people away from
>> RAID5 and better data integrity. Your intentions are good, but your text
>> is going to do considerable harm.
>>
>> Most people don't intentionally drop power (or have a power failure)
>> during RAID rebuilds....
>>
> Example I seen went like this:
>
> Drive in raid 5 failed; hot spare was available (no idea about
> UPS). System apparently locked up trying to talk to the failed drive,
> or maybe admin just was not patient enough, so he just powercycled the
> array. He lost the array.
>
> So while most people will not agressively powercycle the RAID array,
> drive failure still provokes little tested error paths, and getting
> unclean shutdown is quite easy in such case.
> Pavel
>
Then what we need to document is do not power cycle an array during a
rebuild, right?
If it wasn't the admin that timed out and the box really was hung (no
drive activity lights, etc), you will need to power cycle/reboot but
then you should not have this active rebuild issuing writes either...
In the end, there are cascading failures that will defeat any data
protection scheme, but that does not mean that the value of that scheme
is zero. We need to be get more people to use RAID (including MD5) and
try to enhance it as we go. Just using a single disk is not a good thing...
ric
Ric
On Wed, Aug 26, 2009 at 01:25:36PM +0200, Pavel Machek wrote:
> > you just plain cannot count on writes that are in flight when a powerfail
> > happens to do predictable things, let alone what you consider sane or
> > proper.
>
> From what I see, this kind of failure is rather harder to reproduce
> than the software problems. And at least SGI machines were designed to
> avoid this...
>
> Anyway, I'd like to hear from ext3 people... what happens on read
> errors in journal? That's what you'd expect to see in situation above.
On a power failure, what normally happens is that the random garbage
gets written into the disk drive's last dying gasp, since the memory
starts going insane and sends garbage to the disk. So the disk
successfully completes the write, but the sector contains garbage.
Since HDD's tend to be last thing to die, being less sensitive to
voltage drops than the memory or DMA controller, my experience is that
you don't get a read error after the system comes up, you just get
garbage written into the journal.
The ext3 journalling code waits until all of the journal code is
written, and only then writes the commit block. On restart, we look
for the last valid commit block. So if the power failure is before we
write the commit block, we replay the journal up until the previous
commit block. If the power failure is while we are writing the commit
block, garbage will be written out instead of the commit block, and so
it falls back to the previous case.
We do not allow any updates to the filesystem metadata to take place
until the commit block has been written; therefore the filesystem
stays consistent.
If there the journal *does* develop read errors, then fsck will
require a manual fsck, and so the boot operation will get stopped so a
system administrator can provide manual intervention. The best bet
for the sysadmin is to replay as much of the journal she can, and then
let fsck fix any resulting filesystem inconsistencies. In practice,
though, I've not experienced or seen any reports of this happening
from a power failure; usually it happens if the laptop gets dropped or
the hard drive suffers or suffers some other kind of hardware failure.
- Ted
On Wed, Aug 26, 2009 at 07:58:40AM -0400, Ric Wheeler wrote:
>> Drive in raid 5 failed; hot spare was available (no idea about
>> UPS). System apparently locked up trying to talk to the failed drive,
>> or maybe admin just was not patient enough, so he just powercycled the
>> array. He lost the array.
>>
>> So while most people will not agressively powercycle the RAID array,
>> drive failure still provokes little tested error paths, and getting
>> unclean shutdown is quite easy in such case.
>
> Then what we need to document is do not power cycle an array during a
> rebuild, right?
Well, the softwar raid layer could be improved so that it implements
scrubbing by default (i.e., have the md package install a cron job to
implement a periodict scrub pass automatically). The MD code could
also regularly check to make sure the hot spare is OK; the other
possibility is that hot spare, which hadn't been used in a long time,
had silently failed.
> In the end, there are cascading failures that will defeat any data
> protection scheme, but that does not mean that the value of that scheme
> is zero. We need to be get more people to use RAID (including MD5) and
> try to enhance it as we go. Just using a single disk is not a good
> thing...
Yep; the solution is to improve the storage devices. It is *not* to
encourage people to think RAID is not worth it, or that somehow ext2
is better than ext3 because it runs fsck's all the time at boot up.
That's just crazy talk.
- Ted
On 08/26/2009 08:40 AM, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 07:58:40AM -0400, Ric Wheeler wrote:
>>> Drive in raid 5 failed; hot spare was available (no idea about
>>> UPS). System apparently locked up trying to talk to the failed drive,
>>> or maybe admin just was not patient enough, so he just powercycled the
>>> array. He lost the array.
>>>
>>> So while most people will not agressively powercycle the RAID array,
>>> drive failure still provokes little tested error paths, and getting
>>> unclean shutdown is quite easy in such case.
>>
>> Then what we need to document is do not power cycle an array during a
>> rebuild, right?
>
> Well, the softwar raid layer could be improved so that it implements
> scrubbing by default (i.e., have the md package install a cron job to
> implement a periodict scrub pass automatically). The MD code could
> also regularly check to make sure the hot spare is OK; the other
> possibility is that hot spare, which hadn't been used in a long time,
> had silently failed.
Actually, MD does this scan already (not automatically, but you can set up a
simple cron job to kick off a periodic "check"). It is a delicate balance to get
the frequency of the scrubbing correct.
On one hand, you want to make sure that you detect errors in a timely fashion,
certainly detection of single sector errors before you might develop a second
sector level error on another drive.
On the other hand, running scans/scrubs continually impacts the performance of
your real workload and can potentially impact your components' life span by
subjecting them to a heavy workload.
Rule of thumb seems from my experience is that most people settle in with a scan
once a week or two (done at a throttled rate).
>
>> In the end, there are cascading failures that will defeat any data
>> protection scheme, but that does not mean that the value of that scheme
>> is zero. We need to be get more people to use RAID (including MD5) and
>> try to enhance it as we go. Just using a single disk is not a good
>> thing...
>
> Yep; the solution is to improve the storage devices. It is *not* to
> encourage people to think RAID is not worth it, or that somehow ext2
> is better than ext3 because it runs fsck's all the time at boot up.
> That's just crazy talk.
>
> - Ted
Agreed....
ric
On Wed, 26 Aug 2009, Ric Wheeler wrote:
> On 08/26/2009 08:40 AM, Theodore Tso wrote:
>> On Wed, Aug 26, 2009 at 07:58:40AM -0400, Ric Wheeler wrote:
>>>> Drive in raid 5 failed; hot spare was available (no idea about
>>>> UPS). System apparently locked up trying to talk to the failed drive,
>>>> or maybe admin just was not patient enough, so he just powercycled the
>>>> array. He lost the array.
>>>>
>>>> So while most people will not agressively powercycle the RAID array,
>>>> drive failure still provokes little tested error paths, and getting
>>>> unclean shutdown is quite easy in such case.
>>>
>>> Then what we need to document is do not power cycle an array during a
>>> rebuild, right?
>>
>> Well, the softwar raid layer could be improved so that it implements
>> scrubbing by default (i.e., have the md package install a cron job to
>> implement a periodict scrub pass automatically). The MD code could
>> also regularly check to make sure the hot spare is OK; the other
>> possibility is that hot spare, which hadn't been used in a long time,
>> had silently failed.
>
> Actually, MD does this scan already (not automatically, but you can set up a
> simple cron job to kick off a periodic "check"). It is a delicate balance to
> get the frequency of the scrubbing correct.
debian defaults to doing this once a month (first sunday of each month),
on some of my systems this scrub takes almost a week to complete.
David Lang
Pavel Machek wrote:
> Sledgehammer is hardware problem, and I'm demonstrating
> software/documentation problem we have here.
So your argument is that a sledgehammer is a hardware
problem, while a broken hard disk and a power failure
are software/documentation issues?
I'd argue that the broken hard disk and power failure
are hardware issues, too.
--
All rights reversed.
>> Example I seen went like this:
>>
>> Drive in raid 5 failed; hot spare was available (no idea about
>> UPS). System apparently locked up trying to talk to the failed drive,
>> or maybe admin just was not patient enough, so he just powercycled the
>> array. He lost the array.
>>
>> So while most people will not agressively powercycle the RAID array,
>> drive failure still provokes little tested error paths, and getting
>> unclean shutdown is quite easy in such case.
>
> Then what we need to document is do not power cycle an array during a
> rebuild, right?
Yep, that and the fact that you should fsck if you do.
> If it wasn't the admin that timed out and the box really was hung (no
> drive activity lights, etc), you will need to power cycle/reboot but
> then you should not have this active rebuild issuing writes either...
Ok, I guess you are right here.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Wed 2009-08-26 10:45:44, Rik van Riel wrote:
> Pavel Machek wrote:
>
>> Sledgehammer is hardware problem, and I'm demonstrating
>> software/documentation problem we have here.
>
> So your argument is that a sledgehammer is a hardware
> problem, while a broken hard disk and a power failure
> are software/documentation issues?
>
> I'd argue that the broken hard disk and power failure
> are hardware issues, too.
Noone told me that degraded md raid5 is dangerous. Thats documentation
issue #1. Maybe I just pulled the disk for fun.
ext3 docs told me that journal protects me against fs corruption
during power fails. It does not in this particular case. Seems like
docs issue #2. Maybe I just hit the reset button because it was there.
Randomly hitting power button may be stupid, but should not result in
filesystem corruption on reasonably working filesystem/storage stack.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On 2009-08-29 04:39, Pavel Machek wrote:
> On Wed 2009-08-26 10:45:44, Rik van Riel wrote:
>> Pavel Machek wrote:
>>
>>> Sledgehammer is hardware problem, and I'm demonstrating
>>> software/documentation problem we have here.
>> So your argument is that a sledgehammer is a hardware
>> problem, while a broken hard disk and a power failure
>> are software/documentation issues?
>>
>> I'd argue that the broken hard disk and power failure
>> are hardware issues, too.
>
> Noone told me that degraded md raid5 is dangerous. Thats documentation
> issue #1. Maybe I just pulled the disk for fun.
You're kidding, right?
Or are you being too effectively sarcastic?
--
Obsession with "preserving cultural heritage" is a racist impediment
to moral, physical and intellectual progress.
Ron Johnson wrote:
> On 2009-08-29 04:39, Pavel Machek wrote:
>> Noone told me that degraded md raid5 is dangerous. Thats documentation
>> issue #1. Maybe I just pulled the disk for fun.
>
> You're kidding, right?
No he is not... and that is exactly why Ted and Ric have been
fighting so hard against his scare the children documentation.
In 20 years, I have not found a way to educate those who think
"I know computers so it must work the way I want and expect."
Tremendous amounts of information and recommendations are out
there on the web, in books, classes, etc. But people don't
research before using or understand before they have a problem.
Pavel Machek wrote:
> It is not only for system administrators; I was trying to find
> out if kernel is buggy, and that should be in kernel tree.
Pavel, *THE KERNEL IS NOT BUGGY* end of story!
Everyone experienced in storage understands the "in the
edge case that Pavel hit, you will loose your data", and we
take our responsibility to tell people what works and does
not work very seriously. And we try very hard to reduce the
amount of edge case data losses.
But as Ric and Ted and many others keep trying to explain:
- There is no such thing as "never fails" data storage.
- The goal of journal file systems is not what you thing.
- The goal of raid is not what you think.
- We do not want the vast majority of computer users who
are not kernel engineers to stop using the technology
that in 99.99 percent of the use cases keeps their data
as safe as we can reasonably make it, just because they
read Pavel's 0.01 percent scary and inaccurate case.
And the worst part is this 0.01 percent case problem
is really "I did not know what I was doing".
jim
On Wed 2009-08-26 08:37:09, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 01:25:36PM +0200, Pavel Machek wrote:
> > > you just plain cannot count on writes that are in flight when a powerfail
> > > happens to do predictable things, let alone what you consider sane or
> > > proper.
> >
> > From what I see, this kind of failure is rather harder to reproduce
> > than the software problems. And at least SGI machines were designed to
> > avoid this...
> >
> > Anyway, I'd like to hear from ext3 people... what happens on read
> > errors in journal? That's what you'd expect to see in situation above.
>
> On a power failure, what normally happens is that the random garbage
> gets written into the disk drive's last dying gasp, since the memory
> starts going insane and sends garbage to the disk. So the disk
> successfully completes the write, but the sector contains garbage.
> Since HDD's tend to be last thing to die, being less sensitive to
> voltage drops than the memory or DMA controller, my experience is that
> you don't get a read error after the system comes up, you just get
> garbage written into the journal.
>
> The ext3 journalling code waits until all of the journal code is
> written, and only then writes the commit block. On restart, we look
> for the last valid commit block. So if the power failure is before we
> write the commit block, we replay the journal up until the previous
> commit block. If the power failure is while we are writing the commit
> block, garbage will be written out instead of the commit block, and so
> it falls back to the previous case.
>
> We do not allow any updates to the filesystem metadata to take place
> until the commit block has been written; therefore the filesystem
> stays consistent.
Ok, cool.
> If there the journal *does* develop read errors, then fsck will
> require a manual fsck, and so the boot operation will get stopped so a
> system administrator can provide manual intervention. The best bet
> for the sysadmin is to replay as much of the journal she can, and then
> let fsck fix any resulting filesystem inconsistencies. In practice,
...and that should result in consistent fs with no data loss, because
read error is essentialy the same as garbage given back, right?
...plus, this is significant difference from logical-logging
filesystems, no?
Should this go to Documentation/, somewhere?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html