2009-08-25 13:38:34

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On 08/25/2009 05:42 AM, Pavel Machek wrote:
> On Mon 2009-08-24 20:08:42, Theodore Tso wrote:
>> On Tue, Aug 25, 2009 at 01:00:36AM +0200, Pavel Machek wrote:
>>> Then to answer your question... ext2. You expect to run fsck after
>>> unclean shutdown, and you expect to have to solve some problems with
>>> it. So the way ext2 deals with the flash media actually matches what
>>> the user expects. (*)
>>
>> But if the 256k hole is in data blocks, fsck won't find a problem,
>> even with ext2.
>
> True.
>
>> And if the 256k hole is the inode table, you will *still* suffer
>> massive data loss. Fsck will tell you how badly screwed you are, but
>> it doesn't "fix" the disk; most users don't consider questions of the
>> form "directory entry<precious-thesis-data> points to trashed inode,
>> may I delete directory entry?" as being terribly helpful. :-/
>
> Well it will fix the disk in the end. And no, "directory entry
> <precious-thesis-data> points to trashed inode, may I delete directory
> entry?" is not _terribly_ helpful, but it is slightly helpful and
> people actually expect that from ext2.
>
>> Maybe this came as a surprise to you, but anyone who has used a
>> compact flash in a digital camera knows that you ***have*** to wait
>> until the led has gone out before trying to eject the flash card. I
>> remember seeing all sorts of horror stories from professional
>> photographers about how they lost an important wedding's day worth of
>> pictures with the attendant commercial loss, on various digital
>> photography forums. It tends to be the sort of mistake that digital
>> photographers only make once.
>
> It actually comes as surprise to me. Actually yes and no. I know that
> digital cameras use VFAT, so pulling CF card out of it may do bad
> thing, unless I run fsck.vfat afterwards. If digital camera was using
> ext3, I'd expect it to be safely pullable at any time.
>
> Will IBM microdrive do any difference there?
>
> Anyway, it was not known to me. Rather than claiming "everyone knows"
> (when clearly very few people really understand all the details), can
> we simply document that?
> Pavel

I really think that the expectation that all OS's (windows, mac, even your ipod)
all teach you not to hot unplug a device with any file system. Users have an
"eject" or "safe unload" in windows, your iPod tells you not to power off or
disconnect, etc.

I don't object to making that general statement - "Don't hot unplug a device
with an active file system or actively used raw device" - but would object to
the overly general statement about ext3 not working on flash, RAID5 not working,
etc...

ric





2009-08-25 13:42:10

by Alan Cox

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Tue, 25 Aug 2009 09:37:12 -0400
Ric Wheeler <[email protected]> wrote:

> I really think that the expectation that all OS's (windows, mac, even your ipod)
> all teach you not to hot unplug a device with any file system. Users have an
> "eject" or "safe unload" in windows, your iPod tells you not to power off or
> disconnect, etc.

Agreed

> I don't object to making that general statement - "Don't hot unplug a device
> with an active file system or actively used raw device" - but would object to
> the overly general statement about ext3 not working on flash, RAID5 not working,
> etc...

The overall general statement for all media and all OS's should be

"Do you have a backup, have you tested it recently"


2009-08-25 21:15:27

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible


>>> Maybe this came as a surprise to you, but anyone who has used a
>>> compact flash in a digital camera knows that you ***have*** to wait
>>> until the led has gone out before trying to eject the flash card. I
>>> remember seeing all sorts of horror stories from professional
>>> photographers about how they lost an important wedding's day worth of
>>> pictures with the attendant commercial loss, on various digital
>>> photography forums. It tends to be the sort of mistake that digital
>>> photographers only make once.
>>
>> It actually comes as surprise to me. Actually yes and no. I know that
>> digital cameras use VFAT, so pulling CF card out of it may do bad
>> thing, unless I run fsck.vfat afterwards. If digital camera was using
>> ext3, I'd expect it to be safely pullable at any time.
>>
>> Will IBM microdrive do any difference there?
>>
>> Anyway, it was not known to me. Rather than claiming "everyone knows"
>> (when clearly very few people really understand all the details), can
>> we simply document that?
>
> I really think that the expectation that all OS's (windows, mac, even
> your ipod) all teach you not to hot unplug a device with any file system.
> Users have an "eject" or "safe unload" in windows, your iPod tells you
> not to power off or disconnect, etc.

That was before journaling filesystems...

> I don't object to making that general statement - "Don't hot unplug a
> device with an active file system or actively used raw device" - but
> would object to the overly general statement about ext3 not working on
> flash, RAID5 not working, etc...

You can object any way you want, but running ext3 on flash or MD RAID5
is stupid:

* ext2 would be faster

* ext2 would provide better protection against powerfail.

"ext3 works on flash and MD RAID5, as long as you do not have
powerfail" seems to be the accurate statement, and if you don't need
to protect against powerfails, you can just use ext2.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-25 22:42:48

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On 08/25/2009 05:15 PM, Pavel Machek wrote:
>
>>>> Maybe this came as a surprise to you, but anyone who has used a
>>>> compact flash in a digital camera knows that you ***have*** to wait
>>>> until the led has gone out before trying to eject the flash card. I
>>>> remember seeing all sorts of horror stories from professional
>>>> photographers about how they lost an important wedding's day worth of
>>>> pictures with the attendant commercial loss, on various digital
>>>> photography forums. It tends to be the sort of mistake that digital
>>>> photographers only make once.
>>>
>>> It actually comes as surprise to me. Actually yes and no. I know that
>>> digital cameras use VFAT, so pulling CF card out of it may do bad
>>> thing, unless I run fsck.vfat afterwards. If digital camera was using
>>> ext3, I'd expect it to be safely pullable at any time.
>>>
>>> Will IBM microdrive do any difference there?
>>>
>>> Anyway, it was not known to me. Rather than claiming "everyone knows"
>>> (when clearly very few people really understand all the details), can
>>> we simply document that?
>>
>> I really think that the expectation that all OS's (windows, mac, even
>> your ipod) all teach you not to hot unplug a device with any file system.
>> Users have an "eject" or "safe unload" in windows, your iPod tells you
>> not to power off or disconnect, etc.
>
> That was before journaling filesystems...

Not true - that is true today with or without journals as we have discussed in
great detail. Including specifically ext2.

Basically, any file system (Linux, windows, OSX, etc) that writes into the page
cache will lose data when you hot unplug its storage. End of story, don't do it!


>
>> I don't object to making that general statement - "Don't hot unplug a
>> device with an active file system or actively used raw device" - but
>> would object to the overly general statement about ext3 not working on
>> flash, RAID5 not working, etc...
>
> You can object any way you want, but running ext3 on flash or MD RAID5
> is stupid:
>
> * ext2 would be faster
>
> * ext2 would provide better protection against powerfail.

Not true in the slightest, you continue to ignore the ext2/3/4 developers
telling you that it will lose data.

>
> "ext3 works on flash and MD RAID5, as long as you do not have
> powerfail" seems to be the accurate statement, and if you don't need
> to protect against powerfails, you can just use ext2.
> Pavel

Strange how your personal preference is totally out of sync with the entire
enterprise class user base.

ric



2009-08-25 22:51:22

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible



>>> I really think that the expectation that all OS's (windows, mac, even
>>> your ipod) all teach you not to hot unplug a device with any file system.
>>> Users have an "eject" or "safe unload" in windows, your iPod tells you
>>> not to power off or disconnect, etc.
>>
>> That was before journaling filesystems...
>
> Not true - that is true today with or without journals as we have
> discussed in great detail. Including specifically ext2.
>
> Basically, any file system (Linux, windows, OSX, etc) that writes into
> the page cache will lose data when you hot unplug its storage. End of
> story, don't do it!

No, not ext3 on SATA disk with barriers on and proper use of
fsync(). I actually tested that.

Yes, I should be able to hotunplug SATA drives and expect the data
that was fsync-ed to be there.

>>> I don't object to making that general statement - "Don't hot unplug a
>>> device with an active file system or actively used raw device" - but
>>> would object to the overly general statement about ext3 not working on
>>> flash, RAID5 not working, etc...
>>
>> You can object any way you want, but running ext3 on flash or MD RAID5
>> is stupid:
>>
>> * ext2 would be faster
>>
>> * ext2 would provide better protection against powerfail.
>
> Not true in the slightest, you continue to ignore the ext2/3/4 developers
> telling you that it will lose data.

I know I will lose data. Both ext2 and ext3 will lose data on
flashdisk. (That's what I'm trying to document). But... what is the
benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
protects you against kernel panic. MD RAID5 is in software, so... that
additional protection is just not there).

>> "ext3 works on flash and MD RAID5, as long as you do not have
>> powerfail" seems to be the accurate statement, and if you don't need
>> to protect against powerfails, you can just use ext2.
>
> Strange how your personal preference is totally out of sync with the
> entire enterprise class user base.

Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
what I'm trying to document here.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-25 23:04:14

by David Lang

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Wed, 26 Aug 2009, Pavel Machek wrote:

>>>> I don't object to making that general statement - "Don't hot unplug a
>>>> device with an active file system or actively used raw device" - but
>>>> would object to the overly general statement about ext3 not working on
>>>> flash, RAID5 not working, etc...
>>>
>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>> is stupid:
>>>
>>> * ext2 would be faster
>>>
>>> * ext2 would provide better protection against powerfail.
>>
>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>> telling you that it will lose data.
>
> I know I will lose data. Both ext2 and ext3 will lose data on
> flashdisk. (That's what I'm trying to document). But... what is the
> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
> protects you against kernel panic. MD RAID5 is in software, so... that
> additional protection is just not there).

the block device can loose data, it has absolutly nothing to do with the
filesystem

>>> "ext3 works on flash and MD RAID5, as long as you do not have
>>> powerfail" seems to be the accurate statement, and if you don't need
>>> to protect against powerfails, you can just use ext2.
>>
>> Strange how your personal preference is totally out of sync with the
>> entire enterprise class user base.
>
> Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
> what I'm trying to document here.

a MD raid array that's degraded to the point where there is no redundancy
is dangerous, but I don't think that any of the enterprise users would be
surprised.

I think they will be surprised that it's possible that a prior failed
write that hasn't been scrubbed can cause data loss when the array later
degrades.

David Lang

2009-08-25 23:03:45

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On 08/25/2009 06:51 PM, Pavel Machek wrote:
>
>
>>>> I really think that the expectation that all OS's (windows, mac, even
>>>> your ipod) all teach you not to hot unplug a device with any file system.
>>>> Users have an "eject" or "safe unload" in windows, your iPod tells you
>>>> not to power off or disconnect, etc.
>>>
>>> That was before journaling filesystems...
>>
>> Not true - that is true today with or without journals as we have
>> discussed in great detail. Including specifically ext2.
>>
>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>> the page cache will lose data when you hot unplug its storage. End of
>> story, don't do it!
>
> No, not ext3 on SATA disk with barriers on and proper use of
> fsync(). I actually tested that.
>
> Yes, I should be able to hotunplug SATA drives and expect the data
> that was fsync-ed to be there.

You can and will lose data (even after fsync) with any type of storage at some
rate. What you are missing here is that data loss needs to be measured in hard
numbers - say percentage of installed boxes that have config X that lose data.

Strangely enough, this is what high end storage companies do for a living,
configure, deploy and then measure results.

A long winded way of saying that just because you can induce data failure by
recreating an event that happens almost never (power loss while rebuilding a
RAID5 group specifically) does not mean that this makes RAID5 with ext3 unreliable.

What does happen all of the time is single bad sector IO's and (less often, but
more than your scenario) complete drive failures. In both cases, MD RAID5 will
repair that damage before a second failure (including a power failure) happens
99.99% of the time.

I can promise you that hot unplugging and replugging a S-ATA drive will also
lose you data if you are actively writing to it (ext2, 3, whatever).

Your micro datah loss benchmark is not a valid reflection of the wider
experience and I fear that you will cause people to lose more data, not less,
but moving them away from ext3 and MD RAID5.

>
>>>> I don't object to making that general statement - "Don't hot unplug a
>>>> device with an active file system or actively used raw device" - but
>>>> would object to the overly general statement about ext3 not working on
>>>> flash, RAID5 not working, etc...
>>>
>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>> is stupid:
>>>
>>> * ext2 would be faster
>>>
>>> * ext2 would provide better protection against powerfail.
>>
>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>> telling you that it will lose data.
>
> I know I will lose data. Both ext2 and ext3 will lose data on
> flashdisk. (That's what I'm trying to document). But... what is the
> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
> protects you against kernel panic. MD RAID5 is in software, so... that
> additional protection is just not there).

Faster recovery time on any normal kernel crash or power outage. Data loss
would be equivalent with or without the journal.

>
>>> "ext3 works on flash and MD RAID5, as long as you do not have
>>> powerfail" seems to be the accurate statement, and if you don't need
>>> to protect against powerfails, you can just use ext2.
>>
>> Strange how your personal preference is totally out of sync with the
>> entire enterprise class user base.
>
> Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
> what I'm trying to document here.
> Pavel

Using MD RAID5 will save more people from commonly occurring errors (sector and
disk failures) than will lose it because of your rebuild interrupted by a power
failure worry.

What you are trying to do is to document a belief you have that is not born out
by real data across actual user boxes running real work loads.

Unfortunately, getting that data is hard work and one of the things that we as a
community do especially poorly. All of the data (secret data from my past and
published data by NetApp, Google, etc) that I have seen would directly
contradict your assertions and you will cause harm to our users with this.

Ric



2009-08-25 23:26:10

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible


>>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>>> the page cache will lose data when you hot unplug its storage. End of
>>> story, don't do it!
>>
>> No, not ext3 on SATA disk with barriers on and proper use of
>> fsync(). I actually tested that.
>>
>> Yes, I should be able to hotunplug SATA drives and expect the data
>> that was fsync-ed to be there.
>
> You can and will lose data (even after fsync) with any type of storage at
> some rate. What you are missing here is that data loss needs to be
> measured in hard numbers - say percentage of installed boxes that have
> config X that lose data.

I'm talking "by design" here.

I will lose data even on SATA drive that is properly powered on if I
wait 5 years.

> I can promise you that hot unplugging and replugging a S-ATA drive will
> also lose you data if you are actively writing to it (ext2, 3, whatever).

I can promise you that running S-ATA drive will also lose you data,
even if you are not actively writing to it. Just wait 10 years; so
what is your point?

But ext3 is _designed_ to preserve fsynced data on SATA drive, while
it is _not_ designed to preserve fsynced data on MD RAID5.

Do you really think that's not a difference?

>>>>> I don't object to making that general statement - "Don't hot unplug a
>>>>> device with an active file system or actively used raw device" - but
>>>>> would object to the overly general statement about ext3 not working on
>>>>> flash, RAID5 not working, etc...
>>>>
>>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>>> is stupid:
>>>>
>>>> * ext2 would be faster
>>>>
>>>> * ext2 would provide better protection against powerfail.
>>>
>>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>>> telling you that it will lose data.
>>
>> I know I will lose data. Both ext2 and ext3 will lose data on
>> flashdisk. (That's what I'm trying to document). But... what is the
>> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
>> protects you against kernel panic. MD RAID5 is in software, so... that
>> additional protection is just not there).
>
> Faster recovery time on any normal kernel crash or power outage. Data
> loss would be equivalent with or without the journal.

No, because you'll actually repair the ext2 with fsck after the kernel
crash or power outage. Data loss will not be equivalent; in particular
you'll not lose data writen _after_ power outage to ext2.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-25 23:29:42

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible


>>>> "ext3 works on flash and MD RAID5, as long as you do not have
>>>> powerfail" seems to be the accurate statement, and if you don't need
>>>> to protect against powerfails, you can just use ext2.
>>>
>>> Strange how your personal preference is totally out of sync with the
>>> entire enterprise class user base.
>>
>> Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
>> what I'm trying to document here.
>
> a MD raid array that's degraded to the point where there is no redundancy
> is dangerous, but I don't think that any of the enterprise users would be
> surprised.
>
> I think they will be surprised that it's possible that a prior failed
> write that hasn't been scrubbed can cause data loss when the array later
> degrades.

Cool, so Ted's "raid5 has highly undesirable properties" is actually
pretty accurate. Some raid person should write more detailed README,
I'd say...
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-25 23:40:50

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On 08/25/2009 07:26 PM, Pavel Machek wrote:
>
>>>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>>>> the page cache will lose data when you hot unplug its storage. End of
>>>> story, don't do it!
>>>
>>> No, not ext3 on SATA disk with barriers on and proper use of
>>> fsync(). I actually tested that.
>>>
>>> Yes, I should be able to hotunplug SATA drives and expect the data
>>> that was fsync-ed to be there.
>>
>> You can and will lose data (even after fsync) with any type of storage at
>> some rate. What you are missing here is that data loss needs to be
>> measured in hard numbers - say percentage of installed boxes that have
>> config X that lose data.
>
> I'm talking "by design" here.
>
> I will lose data even on SATA drive that is properly powered on if I
> wait 5 years.
>

You are dead wrong.

For RAID5 arrays, you assume that you have a hard failure and a power outage
before you can rebuild the RAID (order of hours at full tilt).

The failure rate of S-ATA drives is at the rate of a few percentage of the
installed base in a year. Some drives will fail faster than that (bad parts, bad
environmental conditions, etc).

Why don't you hold all of your most precious data on that single S-ATA drive for
five year on one box and put a second copy on a small RAID5 with ext3 for the
same period?

Repeat experiment until you get up to something like google scale or the other
papers on failures in national labs in the US and then we can have an informed
discussion.


>> I can promise you that hot unplugging and replugging a S-ATA drive will
>> also lose you data if you are actively writing to it (ext2, 3, whatever).
>
> I can promise you that running S-ATA drive will also lose you data,
> even if you are not actively writing to it. Just wait 10 years; so
> what is your point?

I lost a s-ata drive 24 hours after installing it in a new box. If I had MD5
RAID5, I would not have lost any.

My point is that you fail to take into account the rate of failures of a given
configuration and the probability of data loss given those rates.

>
> But ext3 is _designed_ to preserve fsynced data on SATA drive, while
> it is _not_ designed to preserve fsynced data on MD RAID5.

Of course it will when you properly configure your MD RAID5.

>
> Do you really think that's not a difference?

I think that you are simply wrong.

>
>>>>>> I don't object to making that general statement - "Don't hot unplug a
>>>>>> device with an active file system or actively used raw device" - but
>>>>>> would object to the overly general statement about ext3 not working on
>>>>>> flash, RAID5 not working, etc...
>>>>>
>>>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>>>> is stupid:
>>>>>
>>>>> * ext2 would be faster
>>>>>
>>>>> * ext2 would provide better protection against powerfail.
>>>>
>>>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>>>> telling you that it will lose data.
>>>
>>> I know I will lose data. Both ext2 and ext3 will lose data on
>>> flashdisk. (That's what I'm trying to document). But... what is the
>>> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
>>> protects you against kernel panic. MD RAID5 is in software, so... that
>>> additional protection is just not there).
>>
>> Faster recovery time on any normal kernel crash or power outage. Data
>> loss would be equivalent with or without the journal.
>
> No, because you'll actually repair the ext2 with fsck after the kernel
> crash or power outage. Data loss will not be equivalent; in particular
> you'll not lose data writen _after_ power outage to ext2.
> Pavel


As Ted (who wrote fsck for ext*) said, you will lose data in both. Your
argument is not based on fact.

You need to actually prove your point, not just state it as fact.

ric

2009-08-25 23:46:15

by David Lang

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Wed, 26 Aug 2009, Pavel Machek wrote:

>>>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>>>> the page cache will lose data when you hot unplug its storage. End of
>>>> story, don't do it!
>>>
>>> No, not ext3 on SATA disk with barriers on and proper use of
>>> fsync(). I actually tested that.
>>>
>>> Yes, I should be able to hotunplug SATA drives and expect the data
>>> that was fsync-ed to be there.
>>
>> You can and will lose data (even after fsync) with any type of storage at
>> some rate. What you are missing here is that data loss needs to be
>> measured in hard numbers - say percentage of installed boxes that have
>> config X that lose data.
>
> I'm talking "by design" here.
>
> I will lose data even on SATA drive that is properly powered on if I
> wait 5 years.
>
>> I can promise you that hot unplugging and replugging a S-ATA drive will
>> also lose you data if you are actively writing to it (ext2, 3, whatever).
>
> I can promise you that running S-ATA drive will also lose you data,
> even if you are not actively writing to it. Just wait 10 years; so
> what is your point?
>
> But ext3 is _designed_ to preserve fsynced data on SATA drive, while
> it is _not_ designed to preserve fsynced data on MD RAID5.

substatute 'degraded MD RAID 5' for 'MD RAID 5' and you have a point here.
although the language you are using is pretty harsh. you make it sound
like this is a problem with ext3 when the filesystem has nothing to do
with it. the problem is that a degraded raid 5 array can be corrupted by
an additional failure.

> Do you really think that's not a difference?
>
>>>>>> I don't object to making that general statement - "Don't hot unplug a
>>>>>> device with an active file system or actively used raw device" - but
>>>>>> would object to the overly general statement about ext3 not working on
>>>>>> flash, RAID5 not working, etc...
>>>>>
>>>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>>>> is stupid:
>>>>>
>>>>> * ext2 would be faster
>>>>>
>>>>> * ext2 would provide better protection against powerfail.
>>>>
>>>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>>>> telling you that it will lose data.
>>>
>>> I know I will lose data. Both ext2 and ext3 will lose data on
>>> flashdisk. (That's what I'm trying to document). But... what is the
>>> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
>>> protects you against kernel panic. MD RAID5 is in software, so... that
>>> additional protection is just not there).
>>
>> Faster recovery time on any normal kernel crash or power outage. Data
>> loss would be equivalent with or without the journal.
>
> No, because you'll actually repair the ext2 with fsck after the kernel
> crash or power outage. Data loss will not be equivalent; in particular
> you'll not lose data writen _after_ power outage to ext2.

by the way, while you are thinking about failures that can happen from a
failed write corrupting additional blocks, think about the nightmare that
can happen if those blocks are in the journal.

the 'repair' of ext2 by a fsck is actually much less than you are thinking
that it is.

David Lang

2009-08-25 23:48:45

by David Lang

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Tue, 25 Aug 2009, Ric Wheeler wrote:

> On 08/25/2009 07:26 PM, Pavel Machek wrote:
>>
>>>>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>>>>> the page cache will lose data when you hot unplug its storage. End of
>>>>> story, don't do it!
>>>>
>>>> No, not ext3 on SATA disk with barriers on and proper use of
>>>> fsync(). I actually tested that.
>>>>
>>>> Yes, I should be able to hotunplug SATA drives and expect the data
>>>> that was fsync-ed to be there.
>>>
>>> You can and will lose data (even after fsync) with any type of storage at
>>> some rate. What you are missing here is that data loss needs to be
>>> measured in hard numbers - say percentage of installed boxes that have
>>> config X that lose data.
>>
>> I'm talking "by design" here.
>>
>> I will lose data even on SATA drive that is properly powered on if I
>> wait 5 years.
>>
>
> You are dead wrong.
>
> For RAID5 arrays, you assume that you have a hard failure and a power outage
> before you can rebuild the RAID (order of hours at full tilt).

and that the power outage causes a corrupted write.

>>> I can promise you that hot unplugging and replugging a S-ATA drive will
>>> also lose you data if you are actively writing to it (ext2, 3, whatever).
>>
>> I can promise you that running S-ATA drive will also lose you data,
>> even if you are not actively writing to it. Just wait 10 years; so
>> what is your point?
>
> I lost a s-ata drive 24 hours after installing it in a new box. If I had MD5
> RAID5, I would not have lost any.

me to, in fact just after I copied data from a raid array to it so that I
could rebuild the raid array differently :-(

David Lang

2009-08-25 23:53:59

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

> Why don't you hold all of your most precious data on that single S-ATA
> drive for five year on one box and put a second copy on a small RAID5
> with ext3 for the same period?
>
> Repeat experiment until you get up to something like google scale or the
> other papers on failures in national labs in the US and then we can have
> an informed discussion.

I'm not interested in discussing statistics with you. I'd rather discuss
fsync() and storage design issues.

ext3 is designed to work on single SATA disks, and it is not designed
to work on flash cards/degraded MD RAID5s, as Ted acknowledged.

Because that fact is non obvious to the users, I'd like to see it
documented, and now have nice short writeup from Ted.

If you want to argue that ext3/MD RAID5/no UPS combination is still
less likely to fail than single SATA disk given part fail
probabilities, go ahead and present nice statistics. Its just that I'm
not interested in them.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-26 00:11:21

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On 08/25/2009 07:53 PM, Pavel Machek wrote:
>> Why don't you hold all of your most precious data on that single S-ATA
>> drive for five year on one box and put a second copy on a small RAID5
>> with ext3 for the same period?
>>
>> Repeat experiment until you get up to something like google scale or the
>> other papers on failures in national labs in the US and then we can have
>> an informed discussion.
>
> I'm not interested in discussing statistics with you. I'd rather discuss
> fsync() and storage design issues.
>
> ext3 is designed to work on single SATA disks, and it is not designed
> to work on flash cards/degraded MD RAID5s, as Ted acknowledged.

You are simply incorrect, Ted did not say that ext3 does not work with MD raid5.

>
> Because that fact is non obvious to the users, I'd like to see it
> documented, and now have nice short writeup from Ted.
>
> If you want to argue that ext3/MD RAID5/no UPS combination is still
> less likely to fail than single SATA disk given part fail
> probabilities, go ahead and present nice statistics. Its just that I'm
> not interested in them.
> Pavel
>

That is a proven fact and a well published one. If you choose to ignore
published work (and common sense) that RAID makes you lose data less than
non-RAID, why should anyone care what you write?

Ric


2009-08-26 00:16:46

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Tue 2009-08-25 20:11:21, Ric Wheeler wrote:
> On 08/25/2009 07:53 PM, Pavel Machek wrote:
>>> Why don't you hold all of your most precious data on that single S-ATA
>>> drive for five year on one box and put a second copy on a small RAID5
>>> with ext3 for the same period?
>>>
>>> Repeat experiment until you get up to something like google scale or the
>>> other papers on failures in national labs in the US and then we can have
>>> an informed discussion.
>>
>> I'm not interested in discussing statistics with you. I'd rather discuss
>> fsync() and storage design issues.
>>
>> ext3 is designed to work on single SATA disks, and it is not designed
>> to work on flash cards/degraded MD RAID5s, as Ted acknowledged.
>
> You are simply incorrect, Ted did not say that ext3 does not work
> with MD raid5.

http://lkml.org/lkml/2009/8/25/312
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-26 00:31:21

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On 08/25/2009 08:16 PM, Pavel Machek wrote:
> On Tue 2009-08-25 20:11:21, Ric Wheeler wrote:
>> On 08/25/2009 07:53 PM, Pavel Machek wrote:
>>>> Why don't you hold all of your most precious data on that single S-ATA
>>>> drive for five year on one box and put a second copy on a small RAID5
>>>> with ext3 for the same period?
>>>>
>>>> Repeat experiment until you get up to something like google scale or the
>>>> other papers on failures in national labs in the US and then we can have
>>>> an informed discussion.
>>>
>>> I'm not interested in discussing statistics with you. I'd rather discuss
>>> fsync() and storage design issues.
>>>
>>> ext3 is designed to work on single SATA disks, and it is not designed
>>> to work on flash cards/degraded MD RAID5s, as Ted acknowledged.
>>
>> You are simply incorrect, Ted did not say that ext3 does not work
>> with MD raid5.
>
> http://lkml.org/lkml/2009/8/25/312
> Pavel

I will let Ted clarify his text on his own, but the quoted text says "... have
potential...".

Why not ask Neil if he designed MD to not work properly with ext3?

Ric


2009-08-26 01:00:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote:
>>> You are simply incorrect, Ted did not say that ext3 does not work
>>> with MD raid5.
>>
>> http://lkml.org/lkml/2009/8/25/312
>> Pavel
>
> I will let Ted clarify his text on his own, but the quoted text says "...
> have potential...".
>
> Why not ask Neil if he designed MD to not work properly with ext3?

So let me clarify by saying the following things.

1) Filesystems are designed to expect that storage devices have
certain properties. These include returning the same data that you
wrote, and that an error when writing a sector, or a power failure
when writing sector, should not be amplified to cause collateral
damage with previously succfessfully written sectors.

2) Degraded RAID 5/6 filesystems do not meet these properties.
Neither to cheap flash drives. This increases the chances you can
lose, bigtime.

3) Does that mean that you shouldn't use ext3 on RAID drives? Of
course not! First of all, Ext3 still saves you against kernel panics
and hangs caused by device driver bugs or other kernel hangs. You
will lose less data, and avoid needing to run a long and painful fsck
after a forced reboot, compared to if you used ext2. You are making
an assumption that the only time running the journal takes place is
after a power failure. But if the system hangs, and you need to hit
the Big Red Switch, or if you using the system in a Linux High
Availability setup and the ethernet card fails, so the STONITH ("shoot
the other node in the head") system forces a hard reset of the system,
or you get a kernel panic which forces a reboot, in all of these cases
ext3 will save you from a long fsck, and it will do so safely.

Secondly, what's the probability of a failure causes the RAID array to
become degraded, followed by a power failure, versus a power failure
while the RAID array is not running in degraded mode? Hopefully you
are running with the RAID array in full, proper running order a much
larger percentage of the time than running with the RAID array in
degraded mode. If not, the bug is with the system administrator!

If you are someone who tends to run for long periods of time in
degraded mode --- then better get a UPS. And certainly if you want to
avoid the chances of failure, periodically scrubbing the disks so you
detect hard drive failures early, instead of waiting until a disk
fails before letting the rebuild find the dreaded "second failure"
which causes data loss, is a d*mned good idea.

Maybe a random OS engineer doesn't know these things --- but trust me
when I say a competent system administrator had better be familiar
with these concepts. And someone who wants their data to be reliably
stored needs to do some basic storage engineering if they want to have
long-term data reliability. (That, or maybe they should outsource
their long-term reliable storage some service such as Amazon S3 ---
see Jeremy Zawodny's analysis about how it can be cheaper, here:
http://jeremy.zawodny.com/blog/archives/007624.html)

But we *do* need to be careful that we don't write documentation which
is ends up giving users the wrong impression. The bottom line is that
you're better off using ext3 over ext2, even on a RAID array, for the
reasons listed above.

Are you better off using ext3 over ext2 on a crappy flash drive?
Maybe --- if you are also using crappy proprietary video drivers, such
as Ubuntu ships, where every single time you exit a 3d game the system
crashes (and Ubuntu users accept this as normal?!?), then ext3 might
be a better choice since you'll reduce the chance of data loss when
the system locks up or crashes thanks to the aforemention crappy
proprietary video drivers from Nvidia. On the other hand, crappy
flash drives *do* have really bad write amplification effects, where a
4K write can cause 128k or more worth of flash to be rewritten, such
that using ext3 could seriously degrade the lifetime of said crappy
flash drive; furthermore, the crappy flash drives have such terribly
write performance that using ext3 can be a performance nightmare.
This of course, doesn't apply to well-implemented SSD's, such as the
Intel's X25-M and X18-M. So here your mileage may vary. Still, if
you are using crappy proprietary drivers which cause system hangs and
crashes at a far greater rate than power fail-induced unclean
shutdowns, ext3 *still* might be the better choice, even with crappy
flash drives.

The best thing to do, of course, is to improve your storage stack; use
competently implemented SSD's instead of crap flash cards. If your
hardware RAID card supports a battery option, *get* the battery. Add
a UPS to your system. Provision your RAID array with hot spares, and
regularly scrub (read-test) your array so that failed drives can be
detected early. Make sure you configure your MD setup so that you get
e-mail when a hard drive fails and the array starts running in
degraded mode, so you can replace the failed drive ASAP.

At the end of the day, filesystems are not magic. They can't
compensate for crap hardware, or incompetently administered machines.

- Ted

2009-08-26 01:15:59

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On 08/25/2009 09:00 PM, Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote:
>
>>>> You are simply incorrect, Ted did not say that ext3 does not work
>>>> with MD raid5.
>>>>
>>> http://lkml.org/lkml/2009/8/25/312
>>> Pavel
>>>
>> I will let Ted clarify his text on his own, but the quoted text says "...
>> have potential...".
>>
>> Why not ask Neil if he designed MD to not work properly with ext3?
>>
> So let me clarify by saying the following things.
>
> 1) Filesystems are designed to expect that storage devices have
> certain properties. These include returning the same data that you
> wrote, and that an error when writing a sector, or a power failure
> when writing sector, should not be amplified to cause collateral
> damage with previously succfessfully written sectors.
>
> 2) Degraded RAID 5/6 filesystems do not meet these properties.
> Neither to cheap flash drives. This increases the chances you can
> lose, bigtime.
>
>

I agree with the whole write up outside of the above - degraded RAID
does meet this requirement unless you have a second (or third, counting
the split write) failure during the rebuild.

Note that the window of exposure during a RAID rebuild is linear with
the size of your disk and how much you detune the rebuild...

ric

> 3) Does that mean that you shouldn't use ext3 on RAID drives? Of
> course not! First of all, Ext3 still saves you against kernel panics
> and hangs caused by device driver bugs or other kernel hangs. You
> will lose less data, and avoid needing to run a long and painful fsck
> after a forced reboot, compared to if you used ext2. You are making
> an assumption that the only time running the journal takes place is
> after a power failure. But if the system hangs, and you need to hit
> the Big Red Switch, or if you using the system in a Linux High
> Availability setup and the ethernet card fails, so the STONITH ("shoot
> the other node in the head") system forces a hard reset of the system,
> or you get a kernel panic which forces a reboot, in all of these cases
> ext3 will save you from a long fsck, and it will do so safely.
>
> Secondly, what's the probability of a failure causes the RAID array to
> become degraded, followed by a power failure, versus a power failure
> while the RAID array is not running in degraded mode? Hopefully you
> are running with the RAID array in full, proper running order a much
> larger percentage of the time than running with the RAID array in
> degraded mode. If not, the bug is with the system administrator!
>
> If you are someone who tends to run for long periods of time in
> degraded mode --- then better get a UPS. And certainly if you want to
> avoid the chances of failure, periodically scrubbing the disks so you
> detect hard drive failures early, instead of waiting until a disk
> fails before letting the rebuild find the dreaded "second failure"
> which causes data loss, is a d*mned good idea.
>
> Maybe a random OS engineer doesn't know these things --- but trust me
> when I say a competent system administrator had better be familiar
> with these concepts. And someone who wants their data to be reliably
> stored needs to do some basic storage engineering if they want to have
> long-term data reliability. (That, or maybe they should outsource
> their long-term reliable storage some service such as Amazon S3 ---
> see Jeremy Zawodny's analysis about how it can be cheaper, here:
> http://jeremy.zawodny.com/blog/archives/007624.html)
>
> But we *do* need to be careful that we don't write documentation which
> is ends up giving users the wrong impression. The bottom line is that
> you're better off using ext3 over ext2, even on a RAID array, for the
> reasons listed above.
>
> Are you better off using ext3 over ext2 on a crappy flash drive?
> Maybe --- if you are also using crappy proprietary video drivers, such
> as Ubuntu ships, where every single time you exit a 3d game the system
> crashes (and Ubuntu users accept this as normal?!?), then ext3 might
> be a better choice since you'll reduce the chance of data loss when
> the system locks up or crashes thanks to the aforemention crappy
> proprietary video drivers from Nvidia. On the other hand, crappy
> flash drives *do* have really bad write amplification effects, where a
> 4K write can cause 128k or more worth of flash to be rewritten, such
> that using ext3 could seriously degrade the lifetime of said crappy
> flash drive; furthermore, the crappy flash drives have such terribly
> write performance that using ext3 can be a performance nightmare.
> This of course, doesn't apply to well-implemented SSD's, such as the
> Intel's X25-M and X18-M. So here your mileage may vary. Still, if
> you are using crappy proprietary drivers which cause system hangs and
> crashes at a far greater rate than power fail-induced unclean
> shutdowns, ext3 *still* might be the better choice, even with crappy
> flash drives.
>
> The best thing to do, of course, is to improve your storage stack; use
> competently implemented SSD's instead of crap flash cards. If your
> hardware RAID card supports a battery option, *get* the battery. Add
> a UPS to your system. Provision your RAID array with hot spares, and
> regularly scrub (read-test) your array so that failed drives can be
> detected early. Make sure you configure your MD setup so that you get
> e-mail when a hard drive fails and the array starts running in
> degraded mode, so you can replace the failed drive ASAP.
>
> At the end of the day, filesystems are not magic. They can't
> compensate for crap hardware, or incompetently administered machines.
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


2009-08-26 01:16:13

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Hi!

> 3) Does that mean that you shouldn't use ext3 on RAID drives? Of
> course not! First of all, Ext3 still saves you against kernel panics
> and hangs caused by device driver bugs or other kernel hangs. You
> will lose less data, and avoid needing to run a long and painful fsck
> after a forced reboot, compared to if you used ext2. You are making

Actually... ext3 + MD RAID5 will still have a problem on kernel
panic. MD RAID5 is implemented in software, so if kernel panics, you
can still get inconsistent data in your array.

I mostly agree with the rest.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Tue, 25 Aug 2009, Theodore Tso wrote:
> a UPS to your system. Provision your RAID array with hot spares, and
> regularly scrub (read-test) your array so that failed drives can be

Can we get a proper scrub function (full rewrite of all component
disks), please? Not every disk out there will stop a streaming read to
rewrite weak sectors it happens to come across.

> detected early. Make sure you configure your MD setup so that you get
> e-mail when a hard drive fails and the array starts running in
> degraded mode, so you can replace the failed drive ASAP.

Debian got this right :-)

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2009-08-26 03:50:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Pavel Machek wrote:

> If you want to argue that ext3/MD RAID5/no UPS combination is still
> less likely to fail than single SATA disk given part fail
> probabilities, go ahead and present nice statistics. Its just that I'm
> not interested in them.

The reality in your document does not match up with the reality
out there in the world. That sounds like a good reason not to
have your (incorrect) document out there, confusing people.

--
All rights reversed.

2009-08-27 03:16:40

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Tuesday 25 August 2009 08:42:10 Alan Cox wrote:
> On Tue, 25 Aug 2009 09:37:12 -0400
>
> Ric Wheeler <[email protected]> wrote:
> > I really think that the expectation that all OS's (windows, mac, even
> > your ipod) all teach you not to hot unplug a device with any file system.
> > Users have an "eject" or "safe unload" in windows, your iPod tells you
> > not to power off or disconnect, etc.
>
> Agreed

Ok, I'll bite: What are journaling filesystems _for_?

> > I don't object to making that general statement - "Don't hot unplug a
> > device with an active file system or actively used raw device" - but
> > would object to the overly general statement about ext3 not working on
> > flash, RAID5 not working, etc...
>
> The overall general statement for all media and all OS's should be
>
> "Do you have a backup, have you tested it recently"

It might be nice to know when you _needed_ said backup, and when you shouldn't
re-backup bad data over it, because your data corruption actually got detected
before then.

And maybe a pony.

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-08-27 03:53:16

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote:
> Repeat experiment until you get up to something like google scale or the
> other papers on failures in national labs in the US and then we can have an
> informed discussion.

On google scale anvil lightning can fry your machine out of a clear sky.

However, there are still a few non-enterprise users out there, and knowing
that specific usage patterns don't behave like they expect might be useful to
them.

> >> I can promise you that hot unplugging and replugging a S-ATA drive will
> >> also lose you data if you are actively writing to it (ext2, 3,
> >> whatever).
> >
> > I can promise you that running S-ATA drive will also lose you data,
> > even if you are not actively writing to it. Just wait 10 years; so
> > what is your point?
>
> I lost a s-ata drive 24 hours after installing it in a new box. If I had
> MD5 RAID5, I would not have lost any.
>
> My point is that you fail to take into account the rate of failures of a
> given configuration and the probability of data loss given those rates.

Actually, that's _exactly_ what he's talking about.

When writing to a degraded raid or a flash disk, journaling is essentially
useless. If you get a power failure, kernel panic, somebody tripping over a
USB cable, and so on, your filesystem will not be protected by journaling.
Your data won't be trashed _every_ time, but the likelihood is much greater
than experience with journaling in other contexts would suggest.

Worse, the journaling may be counterproductive by _hiding_ many errors that
fsck would promptly detect, so when the error is detected it may not be
associated with the event that caused it. It also may not be noticed until
good backups of the data have been overwritten or otherwise cycled out.

You seem to be arguing that Linux is no longer used anywhere but the
enterprise, so issues affecting USB flash keys or cheap software-only RAID
aren't worth documenting?

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-08-27 11:43:49

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On 08/26/2009 11:53 PM, Rob Landley wrote:
> On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote:
>
>> Repeat experiment until you get up to something like google scale or the
>> other papers on failures in national labs in the US and then we can have an
>> informed discussion.
>>
> On google scale anvil lightning can fry your machine out of a clear sky.
>
> However, there are still a few non-enterprise users out there, and knowing
> that specific usage patterns don't behave like they expect might be useful to
> them.
>
>

You are missing the broader point of both papers. They (and people like
me when back at EMC) look at large numbers of machines and try to fix
what actually breaks when run in the real world and causes data loss.
The motherboards, S-ATA controllers, disk types are the same class of
parts that I have in my desktop box today.

The advantage of google, national labs, etc is that they have large
numbers of systems and can draw conclusions that are meaningful to our
broad user base.

Specifically, in using S-ATA drives (just like ours, maybe slightly more
reliable) they see up to 7% of those drives fail each year. All users
have "soft" drive failures like single remapped sectors.

These errors happen extremely commonly and are what RAID deals with well.

What does not happen commonly is that during the RAID rebuild (kicked
off only after a drive is kicked out), you push the power button or have
a second failure (power outage).

We will have more users loose data if they decide to use ext2 instead of
ext3 and use only single disk storage.

We have real numbers that show that is true. Injecting double faults
into a system that handles single faults is frankly not that interesting.

You can get better protection from these double faults if you move to
"cloud" like storage configs where each box is fault tolerant, but you
also spread your data over multiple boxes in multiple locations.

Regards,

Ric

>>>> I can promise you that hot unplugging and replugging a S-ATA drive will
>>>> also lose you data if you are actively writing to it (ext2, 3,
>>>> whatever).
>>>>
>>> I can promise you that running S-ATA drive will also lose you data,
>>> even if you are not actively writing to it. Just wait 10 years; so
>>> what is your point?
>>>
>> I lost a s-ata drive 24 hours after installing it in a new box. If I had
>> MD5 RAID5, I would not have lost any.
>>
>> My point is that you fail to take into account the rate of failures of a
>> given configuration and the probability of data loss given those rates.
>>
> Actually, that's _exactly_ what he's talking about.
>
> When writing to a degraded raid or a flash disk, journaling is essentially
> useless. If you get a power failure, kernel panic, somebody tripping over a
> USB cable, and so on, your filesystem will not be protected by journaling.
> Your data won't be trashed _every_ time, but the likelihood is much greater
> than experience with journaling in other contexts would suggest.
>
> Worse, the journaling may be counterproductive by _hiding_ many errors that
> fsck would promptly detect, so when the error is detected it may not be
> associated with the event that caused it. It also may not be noticed until
> good backups of the data have been overwritten or otherwise cycled out.
>
> You seem to be arguing that Linux is no longer used anywhere but the
> enterprise, so issues affecting USB flash keys or cheap software-only RAID
> aren't worth documenting?
>
> Rob
>

2009-08-27 20:51:42

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Thursday 27 August 2009 06:43:49 Ric Wheeler wrote:
> On 08/26/2009 11:53 PM, Rob Landley wrote:
> > On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote:
> >> Repeat experiment until you get up to something like google scale or the
> >> other papers on failures in national labs in the US and then we can have
> >> an informed discussion.
> >
> > On google scale anvil lightning can fry your machine out of a clear sky.
> >
> > However, there are still a few non-enterprise users out there, and
> > knowing that specific usage patterns don't behave like they expect might
> > be useful to them.
>
> You are missing the broader point of both papers.

No, I'm dismissing the papers (some of which I read when they first came out
and got slashdotted) as irrelevant to the topic at hand.

Pavel has two failure modes which he can trivially reproduce. The USB stick
one is reproducible on a laptop by jostling said stick. I myself used to have
a literal USB keychain, and the weight of keys dangling from it pulled it out
of the USB socket fairly easily if I wasn't careful. At the time nobody had
told me a journaling filesystem was not a reasonable safeguard here.

Presumably the degraded raid one can be reproduced under an emulator, with no
hardware directly involved at all, so talking about hardware failure rates
ignores the fact that he's actually discussing a _software_ problem. It may
happen in _response_ to hardware failures, but the damage he's attempting to
document happens entirely in software.

These failure modes can cause data loss which journaling can't help, but which
journaling might (or might not) conceivably hide so you don't immediately
notice it. They share a common underlying assumption that the storage
device's update granularity is less than or equal to the filesystem's block
size, which is not actually true of all modern storage devices. The fact he's
only _found_ two instances where this assumption bites doesn't mean there
aren't more waiting to be found, especially as more new storage media types
get introduced.

Pavel's response was to attempt to document this. Not that journaling is
_bad_, but that it doesn't protect against this class of problem.

Your response is to talk about google clusters, cloud storage, and cite
academic papers of statistical hardware failure rates. As I understand the
discussion, that's not actually the issue Pavel's talking about, merely one
potential trigger for it.

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-08-27 22:00:58

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On 08/27/2009 04:51 PM, Rob Landley wrote:
> On Thursday 27 August 2009 06:43:49 Ric Wheeler wrote:
>
>> On 08/26/2009 11:53 PM, Rob Landley wrote:
>>
>>> On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote:
>>>
>>>> Repeat experiment until you get up to something like google scale or the
>>>> other papers on failures in national labs in the US and then we can have
>>>> an informed discussion.
>>>>
>>> On google scale anvil lightning can fry your machine out of a clear sky.
>>>
>>> However, there are still a few non-enterprise users out there, and
>>> knowing that specific usage patterns don't behave like they expect might
>>> be useful to them.
>>>
>> You are missing the broader point of both papers.
>>
> No, I'm dismissing the papers (some of which I read when they first came out
> and got slashdotted) as irrelevant to the topic at hand.
>

I guess I have to dismiss your dismissing then.
> Pavel has two failure modes which he can trivially reproduce. The USB stick
> one is reproducible on a laptop by jostling said stick. I myself used to have
> a literal USB keychain, and the weight of keys dangling from it pulled it out
> of the USB socket fairly easily if I wasn't careful. At the time nobody had
> told me a journaling filesystem was not a reasonable safeguard here.
>
> Presumably the degraded raid one can be reproduced under an emulator, with no
> hardware directly involved at all, so talking about hardware failure rates
> ignores the fact that he's actually discussing a _software_ problem. It may
> happen in _response_ to hardware failures, but the damage he's attempting to
> document happens entirely in software.
>
> These failure modes can cause data loss which journaling can't help, but which
> journaling might (or might not) conceivably hide so you don't immediately
> notice it. They share a common underlying assumption that the storage
> device's update granularity is less than or equal to the filesystem's block
> size, which is not actually true of all modern storage devices. The fact he's
> only _found_ two instances where this assumption bites doesn't mean there
> aren't more waiting to be found, especially as more new storage media types
> get introduced.
>
> Pavel's response was to attempt to document this. Not that journaling is
> _bad_, but that it doesn't protect against this class of problem.
>
> Your response is to talk about google clusters, cloud storage, and cite
> academic papers of statistical hardware failure rates. As I understand the
> discussion, that's not actually the issue Pavel's talking about, merely one
> potential trigger for it.
>
> Rob
>



2009-08-27 22:13:30

by Pavel Machek

[permalink] [raw]
Subject: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)


>>> Repeat experiment until you get up to something like google scale or the
>>> other papers on failures in national labs in the US and then we can have an
>>> informed discussion.
>>>
>> On google scale anvil lightning can fry your machine out of a clear sky.
>>
>> However, there are still a few non-enterprise users out there, and knowing
>> that specific usage patterns don't behave like they expect might be useful to
>> them.
>
> You are missing the broader point of both papers. They (and people like
> me when back at EMC) look at large numbers of machines and try to fix
> what actually breaks when run in the real world and causes data loss.
> The motherboards, S-ATA controllers, disk types are the same class of
> parts that I have in my desktop box today.
...
> These errors happen extremely commonly and are what RAID deals with well.
>
> What does not happen commonly is that during the RAID rebuild (kicked
> off only after a drive is kicked out), you push the power button or have
> a second failure (power outage).
>
> We will have more users loose data if they decide to use ext2 instead of
> ext3 and use only single disk storage.

So your argument basically is

'our abs brakes are broken, but lets not tell anyone; our car is still
safer than a horse'.

and

'while we know our abs brakes are broken, they are not major factor in
accidents, so lets not tell anyone'.

Sorry, but I'd expect slightly higher moral standards. If we can
document it in a way that's non-scary, and does not push people to
single disks (horses), please go ahead; but you have to mention that
md raid breaks journalling assumptions (our abs brakes really are
broken).
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-28 01:32:49

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 08/27/2009 06:13 PM, Pavel Machek wrote:
>
>>>> Repeat experiment until you get up to something like google scale or the
>>>> other papers on failures in national labs in the US and then we can have an
>>>> informed discussion.
>>>>
>>> On google scale anvil lightning can fry your machine out of a clear sky.
>>>
>>> However, there are still a few non-enterprise users out there, and knowing
>>> that specific usage patterns don't behave like they expect might be useful to
>>> them.
>>
>> You are missing the broader point of both papers. They (and people like
>> me when back at EMC) look at large numbers of machines and try to fix
>> what actually breaks when run in the real world and causes data loss.
>> The motherboards, S-ATA controllers, disk types are the same class of
>> parts that I have in my desktop box today.
> ...
>> These errors happen extremely commonly and are what RAID deals with well.
>>
>> What does not happen commonly is that during the RAID rebuild (kicked
>> off only after a drive is kicked out), you push the power button or have
>> a second failure (power outage).
>>
>> We will have more users loose data if they decide to use ext2 instead of
>> ext3 and use only single disk storage.
>
> So your argument basically is
>
> 'our abs brakes are broken, but lets not tell anyone; our car is still
> safer than a horse'.
>
> and
>
> 'while we know our abs brakes are broken, they are not major factor in
> accidents, so lets not tell anyone'.
>
> Sorry, but I'd expect slightly higher moral standards. If we can
> document it in a way that's non-scary, and does not push people to
> single disks (horses), please go ahead; but you have to mention that
> md raid breaks journalling assumptions (our abs brakes really are
> broken).
> Pavel
>


You continue to ignore the technical facts that everyone (both MD and ext3)
people put in front of you.

If you have a specific bug in MD code, please propose a patch.

Ric


2009-08-28 06:44:49

by Pavel Machek

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Thu 2009-08-27 21:32:49, Ric Wheeler wrote:
> On 08/27/2009 06:13 PM, Pavel Machek wrote:
>>
>>>>> Repeat experiment until you get up to something like google scale or the
>>>>> other papers on failures in national labs in the US and then we can have an
>>>>> informed discussion.
>>>>>
>>>> On google scale anvil lightning can fry your machine out of a clear sky.
>>>>
>>>> However, there are still a few non-enterprise users out there, and knowing
>>>> that specific usage patterns don't behave like they expect might be useful to
>>>> them.
>>>
>>> You are missing the broader point of both papers. They (and people like
>>> me when back at EMC) look at large numbers of machines and try to fix
>>> what actually breaks when run in the real world and causes data loss.
>>> The motherboards, S-ATA controllers, disk types are the same class of
>>> parts that I have in my desktop box today.
>> ...
>>> These errors happen extremely commonly and are what RAID deals with well.
>>>
>>> What does not happen commonly is that during the RAID rebuild (kicked
>>> off only after a drive is kicked out), you push the power button or have
>>> a second failure (power outage).
>>>
>>> We will have more users loose data if they decide to use ext2 instead of
>>> ext3 and use only single disk storage.
>>
>> So your argument basically is
>>
>> 'our abs brakes are broken, but lets not tell anyone; our car is still
>> safer than a horse'.
>>
>> and
>>
>> 'while we know our abs brakes are broken, they are not major factor in
>> accidents, so lets not tell anyone'.
>>
>> Sorry, but I'd expect slightly higher moral standards. If we can
>> document it in a way that's non-scary, and does not push people to
>> single disks (horses), please go ahead; but you have to mention that
>> md raid breaks journalling assumptions (our abs brakes really are
>> broken).
>
> You continue to ignore the technical facts that everyone (both MD and
> ext3) people put in front of you.
>
> If you have a specific bug in MD code, please propose a patch.

Interesting. So, what's technically wrong with the patch below?

Pavel
---

From: Theodore Tso <[email protected]>

Document that many devices are too broken for filesystems to protect
data in case of powerfail.

Signed-of-by: Pavel Machek <[email protected]>

diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
new file mode 100644
index 0000000..2f3eec1
--- /dev/null
+++ b/Documentation/filesystems/dangers.txt
@@ -0,0 +1,21 @@
+There are storage devices that high highly undesirable properties when
+they are disconnected or suffer power failures while writes are in
+progress; such devices include flash devices and DM/MD RAID 4/5/6 (*)
+arrays. These devices have the property of potentially corrupting
+blocks being written at the time of the power failure, and worse yet,
+amplifying the region where blocks are corrupted such that additional
+sectors are also damaged during the power failure.
+
+Users who use such storage devices are well advised take
+countermeasures, such as the use of Uninterruptible Power Supplies,
+and making sure the flash device is not hot-unplugged while the device
+is being used. Regular backups when using these devices is also a
+Very Good Idea.
+
+Otherwise, file systems placed on these devices can suffer silent data
+and file system corruption. An forced use of fsck may detect metadata
+corruption resulting in file system corruption, but will not suffice
+to detect data corruption.
+
+(*) Degraded array or single disk failure "near" the powerfail is
+neccessary for this property of RAID arrays to bite.


--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-28 07:11:45

by Florian Weimer

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret

* Ric Wheeler:

> You continue to ignore the technical facts that everyone (both MD and
> ext3) people put in front of you.
>
> If you have a specific bug in MD code, please propose a patch.

In RAID 1 mode, it should read both copies and error out on
mismatch. 8-)

--
Florian Weimer <[email protected]>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstra?e 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

2009-08-28 07:23:56

by NeilBrown

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret

On Fri, August 28, 2009 5:11 pm, Florian Weimer wrote:
> * Ric Wheeler:
>
>> You continue to ignore the technical facts that everyone (both MD and
>> ext3) people put in front of you.
>>
>> If you have a specific bug in MD code, please propose a patch.
>
> In RAID 1 mode, it should read both copies and error out on
> mismatch. 8-)

Despite your smiley:

no it shouldn't, and no one is making any claims about raid1 being
unsafe, only raid4/5/6.

NeilBrown


2009-08-28 07:31:36

by NeilBrown

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Fri, August 28, 2009 4:44 pm, Pavel Machek wrote:
> On Thu 2009-08-27 21:32:49, Ric Wheeler wrote:
>>>
>> If you have a specific bug in MD code, please propose a patch.
>
> Interesting. So, what's technically wrong with the patch below?
>

You mean apart from ".... that high highly undesirable ...." ??
^^^^^^^^^^^

And the phrase "Regular backups when using these devices ...." should
be "Regular backups when using any devices .....".
^^^
If you have a device failure near a power fail on a raid5 you might
lose some blocks of data. If you have a device failure near (or not
near) a power failure on raid0 or jbod etc you will certainly lose lots
of blocks of data.

I think it would be better to say:

".... and degraded DM/MD RAID 4/5/6(*) arrays..."
^^^^^^^^
with
(*) If device failure causes the array to become degraded during or
immediately after the power failure, the same problem can result.

And "necessary" only have the one 'c' :-)

NeilBrown

> Pavel
> ---
>
> From: Theodore Tso <[email protected]>
>
> Document that many devices are too broken for filesystems to protect
> data in case of powerfail.
>
> Signed-of-by: Pavel Machek <[email protected]>
>
> diff --git a/Documentation/filesystems/dangers.txt
> b/Documentation/filesystems/dangers.txt
> new file mode 100644
> index 0000000..2f3eec1
> --- /dev/null
> +++ b/Documentation/filesystems/dangers.txt
> @@ -0,0 +1,21 @@
> +There are storage devices that high highly undesirable properties when
> +they are disconnected or suffer power failures while writes are in
> +progress; such devices include flash devices and DM/MD RAID 4/5/6 (*)
> +arrays. These devices have the property of potentially corrupting
> +blocks being written at the time of the power failure, and worse yet,
> +amplifying the region where blocks are corrupted such that additional
> +sectors are also damaged during the power failure.
> +
> +Users who use such storage devices are well advised take
> +countermeasures, such as the use of Uninterruptible Power Supplies,
> +and making sure the flash device is not hot-unplugged while the device
> +is being used. Regular backups when using these devices is also a
> +Very Good Idea.
> +
> +Otherwise, file systems placed on these devices can suffer silent data
> +and file system corruption. An forced use of fsck may detect metadata
> +corruption resulting in file system corruption, but will not suffice
> +to detect data corruption.
> +
> +(*) Degraded array or single disk failure "near" the powerfail is
> +neccessary for this property of RAID arrays to bite.
>
>

2009-08-28 11:16:46

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 08/28/2009 02:44 AM, Pavel Machek wrote:
> On Thu 2009-08-27 21:32:49, Ric Wheeler wrote:
>> On 08/27/2009 06:13 PM, Pavel Machek wrote:
>>>
>>>>>> Repeat experiment until you get up to something like google scale or the
>>>>>> other papers on failures in national labs in the US and then we can have an
>>>>>> informed discussion.
>>>>>>
>>>>> On google scale anvil lightning can fry your machine out of a clear sky.
>>>>>
>>>>> However, there are still a few non-enterprise users out there, and knowing
>>>>> that specific usage patterns don't behave like they expect might be useful to
>>>>> them.
>>>>
>>>> You are missing the broader point of both papers. They (and people like
>>>> me when back at EMC) look at large numbers of machines and try to fix
>>>> what actually breaks when run in the real world and causes data loss.
>>>> The motherboards, S-ATA controllers, disk types are the same class of
>>>> parts that I have in my desktop box today.
>>> ...
>>>> These errors happen extremely commonly and are what RAID deals with well.
>>>>
>>>> What does not happen commonly is that during the RAID rebuild (kicked
>>>> off only after a drive is kicked out), you push the power button or have
>>>> a second failure (power outage).
>>>>
>>>> We will have more users loose data if they decide to use ext2 instead of
>>>> ext3 and use only single disk storage.
>>>
>>> So your argument basically is
>>>
>>> 'our abs brakes are broken, but lets not tell anyone; our car is still
>>> safer than a horse'.
>>>
>>> and
>>>
>>> 'while we know our abs brakes are broken, they are not major factor in
>>> accidents, so lets not tell anyone'.
>>>
>>> Sorry, but I'd expect slightly higher moral standards. If we can
>>> document it in a way that's non-scary, and does not push people to
>>> single disks (horses), please go ahead; but you have to mention that
>>> md raid breaks journalling assumptions (our abs brakes really are
>>> broken).
>>
>> You continue to ignore the technical facts that everyone (both MD and
>> ext3) people put in front of you.
>>
>> If you have a specific bug in MD code, please propose a patch.
>
> Interesting. So, what's technically wrong with the patch below?
>
> Pavel


My suggestion was that you stop trying to document your assertion of an issue
and actually suggest fixes in code or implementation. I really don't think that
you have properly diagnosed your specific failure or done sufficient. However,
if you put a full analysis and suggested code out to the MD devel lists, we can
debate technical implementation as we normally do.

As Ted quite clearly stated, documentation on how RAID works, how to configure
it, etc, is best put in RAID documentation. What you claim as a key issue is an
issue for all file systems (including ext2).

The only note that I would put in ext3/4 etc documentation would be:

"Reliable storage is important for any file system. Single disks (or FLASH or
SSD) do fail on a regular basis.

To reduce your risk of data loss, it is advisable to use RAID which can overcome
these common issues. If using MD software RAID, see the RAID documentation on
how best to configure your storage.

With or without RAID, it is always important to back up your data to an external
device and keep copies of that backup off site."

ric



> ---
>
> From: Theodore Tso<[email protected]>
>
> Document that many devices are too broken for filesystems to protect
> data in case of powerfail.
>
> Signed-of-by: Pavel Machek<[email protected]>
>
> diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
> new file mode 100644
> index 0000000..2f3eec1
> --- /dev/null
> +++ b/Documentation/filesystems/dangers.txt
> @@ -0,0 +1,21 @@
> +There are storage devices that high highly undesirable properties when
> +they are disconnected or suffer power failures while writes are in
> +progress; such devices include flash devices and DM/MD RAID 4/5/6 (*)
> +arrays. These devices have the property of potentially corrupting
> +blocks being written at the time of the power failure, and worse yet,
> +amplifying the region where blocks are corrupted such that additional
> +sectors are also damaged during the power failure.
> +
> +Users who use such storage devices are well advised take
> +countermeasures, such as the use of Uninterruptible Power Supplies,
> +and making sure the flash device is not hot-unplugged while the device
> +is being used. Regular backups when using these devices is also a
> +Very Good Idea.
> +
> +Otherwise, file systems placed on these devices can suffer silent data
> +and file system corruption. An forced use of fsck may detect metadata
> +corruption resulting in file system corruption, but will not suffice
> +to detect data corruption.
> +
> +(*) Degraded array or single disk failure "near" the powerfail is
> +neccessary for this property of RAID arrays to bite.
>
>

2009-08-28 12:08:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Fri, Aug 28, 2009 at 08:44:49AM +0200, Pavel Machek wrote:
> From: Theodore Tso <[email protected]>
>
> Document that many devices are too broken for filesystems to protect
> data in case of powerfail.
>
> Signed-of-by: Pavel Machek <[email protected]>

NACK. I didn't write this patch, and it's disingenuous for you to try
to claim that I authored it.

You took text I wrote from the *middle* of an e-mail discussion and
you ignored multiple corrections to typo's that I made --- typo's that
I would have corrected if I had ultimately decided to post this as a
patch, which I did NOT.

While Neil Brown's corrections are minimally necessary so the text is
at least technically *correct*, it's still not the right advice to
give system administrators. It's better than the fear-mongering
patches you had proposed earlier, but what would be better *still* is
telling people why running with degraded RAID arrays is bad, and to
give them further tips about how to use RAID arrays safely.

To use your ABS brakes analogy, just becase it's not safe to rely on
ABS brakes if the "check brakes" light is on, that doesn't justify
writing something alarmist which claims that ABS brakes don't work
100% of the time, don't use ABS brakes, they're broken!!!!

The first part of it is true, since ABS brakes can suffer mechnical
failure. But what we should be telling drivers is, "if the 'check
brakes' light comes on, don't keep driving with it, go to a garage and
get it fixed!!!". Similarly, if you get a notice that your RAID is
running in degraded mode, you've already suffered one failure; you
won't survive another failure, so fix that issue ASAP!

If you're really paranoid, you could decide to "pull over to the side
of the road"; that is, you could stop writing to the RAID array as
soon as possible, and then get the the RAID array rebuilt before
proceeding. That can reduce the chances of a second failure. But in
the real world, there are costs associated with taking a production
server off-line, and the prudent system administrator has to do a
risk-reward tradeoff. A better approach might to have the array
configured with a hot spare, and to regularly scrub the array, and
configure the RAID array with either a battery backup or a UPS. And
hot-swap drives might not be a bad idea, too.

But in any case, just because ABS brakes and RAID arrays can suffer
failures, that doesn't mean you should run around telling people not
to use RAID arrays or RAID arrays are broken. People are better off
using RAID than not using single disk storage solutions, just as
people are better off using ABS brakes than not.

Your argument basically boils down to, "if you drive like a maniac
when the roads are wet and slippery, ABS brakes might not save your
life. Since ABS brake might cause you to have a false sense of
security, it's better to tell users that ABS brakes are broken."

That's just silly. What we should be telling people instead is (a)
pay attention to the check brakes light (just as you should pay
attention to the RAID array is degraded warning), and (b) while ABS
brakes will get you out of some situations with life and limb intact,
they do not repeal that laws of physics (do regular full and
incremental backups; practice disk scrubbing; use UPS's or battery
backups).

- Ted

2009-08-28 14:49:38

by David Lang

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Thu, 27 Aug 2009, Rob Landley wrote:

> Pavel's response was to attempt to document this. Not that journaling is
> _bad_, but that it doesn't protect against this class of problem.

I don't think anyone is disagreeing with the statement that journaling
doesn't protect against this class of problems, but Pavel's statements
didn't say that. he stated that ext3 is more dangerous than ext2.

David Lang

2009-08-29 10:05:58

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Fri 2009-08-28 07:49:38, [email protected] wrote:
> On Thu, 27 Aug 2009, Rob Landley wrote:
>
>> Pavel's response was to attempt to document this. Not that journaling is
>> _bad_, but that it doesn't protect against this class of problem.
>
> I don't think anyone is disagreeing with the statement that journaling
> doesn't protect against this class of problems, but Pavel's statements
> didn't say that. he stated that ext3 is more dangerous than ext2.

Well, if you use 'common' fsck policy, ext3 _is_ more dangerous.

But I'm not pushing that to documentation, I'm trying to push info
everyone agrees with. (check the patches).
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-29 20:22:09

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Saturday 29 August 2009 05:05:58 Pavel Machek wrote:
> On Fri 2009-08-28 07:49:38, [email protected] wrote:
> > On Thu, 27 Aug 2009, Rob Landley wrote:
> >> Pavel's response was to attempt to document this. Not that journaling
> >> is _bad_, but that it doesn't protect against this class of problem.
> >
> > I don't think anyone is disagreeing with the statement that journaling
> > doesn't protect against this class of problems, but Pavel's statements
> > didn't say that. he stated that ext3 is more dangerous than ext2.
>
> Well, if you use 'common' fsck policy, ext3 _is_ more dangerous.

The filesystem itself isn't more dangerous, but it may provide a false sense of
security when used on storage devices it wasn't designed for.

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-08-29 21:34:32

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Sat 2009-08-29 15:22:06, Rob Landley wrote:
> On Saturday 29 August 2009 05:05:58 Pavel Machek wrote:
> > On Fri 2009-08-28 07:49:38, [email protected] wrote:
> > > On Thu, 27 Aug 2009, Rob Landley wrote:
> > >> Pavel's response was to attempt to document this. Not that journaling
> > >> is _bad_, but that it doesn't protect against this class of problem.
> > >
> > > I don't think anyone is disagreeing with the statement that journaling
> > > doesn't protect against this class of problems, but Pavel's statements
> > > didn't say that. he stated that ext3 is more dangerous than ext2.
> >
> > Well, if you use 'common' fsck policy, ext3 _is_ more dangerous.
>
> The filesystem itself isn't more dangerous, but it may provide a false sense of
> security when used on storage devices it wasn't designed for.

Agreed.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-30 07:51:45

by Pavel Machek

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Hi!

> > From: Theodore Tso <[email protected]>
> >
> > Document that many devices are too broken for filesystems to protect
> > data in case of powerfail.
> >
> > Signed-of-by: Pavel Machek <[email protected]>
>
> NACK. I didn't write this patch, and it's disingenuous for you to try
> to claim that I authored it.

Well, you did write original text, so I wanted to give you
credit. Sorry.

> While Neil Brown's corrections are minimally necessary so the text is
> at least technically *correct*, it's still not the right advice to
> give system administrators. It's better than the fear-mongering
> patches you had proposed earlier, but what would be better *still* is
> telling people why running with degraded RAID arrays is bad, and to
> give them further tips about how to use RAID arrays safely.

Maybe this belongs to Doc*/filesystems, and more detailed RAID
description should go to md description?

> To use your ABS brakes analogy, just becase it's not safe to rely on
> ABS brakes if the "check brakes" light is on, that doesn't justify
> writing something alarmist which claims that ABS brakes don't work
> 100% of the time, don't use ABS brakes, they're broken!!!!

If it only was this simple. We don't have 'check brakes' (aka
'journalling ineffective') warning light. If we had that, I would not
have problem.

It is rather that your ABS brakes are ineffective if 'check engine'
(RAID degraded) is lit. And yes, running with 'check engine' for
extended periods may be bad idea, but I know people that do
that... and I still hope their brakes work (and believe they should
have won suit for damages should their ABS brakes fail).

> That's just silly. What we should be telling people instead is (a)
> pay attention to the check brakes light (just as you should pay
> attention to the RAID array is degraded warning), and (b) while ABS

'your RAID array is degraded' is very counter intuitive way to say
'...and btw your journalling is no longer effective, either'.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-30 09:01:28

by Christian Kujau

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Sun, 30 Aug 2009 at 09:51, Pavel Machek wrote:
> > give system administrators. It's better than the fear-mongering
> > patches you had proposed earlier, but what would be better *still* is
> > telling people why running with degraded RAID arrays is bad, and to
> > give them further tips about how to use RAID arrays safely.
>
> Maybe this belongs to Doc*/filesystems, and more detailed RAID
> description should go to md description?

Why should this be placed in *kernel* documentation anyway? The "dangers
of RAID", the hints that "backups are a good idea" - isn't that something
for howtos for sysadmins? No end-user will ever look into Documentation/
anyway. The sysadmins should know what they're doing and see the upsides
and downsides of RAID and journalling filesystems. And they'll turn to
howtos and tutorials to find out. And maybe seek *reference* documentation
in Documentation/ - but I don't think Storage-101 should be covered in
a mostly hidden place like Documentation/.

Christian.
--
BOFH excuse #212:

Of course it doesn't work. We've performed a software upgrade.

2009-08-30 12:55:01

by David Lang

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Sun, 30 Aug 2009, Pavel Machek wrote:

>>> From: Theodore Tso <[email protected]>
>>>
>> To use your ABS brakes analogy, just becase it's not safe to rely on
>> ABS brakes if the "check brakes" light is on, that doesn't justify
>> writing something alarmist which claims that ABS brakes don't work
>> 100% of the time, don't use ABS brakes, they're broken!!!!
>
> If it only was this simple. We don't have 'check brakes' (aka
> 'journalling ineffective') warning light. If we had that, I would not
> have problem.
>
> It is rather that your ABS brakes are ineffective if 'check engine'
> (RAID degraded) is lit. And yes, running with 'check engine' for
> extended periods may be bad idea, but I know people that do
> that... and I still hope their brakes work (and believe they should
> have won suit for damages should their ABS brakes fail).

the 'RAID degraded' warning says that _anything_ you put on that block
device is at risk. it doesn't matter if you are using a filesystem with a
journal, one without, or using the raw device directly.

David Lang

2009-08-30 14:12:06

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 08/30/2009 08:55 AM, [email protected] wrote:
> On Sun, 30 Aug 2009, Pavel Machek wrote:
>
>>>> From: Theodore Tso <[email protected]>
>>>>
>>> To use your ABS brakes analogy, just becase it's not safe to rely on
>>> ABS brakes if the "check brakes" light is on, that doesn't justify
>>> writing something alarmist which claims that ABS brakes don't work
>>> 100% of the time, don't use ABS brakes, they're broken!!!!
>>
>> If it only was this simple. We don't have 'check brakes' (aka
>> 'journalling ineffective') warning light. If we had that, I would not
>> have problem.
>>
>> It is rather that your ABS brakes are ineffective if 'check engine'
>> (RAID degraded) is lit. And yes, running with 'check engine' for
>> extended periods may be bad idea, but I know people that do
>> that... and I still hope their brakes work (and believe they should
>> have won suit for damages should their ABS brakes fail).
>
> the 'RAID degraded' warning says that _anything_ you put on that block
> device is at risk. it doesn't matter if you are using a filesystem
> with a journal, one without, or using the raw device directly.
>
> David Lang

The easiest way to lose your data in Linux - with RAID, without RAID,
S-ATA or SAS - is to run with the write cache enabled.

If you compare the size of even a large RAID stripe it will be measured
in KB and as this thread has mentioned already, you stand to have damage
to just one stripe (or even just a disk sector or two).

If you lose power with the write caches enabled on that same 5 drive
RAID set, you could lose as much as 5 * 32MB of freshly written data on
a power loss (16-32MB write caches are common on s-ata disks these days).

For MD5 (and MD6), you really must run with the write cache disabled
until we get barriers to work for those configurations.

It would be interesting for Pavel to retest with the write cache
enabled/disabled on his power loss scenarios with multi-drive RAID.

Regards,

Ric




2009-08-30 14:44:04

by Michael Tokarev

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Ric Wheeler wrote:
[]
> The easiest way to lose your data in Linux - with RAID, without RAID,
> S-ATA or SAS - is to run with the write cache enabled.
>
> If you compare the size of even a large RAID stripe it will be measured
> in KB and as this thread has mentioned already, you stand to have damage
> to just one stripe (or even just a disk sector or two).
>
> If you lose power with the write caches enabled on that same 5 drive
> RAID set, you could lose as much as 5 * 32MB of freshly written data on
> a power loss (16-32MB write caches are common on s-ata disks these days).

This is fundamentally wrong. Many filesystems today use either barriers
or flushes (if barriers are not supported), and the times when disk drives
were lying to the OS that the cache got flushed are long gone.

> For MD5 (and MD6), you really must run with the write cache disabled
> until we get barriers to work for those configurations.

I highly doubt barriers will ever be supported on anything but simple
raid1, because it's impossible to guarantee ordering across multiple
drives. Well, it *is* possible to have write barriers with journalled
(and/or with battery-backed-cache) raid[456].

Note that even if raid[456] does not support barriers, write cache
flushes still works.

/mjt

2009-08-30 15:05:36

by Pavel Machek

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Sun 2009-08-30 05:55:01, [email protected] wrote:
> On Sun, 30 Aug 2009, Pavel Machek wrote:
>
>>>> From: Theodore Tso <[email protected]>
>>>>
>>> To use your ABS brakes analogy, just becase it's not safe to rely on
>>> ABS brakes if the "check brakes" light is on, that doesn't justify
>>> writing something alarmist which claims that ABS brakes don't work
>>> 100% of the time, don't use ABS brakes, they're broken!!!!
>>
>> If it only was this simple. We don't have 'check brakes' (aka
>> 'journalling ineffective') warning light. If we had that, I would not
>> have problem.
>>
>> It is rather that your ABS brakes are ineffective if 'check engine'
>> (RAID degraded) is lit. And yes, running with 'check engine' for
>> extended periods may be bad idea, but I know people that do
>> that... and I still hope their brakes work (and believe they should
>> have won suit for damages should their ABS brakes fail).
>
> the 'RAID degraded' warning says that _anything_ you put on that block
> device is at risk. it doesn't matter if you are using a filesystem with a
> journal, one without, or using the raw device directly.

If you are using one with journal, you'll still need to run fsck at
boot time, to make sure metadata is still consistent... Protection
provided by journaling is not effective in this configuration.

(You have the point that pretty much all users of the blockdevice will
be affected by powerfail degraded mode.)
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-30 15:20:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Sun, Aug 30, 2009 at 09:51:35AM +0200, Pavel Machek wrote:
>
> If it only was this simple. We don't have 'check brakes' (aka
> 'journalling ineffective') warning light. If we had that, I would not
> have problem.

But we do; comptently designed (and in the cast of software RAID,
competently packaged) RAID subsystems send notifications to the system
administrator when there is a hard drive failure. Some hardware RAID
systems will send a page to the system administrator. A mid-range
Areca card has a separate ethernet port so it can send e-mail to the
administrator, even if the OS is hosed for some reason.

And it's not a matter of journalling ineffective; the much bigger deal
is, "your data is at risk"; perhaps because the file system metadata
may become subject to corruption, but more critically, because the
file data may become subject to corruption. Metadata becoming subject
to corruption is important primarily because it leads to data becoming
corruption; metadata is the tail; the user's data is the dog.

So we *do* have the warning light; the problem is that just as some
people may not realize that "check brakes" means, "YOU COULD DIE",
some people may not realize that "hard drive failure; RAID array
degraded" could mean, "YOU COULD LOSE DATA".

Fortunately, for software RAID, this is easily solved; if you are so
concerned, why don't you submit a patch to mdadm adjusting the e-mail
sent to the system administrator when the array is in a degraded
state, such that it states, "YOU COULD LOSE DATA". I would gently
suggest to you this would be ***far*** more effective that a patch to
kernel documentation.

- Ted

2009-08-30 16:10:52

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 08/30/2009 10:44 AM, Michael Tokarev wrote:
> Ric Wheeler wrote:
> []
>> The easiest way to lose your data in Linux - with RAID, without RAID,
>> S-ATA or SAS - is to run with the write cache enabled.
>>
>> If you compare the size of even a large RAID stripe it will be
>> measured in KB and as this thread has mentioned already, you stand to
>> have damage to just one stripe (or even just a disk sector or two).
>>
>> If you lose power with the write caches enabled on that same 5 drive
>> RAID set, you could lose as much as 5 * 32MB of freshly written data
>> on a power loss (16-32MB write caches are common on s-ata disks
>> these days).
>
> This is fundamentally wrong. Many filesystems today use either barriers
> or flushes (if barriers are not supported), and the times when disk
> drives
> were lying to the OS that the cache got flushed are long gone.
Unfortunately not - if you mount a file system with write cache enabled
and see "barriers disabled" messages in /var/log/messages, this is
exactly what happens.

File systems issue write barrier operations that in turn do cache
flushes (ATA_FLUSH_EXT) commands or its SCSI equivalent.

MD5 and MD6 do not pass these operations on currently and there is no
other file system level mechanism that somehow bypasses the IO stack to
invalidate or flush the cache.

Note that some devices have non-volatile write caches (specifically
arrays or battery backed RAID cards) where this is not an issue.


>
>> For MD5 (and MD6), you really must run with the write cache disabled
>> until we get barriers to work for those configurations.
>
> I highly doubt barriers will ever be supported on anything but simple
> raid1, because it's impossible to guarantee ordering across multiple
> drives. Well, it *is* possible to have write barriers with journalled
> (and/or with battery-backed-cache) raid[456].
>
> Note that even if raid[456] does not support barriers, write cache
> flushes still works.
>
> /mjt

I think that you are confused - barriers are implemented using cache
flushes.

Ric



2009-08-30 16:35:13

by Christoph Hellwig

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Sun, Aug 30, 2009 at 06:44:04PM +0400, Michael Tokarev wrote:
>> If you lose power with the write caches enabled on that same 5 drive
>> RAID set, you could lose as much as 5 * 32MB of freshly written data on
>> a power loss (16-32MB write caches are common on s-ata disks these
>> days).
>
> This is fundamentally wrong. Many filesystems today use either barriers
> or flushes (if barriers are not supported), and the times when disk drives
> were lying to the OS that the cache got flushed are long gone.

While most common filesystem do have barrier support it is:

- not actually enabled for the two most common filesystems
- the support for write barriers an cache flushing tends to be buggy
all over our software stack,

>> For MD5 (and MD6), you really must run with the write cache disabled
>> until we get barriers to work for those configurations.
>
> I highly doubt barriers will ever be supported on anything but simple
> raid1, because it's impossible to guarantee ordering across multiple
> drives. Well, it *is* possible to have write barriers with journalled
> (and/or with battery-backed-cache) raid[456].
>
> Note that even if raid[456] does not support barriers, write cache
> flushes still works.

All currently working barrier implementations on Linux are built upon
queue drains and cache flushes, plus sometimes setting the FUA bit.


2009-08-31 13:14:32

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 08/30/2009 12:35 PM, Christoph Hellwig wrote:
> On Sun, Aug 30, 2009 at 06:44:04PM +0400, Michael Tokarev wrote:
>>> If you lose power with the write caches enabled on that same 5 drive
>>> RAID set, you could lose as much as 5 * 32MB of freshly written data on
>>> a power loss (16-32MB write caches are common on s-ata disks these
>>> days).
>>
>> This is fundamentally wrong. Many filesystems today use either barriers
>> or flushes (if barriers are not supported), and the times when disk drives
>> were lying to the OS that the cache got flushed are long gone.
>
> While most common filesystem do have barrier support it is:
>
> - not actually enabled for the two most common filesystems
> - the support for write barriers an cache flushing tends to be buggy
> all over our software stack,
>

Or just missing - I think that MD5/6 simply drop the requests at present.

I wonder if it would be worth having MD probe for write cache enabled & warn if
barriers are not supported?

>>> For MD5 (and MD6), you really must run with the write cache disabled
>>> until we get barriers to work for those configurations.
>>
>> I highly doubt barriers will ever be supported on anything but simple
>> raid1, because it's impossible to guarantee ordering across multiple
>> drives. Well, it *is* possible to have write barriers with journalled
>> (and/or with battery-backed-cache) raid[456].
>>
>> Note that even if raid[456] does not support barriers, write cache
>> flushes still works.
>
> All currently working barrier implementations on Linux are built upon
> queue drains and cache flushes, plus sometimes setting the FUA bit.
>


2009-08-31 13:16:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote:
>> While most common filesystem do have barrier support it is:
>>
>> - not actually enabled for the two most common filesystems
>> - the support for write barriers an cache flushing tends to be buggy
>> all over our software stack,
>>
>
> Or just missing - I think that MD5/6 simply drop the requests at present.
>
> I wonder if it would be worth having MD probe for write cache enabled &
> warn if barriers are not supported?

In my opinion even that is too weak. We know how to control the cache
settings on all common disks (that is scsi and ata), so we should always
disable the write cache unless we know that the whole stack (filesystem,
raid, volume managers) supports barriers. And even then we should make
sure the filesystems does actually use barriers everywhere that's needed
which failed at for years.


2009-08-31 13:19:58

by Mark Lord

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote:
>>> While most common filesystem do have barrier support it is:
>>>
>>> - not actually enabled for the two most common filesystems
>>> - the support for write barriers an cache flushing tends to be buggy
>>> all over our software stack,
>>>
>> Or just missing - I think that MD5/6 simply drop the requests at present.
>>
>> I wonder if it would be worth having MD probe for write cache enabled &
>> warn if barriers are not supported?
>
> In my opinion even that is too weak. We know how to control the cache
> settings on all common disks (that is scsi and ata), so we should always
> disable the write cache unless we know that the whole stack (filesystem,
> raid, volume managers) supports barriers. And even then we should make
> sure the filesystems does actually use barriers everywhere that's needed
> which failed at for years.
..

That stack does not know that my MD device has full battery backup,
so it bloody well better NOT prevent me from enabling the write caches.

In fact, MD should have nothing to do with that. I do like/prefer the
way that XFS currently does it: disables barriers and logs the event,
but otherwise doesn't try to enforce policy upon me from kernel space.

Cheers

2009-08-31 13:21:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Mon, Aug 31, 2009 at 09:19:58AM -0400, Mark Lord wrote:
>> In my opinion even that is too weak. We know how to control the cache
>> settings on all common disks (that is scsi and ata), so we should always
>> disable the write cache unless we know that the whole stack (filesystem,
>> raid, volume managers) supports barriers. And even then we should make
>> sure the filesystems does actually use barriers everywhere that's needed
>> which failed at for years.
> ..
>
> That stack does not know that my MD device has full battery backup,
> so it bloody well better NOT prevent me from enabling the write caches.

No one is going to prevent you from doing it. That question is one of
sane defaults. And always safe, but slower if you have advanced
equipment is a much better default than usafe by default on most of
the install base.


2009-08-31 13:22:48

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 08/31/2009 09:16 AM, Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote:
>>> While most common filesystem do have barrier support it is:
>>>
>>> - not actually enabled for the two most common filesystems
>>> - the support for write barriers an cache flushing tends to be buggy
>>> all over our software stack,
>>>
>>
>> Or just missing - I think that MD5/6 simply drop the requests at present.
>>
>> I wonder if it would be worth having MD probe for write cache enabled&
>> warn if barriers are not supported?
>
> In my opinion even that is too weak. We know how to control the cache
> settings on all common disks (that is scsi and ata), so we should always
> disable the write cache unless we know that the whole stack (filesystem,
> raid, volume managers) supports barriers. And even then we should make
> sure the filesystems does actually use barriers everywhere that's needed
> which failed at for years.
>

I was thinking about that as well. Having us disable the write cache when we
know it is not supported (like in the MD5 case) would certainly be *much* safer
for almost everyone.

We would need to have a way to override the write cache disabling for people who
either know that they have a non-volatile write cache (unlikely as it would
probably be to put MD5 on top of a hardware RAID/external array, but some of the
new SSD's claim to have non-volatile write cache).

It would also be very useful to have all of our top tier file systems enable
barriers by default, provide consistent barrier on/off mount options and log a
nice warning when not enabled....

ric


2009-08-31 15:14:06

by jim owens

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 09:19:58AM -0400, Mark Lord wrote:
>>> In my opinion even that is too weak. We know how to control the cache
>>> settings on all common disks (that is scsi and ata), so we should always
>>> disable the write cache unless we know that the whole stack (filesystem,
>>> raid, volume managers) supports barriers. And even then we should make
>>> sure the filesystems does actually use barriers everywhere that's needed
>>> which failed at for years.
>> ..
>>
>> That stack does not know that my MD device has full battery backup,
>> so it bloody well better NOT prevent me from enabling the write caches.
>
> No one is going to prevent you from doing it. That question is one of
> sane defaults. And always safe, but slower if you have advanced
> equipment is a much better default than usafe by default on most of
> the install base.

I've always agreed with "be safe first" and have worked where
we always shut write cache off unless we knew it had battery.

But before we make disabling cache the default, this is the impact:

- users will see it as a performance regression

- trashy OS vendors who never disable cache will benchmark
better than "out of the box" linux.

Because as we all know, users don't read release notes.

Been there, done that, felt the pain.

jim

2009-08-31 15:52:15

by David Lang

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Mon, 31 Aug 2009, Ric Wheeler wrote:

> On 08/31/2009 09:16 AM, Christoph Hellwig wrote:
>> On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote:
>>>> While most common filesystem do have barrier support it is:
>>>>
>>>> - not actually enabled for the two most common filesystems
>>>> - the support for write barriers an cache flushing tends to be buggy
>>>> all over our software stack,
>>>>
>>>
>>> Or just missing - I think that MD5/6 simply drop the requests at present.
>>>
>>> I wonder if it would be worth having MD probe for write cache enabled&
>>> warn if barriers are not supported?
>>
>> In my opinion even that is too weak. We know how to control the cache
>> settings on all common disks (that is scsi and ata), so we should always
>> disable the write cache unless we know that the whole stack (filesystem,
>> raid, volume managers) supports barriers. And even then we should make
>> sure the filesystems does actually use barriers everywhere that's needed
>> which failed at for years.
>>
>
> I was thinking about that as well. Having us disable the write cache when we
> know it is not supported (like in the MD5 case) would certainly be *much*
> safer for almost everyone.
>
> We would need to have a way to override the write cache disabling for people
> who either know that they have a non-volatile write cache (unlikely as it
> would probably be to put MD5 on top of a hardware RAID/external array, but
> some of the new SSD's claim to have non-volatile write cache).

I've done this when the hardware raid only suppored raid 5 but I wanted
raid 6. I've also done it when I had enough disks to need more than one
hardware raid card to talk to them all, but wanted one logical drive for
the system.

> It would also be very useful to have all of our top tier file systems enable
> barriers by default, provide consistent barrier on/off mount options and log
> a nice warning when not enabled....

most people are not willing to live with unbuffered write performance.
they care about their data, but they also care about performance, and
since performance is what they see on an ongong basis, they tend to care
more about performance.

given that we don't even have barriers enabled by default on ext3 due to
the performance hit, what makes you think that disabling buffers entirely
is going to be acceptable to people?

David Lang

2009-08-31 16:21:17

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 08/31/2009 11:50 AM, [email protected] wrote:
> On Mon, 31 Aug 2009, Ric Wheeler wrote:
>
>> On 08/31/2009 09:16 AM, Christoph Hellwig wrote:
>>> On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote:
>>>>> While most common filesystem do have barrier support it is:
>>>>>
>>>>> - not actually enabled for the two most common filesystems
>>>>> - the support for write barriers an cache flushing tends to be buggy
>>>>> all over our software stack,
>>>>>
>>>>
>>>> Or just missing - I think that MD5/6 simply drop the requests at
>>>> present.
>>>>
>>>> I wonder if it would be worth having MD probe for write cache enabled&
>>>> warn if barriers are not supported?
>>>
>>> In my opinion even that is too weak. We know how to control the cache
>>> settings on all common disks (that is scsi and ata), so we should always
>>> disable the write cache unless we know that the whole stack (filesystem,
>>> raid, volume managers) supports barriers. And even then we should make
>>> sure the filesystems does actually use barriers everywhere that's needed
>>> which failed at for years.
>>>
>>
>> I was thinking about that as well. Having us disable the write cache
>> when we know it is not supported (like in the MD5 case) would
>> certainly be *much* safer for almost everyone.
>>
>> We would need to have a way to override the write cache disabling for
>> people who either know that they have a non-volatile write cache
>> (unlikely as it would probably be to put MD5 on top of a hardware
>> RAID/external array, but some of the new SSD's claim to have
>> non-volatile write cache).
>
> I've done this when the hardware raid only suppored raid 5 but I wanted
> raid 6. I've also done it when I had enough disks to need more than one
> hardware raid card to talk to them all, but wanted one logical drive for
> the system.
>
>> It would also be very useful to have all of our top tier file systems
>> enable barriers by default, provide consistent barrier on/off mount
>> options and log a nice warning when not enabled....
>
> most people are not willing to live with unbuffered write performance.
> they care about their data, but they also care about performance, and
> since performance is what they see on an ongong basis, they tend to care
> more about performance.
>
> given that we don't even have barriers enabled by default on ext3 due to
> the performance hit, what makes you think that disabling buffers
> entirely is going to be acceptable to people?
>
> David Lang

We do (and have for a number of years) enable barriers by default for XFS and
reiserfs. In SLES, ext3 has default barriers as well.

Ric


2009-08-31 17:56:53

by Jesse Brandeburg

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Sun, Aug 30, 2009 at 8:20 AM, Theodore Tso<[email protected]> wrote:
> So we *do* have the warning light; the problem is that just as some
> people may not realize that "check brakes" means, "YOU COULD DIE",
> some people may not realize that "hard drive failure; RAID array
> degraded" could mean, "YOU COULD LOSE DATA".
>
> Fortunately, for software RAID, this is easily solved; if you are so
> concerned, why don't you submit a patch to mdadm adjusting the e-mail
> sent to the system administrator when the array is in a degraded
> state, such that it states, "YOU COULD LOSE DATA". ?I would gently
> suggest to you this would be ***far*** more effective that a patch to
> kernel documentation.

In the case of a degraded array, could the kernel be more proactive
(or maybe even mdadm) and have the filesystem remount itself withOUT
journalling enabled? ?This seems on the surface to be possible, but I
don't know the internal particulars that might prevent/allow it.

2009-08-31 18:01:06

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 08/31/2009 01:49 PM, Jesse Brandeburg wrote:
> On Sun, Aug 30, 2009 at 8:20 AM, Theodore Tso<[email protected]> wrote:
>> So we *do* have the warning light; the problem is that just as some
>> people may not realize that "check brakes" means, "YOU COULD DIE",
>> some people may not realize that "hard drive failure; RAID array
>> degraded" could mean, "YOU COULD LOSE DATA".
>>
>> Fortunately, for software RAID, this is easily solved; if you are so
>> concerned, why don't you submit a patch to mdadm adjusting the e-mail
>> sent to the system administrator when the array is in a degraded
>> state, such that it states, "YOU COULD LOSE DATA". I would gently
>> suggest to you this would be ***far*** more effective that a patch to
>> kernel documentation.
>
> In the case of a degraded array, could the kernel be more proactive
> (or maybe even mdadm) and have the filesystem remount itself withOUT
> journalling enabled? This seems on the surface to be possible, but I
> don't know the internal particulars that might prevent/allow it.

This a misconception - with or without journalling, you are open to a second
failure during a RAID rebuild.

Also note that by default, ext3 does not mount with barriers turned on.

Even if you mount with barriers, MD5 does not handle barriers, so you stand to
lose a lot of data if you have a power outage.

Ric

2009-08-31 18:16:36

by martin f krafft

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

also sprach Jesse Brandeburg <[email protected]> [2009.08.31.1949 +0200]:
> In the case of a degraded array, could the kernel be more
> proactive (or maybe even mdadm) and have the filesystem remount
> itself withOUT journalling enabled? ?This seems on the surface to
> be possible, but I don't know the internal particulars that might
> prevent/allow it.

Why would I want to disable the filesystem journal in that case?

--
.''`. martin f. krafft <[email protected]> Related projects:
: :' : proud Debian developer http://debiansystem.info
`. `'` http://people.debian.org/~madduck http://vcs-pkg.org
`- Debian - when you have better things to do than fixing systems

"i can stand brute force, but brute reason is quite unbearable. there
is something unfair about its use. it is hitting below the
intellect."
-- oscar wilde


Attachments:
(No filename) (920.00 B)
digital_signature_gpg.asc (197.00 B)
Digital signature (see http://martin-krafft.net/gpg/)
Download all attachments

2009-08-31 18:32:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Mon, Aug 31, 2009 at 08:50:53AM -0700, [email protected] wrote:
>> It would also be very useful to have all of our top tier file systems
>> enable barriers by default, provide consistent barrier on/off mount
>> options and log a nice warning when not enabled....
>
> most people are not willing to live with unbuffered write performance.

I'm not sure what you mean with unbuffered write support, the only
common use of that term is for userspace I/O using the read/write
sysctem calls directly in comparism to buffered I/O which uses
the stdio library.

But be ensure that the use of barriers and cache flushes in fsync does not
completely disable caching (or "buffering"), it just does flush flushes
the disk write cache in case we either commit a log buffer than need to
be on disk, or performan an fsync where we really do want to have data
on disk instead of lying to the application about the status of the
I/O completion. Which btw could be interpreted as a violation of the
Posix rules.


2009-08-31 19:12:38

by David Lang

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Mon, 31 Aug 2009, Christoph Hellwig wrote:

> On Mon, Aug 31, 2009 at 08:50:53AM -0700, [email protected] wrote:
>>> It would also be very useful to have all of our top tier file systems
>>> enable barriers by default, provide consistent barrier on/off mount
>>> options and log a nice warning when not enabled....
>>
>> most people are not willing to live with unbuffered write performance.
>
> I'm not sure what you mean with unbuffered write support, the only
> common use of that term is for userspace I/O using the read/write
> sysctem calls directly in comparism to buffered I/O which uses
> the stdio library.
>
> But be ensure that the use of barriers and cache flushes in fsync does not
> completely disable caching (or "buffering"), it just does flush flushes
> the disk write cache in case we either commit a log buffer than need to
> be on disk, or performan an fsync where we really do want to have data
> on disk instead of lying to the application about the status of the
> I/O completion. Which btw could be interpreted as a violation of the
> Posix rules.

as I understood it, the proposal that I responded to was to change the
kernel to detect if barriers are enabled for the entire stack or not, and
if not disable the write caches on the drives.

there are definantly times when that is the correct thing to do, but I
am not sure that it is the correct thing to do by default.

David Lang

2009-08-31 21:22:52

by Ron Johnson

[permalink] [raw]
Subject: MD5/6? (was Re: raid is dangerous but that's secret ...)

On 2009-08-31 13:01, Ric Wheeler wrote:
[snip]
>
> Even if you mount with barriers, MD5 does not handle barriers, so you
> stand to lose a lot of data if you have a power outage.

Pardon me for asking for such a seemingly obvious question, but what
(besides "Message-Digest algorithm 5") is MD5?

(I've always seen "multiple drive" written in the lower case "md".)

--
Brawndo's got what plants crave. It's got electrolytes!

2009-08-31 22:26:41

by Jesse Brandeburg

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Mon, Aug 31, 2009 at 11:07 AM, martin f krafft<[email protected]> wrote:
> also sprach Jesse Brandeburg <[email protected]> [2009.08.31.1949 +0200]:
>> In the case of a degraded array, could the kernel be more
>> proactive (or maybe even mdadm) and have the filesystem remount
>> itself withOUT journalling enabled? ?This seems on the surface to
>> be possible, but I don't know the internal particulars that might
>> prevent/allow it.
>
> Why would I want to disable the filesystem journal in that case?

I misspoke w.r.t journalling, the idea I was trying to get across was
to remount with -o sync while running on a degraded array, but given
some of the other comments in this thread I'm not even sure that would
help. the idea was to make writes as safe as possible (at the cost of
speed) when running on a degraded array, and to have the transition be
as hands-free as possible, just have the kernel (or mdadm) by default
remount.

2009-08-31 23:19:24

by Ron Johnson

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 2009-08-31 17:26, Jesse Brandeburg wrote:
> On Mon, Aug 31, 2009 at 11:07 AM, martin f krafft<[email protected]> wrote:
>> also sprach Jesse Brandeburg <[email protected]> [2009.08.31.1949 +0200]:
>>> In the case of a degraded array, could the kernel be more
>>> proactive (or maybe even mdadm) and have the filesystem remount
>>> itself withOUT journalling enabled? This seems on the surface to
>>> be possible, but I don't know the internal particulars that might
>>> prevent/allow it.
>> Why would I want to disable the filesystem journal in that case?
>
> I misspoke w.r.t journalling, the idea I was trying to get across was
> to remount with -o sync while running on a degraded array, but given
> some of the other comments in this thread I'm not even sure that would
> help. the idea was to make writes as safe as possible (at the cost of
> speed) when running on a degraded array, and to have the transition be
> as hands-free as possible, just have the kernel (or mdadm) by default
> remount.

Much better, I'd think, to "just" have it scream out DANGER!! WILL
ROBINSON!! DANGER!! to syslog and to an email hook.

--
Brawndo's got what plants crave. It's got electrolytes!

2009-09-01 05:45:07

by martin f krafft

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

also sprach Jesse Brandeburg <[email protected]> [2009.09.01.0026 +0200]:
> I misspoke w.r.t journalling, the idea I was trying to get across
> was to remount with -o sync while running on a degraded array, but
> given some of the other comments in this thread I'm not even sure
> that would help. the idea was to make writes as safe as possible
> (at the cost of speed) when running on a degraded array, and to
> have the transition be as hands-free as possible, just have the
> kernel (or mdadm) by default remount.

I don't see how that is any more necessary with a degraded array
than it is when you have a fully working array. Sync just ensures
that the data are written and not cached, but that has absolutely
nothing to do with the underlying storage. Or am I failing to see
the link?

--
.''`. martin f. krafft <[email protected]> Related projects:
: :' : proud Debian developer http://debiansystem.info
`. `'` http://people.debian.org/~madduck http://vcs-pkg.org
`- Debian - when you have better things to do than fixing systems

"how do you feel about women's rights?"
"i like either side of them."
-- groucho marx


Attachments:
(No filename) (1.19 kB)
digital_signature_gpg.asc (197.00 B)
Digital signature (see http://martin-krafft.net/gpg/)
Download all attachments

2009-09-01 13:58:47

by Pavel Machek

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)


>> Interesting. So, what's technically wrong with the patch below?
>
> My suggestion was that you stop trying to document your assertion of an
> issue and actually suggest fixes in code or implementation. I really
> don't think that you have properly diagnosed your specific failure or
> done sufficient. However, if you put a full analysis and suggested code
> out to the MD devel lists, we can debate technical implementation as we
> normally do.

I don't think I should be required to rewrite linux md layer in order
to fix documentation.

> The only note that I would put in ext3/4 etc documentation would be:
>
> "Reliable storage is important for any file system. Single disks (or
> FLASH or SSD) do fail on a regular basis.

Uh, how clever, instead of documenting that our md raid code does not
always work as expected, you document that components fail. Newspeak
101?

You even failed to mention little design problem with flash and
eraseblock size... and the fact that you don't need flash to fail to
get data loss.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-02 20:55:58

by Pavel Machek

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Sun 2009-08-30 02:01:10, Christian Kujau wrote:
> On Sun, 30 Aug 2009 at 09:51, Pavel Machek wrote:
> > > give system administrators. It's better than the fear-mongering
> > > patches you had proposed earlier, but what would be better *still* is
> > > telling people why running with degraded RAID arrays is bad, and to
> > > give them further tips about how to use RAID arrays safely.
> >
> > Maybe this belongs to Doc*/filesystems, and more detailed RAID
> > description should go to md description?
>
> Why should this be placed in *kernel* documentation anyway? The "dangers
> of RAID", the hints that "backups are a good idea" - isn't that something
> for howtos for sysadmins? No end-user will ever look into

The fact that two kernel subsystems (MD RAID, journaling filesystems)
do not work well together is surprising and should be documented near
the source.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-03 01:59:43

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 08/31/2009 09:21 AM, Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 09:19:58AM -0400, Mark Lord wrote:
>>> In my opinion even that is too weak. We know how to control the cache
>>> settings on all common disks (that is scsi and ata), so we should always
>>> disable the write cache unless we know that the whole stack (filesystem,
>>> raid, volume managers) supports barriers. And even then we should make
>>> sure the filesystems does actually use barriers everywhere that's needed
>>> which failed at for years.
>> ..
>>
>> That stack does not know that my MD device has full battery backup,
>> so it bloody well better NOT prevent me from enabling the write caches.
>
> No one is going to prevent you from doing it. That question is one of
> sane defaults. And always safe, but slower if you have advanced
> equipment is a much better default than usafe by default on most of
> the install base.
>

Just to add some support to this, all of the external RAID arrays that I know of
normally run with write cache disabled on the component drives. In addition,
many of them will disable their internal write cache if/when they detect that
they have lost their UPS.

I think that if we had done this kind of sane default earlier for MD levels that
do not handle barriers, we would not have left some people worried about our
software RAID.

To be clear, if a sophisticated user wants to override this default, that should
be supported. It is not (in my opinion) a safe default behaviour.

Ric


2009-09-03 09:47:29

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Tue 2009-08-25 21:00:18, Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote:
> >>> You are simply incorrect, Ted did not say that ext3 does not work
> >>> with MD raid5.
> >>
> >> http://lkml.org/lkml/2009/8/25/312
> >
> > I will let Ted clarify his text on his own, but the quoted text says "...
> > have potential...".
> >
> > Why not ask Neil if he designed MD to not work properly with ext3?
>
> So let me clarify by saying the following things.
>
> 1) Filesystems are designed to expect that storage devices have
> certain properties. These include returning the same data that you
> wrote, and that an error when writing a sector, or a power failure
> when writing sector, should not be amplified to cause collateral
> damage with previously succfessfully written sectors.

Yes. Unfortunately, different filesystems expect different properties
from block devices. ext3 will work with write cache enabled/barriers
enabled, while ext2 needs write cache disabled.

The requirements are also quite surprising; AFAICT ext3 can handle
disk writing garbage to single sector during powerfail, while xfs can
not handle that.

Now, how do you expect users to know these subtle details when it is
not documented anywhere? And why are you fighting against documenting
these subtleties?

> Secondly, what's the probability of a failure causes the RAID array to
> become degraded, followed by a power failure, versus a power failure
> while the RAID array is not running in degraded mode? Hopefully you
> are running with the RAID array in full, proper running order a much
> larger percentage of the time than running with the RAID array in
> degraded mode. If not, the bug is with the system administrator!

As was uncovered, MD RAID does not properly support barriers,
so... you don't actually need drive failure.

> Maybe a random OS engineer doesn't know these things --- but trust me
> when I say a competent system administrator had better be familiar
> with these concepts. And someone who wants their data to be
> reliably

Trust me, 99% of sysadmins are not compentent by your definition. So
this should be documented.

> At the end of the day, filesystems are not magic. They can't
> compensate for crap hardware, or incompetently administered machines.

ext3 greatly contributes to administrator incomentency:

# The journal supports the transactions start and stop, and in case of a
# crash, the journal can replay the transactions to quickly put the
# partition back into a consistent state.

...it does not mention that (non-default!) barrier=1 is needed to make
this reliable, nor it mentions that there are certain requirements for
this to work. It just says that journal will magically help you.

And you wonder while people expect magic from your filesystem?

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-03 11:12:23

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Ric Wheeler <[email protected]> writes:

> Just to add some support to this, all of the external RAID arrays that
> I know of normally run with write cache disabled on the component
> drives.

Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
--
Krzysztof Halasa

2009-09-03 11:18:10

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 09/03/2009 07:12 AM, Krzysztof Halasa wrote:
> Ric Wheeler<[email protected]> writes:
>
>> Just to add some support to this, all of the external RAID arrays that
>> I know of normally run with write cache disabled on the component
>> drives.
>
> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?

Which drives various vendors ships changes with specific products. Usually, they
ship drives that have carefully vetted firmware, etc. but they are close to the
same drives you buy on the open market.

Seagate has a huge slice of the market,

ric


2009-09-03 13:34:42

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Ric Wheeler <[email protected]om> writes:

>>> Just to add some support to this, all of the external RAID arrays that
>>> I know of normally run with write cache disabled on the component
>>> drives.
>>
>> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
>
> Which drives various vendors ships changes with specific products.
> Usually, they ship drives that have carefully vetted firmware, etc.
> but they are close to the same drives you buy on the open market.

But they aren't the same, are they? If they are not, the fact they can
run well with the write-through cache doesn't mean the off-the-shelf
ones can do as well.

Are they SATA (or PATA) at all? SCSI etc. are usually different
animals, though there are SCSI and SATA models which differ only in
electronics.

Do you have battery-backed write-back RAID cache (which acknowledges
flushes before the data is written out to disks)? PC can't do that.
--
Krzysztof Halasa

2009-09-03 13:50:43

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 09/03/2009 09:34 AM, Krzysztof Halasa wrote:
> Ric Wheeler<[email protected]> writes:
>
>>>> Just to add some support to this, all of the external RAID arrays that
>>>> I know of normally run with write cache disabled on the component
>>>> drives.
>>>
>>> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
>>
>> Which drives various vendors ships changes with specific products.
>> Usually, they ship drives that have carefully vetted firmware, etc.
>> but they are close to the same drives you buy on the open market.
>
> But they aren't the same, are they? If they are not, the fact they can
> run well with the write-through cache doesn't mean the off-the-shelf
> ones can do as well.

Storage vendors have a wide range of options, but what you get today is a
collection of s-ata (not much any more), sas or fc.

Some times they will have different firmware, other times it is the same.


>
> Are they SATA (or PATA) at all? SCSI etc. are usually different
> animals, though there are SCSI and SATA models which differ only in
> electronics.
>
> Do you have battery-backed write-back RAID cache (which acknowledges
> flushes before the data is written out to disks)? PC can't do that.

We (red hat) have all kinds of different raid boxes...

ric



2009-09-03 13:59:13

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Ric Wheeler <[email protected]> writes:

> We (red hat) have all kinds of different raid boxes...

A have no doubt about it, but are those you know equipped with
battery-backed write-back cache? Are they using SATA disks?

We can _at_best_ compare non-battery-backed RAID using SATA disks with
what we typically have in a PC.
--
Krzysztof Halasa

2009-09-03 14:15:58

by Ric Wheeler

[permalink] [raw]
Subject: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On 09/03/2009 09:59 AM, Krzysztof Halasa wrote:
> Ric Wheeler<[email protected]> writes:
>
>> We (red hat) have all kinds of different raid boxes...
>
> A have no doubt about it, but are those you know equipped with
> battery-backed write-back cache? Are they using SATA disks?
>
> We can _at_best_ compare non-battery-backed RAID using SATA disks with
> what we typically have in a PC.

The whole thread above is about software MD using commodity drives (S-ATA or
SAS) without battery backed write cache.

We have that (and I have it personally) and do test it.

You must disable the write cache on these commodity drives *if* the MD RAID
level does not support barriers properly.

This will greatly reduce errors after a power loss (both in degraded state and
non-degraded state), but it will not eliminate data loss entirely. You simply
cannot do that with any storage device!

Note that even without MD raid, the file system issues IO's in file system block
size (4096 bytes normally) and most commodity storage devices use a 512 byte
sector size which means that we have to update 8 512b sectors.

Drives can (and do) have multiple platters and surfaces and it is perfectly
normal to have contiguous logical ranges of sectors map to non-contiguous
sectors physically. Imagine a 4KB write stripe that straddles two adjacent
tracks on one platter (requiring a seek) or mapped across two surfaces
(requiring a head switch). Also, a remapped sector can require more or less a
full surface seek from where ever you are to the remapped sector area of the drive.

These are all examples that can after a power loss, even a local (non-MD)
device, do a partial update of that 4KB write range of sectors. Note that
unlike unlike RAID/MD, local storage has no parity on the server to detect this
partial write.

This is why new file systems like btrfs and zfs do checksumming of data and
metadata. This won't prevent partial updates during a write, but can at least
detect them and try to do some kind of recovery.

In other words, this is not just an MD issue, it is entirely possible even with
non-MD devices.

Also, when you enable the write cache (MD or not) you are buffering multiple
MB's of data that can go away on power loss. Far greater (10x) the exposure that
the partial RAID rewrite case worries about.

ric

2009-09-03 14:26:55

by Florian Weimer

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

* Ric Wheeler:

> Note that even without MD raid, the file system issues IO's in file
> system block size (4096 bytes normally) and most commodity storage
> devices use a 512 byte sector size which means that we have to update
> 8 512b sectors.

Database software often attempts to deal with this phenomenon
(sometimes called "torn page writes"). For example, you can make sure
that the first time you write to a database page, you keep a full copy
in your transaction log. If the machine crashes, the log is replayed,
first completely overwriting the partially-written page. Only after
that, you can perform logical/incremental logging.

The log itself has to be protected with a different mechanism, so that
you don't try to replay bad data. But you haven't comitted to this
data yet, so it is fine to skip bad records.

Therefore, sub-page corruption is a fundamentally different issue from
super-page corruption.

BTW, older textbooks will tell you that mirroring requires that you
read from two copies of the data and compare it (and have some sort of
tie breaker if you need availability). And you also have to re-read
data you've just written to disk, to make sure it's actually there and
hit the expected sectors. We can't even do this anymore, thanks to
disk caches. And it doesn't seem to be necessary in most cases.

--
Florian Weimer <[email protected]>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstra?e 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

2009-09-03 14:36:14

by David Lang

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Thu, 3 Sep 2009, Krzysztof Halasa wrote:

> Ric Wheeler <[email protected]> writes:
>
>>>> Just to add some support to this, all of the external RAID arrays that
>>>> I know of normally run with write cache disabled on the component
>>>> drives.
>>>
>>> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
>>
>> Which drives various vendors ships changes with specific products.
>> Usually, they ship drives that have carefully vetted firmware, etc.
>> but they are close to the same drives you buy on the open market.
>
> But they aren't the same, are they? If they are not, the fact they can
> run well with the write-through cache doesn't mean the off-the-shelf
> ones can do as well.

frequently they are exactly the same drives, with exactly the same
firmware.

you disable the write caches on the drives themselves, but you add a large
write cache (with battery backup) in the raid card/chassis

> Are they SATA (or PATA) at all? SCSI etc. are usually different
> animals, though there are SCSI and SATA models which differ only in
> electronics.

it depends on what raid array you use, some use SATA, some use SAS/SCSI

David Lang

2009-09-03 15:08:54

by Ric Wheeler

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On 09/03/2009 10:26 AM, Florian Weimer wrote:
> * Ric Wheeler:
>
>> Note that even without MD raid, the file system issues IO's in file
>> system block size (4096 bytes normally) and most commodity storage
>> devices use a 512 byte sector size which means that we have to update
>> 8 512b sectors.
>
> Database software often attempts to deal with this phenomenon
> (sometimes called "torn page writes"). For example, you can make sure
> that the first time you write to a database page, you keep a full copy
> in your transaction log. If the machine crashes, the log is replayed,
> first completely overwriting the partially-written page. Only after
> that, you can perform logical/incremental logging.
>
> The log itself has to be protected with a different mechanism, so that
> you don't try to replay bad data. But you haven't comitted to this
> data yet, so it is fine to skip bad records.

Yes - databases worry a lot about this. Another technique that they tend to use
is to have state bits at the beginning and end of their logical pages. For
example, the first byte and last byte toggle together from 1 to 0 to 1 to 0 as
you update.

If the bits don't match, that is a quick level indication of a torn write.

Even with the above scheme, you can still have data loss of course - you just
need an IO error in the log and in your db table that was recently updated. Not
entirely unlikely, especially if you use write cache enabled storage and don't
flush that cache :-)

>
> Therefore, sub-page corruption is a fundamentally different issue from
> super-page corruption.

We have to be careful to keep our terms clear since the DB pages are (usually)
larger than the FS block size which in turn is larger than non-RAID storage
sector size. At the FS level, we send down multiples of fs blocks (not
blocked/aligned at RAID stripe levels, etc).

In any case, we can get sub-FS block level "torn writes" even with a local S-ATA
drive in edge conditions.


>
> BTW, older textbooks will tell you that mirroring requires that you
> read from two copies of the data and compare it (and have some sort of
> tie breaker if you need availability). And you also have to re-read
> data you've just written to disk, to make sure it's actually there and
> hit the expected sectors. We can't even do this anymore, thanks to
> disk caches. And it doesn't seem to be necessary in most cases.
>

We can do something like this with the built in RAID in btrfs. If you detect an
IO error (or bad checksum) on a read, btrfs knows how to request/grab another copy.

Also note that the SCSI T10 DIF/DIX has baked in support for applications to
layer on extra data integrity (look for MKP's slide decks). This is really neat
since you can intercept bad IO's on the way down and prevent overwriting good data.

ric


2009-09-03 16:57:55

by David Lang

[permalink] [raw]
Subject: what fsck can (and can't) do was Re: [patch] ext2/3: document conditions when reliable operation is possible

On Sat, 29 Aug 2009, Rob Landley wrote:

> On Saturday 29 August 2009 05:05:58 Pavel Machek wrote:
>> On Fri 2009-08-28 07:49:38, [email protected] wrote:
>>> On Thu, 27 Aug 2009, Rob Landley wrote:
>>>> Pavel's response was to attempt to document this. Not that journaling
>>>> is _bad_, but that it doesn't protect against this class of problem.
>>>
>>> I don't think anyone is disagreeing with the statement that journaling
>>> doesn't protect against this class of problems, but Pavel's statements
>>> didn't say that. he stated that ext3 is more dangerous than ext2.
>>
>> Well, if you use 'common' fsck policy, ext3 _is_ more dangerous.
>
> The filesystem itself isn't more dangerous, but it may provide a false sense of
> security when used on storage devices it wasn't designed for.

from this discussin (and the similar discussion on lwn.net) there appears
to be confusion/disagreement over what fsck does and what the results of
not running it are.

it has been stated here that fsck cannot fix broken data, all it tries to
do is to clean up metadata, but it would probably help to get a clear
statement of what exactly that means.

I know that it:

finds entries that don't actually have data and deletes them

finds entries where multiple files share data blocks and duplicates the
(bad for one file) data to seperate them

finds blocks that have been orphaned (allocated, but no directory pointer
to them) and creates entries in lost+found

but if a fsck does not get run on a filesystem that has been damaged, what
additional damage can be done?

can it overwrite data that could have been saved?

can it cause new files that are created (or new data written to existing,
but uncorrupted files) to be lost?

or is it just a matter of not knowing about existing corruption?

David Lang


2009-09-03 19:27:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: what fsck can (and can't) do was Re: [patch] ext2/3: document conditions when reliable operation is possible

On Thu, Sep 03, 2009 at 09:56:48AM -0700, [email protected] wrote:
> from this discussin (and the similar discussion on lwn.net) there appears
> to be confusion/disagreement over what fsck does and what the results of
> not running it are.
>
> it has been stated here that fsck cannot fix broken data, all it tries to
> do is to clean up metadata, but it would probably help to get a clear
> statement of what exactly that means.

Let me give you my formulation of fsck which may be helpful. Fsck can
not fix broken data; and (particularly in fsck -y mode) may not even
recover the maximal amount of lost data caused by metadata corruption.
(This is why sometimes an expert using debugfs can recover more data
than fsck -y, and if you have some really precious data, like ten
years' worth of Ph.D. research that you've never bothered to back
up[1], the first thing you should do is buy a new hard drive and make a
sector-by-sector copy of the disk and *then* run fsck. A new
terrabyte hard drive costs $100; how much is your data worth to you?)

[1] This isn't hypothetical; while I was at MIT this sort of thing
actually happened more than once --- which brings up the philosophical
question of whether someone who is that stupid about not doing backups
on critical data *deserves* to get a Ph.D. degree. :-)

Fsck's primary job is to make sure that further writes to the
filesystem, whether you are creating new files or removing directory
hierarchies, etc., will not cause *additional* data loss due to meta
data corruption in the file system. Its secondary goals are to
preserve as much data as possible, and to make sure that file system
metadata is valid (i.e., so that a block pointer contains a valid
block address, so that an attempt to read a file won't cause an I/O
error when the filesystems attempts to seek to a non-existent sector
on disk).

For some filesystems, invalid, corrupt metadata can actually cause a
system panic or oops message, so it's not necessarily safe to mount a
filesystem with corrupt metadata read-only without risking the need to
reboot the machine in question. More recently, there are folks who
have been filing security bugs when they detect such cases, so there
are fewer examples of such cases, but historically it was a good idea
to run fsck because otherwise it's possible the kernel might oops or
panic when it tripped over some particularly nasty metadata corruption.

> but if a fsck does not get run on a filesystem that has been damaged,
> what additional damage can be done?

Consider the case where there are data blocks in use by inodes,
containing precious data, but which are marked free in a filesystem
allocation data structures (e.g., ext3's block bitmaps, but this
applies to pretty much any filesystem, whether it's xfs, reiserfs,
btrfs, etc.). When you create a new file on that filesystem, there's
a chance that blocks that really contain data belonging to other
inodes (perhaps the aforementioned ten years' of unbacked-up
Ph.D. thesis research) will get overwritten by the newly created file.

Another example is an inode which has multiple hard links, but the
hard link count is wrong by being too low. Now when you delete one of
the hard links, the inode will be released, and the inode and its data
blocks returned to the free pool, despite the fact that it is still
accessible via another directory entry in the filesystem, and despite
the fact that the file contents should be saved.

In the case where you have a block which is claimed by more than one
file, if that file is rewritten in place, it's possible that the newly
written file could have its data corrupted, so it's not just a matter
of potential corruption to existing files; the newly created files are
at risk as well.

> can it overwrite data that could have been saved?
>
> can it cause new files that are created (or new data written to existing,
> but uncorrupted files) to be lost?
>
> or is it just a matter of not knowing about existing corruption?

So it's yes to all of the above; yes, you can overwrite existing data
files; yes it can cause data blocks belonging to newly created files
to be list; and no you won't know about data loss caused by metadata
corruption. (Again, you won't know about data loss caused by
corruption to the data blocks.)

- Ted


2009-09-03 23:50:42

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

Ric Wheeler <[email protected]> writes:

> The whole thread above is about software MD using commodity drives
> (S-ATA or SAS) without battery backed write cache.

Yes. However, you mentioned external RAID arrays disable disk caches.
That's why I asked if they are using SATA or SCSI/etc. disks, and if
they have battery-backed cache.

> Also, when you enable the write cache (MD or not) you are buffering
> multiple MB's of data that can go away on power loss. Far greater
> (10x) the exposure that the partial RAID rewrite case worries about.

The cache is flushed with working barriers. I guess it should be
superior to disabled WB cache, in both performance and expected disk
lifetime.
--
Krzysztof Halasa

2009-09-04 00:39:36

by Ric Wheeler

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On 09/03/2009 07:50 PM, Krzysztof Halasa wrote:
> Ric Wheeler<[email protected]> writes:
>
>
>> The whole thread above is about software MD using commodity drives
>> (S-ATA or SAS) without battery backed write cache.
>>
> Yes. However, you mentioned external RAID arrays disable disk caches.
> That's why I asked if they are using SATA or SCSI/etc. disks, and if
> they have battery-backed cache.
>
>

Sorry for the confusion - they disable the write caches on the component
drives normally, but have their own write cache which is not disabled in
most cases.

>> Also, when you enable the write cache (MD or not) you are buffering
>> multiple MB's of data that can go away on power loss. Far greater
>> (10x) the exposure that the partial RAID rewrite case worries about.
>>
> The cache is flushed with working barriers. I guess it should be
> superior to disabled WB cache, in both performance and expected disk
> lifetime.
>

True - barriers (especially on big, slow s-ata drives) usually give you
an overall win. SAS drives it seems to make less of an impact, but then
you always need to benchmark your workload on anything to get the only
numbers that really matter :-)

ric


2009-09-04 21:21:27

by Mark Lord

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

Ric Wheeler wrote:
..
> You must disable the write cache on these commodity drives *if* the MD
> RAID level does not support barriers properly.
..

Rather than further trying to cripple Linux on the notebook,
(it's bad enough already)..

How about instead, *fixing* the MD layer to properly support barriers?
That would be far more useful, productive, and better for end-users.

Cheers

2009-09-04 21:29:20

by Ric Wheeler

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On 09/04/2009 05:21 PM, Mark Lord wrote:
> Ric Wheeler wrote:
> ..
>> You must disable the write cache on these commodity drives *if* the
>> MD RAID level does not support barriers properly.
> ..
>
> Rather than further trying to cripple Linux on the notebook,
> (it's bad enough already)..

People using MD on notebooks (not sure there are that many using RAID5
MD) could leave their write cache enabled.

>
> How about instead, *fixing* the MD layer to properly support barriers?
> That would be far more useful, productive, and better for end-users.
>
> Cheers

Fixing MD would be great - not sure that it would end up still faster
(look at md1 devices with working barriers with compared to md1 with
write cache disabled).

In the mean time, if you are using MD to make your data more reliable, I
would still strongly urge you to disable the write cache when you see
"barriers disabled" messages spit out in /var/log/messages :-)

ric


2009-09-05 10:34:46

by Pavel Machek

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Hi!

> > If it only was this simple. We don't have 'check brakes' (aka
> > 'journalling ineffective') warning light. If we had that, I would not
> > have problem.
>
> But we do; comptently designed (and in the cast of software RAID,
> competently packaged) RAID subsystems send notifications to the system
> administrator when there is a hard drive failure. Some hardware RAID
> systems will send a page to the system administrator. A mid-range
> Areca card has a separate ethernet port so it can send e-mail to the
> administrator, even if the OS is hosed for some reason.

Well, my MMC/uSD cards do not have ethernet ports to remind me that
they are unreliable :-(.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-05 12:57:57

by Mark Lord

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

Ric Wheeler wrote:
> On 09/04/2009 05:21 PM, Mark Lord wrote:
..
>> How about instead, *fixing* the MD layer to properly support barriers?
>> That would be far more useful, productive, and better for end-users.
..
> Fixing MD would be great - not sure that it would end up still faster
> (look at md1 devices with working barriers with compared to md1 with
> write cache disabled).
..

There's no inherent reason for it to be slower, except possibly
drives with b0rked FUA support.

So the first step is to fix MD to pass barriers to the LLDs
for most/all RAID types.

Then, if it has performance issues, those can be addressed
by more application of little grey cells. :)

Cheers

2009-09-05 13:40:29

by Ric Wheeler

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On 09/05/2009 08:57 AM, Mark Lord wrote:
> Ric Wheeler wrote:
>> On 09/04/2009 05:21 PM, Mark Lord wrote:
> ..
>>> How about instead, *fixing* the MD layer to properly support barriers?
>>> That would be far more useful, productive, and better for end-users.
> ..
>> Fixing MD would be great - not sure that it would end up still faster
>> (look at md1 devices with working barriers with compared to md1 with
>> write cache disabled).
> ..
>
> There's no inherent reason for it to be slower, except possibly
> drives with b0rked FUA support.
>
> So the first step is to fix MD to pass barriers to the LLDs
> for most/all RAID types.
> Then, if it has performance issues, those can be addressed
> by more application of little grey cells. :)
>
> Cheers

The performance issue with MD is that the "simple" answer is to not only
pass on those downstream barrier ops, but also to block and wait until
all of those dependent barrier ops complete before ack'ing the IO.

When you do that implementation at least, you will see a very large
performance impact and I am not sure that you would see any degradation
vs just turning off the write caches.

Sounds like we should actually do some testing and actually measure, I
do think that it will vary with the class of device quite a lot just
like we see with single disk barriers vs write cache disabled on SAS vs
S-ATA, etc...

ric


2009-09-05 21:43:41

by NeilBrown

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On Sat, September 5, 2009 10:57 pm, Mark Lord wrote:
> Ric Wheeler wrote:
>> On 09/04/2009 05:21 PM, Mark Lord wrote:
> ..
>>> How about instead, *fixing* the MD layer to properly support barriers?
>>> That would be far more useful, productive, and better for end-users.
> ..
>> Fixing MD would be great - not sure that it would end up still faster
>> (look at md1 devices with working barriers with compared to md1 with
>> write cache disabled).
> ..
>
> There's no inherent reason for it to be slower, except possibly
> drives with b0rked FUA support.
>
> So the first step is to fix MD to pass barriers to the LLDs
> for most/all RAID types.

Having MD "pass barriers" to LLDs isn't really very useful.
The barrier need to act with respect to all addresses of the device,
and once you pass it down, it can only act with respect to addresses
on that device.
What any striping RAID level needs to do when it sees a barrier
is:
suspend all future writes
drain and flush all queues
submit the barrier write
drain and flush all queues
unsuspend writes

I guess "drain can flush all queues" can be done with an empty barrier
so maybe that is exactly what you meant.

The double flush which (I think) is required by the barrier semantic
is unfortunate. I wonder if it would actually make things slower than
necessary.

NeilBrown

>
> Then, if it has performance issues, those can be addressed
> by more application of little grey cells. :)
>
> Cheers
>


2009-09-07 11:45:39

by Pavel Machek

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

Hi!

> Note that even without MD raid, the file system issues IO's in file
> system block size (4096 bytes normally) and most commodity storage
> devices use a 512 byte sector size which means that we have to update 8
> 512b sectors.
>
> Drives can (and do) have multiple platters and surfaces and it is
> perfectly normal to have contiguous logical ranges of sectors map to
> non-contiguous sectors physically. Imagine a 4KB write stripe that
> straddles two adjacent tracks on one platter (requiring a seek) or mapped
> across two surfaces (requiring a head switch). Also, a remapped sector
> can require more or less a full surface seek from where ever you are to
> the remapped sector area of the drive.

Yes, but ext3 was designed to handle the partial write (according to
tytso).

> These are all examples that can after a power loss, even a local
> (non-MD) device, do a partial update of that 4KB write range of
> sectors.

Yes, but ext3 journal protects metadata integrity in that case.

> In other words, this is not just an MD issue, it is entirely possible
> even with non-MD devices.
>
> Also, when you enable the write cache (MD or not) you are buffering
> multiple MB's of data that can go away on power loss. Far greater (10x)
> the exposure that the partial RAID rewrite case worries about.

Yes, that's what barriers are for. Except that they are not there on
MD0/MD5/MD6. They actually work on local sata drives...

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-07 13:11:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On Mon, Sep 07, 2009 at 01:45:34PM +0200, Pavel Machek wrote:
>
> Yes, but ext3 was designed to handle the partial write (according to
> tytso).

I'm not sure what made you think that I said that. In practice things
usually work out, as a conseuqence of the fact that ext3 uses physical
block journaling, but it's not perfect, becase...

> > Also, when you enable the write cache (MD or not) you are buffering
> > multiple MB's of data that can go away on power loss. Far greater (10x)
> > the exposure that the partial RAID rewrite case worries about.
>
> Yes, that's what barriers are for. Except that they are not there on
> MD0/MD5/MD6. They actually work on local sata drives...

Yes, but ext3 does not enable barriers by default (the patch has been
submitted but akpm has balked because he doesn't like the performance
degredation and doesn't believe that Chris Mason's "workload of doom"
is a common case). Note though that it is possible for dirty blocks
to remain in the track buffer for *minutes* without being written to
spinning rust platters without a barrier.

See Chris Mason's report of this phenonmenon here:

http://lkml.org/lkml/2009/3/30/297

Here's Chris Mason "barrier test" which will corrupt ext3 filesystems
50% of the time after a power drop if the filesystem is mounted with
barriers disabled (which is the default; use the mount option
barrier=1 to enable barriers):

http://lkml.indiana.edu/hypermail/linux/kernel/0805.2/1518.html

(Yes, ext4 has barriers enabled by default.)

- Ted

2009-11-09 10:50:54

by Pavel Machek

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Hi!

> >> If you have a specific bug in MD code, please propose a patch.
> >
> > Interesting. So, what's technically wrong with the patch below?
> >
>
> You mean apart from ".... that high highly undesirable ...." ??
> ^^^^^^^^^^^
>

Ok, I still believe kernel documentation should be ... well... in
kernel, not in LWN article, so I fixed the patch according to your
comments.

Signed-off-by: Pavel Machek <[email protected]>

diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
new file mode 100644
index 0000000..14d0324
--- /dev/null
+++ b/Documentation/filesystems/dangers.txt
@@ -0,0 +1,21 @@
+There are storage devices that have highly undesirable properties when
+they are disconnected or suffer power failures while writes are in
+progress; such devices include flash devices and degraded DM/MD RAID
+4/5/6 (*) arrays. These devices have the property of potentially
+corrupting blocks being written at the time of the power failure, and
+worse yet, amplifying the region where blocks are corrupted such that
+additional sectors are also damaged during the power failure.
+
+Users who use such storage devices are well advised take
+countermeasures, such as the use of Uninterruptible Power Supplies,
+and making sure the flash device is not hot-unplugged while the device
+is being used. Regular backups when using any devices, and these
+devices in particular is also a Very Good Idea.
+
+Otherwise, file systems placed on these devices can suffer silent data
+and file system corruption. An forced use of fsck may detect metadata
+corruption resulting in file system corruption, but will not suffice
+to detect data corruption.
+
+(*) If device failure causes the array to become degraded during or
+immediately after the power failure, the same problem can result.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-04-04 13:47:41

by Pavel Machek

[permalink] [raw]
Subject: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

Hi!

> > Yes, but ext3 was designed to handle the partial write (according to
> > tytso).
>
> I'm not sure what made you think that I said that. In practice things
> usually work out, as a conseuqence of the fact that ext3 uses physical
> block journaling, but it's not perfect, becase...

Ok; so the journalling actually is not reliable on many machines --
not even disk drive manufacturers guarantee full block writes AFAICT.

Maybe there's time to reviwe the patch to increase mount count by >1
when journal is replayed, to do fsck more often when powerfails are
present?


> > > Also, when you enable the write cache (MD or not) you are buffering
> > > multiple MB's of data that can go away on power loss. Far greater (10x)
> > > the exposure that the partial RAID rewrite case worries about.
> >
> > Yes, that's what barriers are for. Except that they are not there on
> > MD0/MD5/MD6. They actually work on local sata drives...
>
> Yes, but ext3 does not enable barriers by default (the patch has been
> submitted but akpm has balked because he doesn't like the performance
> degredation and doesn't believe that Chris Mason's "workload of doom"
> is a common case). Note though that it is possible for dirty blocks
> to remain in the track buffer for *minutes* without being written to
> spinning rust platters without a barrier.

So we do wrong thing by default. Another reason to do fsck more often
when powerfails are present?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-04-04 17:40:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

On Sun, Apr 04, 2010 at 03:47:29PM +0200, Pavel Machek wrote:
> > Yes, but ext3 does not enable barriers by default (the patch has been
> > submitted but akpm has balked because he doesn't like the performance
> > degredation and doesn't believe that Chris Mason's "workload of doom"
> > is a common case). Note though that it is possible for dirty blocks
> > to remain in the track buffer for *minutes* without being written to
> > spinning rust platters without a barrier.
>
> So we do wrong thing by default. Another reason to do fsck more often
> when powerfails are present?

Or migrate to ext4, which does use barriers by defaults, as well as
journal-level checksumming. :-)

As far as changing the default to enable barriers for ext3, you'll
need to talk to akpm about that; he's the one who has been against it
in the past.

- Ted

2010-04-04 17:59:16

by Rob Landley

[permalink] [raw]
Subject: Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

On Sunday 04 April 2010 08:47:29 Pavel Machek wrote:
> Maybe there's time to reviwe the patch to increase mount count by >1
> when journal is replayed, to do fsck more often when powerfails are
> present?

Wow, you mean there are Linux users left who _don't_ rip that out?

The auto-fsck stuff is an instance of "we the developers know what you the
users need far more than you ever could, so let me ram this down your throat".
I don't know of a server anywhere that can afford an unscheduled extra four
hours of downtime due to the system deciding to fsck itself, and I don't know
a Linux laptop user anywhere who would be happy to fire up their laptop and
suddenly be told "oh, you can't do anything with it for two hours, and you
can't power it down either".

I keep my laptop backed up to an external terabyte USB drive and the volatile
subset of it to a network drive (rsync is great for both), and when it dies,
it dies. But I've never lost data due to an issue fsck would have fixed. I've
lost data to disks overheating, disks wearing out, disks being run undervolt
because the cat chewed on the power supply cord... I've copied floppy images to
/dev/hda instead of /dev/fd0... I even ran over my laptop with my car once.
(Amazingly enough, that hard drive survived.)

But fsck has never once protected any data of mine, that I am aware of, since
journaling was introduced.

I'm all for btrfs coming along and being able to fsck itself behind my back
where I don't have to care about it. (Although I want to tell it _not_ to do
that when on battery power.) But the "fsck lottery" at powerup is just
stupid.

> > > > Also, when you enable the write cache (MD or not) you are buffering
> > > > multiple MB's of data that can go away on power loss. Far greater
> > > > (10x) the exposure that the partial RAID rewrite case worries about.
> > >
> > > Yes, that's what barriers are for. Except that they are not there on
> > > MD0/MD5/MD6. They actually work on local sata drives...
> >
> > Yes, but ext3 does not enable barriers by default (the patch has been
> > submitted but akpm has balked because he doesn't like the performance
> > degredation and doesn't believe that Chris Mason's "workload of doom"
> > is a common case). Note though that it is possible for dirty blocks
> > to remain in the track buffer for *minutes* without being written to
> > spinning rust platters without a barrier.
>
> So we do wrong thing by default. Another reason to do fsck more often
> when powerfails are present?

My laptop power fails all the time, due to battery exhaustion. Back under KDE
it was decent about suspending when it was ran low on power, but ever since
KDE 4 came out and I had to switch to XFCE, it's using the gnome
infrastructure, which collects funky statistics and heuristics but can never
quite save them to disk because suddenly running out of power when it thinks
it's got 20 minutes left doesn't give it the opportunity to save its database.
So it'll never auto-suspend, just suddenly die if I don't hit the button.

As a result of one of these, two large media files in my "anime" subdirectory
are not only crosslinked, but the common sector they share is bad. (It ran
out of power in the act of writing that sector. I left it copying large files
to the drive and forgot to plug it in, and it did the loud click emergency
park and power down thing when the hardware voltage regulator tripped.)

This corruption has been there for a year now. Presumably if it overwrote
that sector it might recover (perhaps by allocating one of the spares), but
the drive firmware has proven unwilling to do so in response to _reading_ the
bad sector, and I'm largely ignoring it because it's by no means the worst
thing wrong with this laptop's hardware, and some glorious day I'll probably
break down and buy a macintosh. The stuff I have on it's backed up, and in the
year since it hasn't developed a second bad sector and I haven't deleted those
files. (Yes, I could replace the hard drive _again_ but this laptop's on its
third hard drive already and it's just not worth the effort.)

I'm much more comfortable living with this until I can get a new laptop than
with the idea of running fsck on the system and letting it do who knows what
it response to something that is not actually a problem.

> Pavel

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2010-04-04 18:45:46

by Pavel Machek

[permalink] [raw]
Subject: Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

On Sun 2010-04-04 12:59:16, Rob Landley wrote:
> On Sunday 04 April 2010 08:47:29 Pavel Machek wrote:
> > Maybe there's time to reviwe the patch to increase mount count by >1
> > when journal is replayed, to do fsck more often when powerfails are
> > present?
>
> Wow, you mean there are Linux users left who _don't_ rip that out?

Yes, there are. It actually helped pinpoint corruption here, 4 time it
was major corruption.

And yes, I'd like fsck more often, when they are power failures and
less often when the shutdowns are orderly...

I'm not sure of what right intervals between check are for you, but
I'd say that fsck once a year or every 100 mounts or every 10 power
failures is probably good idea for everybody...

> The auto-fsck stuff is an instance of "we the developers know what you the
> users need far more than you ever could, so let me ram this down your throat".
> I don't know of a server anywhere that can afford an unscheduled extra four
> hours of downtime due to the system deciding to fsck itself, and I don't know
> a Linux laptop user anywhere who would be happy to fire up their laptop and
> suddenly be told "oh, you can't do anything with it for two hours, and you
> can't power it down either".

On laptop situation is easy. Pull the plug, hit reset, wait for fsck,
plug AC back in. Done that, too :-).

Yep, it would be nice if fsck had "escape" button.

> I'm all for btrfs coming along and being able to fsck itself behind my back
> where I don't have to care about it. (Although I want to tell it _not_ to do
> that when on battery power.) But the "fsck lottery" at powerup is just
> stupid.

fsck lottery. :-).
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-04-04 19:29:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

On Sun, Apr 04, 2010 at 12:59:16PM -0500, Rob Landley wrote:
> I don't know of a server anywhere that can afford an unscheduled
> extra four hours of downtime due to the system deciding to fsck
> itself, and I don't know a Linux laptop user anywhere who would be
> happy to fire up their laptop and suddenly be told "oh, you can't do
> anything with it for two hours, and you can't power it down either".

So what I recommend for server class machines is to either turn off
the automatic fsck's (it's the default, but it's documented and there
are supported ways of turning it off --- that's hardly developers
"ramming" it down user's throats), or more preferably, to use LVM, and
then use a snapshot and running fsck on the snapshot.

> I'm all for btrfs coming along and being able to fsck itself behind
> my back where I don't have to care about it. (Although I want to
> tell it _not_ to do that when on battery power.)

You can do this with ext3/ext4 today, now. Just take a look at
e2croncheck in the contrib directory of e2fsprogs. Changing it to not
do this when on battery power is a trivial exercise.

> My laptop power fails all the time, due to battery exhaustion. Back
> under KDE it was decent about suspending when it was ran low on
> power, but ever since KDE 4 came out and I had to switch to XFCE,
> it's using the gnome infrastructure, which collects funky statistics
> and heuristics but can never quite save them to disk because
> suddenly running out of power when it thinks it's got 20 minutes
> left doesn't give it the opportunity to save its database. So it'll
> never auto-suspend, just suddenly die if I don't hit the button.

Hmm, why are you running on battery so often? I make a point of
running connected to the AC mains whenever possible, because a LiOn
battery only has about 200 full-cycle charge/discharges in it, and
given the cost of LiOn batteries, basically each charge/discharge
cycle costs a dollar each. So I only run on batteries when I
absolutely have to, and in practice it's rare that I dip below 30% or
so.

> As a result of one of these, two large media files in my "anime"
> subdirectory are not only crosslinked, but the common sector they
> share is bad. (It ran out of power in the act of writing that
> sector. I left it copying large files to the drive and forgot to
> plug it in, and it did the loud click emergency park and power down
> thing when the hardware voltage regulator tripped.)

So e2fsck would fix the cross-linking. We do need to have some better
tools to do forced rewrite of sectors that have gone bad in a HDD. It
can be done by using badblocks -n, but translating the sector number
emitted by the device driver (which for some drivers is relative to
the beginning of the partition, and for others is relative to the
beginning of the disk). It is possible to run badblocks -w on the
whole disk, of course, but it's better to just run it on the specific
block in question.

> I'm much more comfortable living with this until I can get a new laptop than
> with the idea of running fsck on the system and letting it do who knows what
> it response to something that is not actually a problem.

Well, it actually is a problem. And there may be other problems
hiding that you're not aware of. Running "badblocks -b 4096 -n" may
discover other blocks that have failed, and you can then decide
whether you want to let fsck fix things up. If you don't, though,
it's probably not fair to blame ext3 or e2fsck for any future
failures (not that it's likely to stop you :-).

- Ted

2010-04-04 19:35:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

On Sun, Apr 04, 2010 at 08:45:46PM +0200, Pavel Machek wrote:
>
> I'm not sure of what right intervals between check are for you, but
> I'd say that fsck once a year or every 100 mounts or every 10 power
> failures is probably good idea for everybody...

For people using e2croncheck, where you can check it when the system
is idle and without needing to do a power cycle, I'd recommend once a
week, actually.

> > hours of downtime due to the system deciding to fsck itself, and I
> > don't know a Linux laptop user anywhere who would be happy to fire
> > up their laptop and suddenly be told "oh, you can't do anything
> > with it for two hours, and you can't power it down either".
>
> On laptop situation is easy. Pull the plug, hit reset, wait for fsck,
> plug AC back in. Done that, too :-).

Some distributions will allow you to cancel an fsck; either by using
^C, or hitting escape. That's a matter for the boot scripts, which
are distribution specific. Ubuntu has a way of doing this, for
example, if I recall correctly --- although since I've started using
e2croncheck, I've never had an issue with an e2fsck taking place on
bootup. Also, ext4, fscks are so much much faster that even before I
upgraded to using an SSD, it's never been an issue for me. It's
certainly not hours any more....

> Yep, it would be nice if fsck had "escape" button.

Complain to your distribution. :-)

Or this is Linux and open source; fix it yourself, and submit the
patches back to your distribution. If all you want to do is whine,
then maybe Rob's choice is the best way, go switch to the velvet-lined
closed system/jail which is the Macintosh. :-)

(I created e2croncheck to solve my problem; if that isn't good enough
for you, I encourage you to find/create your own fixes.)

- Ted

2010-04-04 23:58:42

by Rob Landley

[permalink] [raw]
Subject: Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

On Sunday 04 April 2010 14:29:12 [email protected] wrote:
> On Sun, Apr 04, 2010 at 12:59:16PM -0500, Rob Landley wrote:
> > I don't know of a server anywhere that can afford an unscheduled
> > extra four hours of downtime due to the system deciding to fsck
> > itself, and I don't know a Linux laptop user anywhere who would be
> > happy to fire up their laptop and suddenly be told "oh, you can't do
> > anything with it for two hours, and you can't power it down either".
>
> So what I recommend for server class machines is to either turn off
> the automatic fsck's (it's the default, but it's documented and there
> are supported ways of turning it off --- that's hardly developers
> "ramming" it down user's throats), or more preferably, to use LVM, and
> then use a snapshot and running fsck on the snapshot.

Turning off the automatic fsck is what I see people do, yes.

My point is that if you don't force the thing to run memtest86 overnight every
20 boots, forcing it to run fsck seems a bit silly.

> > I'm all for btrfs coming along and being able to fsck itself behind
> > my back where I don't have to care about it. (Although I want to
> > tell it _not_ to do that when on battery power.)
>
> You can do this with ext3/ext4 today, now. Just take a look at
> e2croncheck in the contrib directory of e2fsprogs. Changing it to not
> do this when on battery power is a trivial exercise.
>
> > My laptop power fails all the time, due to battery exhaustion. Back
> > under KDE it was decent about suspending when it was ran low on
> > power, but ever since KDE 4 came out and I had to switch to XFCE,
> > it's using the gnome infrastructure, which collects funky statistics
> > and heuristics but can never quite save them to disk because
> > suddenly running out of power when it thinks it's got 20 minutes
> > left doesn't give it the opportunity to save its database. So it'll
> > never auto-suspend, just suddenly die if I don't hit the button.
>
> Hmm, why are you running on battery so often?

Personal working style?

When I was in Pittsburgh, I used the laptop on the bus to and from work every
day. Here in Austin, my laundromat has free wifi. It also gets usable free
wifi from the coffee shop to the right, the japanese restaurant to the left, and
the ice cream shop across the street. (And when I'm not in a wifi area, my
cell phone can bluetooth associate to give me net access too.)

I like coffee shops. (Of course the fact that if I try to work from home I
have to fight off the affections of four cats might have something to do with it
too...)

> I make a point of
> running connected to the AC mains whenever possible, because a LiOn
> battery only has about 200 full-cycle charge/discharges in it, and
> given the cost of LiOn batteries, basically each charge/discharge
> cycle costs a dollar each.

Actually the battery's about $50, so that would be 25 cents each.

My laptop is on its third battery. It's also on its third hard drive.

> So I only run on batteries when I
> absolutely have to, and in practice it's rare that I dip below 30% or
> so.

Actually I find the suckers die just as quickly from simply being plugged in
and kept hot by the electronics, and never used so they're pegged at 100% with
slight trickle current beyond that constantly overcharging them.

> > As a result of one of these, two large media files in my "anime"
> > subdirectory are not only crosslinked, but the common sector they
> > share is bad. (It ran out of power in the act of writing that
> > sector. I left it copying large files to the drive and forgot to
> > plug it in, and it did the loud click emergency park and power down
> > thing when the hardware voltage regulator tripped.)
>
> So e2fsck would fix the cross-linking. We do need to have some better
> tools to do forced rewrite of sectors that have gone bad in a HDD. It
> can be done by using badblocks -n, but translating the sector number
> emitted by the device driver (which for some drivers is relative to
> the beginning of the partition, and for others is relative to the
> beginning of the disk). It is possible to run badblocks -w on the
> whole disk, of course, but it's better to just run it on the specific
> block in question.

The point I was trying to make is that running "preemptive" fsck is imposing a
significant burden on users in an attempt to find purely theoretical problems,
with the expectation that a given run will _not_ find them. I've had systems
taken out by actual hardware issues often enough that keeping good backups and
being prepared to lose the entire laptop at any time is just common sense.

I knocked my laptop into the bathtub last month. Luckily there wasn't any
water in the thing at the time, but it made a very loud bang when it hit, and
it was on at the time. (Checked dmesg several times over the next few days
and it didn't start spitting errors at me, so that's something...)

> > I'm much more comfortable living with this until I can get a new laptop
> > than with the idea of running fsck on the system and letting it do who
> > knows what it response to something that is not actually a problem.
>
> Well, it actually is a problem. And there may be other problems
> hiding that you're not aware of. Running "badblocks -b 4096 -n" may
> discover other blocks that have failed, and you can then decide
> whether you want to let fsck fix things up. If you don't, though,
> it's probably not fair to blame ext3 or e2fsck for any future
> failures (not that it's likely to stop you :-).

I'm not blaming ext2. I'm saying I've spilled sodas into my working machines
on so many occasions over the years I've lost _track_. (The vast majority of
'em survived, actually.)

Random example of current cascading badness: The latch sensor on my laptop is
no longer debounced. That happened when I upgraded to Ubuntu 9.04 but I'm not
sure how that _can_ screw that up, you'd think the bios would be in charge of
that. So anyway, it now has a nasty habit of waking itself up in the nice
insulated pocket in my backpack and then shutting itself down hard five minutes
later when the thermal sensors trip (at the bios level I think, not in the
OS). So I now regularly suspend to disk instead of to ram because that way it
can't spuriously wake itself back up just because it got jostled slightly.
Except that when it resumes from disk, the console it suspended in is totally
misprogrammed (vertical lines on what it _thinks_ is text mode), and sometimes
the chip is so horked I can hear the sucker making a screeching noise. The
easy workarond is to ctrl-alt-F1 and suspend from a text console, then Ctrl-
alt-f7 gets me back to the desktop. But going back to that text console
remembers the misprogramming, and I get vertical lines and an adible whine
coming from something that isn't a speaker. (Luckly cursor-up and enter works
to re-suspend, so I can just sacrifice one console to the suspend bug.)

The _fun_ part is that the last system I had where X11 regularly misprogramed
it so badly I could _hear_ the video chip, said video chip eventually
overheated and melted bits of the motherboard. (That was a toshiba laptop.
It took out the keyboard controller first, and I used it for a few months with
an external keyboard until the whole thing just went one day. The display you
get when your video chip finally goes can be pretty impressive. Way prettier
than the time I was caught in a thunderstorm and my laptop got soaked and two
vertical sections of the display were flickering white while the rest was
displaying normally -- that system actally started working again when it dried
out...)

It just wouldn't be a Linux box to me if I didn't have workarounds for the
side effects of my workarounds.

Anyway, this is the perspective from which I say that the fsck to look for
purely theoretical badness on my otherwise perfect system is not worth 2 hours
to never find anything wrong.

If Ubuntu's little upgrade icon had a "recommend fsck" thing that lights up
every 3 months which I could hit some weekend when I was going out anyway,
that would be one thing. But "Ah, Ubuntu 9.04 moved DRM from X11 into the
kernel and the Intel 945 3D driver is now psychotic and it froze your machine
for the second time this week. Since you're rebooting anyway, you won't mind
if I add an extra 3 hours to the process"...? That stopped really being a
viable assumption some time before hard drives were regularly measured in
terabytes.

> - Ted

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds