LinuxLists.cc - Re: [patch] ext2/3: document conditions when reliable operation is possible

[permalink] [raw]

Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On 08/26/2009 07:12 AM, Pavel Machek wrote:
> On Wed 2009-08-26 06:39:14, Ric Wheeler wrote:
>
>> On 08/25/2009 10:58 PM, Theodore Tso wrote:
>>
>>> On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote:
>>>
>>>
>>>> I agree with the whole write up outside of the above - degraded RAID
>>>> does meet this requirement unless you have a second (or third, counting
>>>> the split write) failure during the rebuild.
>>>>
>>>>
>>> The argument is that if the degraded RAID array is running in this
>>> state for a long time, and the power fails while the software RAID is
>>> in the middle of writing out a stripe, such that the stripe isn't
>>> completely written out, we could lose all of the data in that stripe.
>>>
>>> In other words, a power failure in the middle of writing out a stripe
>>> in a degraded RAID array counts as a second failure.
>>> To me, this isn't a particularly interesting or newsworthy point,
>>> since a competent system administrator who cares about his data and/or
>>> his hardware will (a) have a UPS, and (b) be running with a hot spare
>>> and/or will imediately replace a failed drive in a RAID array.
>>>
>> I agree that this is not an interesting (or likely) scenario, certainly
>> when compared to the much more frequent failures that RAID will protect
>> against which is why I object to the document as Pavel suggested. It
>> will steer people away from using RAID and directly increase their
>> chances of losing their data if they use just a single disk.
>>
> So instead of fixing or at least documenting known software deficiency
> in Linux MD stack, you'll try to surpress that information so that
> people use more of raid5 setups?
>
> Perhaps the better documentation will push them to RAID1, or maybe
> make them buy an UPS?
> Pavel
>

I am against documenting unlikely scenarios out of context that will
lead people to do the wrong thing.

ric

2009-08-26 12:23:11

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Wed, Aug 26, 2009 at 01:12:08PM +0200, Pavel Machek wrote:
> > I agree that this is not an interesting (or likely) scenario, certainly
> > when compared to the much more frequent failures that RAID will protect
> > against which is why I object to the document as Pavel suggested. It
> > will steer people away from using RAID and directly increase their
> > chances of losing their data if they use just a single disk.
>
> So instead of fixing or at least documenting known software deficiency
> in Linux MD stack, you'll try to surpress that information so that
> people use more of raid5 setups?

First of all, it's not a "known software deficiency"; you can't do
anything about a degraded RAID array, other than to replace the failed
disk. Secondly, what we should document is things like "don't use
crappy flash devices", "don't let the RAID array run in degraded mode
for a long time" and "if you must (which is a bad idea), better have a
UPS or a battery-backed hardware RAID". What we should *not* document
is

"ext3 is worthless for RAID 5 arrays" (simply wrong)

and

"ext2 is better than ext3 because it forces you to run a long, slow
fsck after each boot, and that helps you to catch filesystem
corruptions when the storage devices goes bad" (Second part of the
statement is true, but it's still bad general advice, and it's
horribly misleading)

and

"ext2 and ext3 have this surprising dependency that disks act like
disks". (alarmist)

- Ted

2009-08-29 09:49:20

[permalink] [raw]

Subject: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

>> So instead of fixing or at least documenting known software deficiency
>> in Linux MD stack, you'll try to surpress that information so that
>> people use more of raid5 setups?
>>
>> Perhaps the better documentation will push them to RAID1, or maybe
>> make them buy an UPS?
>
> people aren't objecting to better documentation, they are objecting to
> misleading documentation.

Actually Ric is. He's trying hard to make RAID5 look better than it
really is.

> for flash drives the danger is very straightforward (although even then
> you have to note that it depends heavily on the firmware of the device,
> some will loose lots of data, some won't loose any)

I have not seen one that works :-(.

> you are generalizing that since you have lost data on flash drives, all
> flash drives are dangerous.

Do the flash manufacturers claim they do not cause collateral damage
during powerfail? If not, they probably are dangerous.

Anyway, you wanted a test, and one is attached. It normally takes like
4 unplugs to uncover problems.

> but the super simplified statement you keep trying to make is
> significantly overstating and oversimplifying the problem.

Offer better docs? You are right that it does not lose whole stripe,
it merely loses random block on same stripe, but result for journaling
filesystem is similar.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Attachments:

(No filename) (1.45 kB)
fstest (923.00 B)
fstest.work (409.00 B)
Download all attachments

2009-08-29 11:28:21

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 08/29/2009 05:49 AM, Pavel Machek wrote:
>
>>> So instead of fixing or at least documenting known software deficiency
>>> in Linux MD stack, you'll try to surpress that information so that
>>> people use more of raid5 setups?
>>>
>>> Perhaps the better documentation will push them to RAID1, or maybe
>>> make them buy an UPS?
>>>
>> people aren't objecting to better documentation, they are objecting to
>> misleading documentation.
>>
> Actually Ric is. He's trying hard to make RAID5 look better than it
> really is.
>
>
>

I object to misleading and dangerous documentation that you have
proposed. I spend a lot of time working in data integrity, talking and
writing about it so I care deeply that we don't misinform people.

In this thread, I put out a draft that is accurate several times and you
have failed to respond to it.

The big picture that you don't agree with is:

(1) RAID (specifically MD RAID) will dramatically improve data integrity
for real users. This is not a statement of opinion, this is a statement
of fact that has been shown to be true in large scale deployments with
commodity hardware.

(2) RAID5 protects you against a single failure and your test case
purposely injects a double failure.

(3) How to configure MD reliably should be documented in MD
documentation, not in each possible FS or raw device application

(4) Data loss occurs in non-journalling file systems and journalling
file systems when you suffer double failures or hot unplug storage,
especially inexpensive FLASH parts.

ric

2009-08-29 16:35:51

by David Lang

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Sat, 29 Aug 2009, Pavel Machek wrote:

>> for flash drives the danger is very straightforward (although even then
>> you have to note that it depends heavily on the firmware of the device,
>> some will loose lots of data, some won't loose any)
>
> I have not seen one that works :-(.

so let's get broader testing (including testing the SSDs as well as the
thumb drives)

>> you are generalizing that since you have lost data on flash drives, all
>> flash drives are dangerous.
>
> Do the flash manufacturers claim they do not cause collateral damage
> during powerfail? If not, they probably are dangerous.

I think that every single one of them will tell you to not unplug the
drive while writing to it. in fact, I'll bet they all tell you to not
unplug the drive without unmounting ('ejecting') it at the OS level.

> Anyway, you wanted a test, and one is attached. It normally takes like
> 4 unplugs to uncover problems.

Ok, help me understand this.

I copy these two files to a system, change them to point at the correct
device, run them and unplug the drive while it's running.

when I plug the device back in, how do I tell if it lost something
unexpected? since you are writing from urandom I have no idea what data
_should_ be on the drive, so how can I detect that a data block has been
corrupted?

David Lang

Attachments:

fstest (976.00 B)
fstest.work (425.00 B)
Download all attachments

2009-08-30 07:01:15

[permalink] [raw]

Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Wed 2009-08-26 08:23:11, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 01:12:08PM +0200, Pavel Machek wrote:
> > > I agree that this is not an interesting (or likely) scenario, certainly
> > > when compared to the much more frequent failures that RAID will protect
> > > against which is why I object to the document as Pavel suggested. It
> > > will steer people away from using RAID and directly increase their
> > > chances of losing their data if they use just a single disk.
> >
> > So instead of fixing or at least documenting known software deficiency
> > in Linux MD stack, you'll try to surpress that information so that
> > people use more of raid5 setups?
>
> First of all, it's not a "known software deficiency"; you can't do
> anything about a degraded RAID array, other than to replace the failed
> disk.

You could add journal to raid5.

> "ext2 and ext3 have this surprising dependency that disks act like
> disks". (alarmist)

AFAICT, you mount block device, not disk. Many block devices fail the
test. And since users (and block device developers) do not know in
detail how disks behave, it is hard to blame them... ("you may corrupt
sector you are writing to and ext3 handles that ok" was surprise for
me, for example).

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-30 07:07:49

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Hi!

>>> for flash drives the danger is very straightforward (although even then
>>> you have to note that it depends heavily on the firmware of the device,
>>> some will loose lots of data, some won't loose any)
>>
>> I have not seen one that works :-(.
>
> so let's get broader testing (including testing the SSDs as well as the
> thumb drives)

If someone can do ssd test -- yes that would be interesting.

>> Anyway, you wanted a test, and one is attached. It normally takes like
>> 4 unplugs to uncover problems.
>
> Ok, help me understand this.
>
> I copy these two files to a system, change them to point at the correct
> device, run them and unplug the drive while it's running.

Yep.

> when I plug the device back in, how do I tell if it lost something
> unexpected? since you are writing from urandom I have no idea what data
> _should_ be on the drive, so how can I detect that a data block has been
> corrupted?

I have mirror on disk you are not unplugging. See cmp || exit lines.

The test continues until it detects corruption.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-02 20:12:10

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

>>> people aren't objecting to better documentation, they are objecting to
>>> misleading documentation.
>>>
>> Actually Ric is. He's trying hard to make RAID5 look better than it
>> really is.
>
> I object to misleading and dangerous documentation that you have
> proposed. I spend a lot of time working in data integrity, talking and
> writing about it so I care deeply that we don't misinform people.

Yes, truth is dangerous. To vendors selling crap products.

> In this thread, I put out a draft that is accurate several times and you
> have failed to respond to it.

Accurate as in 'has 0 information content' :-(.

> The big picture that you don't agree with is:
>
> (1) RAID (specifically MD RAID) will dramatically improve data integrity
> for real users. This is not a statement of opinion, this is a statement
> of fact that has been shown to be true in large scale deployments with
> commodity hardware.

It is also completely irrelevant.

> (2) RAID5 protects you against a single failure and your test case
> purposely injects a double failure.

Most people would be surprised that press of reset button is 'failure'
in this context.

> (4) Data loss occurs in non-journalling file systems and journalling
> file systems when you suffer double failures or hot unplug storage,
> especially inexpensive FLASH parts.

It does not happen on inexpensive DISK parts, so people do not expect
that and it is worth pointing out.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-02 20:42:19

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 09/02/2009 04:12 PM, Pavel Machek wrote:
>
>>>> people aren't objecting to better documentation, they are objecting to
>>>> misleading documentation.
>>>>
>>> Actually Ric is. He's trying hard to make RAID5 look better than it
>>> really is.
>>
>> I object to misleading and dangerous documentation that you have
>> proposed. I spend a lot of time working in data integrity, talking and
>> writing about it so I care deeply that we don't misinform people.
>
> Yes, truth is dangerous. To vendors selling crap products.

Pavel, you have no information and an attitude of not wanting to listen to
anyone who has real experience or facts. Not just me, but also Ted and others.

Totally pointless to reply to you further.

Ric

2009-09-02 22:45:34

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Wednesday 02 September 2009 15:12:10 Pavel Machek wrote:
> > (2) RAID5 protects you against a single failure and your test case
> > purposely injects a double failure.
>
> Most people would be surprised that press of reset button is 'failure'
> in this context.

Apparently because most people haven't read Documentation/md.txt:

Boot time assembly of degraded/dirty arrays
-------------------------------------------

If a raid5 or raid6 array is both dirty and degraded, it could have
undetectable data corruption. This is because the fact that it is
'dirty' means that the parity cannot be trusted, and the fact that it
is degraded means that some datablocks are missing and cannot reliably
be reconstructed (due to no parity).

And so on for several more paragraphs. Perhaps the documentation needs to be
extended to note that "journaling will not help here, because the lost data
blocks render entire stripes unreconstructable"...

Hmmm, I'll take a stab at it. (I'm not addressing the raid 0 issues brought
up elsewhere in this thread because I don't comfortably understand the current
state of play...)

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-09-02 22:49:48

[permalink] [raw]

Subject: [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case.

From: Rob Landley <[email protected]>

Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
explaining that using a journaling filesystem can't overcome this problem.

Signed-off-by: Rob Landley <[email protected]>
---

Documentation/md.txt | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/Documentation/md.txt b/Documentation/md.txt
index 4edd39e..52b8450 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use

md-mod.start_dirty_degraded=1

+Note that Journaling filesystems do not effectively protect data in this
+case, because the update granularity of the RAID is larger than the journal
+was designed to expect. Reconstructing data via partity information involes
+matching together corresponding stripes, and updating only some of these
+stripes renders the corresponding data in all the unmatched stripes
+meaningless. Thus seemingly unrelated data in other parts of the filesystem
+(stored in the unmatched stripes) can become unreadable after a partial
+update, but the journal is only aware of the parts it modified, not the
+"collateral damage" elsewhere in the filesystem which was affected by those
+changes.
+
+Thus successful journal replay proves nothing in this context, and even a
+full fsck only shows whether or not the filesystem's metadata was affected.
+(A proper solution to this problem would involve adding journaling to the RAID
+itself, at least during degraded writes. In the meantime, try not to allow
+a system to shut down uncleanly with its RAID both dirty and degraded, it
+can handle one but not both.)

Superblock formats
------------------

--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-09-02 23:00:48

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Wednesday 02 September 2009 15:42:19 Ric Wheeler wrote:
> On 09/02/2009 04:12 PM, Pavel Machek wrote:
> >>>> people aren't objecting to better documentation, they are objecting to
> >>>> misleading documentation.
> >>>
> >>> Actually Ric is. He's trying hard to make RAID5 look better than it
> >>> really is.
> >>
> >> I object to misleading and dangerous documentation that you have
> >> proposed. I spend a lot of time working in data integrity, talking and
> >> writing about it so I care deeply that we don't misinform people.
> >
> > Yes, truth is dangerous. To vendors selling crap products.
>
> Pavel, you have no information and an attitude of not wanting to listen to
> anyone who has real experience or facts. Not just me, but also Ted and
> others.
>
> Totally pointless to reply to you further.

For the record, I've been able to follow Pavel's arguments, and I've been able
to follow Ted's arguments. But as far as I can tell, you're arguing about a
different topic than the rest of us.

There's a difference between:

A) This filesystem was corrupted because the underlying hardware is permanently
damaged, no longer functioning as it did when it was new, and never will
again.

B) We had a transient glitch that ate the filesystem. The underlying hardware
is as good as new, but our data is gone.

You can argue about whether or not "new" was ever any good, but Linux has run
on PC-class hardware from day 1. Sure PC-class hardware remains crap in many
different ways, but this is not a _new_ problem. Refusing to work around what
people actually _have_ and insisting we get a better class of user instead
_is_ a new problem, kind of a disturbing one.

USB keys are the modern successor to floppy drives, and even now
Documentation/blockdev/floppy.txt is still full of some of the torturous
workarounds implemented for that over the past 2 decades. The hardware
existed, and instead of turning up their nose at it they made it work as best
they could.

Perhaps what's needed for the flash thing is a userspace package, the way
mdutils made floppies a lot more usable than the kernel managed at the time.
For the flash problem perhaps some FUSE thing a bit like mtdblock might be
nice, a translation layer remapping an arbitrary underlying block device into
larger granularity chunks and being sure to do the "write the new one before
you erase the old one" trick that so many hardware-only flash devices _don't_,
and then maybe even use Pavel's crash tool to figure out the write granularity
of various sticks and ship it with a whitelist people can email updates to so
we don't have to guess large. (Pressure on the USB vendors to give us a "raw
view" extension bypassing the "pretend to be a hard drive, with remapping"
hardware in future devices would be nice too, but won't help any of the
hardware out in the field. I'm not sure that block remapping wouldn't screw up
_this_ approach either, but it's an example of something that culd be
_tried_.)

However, thinking about how to _fix_ a problem is predicated on acknowledging
that there actually _is_ a problem. "The hardware is not physically damaged
but your data was lost" sounds to me like a software problem, and thus
something software could at least _attempt_ to address. "There's millions of
'em, Linux can't cope" doesn't seem like a useful approach.

I already addressed the software raid thing last post.

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-09-02 23:09:24

by David Lang

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Wed, 2 Sep 2009, Rob Landley wrote:

> USB keys are the modern successor to floppy drives, and even now
> Documentation/blockdev/floppy.txt is still full of some of the torturous
> workarounds implemented for that over the past 2 decades. The hardware
> existed, and instead of turning up their nose at it they made it work as best
> they could.
>
> Perhaps what's needed for the flash thing is a userspace package, the way
> mdutils made floppies a lot more usable than the kernel managed at the time.
> For the flash problem perhaps some FUSE thing a bit like mtdblock might be
> nice, a translation layer remapping an arbitrary underlying block device into
> larger granularity chunks and being sure to do the "write the new one before
> you erase the old one" trick that so many hardware-only flash devices _don't_,
> and then maybe even use Pavel's crash tool to figure out the write granularity
> of various sticks and ship it with a whitelist people can email updates to so
> we don't have to guess large. (Pressure on the USB vendors to give us a "raw
> view" extension bypassing the "pretend to be a hard drive, with remapping"
> hardware in future devices would be nice too, but won't help any of the
> hardware out in the field. I'm not sure that block remapping wouldn't screw up
> _this_ approach either, but it's an example of something that culd be
> _tried_.)
>
> However, thinking about how to _fix_ a problem is predicated on acknowledging
> that there actually _is_ a problem. "The hardware is not physically damaged
> but your data was lost" sounds to me like a software problem, and thus
> something software could at least _attempt_ to address. "There's millions of
> 'em, Linux can't cope" doesn't seem like a useful approach.

no other OS avoids this problem either.

I actually don't see how you can do this from userspace, because when you
write to the device you have _no_ idea where on the device your data will
actually land.

writing in larger chunks may or may not help, (if you do a 128K write,
and the device is emulating 512b blocks on top of 128K eraseblocks,
depending on the current state of the flash translation layer, you could
end up writing to many different eraseblocks, up to the theoretical max of
256)

David Lang

2009-09-03 00:36:12

by jim owens

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Rob Landley wrote:
> On Wednesday 02 September 2009 15:42:19 Ric Wheeler wrote:
>>
>> Totally pointless to reply to you further.
>
> For the record, I've been able to follow Pavel's arguments, and I've been able
> to follow Ted's arguments. But as far as I can tell, you're arguing about a
> different topic than the rest of us.

I had no trouble following what Ric was arguing about.

Ric never said "use only the best devices and you won't have problems".

Ric was arguing the exact opposite - ALL devices are crap if you define
crap as "can loose data". What he is saying is you need to UNDERSTAND
your devices and their behavior and you must act accordingly.

PAVEL DID NOT ACT ACCORDING TO HIS DEVICE LIMITATIONS.

We understand he was clueless, but user error is still user error!

And Ric said do not stigmatize whole classes of A) devices, B) raid,
and C) filesystems with "Pavel says...".

> However, thinking about how to _fix_ a problem is predicated on acknowledging
> that there actually _is_ a problem. "The hardware is not physically damaged
> but your data was lost" sounds to me like a software problem, and thus
> something software could at least _attempt_ to address. "There's millions of
> 'em, Linux can't cope" doesn't seem like a useful approach.

We have been trying forever to deal with device problems and as
Ric kept trying to explain we do understand them. The problem is
not "can we be better" it is "at what cost". As they keep saying
"fast", "cheap", "safe"... pick any 2. Adding software solutions
to solve it will always turn "fast" to "slow".

Most people will choose some risk they can manage (such as
don't pull the flash card you idiot), instead of snail slow.

> I already addressed the software raid thing last post.

Saw it. I am not an MD guy so I will not say anything bad about it
except all the "journal" crud. It really is only pandering to Pavel
because ALL filesystems can be screwed and that is what they really
need to know. The journal stuff distracts those who are not running
a journaling filesystem, even if your description is correct except
that as we fs people keep saying, fsck is meaningless and again will
only give you a false sense of security that your data is OK.

jim

2009-09-03 02:41:46

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Wednesday 02 September 2009 19:36:10 jim owens wrote:
> Rob Landley wrote:
> > On Wednesday 02 September 2009 15:42:19 Ric Wheeler wrote:
> >> Totally pointless to reply to you further.
> >
> > For the record, I've been able to follow Pavel's arguments, and I've been
> > able to follow Ted's arguments. But as far as I can tell, you're arguing
> > about a different topic than the rest of us.
>
> I had no trouble following what Ric was arguing about.
>
> Ric never said "use only the best devices and you won't have problems".
>
> Ric was arguing the exact opposite - ALL devices are crap if you define
> crap as "can loose data".

And if you include meteor strike and flooding in your operating criteria you
can come up with quite a straw man argument. It still doesn't mean "X is
highly likely to cause data loss" can never come as news to people.

> What he is saying is you need to UNDERSTAND
> your devices and their behavior and you must act accordingly.
>
> PAVEL DID NOT ACT ACCORDING TO HIS DEVICE LIMITATIONS.

Where was this limitation documented? (Before he documented it, I mean?)

> We understand he was clueless, but user error is still user error!

I think he understands he was clueless too, that's why he investigated the
failure and wrote it up for posterity.

> And Ric said do not stigmatize whole classes of A) devices, B) raid,
> and C) filesystems with "Pavel says...".

I don't care what "Pavel says", so you can leave the ad hominem at the door,
thanks.

The kernel presents abstractions, such as block device nodes. Sometimes
implementation details bubble through those abstractions. Presumably, we
agree on that so far.

I was once asked to write what became Documentation/rbtree.txt, which got
merged. I've also read maybe half of Documentation/RCU. Neither technique is
specific to Linux, but this doesn't seem to have been an objection at the time.

The technique, "journaling", is widely perceived as eliminating the need for
fsck (and thus the potential for filesystem corruption) in the case of unclean
shutdowns. But there are easily reproducible cases where the technique,
"journaling", does not do this. Thus journaling, as a concept, has
limitations which are _not_ widely understood by the majority of people who
purchase and use USB flash keys.

The kernel doesn't currently have any documentation on journaling theory where
mention of journaling's limitations could go. It does have a section on its
internal Journaling API in Documentation/DocBook/filesystems.tmpl which links
to two papers (both about ext3, even though reiserfs was merged first and IBM's
JFS was implemented before either) from 1998 and 2000 respectively. The 2000
paper brushes against disk granularity answering a question starting at 72m,
21s, and brushes against software raid and write ordering starting at the 72m
32s mark. But it never directly addresses either issue...

Sigh, I'm well into tl;dr territory here, aren't I?

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-09-03 08:55:10

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Hi!

>> However, thinking about how to _fix_ a problem is predicated on acknowledging
>> that there actually _is_ a problem. "The hardware is not physically damaged
>> but your data was lost" sounds to me like a software problem, and thus
>> something software could at least _attempt_ to address. "There's millions of
>> 'em, Linux can't cope" doesn't seem like a useful approach.
>
> no other OS avoids this problem either.
>
> I actually don't see how you can do this from userspace, because when you
> write to the device you have _no_ idea where on the device your data will
> actually land.

It certainly is not easy. Self-correcting codes could probably be
used, but that would be very special, very slow, and very
non-standard. (Basically... we could design filesystem so that it
would survive damage of arbitrarily 512K on disk -- using
self-correcting codes in CD-like manner). I'm not sure if it would be
practical.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-03 09:08:39

[permalink] [raw]

Subject: Re: [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case.

On Wed 2009-09-02 17:49:46, Rob Landley wrote:
> From: Rob Landley <[email protected]>
>
> Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
> explaining that using a journaling filesystem can't overcome this problem.
>
> Signed-off-by: Rob Landley <[email protected]>

I like it! Not sure if I know enough about MD to add ack, but...

Acked-by: Pavel Machek <[email protected]>

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-03 12:05:53

[permalink] [raw]

Subject: Re: [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case.

On 09/02/2009 06:49 PM, Rob Landley wrote:
> From: Rob Landley<[email protected]>
>
> Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
> explaining that using a journaling filesystem can't overcome this problem.
>
> Signed-off-by: Rob Landley<[email protected]>
> ---
>
> Documentation/md.txt | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/md.txt b/Documentation/md.txt
> index 4edd39e..52b8450 100644
> --- a/Documentation/md.txt
> +++ b/Documentation/md.txt
> @@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use
>
> md-mod.start_dirty_degraded=1
>
> +Note that Journaling filesystems do not effectively protect data in this
> +case, because the update granularity of the RAID is larger than the journal
> +was designed to expect. Reconstructing data via partity information involes
> +matching together corresponding stripes, and updating only some of these
> +stripes renders the corresponding data in all the unmatched stripes
> +meaningless. Thus seemingly unrelated data in other parts of the filesystem
> +(stored in the unmatched stripes) can become unreadable after a partial
> +update, but the journal is only aware of the parts it modified, not the
> +"collateral damage" elsewhere in the filesystem which was affected by those
> +changes.
> +
> +Thus successful journal replay proves nothing in this context, and even a
> +full fsck only shows whether or not the filesystem's metadata was affected.
> +(A proper solution to this problem would involve adding journaling to the RAID
> +itself, at least during degraded writes. In the meantime, try not to allow
> +a system to shut down uncleanly with its RAID both dirty and degraded, it
> +can handle one but not both.)
>
> Superblock formats
> ------------------
>
>

NACK.

Now you have moved the inaccurate documentation about journalling file systems
into the MD documentation.

Repeat after me:

(1) partial writes to a RAID stripe (with or without file systems, with or
without journals) create an invalid stripe

(2) partial writes can be prevented in most cases by running with write cache
disabled or working barriers

(3) fsck can (for journalling fs or non journalling fs) detect and fix your file
system. It won't give you back the data in that stripe, but you will get the
rest of your metadata and data back and usable.

You don't need MD in the picture to test this - take fsfuzzer or just dd and
zero out a RAID stripe width of data from a file system. If you hit data blocks,
your fsck (for ext2) or mount (for any journalling fs) will not see an error. If
metadata, fsck in both cases when run will try to fix it as best as it can.

Also note that partial writes (similar to torn writes) can happen for multiple
reasons on non-RAID systems and leave the same kind of damage.

Side note, proposing a half sketched out "fix" for partial stripe writes in
documentation is not productive. Much better to submit a fully thought out
proposal or actual patches to demonstrate the issue.

Rob, you should really try to take a few disks, build a working MD RAID5 group
and test your ideas. Try it with and without the write cache enabled.

Measure and report, say after 20 power losses, how files integrity and fsck
repairs were impacted.

Try the same with ext2 and ext3.

Regards,

Ric

2009-09-03 12:31:11

[permalink] [raw]

Subject: Re: [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case.

On Thu 2009-09-03 08:05:31, Ric Wheeler wrote:
> On 09/02/2009 06:49 PM, Rob Landley wrote:
>> From: Rob Landley<[email protected]>
>>
>> Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
>> explaining that using a journaling filesystem can't overcome this problem.
>>
>> Signed-off-by: Rob Landley<[email protected]>
>> ---
>>
>> Documentation/md.txt | 17 +++++++++++++++++
>> 1 file changed, 17 insertions(+)
>>
>> diff --git a/Documentation/md.txt b/Documentation/md.txt
>> index 4edd39e..52b8450 100644
>> --- a/Documentation/md.txt
>> +++ b/Documentation/md.txt
>> @@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use
>>
>> md-mod.start_dirty_degraded=1
>>
>> +Note that Journaling filesystems do not effectively protect data in this
>> +case, because the update granularity of the RAID is larger than the journal
>> +was designed to expect. Reconstructing data via partity information involes
>> +matching together corresponding stripes, and updating only some of these
>> +stripes renders the corresponding data in all the unmatched stripes
>> +meaningless. Thus seemingly unrelated data in other parts of the filesystem
>> +(stored in the unmatched stripes) can become unreadable after a partial
>> +update, but the journal is only aware of the parts it modified, not the
>> +"collateral damage" elsewhere in the filesystem which was affected by those
>> +changes.
>> +
>> +Thus successful journal replay proves nothing in this context, and even a
>> +full fsck only shows whether or not the filesystem's metadata was affected.
>> +(A proper solution to this problem would involve adding journaling to the RAID
>> +itself, at least during degraded writes. In the meantime, try not to allow
>> +a system to shut down uncleanly with its RAID both dirty and degraded, it
>> +can handle one but not both.)
>>
>> Superblock formats
>> ------------------
>>
>>
>
> NACK.
>
> Now you have moved the inaccurate documentation about journalling file
> systems into the MD documentation.

What is inaccurate about it?

> Repeat after me:

> (1) partial writes to a RAID stripe (with or without file systems, with
> or without journals) create an invalid stripe

That's what he's documenting.

> (2) partial writes can be prevented in most cases by running with write
> cache disabled or working barriers

Given how long experience with storage you claim, you should know that
MD RAID5 does not support barriers by now...

> Rob, you should really try to take a few disks, build a working MD RAID5
> group and test your ideas. Try it with and without the write cache
> enabled.

....and understand by now that statistics are irrelevant for design
problems.

Ouch and trying to silence people by telling them to fix the problem
instead of documenting it is not nice either.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-03 14:14:43

by jim owens

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Rob Landley wrote:
> I think he understands he was clueless too, that's why he investigated the
> failure and wrote it up for posterity.
>
>> And Ric said do not stigmatize whole classes of A) devices, B) raid,
>> and C) filesystems with "Pavel says...".
>
> I don't care what "Pavel says", so you can leave the ad hominem at the door,
> thanks.

See, this is exactly the problem we have with all the proposed
documentation. The reader (you) did not get what the writer (me)
was trying to say. That does not say either of us was wrong in
what we thought was meant, simply that we did not communicate.

What I meant was we did not want to accept Pavel's incorrect
documentation and post it in kernel docs.

> The kernel presents abstractions, such as block device nodes. Sometimes
> implementation details bubble through those abstractions. Presumably, we
> agree on that so far.

We don't have any problem with documenting abstractions. But they
must be written as abstracts and accurate, not as IMO blogs.

It is not "he means well, so we will just accept it". The rule
for kernel docs should be the same as for code. If it is not
correct in all cases or causes problems, we don't accept it.

jim

2009-09-04 07:44:55

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Thursday 03 September 2009 09:14:43 jim owens wrote:
> Rob Landley wrote:
> > I think he understands he was clueless too, that's why he investigated
> > the failure and wrote it up for posterity.
> >
> >> And Ric said do not stigmatize whole classes of A) devices, B) raid,
> >> and C) filesystems with "Pavel says...".
> >
> > I don't care what "Pavel says", so you can leave the ad hominem at the
> > door, thanks.
>
> See, this is exactly the problem we have with all the proposed
> documentation. The reader (you) did not get what the writer (me)
> was trying to say. That does not say either of us was wrong in
> what we thought was meant, simply that we did not communicate.

That's why I've mostly stopped bothering with this thread. I could respond to
Ric Wheeler's latest (what does write barriers have to do with whether or not
a multi-sector stripe is guaranteed to be atomically updated during a panic or
power failure?) but there's just no point.

The LWN article on the topic is out, and incomplete as it is I expect it's the
best documentation anybody will actually _read_.

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-09-04 11:49:52

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 09/04/2009 03:44 AM, Rob Landley wrote:
> On Thursday 03 September 2009 09:14:43 jim owens wrote:
>
>> Rob Landley wrote:
>>
>>> I think he understands he was clueless too, that's why he investigated
>>> the failure and wrote it up for posterity.
>>>
>>>
>>>> And Ric said do not stigmatize whole classes of A) devices, B) raid,
>>>> and C) filesystems with "Pavel says...".
>>>>
>>> I don't care what "Pavel says", so you can leave the ad hominem at the
>>> door, thanks.
>>>
>> See, this is exactly the problem we have with all the proposed
>> documentation. The reader (you) did not get what the writer (me)
>> was trying to say. That does not say either of us was wrong in
>> what we thought was meant, simply that we did not communicate.
>>
> That's why I've mostly stopped bothering with this thread. I could respond to
> Ric Wheeler's latest (what does write barriers have to do with whether or not
> a multi-sector stripe is guaranteed to be atomically updated during a panic or
> power failure?) but there's just no point.
>

The point of that post was that the failure that you and Pavel both
attribute to RAID and journalled fs happens whenever the storage cannot
promise to do atomic writes of a logical FS block (prevent torn
pages/split writes/etc). I gave a specific example of why this happens
even with simple, single disk systems.

Further, if you have the write cache enabled on your local S-ATA/SAS
drives and do not have working barriers (as is the case with MD
RAID5/6), you have a hard promise of data loss on power outage and these
split writes are not going to be the cause of your issues.

You can verify this by testing. Or, try to find people that do storage
and file systems that you would listen to and ask.
> The LWN article on the topic is out, and incomplete as it is I expect it's the
> best documentation anybody will actually _read_.
>
> Rob
>

2009-09-05 10:28:10

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Fri 2009-09-04 07:49:34, Ric Wheeler wrote:
> On 09/04/2009 03:44 AM, Rob Landley wrote:
>> On Thursday 03 September 2009 09:14:43 jim owens wrote:
>>
>>> Rob Landley wrote:
>>>
>>>> I think he understands he was clueless too, that's why he investigated
>>>> the failure and wrote it up for posterity.
>>>>
>>>>
>>>>> And Ric said do not stigmatize whole classes of A) devices, B) raid,
>>>>> and C) filesystems with "Pavel says...".
>>>>>
>>>> I don't care what "Pavel says", so you can leave the ad hominem at the
>>>> door, thanks.
>>>>
>>> See, this is exactly the problem we have with all the proposed
>>> documentation. The reader (you) did not get what the writer (me)
>>> was trying to say. That does not say either of us was wrong in
>>> what we thought was meant, simply that we did not communicate.
>>>
>> That's why I've mostly stopped bothering with this thread. I could respond to
>> Ric Wheeler's latest (what does write barriers have to do with whether or not
>> a multi-sector stripe is guaranteed to be atomically updated during a panic or
>> power failure?) but there's just no point.
>>
>
> The point of that post was that the failure that you and Pavel both
> attribute to RAID and journalled fs happens whenever the storage cannot
> promise to do atomic writes of a logical FS block (prevent torn
> pages/split writes/etc). I gave a specific example of why this happens
> even with simple, single disk systems.

ext3 does not expect atomic write of 4K block, according to Ted. So
no, it is not broken on single disk.

>> The LWN article on the topic is out, and incomplete as it is I expect it's the
>> best documentation anybody will actually _read_.

Would anyone (probably privately?) share the lwn link?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-05 12:20:47

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 09/05/2009 06:28 AM, Pavel Machek wrote:
> On Fri 2009-09-04 07:49:34, Ric Wheeler wrote:
>
>> On 09/04/2009 03:44 AM, Rob Landley wrote:
>>
>>> On Thursday 03 September 2009 09:14:43 jim owens wrote:
>>>
>>>
>>>> Rob Landley wrote:
>>>>
>>>>
>>>>> I think he understands he was clueless too, that's why he investigated
>>>>> the failure and wrote it up for posterity.
>>>>>
>>>>>
>>>>>
>>>>>> And Ric said do not stigmatize whole classes of A) devices, B) raid,
>>>>>> and C) filesystems with "Pavel says...".
>>>>>>
>>>>>>
>>>>> I don't care what "Pavel says", so you can leave the ad hominem at the
>>>>> door, thanks.
>>>>>
>>>>>
>>>> See, this is exactly the problem we have with all the proposed
>>>> documentation. The reader (you) did not get what the writer (me)
>>>> was trying to say. That does not say either of us was wrong in
>>>> what we thought was meant, simply that we did not communicate.
>>>>
>>>>
>>> That's why I've mostly stopped bothering with this thread. I could respond to
>>> Ric Wheeler's latest (what does write barriers have to do with whether or not
>>> a multi-sector stripe is guaranteed to be atomically updated during a panic or
>>> power failure?) but there's just no point.
>>>
>>>
>> The point of that post was that the failure that you and Pavel both
>> attribute to RAID and journalled fs happens whenever the storage cannot
>> promise to do atomic writes of a logical FS block (prevent torn
>> pages/split writes/etc). I gave a specific example of why this happens
>> even with simple, single disk systems.
>>
> ext3 does not expect atomic write of 4K block, according to Ted. So
> no, it is not broken on single disk.
>

I am not sure what you mean by "expect."

ext3 (and other file systems) certainly expect that acknowledged writes
will still be there after a crash.

With your disk write cache on (and no working barriers or non-volatile
write cache), this will always require a repair via fsck or leave you
with corrupted data or metadata.

ext4, btrfs and zfs all do checksumming of writes, but this is a
detection mechanism.

Repair of the partial write is done on detection (if you have another
copy in btrfs or xfs) or by repair (ext4's fsck).

For what it's worth, this is the same story with databases (DB2, Oracle,
etc). They spend a lot of energy trying to detect partial writes from
the application level's point of view and their granularity is often
multiple fs blocks....

>
>
>>> The LWN article on the topic is out, and incomplete as it is I expect it's the
>>> best documentation anybody will actually _read_.
>>>
> Would anyone (probably privately?) share the lwn link?
> Pavel
>

2009-09-05 13:54:24

by Jonathan Corbet

[permalink] [raw]

Subject: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Sat, 5 Sep 2009 12:28:10 +0200
Pavel Machek <[email protected]> wrote:

> >> The LWN article on the topic is out, and incomplete as it is I expect it's the
> >> best documentation anybody will actually _read_.
>
> Would anyone (probably privately?) share the lwn link?

http://lwn.net/SubscriberLink/349970/9875eff987190551/

assuming you've not already gotten one from elsewhere.

jon

2009-09-05 21:27:32