2009-08-24 21:11:08

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Monday 24 August 2009 04:31:43 Pavel Machek wrote:
> Running journaling filesystem such as ext3 over flashdisk or degraded
> RAID array is a bad idea: journaling guarantees no longer apply and
> you will get data corruption on powerfail.
>
> We can't solve it easily, but we should certainly warn the users. I
> actually lost data because I did not understand these limitations...
>
> Signed-off-by: Pavel Machek <[email protected]>

Acked-by: Rob Landley <[email protected]>

With a couple comments:

> +* write caching is disabled. ext2 does not know how to issue barriers
> + as of 2.6.28. hdparm -W0 disables it on SATA disks.

It's coming up on 2.6.31, has it learned anything since or should that version
number be bumped?

> + (Thrash may get written into sectors during powerfail. And
> + ext3 handles this surprisingly well at least in the
> + catastrophic case of garbage getting written into the inode
> + table, since the journal replay often will "repair" the
> + garbage that was written into the filesystem metadata blocks.
> + It won't do a bit of good for the data blocks, of course
> + (unless you are using data=journal mode). But this means that
> + in fact, ext3 is more resistant to suriving failures to the
> + first problem (powerfail while writing can damage old data on
> + a failed write) but fortunately, hard drives generally don't
> + cause collateral damage on a failed write.

Possible rewording of this paragraph:

Ext3 handles trash getting written into sectors during powerfail
surprisingly well. It's not foolproof, but it is resilient. Incomplete
journal entries are ignored, and journal replay of complete entries will
often "repair" garbage written into the inode table. The data=journal
option extends this behavior to file and directory data blocks as well
(without which your dentries can still be badly corrupted by a power fail
during a write).

(I'm not entirely sure about that last bit, but clarifying it one way or the
other would be nice because I can't tell from reading it which it is. My
_guess_ is that directories are just treated as files with an attitude and an
extra cacheing layer...?)

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds


2009-08-24 21:33:12

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Mon 2009-08-24 16:11:08, Rob Landley wrote:
> On Monday 24 August 2009 04:31:43 Pavel Machek wrote:
> > Running journaling filesystem such as ext3 over flashdisk or degraded
> > RAID array is a bad idea: journaling guarantees no longer apply and
> > you will get data corruption on powerfail.
> >
> > We can't solve it easily, but we should certainly warn the users. I
> > actually lost data because I did not understand these limitations...
> >
> > Signed-off-by: Pavel Machek <[email protected]>
>
> Acked-by: Rob Landley <[email protected]>
>
> With a couple comments:
>
> > +* write caching is disabled. ext2 does not know how to issue barriers
> > + as of 2.6.28. hdparm -W0 disables it on SATA disks.
>
> It's coming up on 2.6.31, has it learned anything since or should that version
> number be bumped?

Jan, did those "barrier for ext2" patches get merged?

> > + (Thrash may get written into sectors during powerfail. And
> > + ext3 handles this surprisingly well at least in the
> > + catastrophic case of garbage getting written into the inode
> > + table, since the journal replay often will "repair" the
> > + garbage that was written into the filesystem metadata blocks.
> > + It won't do a bit of good for the data blocks, of course
> > + (unless you are using data=journal mode). But this means that
> > + in fact, ext3 is more resistant to suriving failures to the
> > + first problem (powerfail while writing can damage old data on
> > + a failed write) but fortunately, hard drives generally don't
> > + cause collateral damage on a failed write.
>
> Possible rewording of this paragraph:
>
> Ext3 handles trash getting written into sectors during powerfail
> surprisingly well. It's not foolproof, but it is resilient. Incomplete
> journal entries are ignored, and journal replay of complete entries will
> often "repair" garbage written into the inode table. The data=journal
> option extends this behavior to file and directory data blocks as well
> (without which your dentries can still be badly corrupted by a power fail
> during a write).
>
> (I'm not entirely sure about that last bit, but clarifying it one way or the
> other would be nice because I can't tell from reading it which it is. My
> _guess_ is that directories are just treated as files with an attitude and an
> extra cacheing layer...?)

Thanks, applied, it looks better than what I wrote. I removed the ()
part, as I'm not sure about it...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-25 18:45:20

by Jan Kara

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Mon 24-08-09 23:33:12, Pavel Machek wrote:
> On Mon 2009-08-24 16:11:08, Rob Landley wrote:
> > On Monday 24 August 2009 04:31:43 Pavel Machek wrote:
> > > Running journaling filesystem such as ext3 over flashdisk or degraded
> > > RAID array is a bad idea: journaling guarantees no longer apply and
> > > you will get data corruption on powerfail.
> > >
> > > We can't solve it easily, but we should certainly warn the users. I
> > > actually lost data because I did not understand these limitations...
> > >
> > > Signed-off-by: Pavel Machek <[email protected]>
> >
> > Acked-by: Rob Landley <[email protected]>
> >
> > With a couple comments:
> >
> > > +* write caching is disabled. ext2 does not know how to issue barriers
> > > + as of 2.6.28. hdparm -W0 disables it on SATA disks.
> >
> > It's coming up on 2.6.31, has it learned anything since or should that version
> > number be bumped?
>
> Jan, did those "barrier for ext2" patches get merged?
No, they did not. We were discussing how to be able to enable / disable
sending barriers, someone told he'd implement it but it somehow never got
beyond an initial attempt.
Actually, after recent sync cleanups (and when my O_SYNC cleanups get
merged) it should be pretty easy because every filesystem now has ->fsync()
and ->sync_fs() callback so we just have to add sending barriers to these
two functions and implement possibility to set via sysfs that barriers on the
block device should be ignored.
I've put it to my todo list but if someone else has time for this, I
certainly would not mind :). It would be a nice beginner project...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR