2009-08-24 13:21:11

by Greg Freemyer

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Mon, Aug 24, 2009 at 5:31 AM, Pavel Machek<[email protected]> wrote:
>
> Running journaling filesystem such as ext3 over flashdisk or degraded
> RAID array is a bad idea: journaling guarantees no longer apply and
> you will get data corruption on powerfail.
>
> We can't solve it easily, but we should certainly warn the users. I
> actually lost data because I did not understand these limitations...
>
> Signed-off-by: Pavel Machek <[email protected]>
>
> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
> new file mode 100644
> index 0000000..80fa886
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> @@ -0,0 +1,52 @@
> +Linux block-backed filesystems can only work correctly when several
> +conditions are met in the block layer and below (disks, flash
> +cards). Some of them are obvious ("data on media should not change
> +randomly"), some are less so.
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly.
> +
> + ? ? ? Fortunately writes failing are very uncommon on traditional
> + ? ? ? spinning disks, as they have spare sectors they use when write
> + ? ? ? fails.
> +
> +Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
> +and are thus unsuitable for all filesystems I know.
> +
> + ? ? ? An inherent problem with using flash as a normal block device
> + ? ? ? is that the flash erase size is bigger than most filesystem
> + ? ? ? sector sizes. ?So when you request a write, it may erase and
> + ? ? ? rewrite some 64k, 128k, or even a couple megabytes on the
> + ? ? ? really _big_ ones.
> +
> + ? ? ? If you lose power in the middle of that, filesystem won't
> + ? ? ? notice that data in the "sectors" _around_ the one your were
> + ? ? ? trying to write to got trashed.
> +
> + ? ? ? RAID-4/5/6 in degraded mode has same problem.
> +
> +
> +Don't damage the old data on a failed write (ATOMIC-WRITES)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> + ? ? ? Because RAM tends to fail faster than rest of system during
> + ? ? ? powerfail, special hw killing DMA transfers may be necessary;
> + ? ? ? otherwise, disks may write garbage during powerfail.
> + ? ? ? This may be quite common on generic PC machines.
> +
> + ? ? ? Note that atomic write is very hard to guarantee for RAID-4/5/6,
> + ? ? ? because it needs to write both changed data, and parity, to
> + ? ? ? different disks. (But it will only really show up in degraded mode).
> + ? ? ? UPS for RAID array should help.

Can someone clarify if this is true in raid-6 with just a single disk
failure? I don't see why it would be.

And if not can the above text be changed to reflect raid 4/5 with a
single disk failure and raid 6 with a double disk failure are the
modes that have atomicity problems.

Greg


2009-08-24 18:44:21

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible


> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
> > +
> > + ? ? ? Because RAM tends to fail faster than rest of system during
> > + ? ? ? powerfail, special hw killing DMA transfers may be necessary;
> > + ? ? ? otherwise, disks may write garbage during powerfail.
> > + ? ? ? This may be quite common on generic PC machines.
> > +
> > + ? ? ? Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > + ? ? ? because it needs to write both changed data, and parity, to
> > + ? ? ? different disks. (But it will only really show up in degraded mode).
> > + ? ? ? UPS for RAID array should help.
>
> Can someone clarify if this is true in raid-6 with just a single disk
> failure? I don't see why it would be.
>
> And if not can the above text be changed to reflect raid 4/5 with a
> single disk failure and raid 6 with a double disk failure are the
> modes that have atomicity problems.

I don't know enough about raid-6, but... I said "degraded mode" above,
and you can read it as double failure in raid-6 case ;-). I'll prefer
to avoid too many details here.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html