2009-08-25 00:06:48

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Pavel Machek wrote:
> On Mon 2009-08-24 18:39:15, Theodore Tso wrote:
>
>> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
>>
>>>> I have to admit that I have not paid enough attention to this specifics
>>>> of your ext3 + flash card issue - is it the ftl stuff doing out of order
>>>> IO's?
>>>>
>>> The problem is that flash cards destroy whole erase block on unplug,
>>> and ext3 can't cope with that.
>>>
>> Sure --- but name **any** filesystem that can deal with the fact that
>> 128k or 256k worth of data might disappear when you pull out the flash
>> card while it is writing a single sector?
>>
>
> First... I consider myself quite competent in the os level, yet I did
> not realize what flash does and what that means for data
> integrity. That means we need some documentation, or maybe we should
> refuse to mount those devices r/w or something.
>
> Then to answer your question... ext2. You expect to run fsck after
> unclean shutdown, and you expect to have to solve some problems with
> it. So the way ext2 deals with the flash media actually matches what
> the user expects. (*)
>
> OTOH in ext3 case you expect consistent filesystem after unplug; and
> you don't get that.
>
>
>>>> Your statement is overly broad - ext3 on a commercial RAID array that
>>>> does RAID5 or RAID6, etc has no issues that I know of.
>>>>
>>> If your commercial RAID array is battery backed, maybe. But I was
>>> talking Linux MD here.
>>>
> ...
>
>> If your concern is that with Linux MD, you could potentially lose an
>> entire stripe in RAID 5 mode, then you should say that explicitly; but
>> again, this isn't a filesystem specific cliam; it's true for all
>> filesystems. I don't know of any file system that can survive having
>> a RAID stripe-shaped-hole blown into the middle of it due to a power
>> failure.
>>
>
> Again, ext2 handles that in a way user expects it.
>
> At least I was teached "ext2 needs fsck after powerfail; ext3 can
> handle powerfails just ok".
>
>

So, would you be happy if ext3 fsck was always run on reboot (at least
for flash devices)?

ric

>> I'll note, BTW, that AIX uses a journal to protect against these sorts
>> of problems with software raid; this also means that with AIX, you
>> also don't have to rebuild a RAID 1 device after an unclean shutdown,
>> like you have do with Linux MD. This was on the EVMS's team
>> development list to implement for Linux, but it got canned after LVM
>> won out, lo those many years ago. Ce la vie; but it's a problem which
>> is solvable at the RAID layer, and which is traditionally and
>> historically solved in competent RAID implementations.
>>
>
> Yep, we should add journal to RAID; or at least write "Linux MD
> *needs* an UPS" in big and bold letters. I'm trying to do the second
> part.
>
> (Attached is current version of the patch).
>
> [If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are
> generaly unsafe to use without UPS/reliable connection/no kernel
> bugs... then I may try to push that. I was not sure... maybe some
> filesystem _can_ handle this kind of issues?]
>
> Pavel
>
> (*) Ok, now... user expects to run fsck, but very advanced users may
> not expect old data to be damaged. Certainly I was not advanced enough
> user few months ago.
>
> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
> new file mode 100644
> index 0000000..d1ef4d0
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> @@ -0,0 +1,57 @@
> +Linux block-backed filesystems can only work correctly when several
> +conditions are met in the block layer and below (disks, flash
> +cards). Some of them are obvious ("data on media should not change
> +randomly"), some are less so. Not all filesystems require all of these
> +to be satisfied for safe operation.
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly.
> +
> + Fortunately writes failing are very uncommon on traditional
> + spinning disks, as they have spare sectors they use when write
> + fails.
> +
> +Don't cause collateral damage on a failed write (NO-COLLATERALS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +On some storage systems, failed write (for example due to power
> +failure) kills data in adjacent (or maybe unrelated) sectors.
> +
> +Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
> +and are thus unsuitable for all filesystems I know.
> +
> + An inherent problem with using flash as a normal block device
> + is that the flash erase size is bigger than most filesystem
> + sector sizes. So when you request a write, it may erase and
> + rewrite some 64k, 128k, or even a couple megabytes on the
> + really _big_ ones.
> +
> + If you lose power in the middle of that, filesystem won't
> + notice that data in the "sectors" _around_ the one your were
> + trying to write to got trashed.
> +
> + MD RAID-4/5/6 in degraded mode has similar problem, stripes
> + behave similary to eraseblocks.
> +
> +
> +Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> + Because RAM tends to fail faster than rest of system during
> + powerfail, special hw killing DMA transfers may be necessary;
> + otherwise, disks may write garbage during powerfail.
> + This may be quite common on generic PC machines.
> +
> + Note that atomic write is very hard to guarantee for MD RAID-4/5/6,
> + because it needs to write both changed data, and parity, to
> + different disks. (But it will only really show up in degraded mode).
> + UPS for RAID array should help.
> +
> +
> +
> diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
> index 67639f9..ef9ff0f 100644
> --- a/Documentation/filesystems/ext2.txt
> +++ b/Documentation/filesystems/ext2.txt
> @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they
> have to be 8 character filenames, even then we are fairly close to
> running out of unique filenames.
>
> +Requirements
> +============
> +
> +Ext2 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* write errors not allowed (NO-WRITE-ERRORS)
> +
> +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
> +
> +and obviously:
> +
> +* don't cause collateral damage to adjacent sectors on a failed write
> + (NO-COLLATERALS)
> +
> +(see expectations.txt; note that most/all linux block-based
> +filesystems have similar expectations)
> +
> +* write caching is disabled. ext2 does not know how to issue barriers
> + as of 2.6.28. hdparm -W0 disables it on SATA disks.
> +
> Journaling
> -----------
> -
> -A journaling extension to the ext2 code has been developed by Stephen
> -Tweedie. It avoids the risks of metadata corruption and the need to
> -wait for e2fsck to complete after a crash, without requiring a change
> -to the on-disk ext2 layout. In a nutshell, the journal is a regular
> -file which stores whole metadata (and optionally data) blocks that have
> -been modified, prior to writing them into the filesystem. This means
> -it is possible to add a journal to an existing ext2 filesystem without
> -the need for data conversion.
> -
> -When changes to the filesystem (e.g. a file is renamed) they are stored in
> -a transaction in the journal and can either be complete or incomplete at
> -the time of a crash. If a transaction is complete at the time of a crash
> -(or in the normal case where the system does not crash), then any blocks
> -in that transaction are guaranteed to represent a valid filesystem state,
> -and are copied into the filesystem. If a transaction is incomplete at
> -the time of the crash, then there is no guarantee of consistency for
> -the blocks in that transaction so they are discarded (which means any
> -filesystem changes they represent are also lost).
> +==========
> Check Documentation/filesystems/ext3.txt if you want to read more about
> ext3 and journaling.
>
> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
> index 570f9bd..752f4b4 100644
> --- a/Documentation/filesystems/ext3.txt
> +++ b/Documentation/filesystems/ext3.txt
> @@ -199,6 +202,43 @@ debugfs: ext2 and ext3 file system debugger.
> ext2online: online (mounted) ext2 and ext3 filesystem resizer
>
>
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* write errors not allowed (NO-WRITE-ERRORS)
> +
> +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
> +
> + Ext3 handles trash getting written into sectors during powerfail
> + surprisingly well. It's not foolproof, but it is resilient.
> + Incomplete journal entries are ignored, and journal replay of
> + complete entries will often "repair" garbage written into the inode
> + table. The data=journal option extends this behavior to file and
> + directory data blocks as well.
> +
> +
> +and obviously:
> +
> +* don't cause collateral damage to adjacent sectors on a failed write
> + (NO-COLLATERALS)
> +
> +
> +(see expectations.txt; note that most/all linux block-based
> +filesystems have similar expectations)
> +
> +* either write caching is disabled, or hw can do barriers and they are enabled.
> +
> + (Note that barriers are disabled by default, use "barrier=1"
> + mount option after making sure hw can support them).
> +
> + hdparm -I reports disk features. If you have "Native
> + Command Queueing" is the feature you are looking for.
> +
> +
> References
> ==========
>
>
>



2009-08-25 09:34:15

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Hi!

>>> If your concern is that with Linux MD, you could potentially lose an
>>> entire stripe in RAID 5 mode, then you should say that explicitly; but
>>> again, this isn't a filesystem specific cliam; it's true for all
>>> filesystems. I don't know of any file system that can survive having
>>> a RAID stripe-shaped-hole blown into the middle of it due to a power
>>> failure.
>>>
>>
>> Again, ext2 handles that in a way user expects it.
>>
>> At least I was teached "ext2 needs fsck after powerfail; ext3 can
>> handle powerfails just ok".
>
> So, would you be happy if ext3 fsck was always run on reboot (at least
> for flash devices)?

For flash devices, MD Raid 5 and anything else that needs it; yes that
would make me happy ;-).

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-25 15:34:46

by David Lang

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Tue, 25 Aug 2009, Pavel Machek wrote:

> Hi!
>
>>>> If your concern is that with Linux MD, you could potentially lose an
>>>> entire stripe in RAID 5 mode, then you should say that explicitly; but
>>>> again, this isn't a filesystem specific cliam; it's true for all
>>>> filesystems. I don't know of any file system that can survive having
>>>> a RAID stripe-shaped-hole blown into the middle of it due to a power
>>>> failure.
>>>>
>>>
>>> Again, ext2 handles that in a way user expects it.
>>>
>>> At least I was teached "ext2 needs fsck after powerfail; ext3 can
>>> handle powerfails just ok".
>>
>> So, would you be happy if ext3 fsck was always run on reboot (at least
>> for flash devices)?
>
> For flash devices, MD Raid 5 and anything else that needs it; yes that
> would make me happy ;-).

the thing is that fsck would not fix the problem.

it may (if the data lost was metadata) detect the problem and tell you how
many files you have lost, but if the data lost was all in a data file you
would not detect it with a fsck

the only way you would detect the missing data is to read all the files on
the filesystem and detect that the data you are reading is wrong.

but how can you tell if the data you are reading is wrong?

on a flash drive, your read can return garbage, but how do you know that
garbage isn't the contents of the file?

on a degraded raid5 array you have no way to test data integrity, so when
the missing drive is replaced, the rebuild algorithm will calculate the
appropriate data to make the parity calculations work out and write
garbage to that drive.

David Lang

2009-08-26 03:32:47

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Pavel Machek wrote:

>> So, would you be happy if ext3 fsck was always run on reboot (at least
>> for flash devices)?
>
> For flash devices, MD Raid 5 and anything else that needs it; yes that
> would make me happy ;-).

Sorry, but that just shows your naivete.

Metadata takes up such a small part of the disk that fscking
it and finding it to be OK is absolutely no guarantee that
the data on the filesystem has not been horribly mangled.

Personally, what I care about is my data.

The metadata is just a way to get to my data, while the data
is actually important.

--
All rights reversed.

2009-08-26 11:17:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Tue 2009-08-25 23:32:47, Rik van Riel wrote:
> Pavel Machek wrote:
>
>>> So, would you be happy if ext3 fsck was always run on reboot (at
>>> least for flash devices)?
>>
>> For flash devices, MD Raid 5 and anything else that needs it; yes that
>> would make me happy ;-).
>
> Sorry, but that just shows your naivete.
>
> Metadata takes up such a small part of the disk that fscking
> it and finding it to be OK is absolutely no guarantee that
> the data on the filesystem has not been horribly mangled.
>
> Personally, what I care about is my data.
>
> The metadata is just a way to get to my data, while the data
> is actually important.

Personally, I care about metadata consistency, and ext3 documentation
suggests that journal protects its integrity. Except that it does not
on broken storage devices, and you still need to run fsck there.

How do you protect your data is another question, but ext3
documentation does not claim journal to protect them, so that's up to
the user I guess.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-26 11:29:42

by David Lang

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Wed, 26 Aug 2009, Pavel Machek wrote:

> On Tue 2009-08-25 23:32:47, Rik van Riel wrote:
>> Pavel Machek wrote:
>>
>>>> So, would you be happy if ext3 fsck was always run on reboot (at
>>>> least for flash devices)?
>>>
>>> For flash devices, MD Raid 5 and anything else that needs it; yes that
>>> would make me happy ;-).
>>
>> Sorry, but that just shows your naivete.
>>
>> Metadata takes up such a small part of the disk that fscking
>> it and finding it to be OK is absolutely no guarantee that
>> the data on the filesystem has not been horribly mangled.
>>
>> Personally, what I care about is my data.
>>
>> The metadata is just a way to get to my data, while the data
>> is actually important.
>
> Personally, I care about metadata consistency, and ext3 documentation
> suggests that journal protects its integrity. Except that it does not
> on broken storage devices, and you still need to run fsck there.

as the ext3 authors have stated many times over the years, you still need
to run fsck periodicly anyway.

what the journal gives you is a reasonable chance of skipping it when the
system crashes and you want to get it back up ASAP.

David Lang

> How do you protect your data is another question, but ext3
> documentation does not claim journal to protect them, so that's up to
> the user I guess.
> Pavel
>

2009-08-26 12:28:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Wed, Aug 26, 2009 at 01:17:52PM +0200, Pavel Machek wrote:
> > Metadata takes up such a small part of the disk that fscking
> > it and finding it to be OK is absolutely no guarantee that
> > the data on the filesystem has not been horribly mangled.
> >
> > Personally, what I care about is my data.
> >
> > The metadata is just a way to get to my data, while the data
> > is actually important.
>
> Personally, I care about metadata consistency, and ext3 documentation
> suggests that journal protects its integrity. Except that it does not
> on broken storage devices, and you still need to run fsck there.

Caring about metadata consistency and not data is just weird, I'm
sorry. I can't imagine anyone who actually *cares* about what they
have stored, whether it's digital photographs of child taking a first
step, or their thesis research, caring about more about the metadata
than the data. Giving advice that pretends that most users have that
priority is Just Wrong.

That's why what we should document is that people should avoid broken
storage devices, and advice on how to use RAID properly. At the end
of the day, getting people to switch from ext2 to ext3 on some
misguided notion that this way, they'll know when their metadata is
safe (at least in the power failure case; but not the system hangs and
you have to reboot case), and getting them to ignore the question of
why are they using a broken storage device in the first place, is
Documentation malpractice.

- Ted

2009-08-26 13:10:28

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible


>>> The metadata is just a way to get to my data, while the data
>>> is actually important.
>>
>> Personally, I care about metadata consistency, and ext3 documentation
>> suggests that journal protects its integrity. Except that it does not
>> on broken storage devices, and you still need to run fsck there.
>
> as the ext3 authors have stated many times over the years, you still need
> to run fsck periodicly anyway.

Where is that documented? I very much agree with that, but when suse10
switched periodic fsck off, I could not find any docs to show that it
is bad idea.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-26 13:44:31

by David Lang

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Wed, 26 Aug 2009, Pavel Machek wrote:

>>>> The metadata is just a way to get to my data, while the data
>>>> is actually important.
>>>
>>> Personally, I care about metadata consistency, and ext3 documentation
>>> suggests that journal protects its integrity. Except that it does not
>>> on broken storage devices, and you still need to run fsck there.
>>
>> as the ext3 authors have stated many times over the years, you still need
>> to run fsck periodicly anyway.
>
> Where is that documented?

linux-kernel mailing list archives.

David Lang

> I very much agree with that, but when suse10
> switched periodic fsck off, I could not find any docs to show that it
> is bad idea.
> Pavel
>

2009-08-26 18:02:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Wed, Aug 26, 2009 at 06:43:24AM -0700, [email protected] wrote:
>>>
>>> as the ext3 authors have stated many times over the years, you still need
>>> to run fsck periodicly anyway.
>>
>> Where is that documented?
>
> linux-kernel mailing list archives.

Probably from some 6-8 years ago, in e-mail postings that I made. My
argument has always been that PC-class hardware is crap, and it's a
Really Good Idea to periodically check the metadata because corruption
there can end up causing massive data loss. The main problem is that
doing it at reboot time really hurt system availability, and "after 20
reboots (plus or minus)" resulted in fsck checks at wildly varying
intervals depending on how often people reboot.

What I've been recommending for some time is that people use LVM, and
run fsck on a snapshot every week or two, at some convenient time when
the system load is at a minimum. There is an e2croncheck script in
the e2fsprogs sources, in the contrib directory; it's short enough
that I'll attach here here.

Is it *necessary*? In a world where hardware is perfect, no. In a
world where people don't bother buying ECC memory because it's 10%
more expensive, and PC builders use the cheapest possible parts --- I
think it's a really good idea.

- Ted

P.S. Patches so that this shell script takes a config file, and/or
parses /etc/fstab to automatically figure out which filesystems should
be checked, are greatly appreciated. Getting distro's to start
including this in their e2fsprogs packaging scripts would also be
greatly appreciated.

#!/bin/sh
#
# e2croncheck -- run e2fsck automatically out of /etc/cron.weekly
#
# This script is intended to be run by the system administrator
# periodically from the command line, or to be run once a week
# or so by the cron daemon to check a mounted filesystem (normally
# the root filesystem, but it could be used to check other filesystems
# that are always mounted when the system is booted).
#
# Make sure you customize "VG" so it is your LVM volume group name,
# "VOLUME" so it is the name of the filesystem's logical volume,
# and "EMAIL" to be your e-mail address
#
# Written by Theodore Ts'o, Copyright 2007, 2008, 2009.
#
# This file may be redistributed under the terms of the
# GNU Public License, version 2.
#

VG=ssd
VOLUME=root
SNAPSIZE=100m
[email protected]

TMPFILE=`mktemp -t e2fsck.log.XXXXXXXXXX`

OPTS="-Fttv -C0"
#OPTS="-Fttv -E fragcheck"

set -e
START="$(date +'%Y%m%d%H%M%S')"
lvcreate -s -L ${SNAPSIZE} -n "${VOLUME}-snap" "${VG}/${VOLUME}"
if nice logsave -as $TMPFILE e2fsck -p $OPTS "/dev/${VG}/${VOLUME}-snap" && \
nice logsave -as $TMPFILE e2fsck -fy $OPTS "/dev/${VG}/${VOLUME}-snap" ; then
echo 'Background scrubbing succeeded!'
tune2fs -C 0 -T "${START}" "/dev/${VG}/${VOLUME}"
else
echo 'Background scrubbing failed! Reboot to fsck soon!'
tune2fs -C 16000 -T "19000101" "/dev/${VG}/${VOLUME}"
if test -n "$RPT-EMAIL"; then
mail -s "E2fsck of /dev/${VG}/${VOLUME} failed!" $EMAIL < $TMPFILE
fi
fi
lvremove -f "${VG}/${VOLUME}-snap"
rm $TMPFILE


2009-08-27 05:27:37

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Tuesday 25 August 2009 22:32:47 Rik van Riel wrote:
> Pavel Machek wrote:
> >> So, would you be happy if ext3 fsck was always run on reboot (at least
> >> for flash devices)?
> >
> > For flash devices, MD Raid 5 and anything else that needs it; yes that
> > would make me happy ;-).
>
> Sorry, but that just shows your naivete.

Hence wanting documentation properly explaining the situation, yes.

Often the people writing the documentation aren't the people who know the most
about the situation, but the people who found out they NEED said
documentation, and post errors until they get sufficient corrections.

In which case "you're wrong, it's actually _this_" is helpful, and "you're
wrong, go away and stop bothering us grown-ups" isn't.

> Metadata takes up such a small part of the disk that fscking
> it and finding it to be OK is absolutely no guarantee that
> the data on the filesystem has not been horribly mangled.
>
> Personally, what I care about is my data.
>
> The metadata is just a way to get to my data, while the data
> is actually important.

Are you saying ext3 should default to journal=data then?

It seems that the default journaling only handles the metadata, and people
seem to think that journaled filesystems exist for a reason.

There seems to be a lot of "the guarantees you think a journal provides aren't
worth anything, so the fact there are circumstances under which it doesn't
provide them isn't worth telling anybody about" in this thread. So we
shouldn't bother journaled filesystems? I'm not sure what the intended
argument is here...

I have no clue what the finished documentation on this issue should look like
either. But I want to read it.

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-08-27 06:06:13

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Wednesday 26 August 2009 07:28:13 Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 01:17:52PM +0200, Pavel Machek wrote:
> > > Metadata takes up such a small part of the disk that fscking
> > > it and finding it to be OK is absolutely no guarantee that
> > > the data on the filesystem has not been horribly mangled.
> > >
> > > Personally, what I care about is my data.
> > >
> > > The metadata is just a way to get to my data, while the data
> > > is actually important.
> >
> > Personally, I care about metadata consistency, and ext3 documentation
> > suggests that journal protects its integrity. Except that it does not
> > on broken storage devices, and you still need to run fsck there.
>
> Caring about metadata consistency and not data is just weird, I'm
> sorry. I can't imagine anyone who actually *cares* about what they
> have stored, whether it's digital photographs of child taking a first
> step, or their thesis research, caring about more about the metadata
> than the data. Giving advice that pretends that most users have that
> priority is Just Wrong.

I thought the reason for that was that if your metadata is horked, further
writes to the disk can trash unrelated existing data because it's lost track
of what's allocated and what isn't. So back when the assumption was "what's
written stays written", then keeping the metadata sane was still darn
important to prevent normal operation from overwriting unrelated existing
data.

Then Pavel notified us of a situation where interrupted writes to the disk can
trash unrelated existing data _anyway_, because the flash block size on the 16
gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks
it's 4k or smaller. It seems like what _broke_ was the assumption that the
filesystem block size >= the disk block size, and nobody noticed for a while.
(Except the people making jffs2 and friends, anyway.)

Today we have cheap plentiful USB keys that act like hard drives, except that
their write block size isn't remotely the same as hard drives', but they
pretend it is, and then the block wear levelling algorithms fuzz things
further. (Gee, a drive controller lying about drive geometry, the scsi crowd
should feel right at home.)

Now Pavel's coming back with a second situation where RAID stripes (under
certain circumstances) seem to have similar granularity issues, again breaking
what seems to be the same assumption. Big media use big chunks for data, and
media is getting bigger. It doesn't seem like this problem is going to
diminish in future.

I agree that it seems like a good idea to have BIG RED WARNING SIGNS about
those kind of media and how _any_ journaling filesystem doesn't really help
here. So specifically documenting "These kinds of media lose unrelated random
data if writes to them are interrupted, journaling filesystems can't help with
this and may actually hide the problem, and even an fsck will only find
corrupted metadata not lost file contents" seems kind of useful.

That said, ext3's assumption that filesystem block size always >= disk update
block size _is_ a fundamental part of this problem, and one that isn't shared
by things like jffs2, and which things like btrfs might be able to address if
they try, by adding awareness of the real media update granularity to their
node layout algorithms. (Heck, ext2 has a stripe size parameter already.
Does setting that appropriately for your raid make this suck less? I haven't
heard anybody comment on that one yet...)

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-08-27 06:28:07

by Eric Sandeen

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 06:43:24AM -0700, [email protected] wrote:
>>>> as the ext3 authors have stated many times over the years, you still need
>>>> to run fsck periodicly anyway.
>>> Where is that documented?
>> linux-kernel mailing list archives.
>
> Probably from some 6-8 years ago, in e-mail postings that I made. My
> argument has always been that PC-class hardware is crap, and it's a
> Really Good Idea to periodically check the metadata because corruption
> there can end up causing massive data loss. The main problem is that
> doing it at reboot time really hurt system availability, and "after 20
> reboots (plus or minus)" resulted in fsck checks at wildly varying
> intervals depending on how often people reboot.

Aside ... can we default mkfs.ext3 to not set a mandatory fsck interval
then? :)

-Eric

> What I've been recommending for some time is that people use LVM, and
> run fsck on a snapshot every week or two, at some convenient time when
> the system load is at a minimum. There is an e2croncheck script in
> the e2fsprogs sources, in the contrib directory; it's short enough
> that I'll attach here here.
>
> Is it *necessary*? In a world where hardware is perfect, no. In a
> world where people don't bother buying ECC memory because it's 10%
> more expensive, and PC builders use the cheapest possible parts --- I
> think it's a really good idea.
>
> - Ted

2009-08-27 06:54:30

by David Lang

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Thu, 27 Aug 2009, Rob Landley wrote:

> On Wednesday 26 August 2009 07:28:13 Theodore Tso wrote:
>> On Wed, Aug 26, 2009 at 01:17:52PM +0200, Pavel Machek wrote:
>>>> Metadata takes up such a small part of the disk that fscking
>>>> it and finding it to be OK is absolutely no guarantee that
>>>> the data on the filesystem has not been horribly mangled.
>>>>
>>>> Personally, what I care about is my data.
>>>>
>>>> The metadata is just a way to get to my data, while the data
>>>> is actually important.
>>>
>>> Personally, I care about metadata consistency, and ext3 documentation
>>> suggests that journal protects its integrity. Except that it does not
>>> on broken storage devices, and you still need to run fsck there.
>>
>> Caring about metadata consistency and not data is just weird, I'm
>> sorry. I can't imagine anyone who actually *cares* about what they
>> have stored, whether it's digital photographs of child taking a first
>> step, or their thesis research, caring about more about the metadata
>> than the data. Giving advice that pretends that most users have that
>> priority is Just Wrong.
>
> I thought the reason for that was that if your metadata is horked, further
> writes to the disk can trash unrelated existing data because it's lost track
> of what's allocated and what isn't. So back when the assumption was "what's
> written stays written", then keeping the metadata sane was still darn
> important to prevent normal operation from overwriting unrelated existing
> data.
>
> Then Pavel notified us of a situation where interrupted writes to the disk can
> trash unrelated existing data _anyway_, because the flash block size on the 16
> gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks
> it's 4k or smaller. It seems like what _broke_ was the assumption that the
> filesystem block size >= the disk block size, and nobody noticed for a while.
> (Except the people making jffs2 and friends, anyway.)
>
> Today we have cheap plentiful USB keys that act like hard drives, except that
> their write block size isn't remotely the same as hard drives', but they
> pretend it is, and then the block wear levelling algorithms fuzz things
> further. (Gee, a drive controller lying about drive geometry, the scsi crowd
> should feel right at home.)

actually, you don't know if your USB key works that way or not. Pavel has
ssome that do, that doesn't mean that all flash drives do

when you do a write to a flash drive you have to do the following items

1. allocate an empty eraseblock to put the data on

2. read the old eraseblock

3. merge the incoming write to the eraseblock

4. write the updated data to the flash

5. update the flash trnslation layer to point reads at the new location
instead of the old location.

now if the flash drive does things in this order you will not loose any
previously written data.

if the flash drive does step 5 before it does step 4, then you have a
window where a crash can loose data (and no btrfs won't survive any better
to have a large chunk of data just disappear)

it's possible that some super-cheap flash drives skip having a flash
translation layer entirely, on those the process would be

1. read the old data into ram

2. merge the new write into the data in ram

3. erase the old data

4. write the new data

this obviously has a significant data loss window.

but if the device doesn't have a flash translation layer, then repeated
writes to any one sector will kill the drive fairly quickly. (updates to
the FAT would kill the sectors the FAT, journal, root directory, or
superblock lives in due to the fact that every change to the disk requires
an update to this file for example)

> Now Pavel's coming back with a second situation where RAID stripes (under
> certain circumstances) seem to have similar granularity issues, again breaking
> what seems to be the same assumption. Big media use big chunks for data, and
> media is getting bigger. It doesn't seem like this problem is going to
> diminish in future.
>
> I agree that it seems like a good idea to have BIG RED WARNING SIGNS about
> those kind of media and how _any_ journaling filesystem doesn't really help
> here. So specifically documenting "These kinds of media lose unrelated random
> data if writes to them are interrupted, journaling filesystems can't help with
> this and may actually hide the problem, and even an fsck will only find
> corrupted metadata not lost file contents" seems kind of useful.

I think an update to the documentation is a good thing (especially after
learning that a raid 6 array that has lost a single disk can still be
corrupted during a powerfail situation), but I also agree that Pavel's
wording is not detailed enough

> That said, ext3's assumption that filesystem block size always >= disk update
> block size _is_ a fundamental part of this problem, and one that isn't shared
> by things like jffs2, and which things like btrfs might be able to address if
> they try, by adding awareness of the real media update granularity to their
> node layout algorithms. (Heck, ext2 has a stripe size parameter already.
> Does setting that appropriately for your raid make this suck less? I haven't
> heard anybody comment on that one yet...)

I thought that that assumption was in the VFS layer, not in any particular
filesystem


David Lang

2009-08-27 07:34:42

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Thursday 27 August 2009 01:54:30 [email protected] wrote:
> On Thu, 27 Aug 2009, Rob Landley wrote:
> > On Wednesday 26 August 2009 07:28:13 Theodore Tso wrote:
> >> On Wed, Aug 26, 2009 at 01:17:52PM +0200, Pavel Machek wrote:
> >>>> Metadata takes up such a small part of the disk that fscking
> >>>> it and finding it to be OK is absolutely no guarantee that
> >>>> the data on the filesystem has not been horribly mangled.
> >>>>
> >>>> Personally, what I care about is my data.
> >>>>
> >>>> The metadata is just a way to get to my data, while the data
> >>>> is actually important.
> >>>
> >>> Personally, I care about metadata consistency, and ext3 documentation
> >>> suggests that journal protects its integrity. Except that it does not
> >>> on broken storage devices, and you still need to run fsck there.
> >>
> >> Caring about metadata consistency and not data is just weird, I'm
> >> sorry. I can't imagine anyone who actually *cares* about what they
> >> have stored, whether it's digital photographs of child taking a first
> >> step, or their thesis research, caring about more about the metadata
> >> than the data. Giving advice that pretends that most users have that
> >> priority is Just Wrong.
> >
> > I thought the reason for that was that if your metadata is horked,
> > further writes to the disk can trash unrelated existing data because it's
> > lost track of what's allocated and what isn't. So back when the
> > assumption was "what's written stays written", then keeping the metadata
> > sane was still darn important to prevent normal operation from
> > overwriting unrelated existing data.
> >
> > Then Pavel notified us of a situation where interrupted writes to the
> > disk can trash unrelated existing data _anyway_, because the flash block
> > size on the 16 gig flash key I bought retail at Fry's is 2 megabytes, and
> > the filesystem thinks it's 4k or smaller. It seems like what _broke_ was
> > the assumption that the filesystem block size >= the disk block size, and
> > nobody noticed for a while. (Except the people making jffs2 and friends,
> > anyway.)
> >
> > Today we have cheap plentiful USB keys that act like hard drives, except
> > that their write block size isn't remotely the same as hard drives', but
> > they pretend it is, and then the block wear levelling algorithms fuzz
> > things further. (Gee, a drive controller lying about drive geometry, the
> > scsi crowd should feel right at home.)
>
> actually, you don't know if your USB key works that way or not.

Um, yes, I think I do.

> Pavel has ssome that do, that doesn't mean that all flash drives do

Pretty much all the ones that present a USB disk interface to the outside
world and then thus have to do hardware levelling. Here's Valerie Aurora on
the topic:

http://valhenson.livejournal.com/25228.html

>Let's start with hardware wear-leveling. Basically, nearly all practical
> implementations of it suck. You'd imagine that it would spread out writes
> over all the blocks in the drive, only rewriting any particular block after
> every other block has been written. But I've heard from experts several
> times that hardware wear-leveling can be as dumb as a ring buffer of 12
> blocks; each time you write a block, it pulls something out of the queue
> and sticks the old block in. If you only write one block over and over,
> this means that writes will be spread out over a staggering 12 blocks! My
> direct experience working with corrupted flash with built-in wear-leveling
> is that corruption was centered around frequently written blocks (with
> interesting patterns resulting from the interleaving of blocks from
> different erase blocks). As a file systems person, I know what it takes to
> do high-quality wear-leveling: it's called a log-structured file system and
> they are non-trivial pieces of software. Your average consumer SSD is not
> going to have sufficient hardware to implement even a half-assed
> log-structured file system, so clearly it's going to be a lot stupider than
> that.

Back to you:

> when you do a write to a flash drive you have to do the following items
>
> 1. allocate an empty eraseblock to put the data on
>
> 2. read the old eraseblock
>
> 3. merge the incoming write to the eraseblock
>
> 4. write the updated data to the flash
>
> 5. update the flash trnslation layer to point reads at the new location
> instead of the old location.
>
> now if the flash drive does things in this order you will not loose any
> previously written data.

That's what something like jffs2 will do, sure. (And note that mounting those
suckers is slow while it reads the whole disk to figure out what order to put
the chunks in.)

However, your average consumer level device A) isn't very smart, B) is judged
almost entirely by price/capacity ratio and thus usually won't even hide
capacity for bad block remapping. You expect them to have significant hidden
capacity to do safer updates with when customers aren't demanding it yet?

> if the flash drive does step 5 before it does step 4, then you have a
> window where a crash can loose data (and no btrfs won't survive any better
> to have a large chunk of data just disappear)
>
> it's possible that some super-cheap flash drives

I've never seen one that presented a USB disk interface that _didn't_ do this.
(Not that this observation means much.) Neither the windows nor the Macintosh
world is calling for this yet. Even the Linux guys barely know about it. And
these are the same kinds of manufacturers that NOPed out the flush commands to
make their benchmarks look better...

> but if the device doesn't have a flash translation layer, then repeated
> writes to any one sector will kill the drive fairly quickly. (updates to
> the FAT would kill the sectors the FAT, journal, root directory, or
> superblock lives in due to the fact that every change to the disk requires
> an update to this file for example)

Yup. It's got enough of one to get past the warantee, but beyond that they're
intended for archiving and sneakernet, not for running compiles on.

> > That said, ext3's assumption that filesystem block size always >= disk
> > update block size _is_ a fundamental part of this problem, and one that
> > isn't shared by things like jffs2, and which things like btrfs might be
> > able to address if they try, by adding awareness of the real media update
> > granularity to their node layout algorithms. (Heck, ext2 has a stripe
> > size parameter already. Does setting that appropriately for your raid
> > make this suck less? I haven't heard anybody comment on that one yet...)
>
> I thought that that assumption was in the VFS layer, not in any particular
> filesystem

The VFS layer cares about how to talk to the backing store? I thought that
was the filesystem driver's job...

I wonder how jffs2 gets around it, then? (Or for that matter, squashfs...)

> David Lang

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-08-28 14:38:24

by David Lang

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Thu, 27 Aug 2009, Rob Landley wrote:

> On Thursday 27 August 2009 01:54:30 [email protected] wrote:
>> On Thu, 27 Aug 2009, Rob Landley wrote:
>>>
>>> Today we have cheap plentiful USB keys that act like hard drives, except
>>> that their write block size isn't remotely the same as hard drives', but
>>> they pretend it is, and then the block wear levelling algorithms fuzz
>>> things further. (Gee, a drive controller lying about drive geometry, the
>>> scsi crowd should feel right at home.)
>>
>> actually, you don't know if your USB key works that way or not.
>
> Um, yes, I think I do.
>
>> Pavel has ssome that do, that doesn't mean that all flash drives do
>
> Pretty much all the ones that present a USB disk interface to the outside
> world and then thus have to do hardware levelling. Here's Valerie Aurora on
> the topic:
>
> http://valhenson.livejournal.com/25228.html
>
>> Let's start with hardware wear-leveling. Basically, nearly all practical
>> implementations of it suck. You'd imagine that it would spread out writes
>> over all the blocks in the drive, only rewriting any particular block after
>> every other block has been written. But I've heard from experts several
>> times that hardware wear-leveling can be as dumb as a ring buffer of 12
>> blocks; each time you write a block, it pulls something out of the queue
>> and sticks the old block in. If you only write one block over and over,
>> this means that writes will be spread out over a staggering 12 blocks! My
>> direct experience working with corrupted flash with built-in wear-leveling
>> is that corruption was centered around frequently written blocks (with
>> interesting patterns resulting from the interleaving of blocks from
>> different erase blocks). As a file systems person, I know what it takes to
>> do high-quality wear-leveling: it's called a log-structured file system and
>> they are non-trivial pieces of software. Your average consumer SSD is not
>> going to have sufficient hardware to implement even a half-assed
>> log-structured file system, so clearly it's going to be a lot stupider than
>> that.
>
> Back to you:

I am not saying that all devices get this right (not by any means), but I
_am_ saying that devices with wear-leveling _can_ avoid this problem
entirely

you do not need to do a log-structured filesystem. all you need to do is
to always write to a new block rather than re-writing a block in place.

even if the disk only does a 12-block rotation for it's wear leveling,
that is enough for it to not loose other data when you write. to loose
data you have to be updating a block in place by erasing the old one
first. _anything_ that writes the data to a new location before it erases
the old location will prevent you from loosing other data.

I'm all for documenting that this problem can and does exist, but I'm not
in agreement with documentation that states that _all_ flash drives have
this problem because (with wear-leveling in a flash translation layer on
the device) it's not inherent to the technology. so even if all existing
flash devices had this problem, there could be one released tomorrow that
didn't.

this is like the problem that flash SSDs had last year that could cause
them to stall for up to a second on write-heavy workloads. it went from a
problem that almost every drive for sale had (and something that was
generally accepted as being a characteristic of SSDs), to being extinct in
about one product cycle after the problem was identified.

I think this problem will also disappear rapidly once it's publicised.

so what's needed is for someone to come up with a way to test this, let
people test the various devices, find out how broad the problem is, and
publicise the results.

personally, I expect that the better disk-replacements will not have a
problem with this.

I would also be surprised if the larger thumb drives had this problem.

if a flash eraseblock can be used 100k times, then if you use FAT on a 16G
drive and write 1M files and update the FAT after each file (like you
would with a camera), the block the FAT is on will die after filling the
device _6_ times. if it does a 12-block rotation it would die after 72
times, but if it can move the blocks around the entire device it would
take 50k times of filling the device.

for a 2G device the numbers would be 50 times with no wear-leveling and
600 times with 12-block rotation.

so I could see them getting away with this sort of thing for the smaller
devices, but as the thumb drives get larger, I expect that they will start
to gain the wear-leveling capabilities that the SSDs have.

>> when you do a write to a flash drive you have to do the following items
>>
>> 1. allocate an empty eraseblock to put the data on
>>
>> 2. read the old eraseblock
>>
>> 3. merge the incoming write to the eraseblock
>>
>> 4. write the updated data to the flash
>>
>> 5. update the flash trnslation layer to point reads at the new location
>> instead of the old location.
>>
>> now if the flash drive does things in this order you will not loose any
>> previously written data.
>
> That's what something like jffs2 will do, sure. (And note that mounting those
> suckers is slow while it reads the whole disk to figure out what order to put
> the chunks in.)
>
> However, your average consumer level device A) isn't very smart, B) is judged
> almost entirely by price/capacity ratio and thus usually won't even hide
> capacity for bad block remapping. You expect them to have significant hidden
> capacity to do safer updates with when customers aren't demanding it yet?

this doesn't require filesystem smarts, but it does require a device with
enough smarts to do bad-block remapping (if it does wear leveling all that
bad-block remapping would be is not writing to a bad eraseblock, which
doesn't even require maintaining a map of such blocks, all it would have
to do is to check if what is on the flash is what it intended to write, if
it is, use it, if it isn't, try again.

>> if the flash drive does step 5 before it does step 4, then you have a
>> window where a crash can loose data (and no btrfs won't survive any better
>> to have a large chunk of data just disappear)
>>
>> it's possible that some super-cheap flash drives
>
> I've never seen one that presented a USB disk interface that _didn't_ do this.
> (Not that this observation means much.) Neither the windows nor the Macintosh
> world is calling for this yet. Even the Linux guys barely know about it. And
> these are the same kinds of manufacturers that NOPed out the flush commands to
> make their benchmarks look better...

the nature of the FAT filesystem calls for it. I've heard people talk
about devices that try to be smart enough to take extra care of the blocks
that the FAT is on

>> but if the device doesn't have a flash translation layer, then repeated
>> writes to any one sector will kill the drive fairly quickly. (updates to
>> the FAT would kill the sectors the FAT, journal, root directory, or
>> superblock lives in due to the fact that every change to the disk requires
>> an update to this file for example)
>
> Yup. It's got enough of one to get past the warantee, but beyond that they're
> intended for archiving and sneakernet, not for running compiles on.

it doesn't take them being used for compiles, using them in a camera,
media player, phone with a FAT filesystem will excersise the FAT blocks
enough to cause problems

>>> That said, ext3's assumption that filesystem block size always >= disk
>>> update block size _is_ a fundamental part of this problem, and one that
>>> isn't shared by things like jffs2, and which things like btrfs might be
>>> able to address if they try, by adding awareness of the real media update
>>> granularity to their node layout algorithms. (Heck, ext2 has a stripe
>>> size parameter already. Does setting that appropriately for your raid
>>> make this suck less? I haven't heard anybody comment on that one yet...)
>>
>> I thought that that assumption was in the VFS layer, not in any particular
>> filesystem
>
> The VFS layer cares about how to talk to the backing store? I thought that
> was the filesystem driver's job...

I could be mistaken, but I have run into cases with filesystems where the
filesystem was designed to be able to use large blocks, but they could
only be used on specific architectures because the disk block size had to
be smaller than the page size.

> I wonder how jffs2 gets around it, then? (Or for that matter, squashfs...)

if you know where the eraseblock boundries are, all you need to do is
submit your writes in groups of blocks corresponding to those boundries.
there is no need to make the blocks themselves the size of the
eraseblocks.

any filesystem that is doing compressed storage is going to end up dealing
with logical changes that span many different disk blocks.

I thought that squashfs was read-only (you create a filesystem image, burn
it to flash, then use it)

as I say I could be completely misunderstanding this interaction.

David Lang

2009-08-30 07:03:41

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Wed 2009-08-26 06:43:24, [email protected] wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>>>>> The metadata is just a way to get to my data, while the data
>>>>> is actually important.
>>>>
>>>> Personally, I care about metadata consistency, and ext3 documentation
>>>> suggests that journal protects its integrity. Except that it does not
>>>> on broken storage devices, and you still need to run fsck there.
>>>
>>> as the ext3 authors have stated many times over the years, you still need
>>> to run fsck periodicly anyway.
>>
>> Where is that documented?
>
> linux-kernel mailing list archives.

That's not where fs documentation belongs :-(.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-30 07:19:57

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

Hi!

>> I thought the reason for that was that if your metadata is horked, further
>> writes to the disk can trash unrelated existing data because it's lost track
>> of what's allocated and what isn't. So back when the assumption was "what's
>> written stays written", then keeping the metadata sane was still darn
>> important to prevent normal operation from overwriting unrelated existing
>> data.
>>
>> Then Pavel notified us of a situation where interrupted writes to the disk can
>> trash unrelated existing data _anyway_, because the flash block size on the 16
>> gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks
>> it's 4k or smaller. It seems like what _broke_ was the assumption that the
>> filesystem block size >= the disk block size, and nobody noticed for a while.
>> (Except the people making jffs2 and friends, anyway.)
>>
>> Today we have cheap plentiful USB keys that act like hard drives, except that
>> their write block size isn't remotely the same as hard drives', but they
>> pretend it is, and then the block wear levelling algorithms fuzz things
>> further. (Gee, a drive controller lying about drive geometry, the scsi crowd
>> should feel right at home.)
>
> actually, you don't know if your USB key works that way or not. Pavel has
> ssome that do, that doesn't mean that all flash drives do
>
> when you do a write to a flash drive you have to do the following items
>
> 1. allocate an empty eraseblock to put the data on
>
> 2. read the old eraseblock
>
> 3. merge the incoming write to the eraseblock
>
> 4. write the updated data to the flash
>
> 5. update the flash trnslation layer to point reads at the new location
> instead of the old location.


That would need two erases per single sector writen, no? Erase is in
milisecond range, so the performance would be just way too bad :-(.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-30 12:48:52

by David Lang

[permalink] [raw]
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Sun, 30 Aug 2009, Pavel Machek wrote:

>>> I thought the reason for that was that if your metadata is horked, further
>>> writes to the disk can trash unrelated existing data because it's lost track
>>> of what's allocated and what isn't. So back when the assumption was "what's
>>> written stays written", then keeping the metadata sane was still darn
>>> important to prevent normal operation from overwriting unrelated existing
>>> data.
>>>
>>> Then Pavel notified us of a situation where interrupted writes to the disk can
>>> trash unrelated existing data _anyway_, because the flash block size on the 16
>>> gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks
>>> it's 4k or smaller. It seems like what _broke_ was the assumption that the
>>> filesystem block size >= the disk block size, and nobody noticed for a while.
>>> (Except the people making jffs2 and friends, anyway.)
>>>
>>> Today we have cheap plentiful USB keys that act like hard drives, except that
>>> their write block size isn't remotely the same as hard drives', but they
>>> pretend it is, and then the block wear levelling algorithms fuzz things
>>> further. (Gee, a drive controller lying about drive geometry, the scsi crowd
>>> should feel right at home.)
>>
>> actually, you don't know if your USB key works that way or not. Pavel has
>> ssome that do, that doesn't mean that all flash drives do
>>
>> when you do a write to a flash drive you have to do the following items
>>
>> 1. allocate an empty eraseblock to put the data on
>>
>> 2. read the old eraseblock
>>
>> 3. merge the incoming write to the eraseblock
>>
>> 4. write the updated data to the flash
>>
>> 5. update the flash trnslation layer to point reads at the new location
>> instead of the old location.
>
>
> That would need two erases per single sector writen, no? Erase is in
> milisecond range, so the performance would be just way too bad :-(.

no, it only needs one erase

if you don't have a pool of pre-erased blocks, then you need to do an
erase of the new block you are allocating (before step 4)

if you do have a pool of pre-erased blocks, then you don't have to do any
erase of the data blocks until after step 5 and you do the erase when you
add the old data block to the pool of pre-erased blocks later.

in either case the requirements of wear leveling require that the flash
translation layer update it's records to show that an additional write
took place.

what appears to be happening on some cheap devices is that they do the
following instead

1. allocate an empty eraseblock to put the data on

2. read the old eraseblock

3. merge the incoming write to the eraseblock

4. erase the old eraseblock

5. write the updated data to the flash

I don't know where in (or after) this process theyupdate the
wear-levling/flash translation layer info.

with this algortihm, if the device looses power between step 4 and step 5
you loose all the data on the eraseblock.

with deferred erasing of blocks, the safer algortihm is actually the
faster one (up until you run out of your pool of available eraseblocks, at
which time it slows down to the same speed as the unreliable one.

most flash drives are fairly slow to write to in any case.

even the Intel X25M drives are in the same ballpark as rotating media for
writes. as far as I know only the X25E SSD drives are faster to write to
than rotating media, and most of them are _far_ slower.

David Lang

2009-11-09 08:53:22

by Pavel Machek

[permalink] [raw]
Subject: periodic fsck was Re: [patch] ext2/3: document conditions when reliable operation is possible

On Wed 2009-08-26 14:02:48, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 06:43:24AM -0700, [email protected] wrote:
> >>>
> >>> as the ext3 authors have stated many times over the years, you still need
> >>> to run fsck periodicly anyway.
> >>
> >> Where is that documented?
> >
> > linux-kernel mailing list archives.
>
> Probably from some 6-8 years ago, in e-mail postings that I made. My
> argument has always been that PC-class hardware is crap, and it's a

Well, in SUSE11-or-so, distro stopped period fscks, silently :-(. I
believed that it was really bad idea at that point, but because I
could not find piece of documentation recommending them, I lost the
argument.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-11-09 14:05:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: periodic fsck was Re: [patch] ext2/3: document conditions when reliable operation is possible

On Mon, Nov 09, 2009 at 09:53:18AM +0100, Pavel Machek wrote:
>
> Well, in SUSE11-or-so, distro stopped period fscks, silently :-(. I
> believed that it was really bad idea at that point, but because I
> could not find piece of documentation recommending them, I lost the
> argument.

It's an engineering trade-off. If you have perfect memory that is
never has cosmic-ray hiccups, and hard drives that never write data to
the wrong place, etc. then you don't need periodic fsck's.

If you do have imperfect hardware, the question then is how imperfect
your hardware is, and how frequently it introduces errors. If you
check too frequently, though, users get upset, especially when it
happens at the most inconvenient time (when you're trying to recover
from unscheduled downtime by rebooting); if you check too infrequently
then it doesn't help you too much since too much data gets damaged
before fsck notices.

So these days, what I strongly recommend is that people use LVM
snapshots, and schedule weekly checks during some low usage period
(i.e., 3am on Saturdays), using something like the e2croncheck shell
script.

- Ted

2009-11-09 15:58:49

by Andreas Dilger

[permalink] [raw]
Subject: Re: periodic fsck was Re: [patch] ext2/3: document conditions when reliable operation is possible

On 2009-11-09, at 07:05, Theodore Tso wrote:
> So these days, what I strongly recommend is that people use LVM
> snapshots, and schedule weekly checks during some low usage period
> (i.e., 3am on Saturdays), using something like the e2croncheck shell
> script.


There was another script written to do this that handled the e2fsck,
reiserfsck
and xfs_check, detecting all volume groups automatically, along with
e.g.
validating that the snapshot volume doesn't exist before starting the
check
(which may indicate that the previous e2fsck is still running), and
not running while on AC power.

The last version was in the thread "forced fsck (again?)" dated
2008-01-28.
Would it be better to use that one? In that thread we discussed not
clobbering
the last checked time as e2croncheck does, so the admin can see how
long it
was since the filesystem was last checked.

Maybe it makes more sense to get the lvcheck script included into util-
linux-ng
or lvm2 packages, and have it added automatically to the cron.weekly
directory?
Then the distros could disable the at-boot checking safely, while
still being
able to detect corruption caused by cables/RAM/drives/software.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.