Using ext3 is only safe if storage subsystem meets certain
criteria. Document those.
Errors=remount-ro is documented as default, but superblock setting
overrides that and mkfs defaults to errors=continue... so the default
is errors=continue in practice.
readonly mount does actually write to the media in some cases. Document that.
Signed-off-by: Pavel Machek <[email protected]>
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..74a73b0 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -14,6 +14,9 @@ Options
When mounting an ext3 filesystem, the following option are accepted:
(*) == default
+ro Note that ext3 will replay the journal (and thus write
+ to the partition) even when mounted "read only".
+
journal=update Update the ext3 file system's journal to the current
format.
@@ -95,6 +98,8 @@ debug Extra debugging information is sent to syslog.
errors=remount-ro(*) Remount the filesystem read-only on an error.
errors=continue Keep going on a filesystem error.
errors=panic Panic and halt the machine if an error occurs.
+ (Note that default is overriden by superblock
+ setting on most systems).
data_err=ignore(*) Just print an error message if an error occurs
in a file data buffer in ordered mode.
@@ -188,6 +193,34 @@ mke2fs: create a ext3 partition with the -j flag.
debugfs: ext2 and ext3 file system debugger.
ext2online: online (mounted) ext2 and ext3 filesystem resizer
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* writes to media never fail. Even if disk returns error condition during
+ write, ext3 can't handle that correctly, because success on fsync was already
+ returned when data hit the journal.
+
+ (Fortunately writes failing are very uncommon on disks, as they
+ have spare sectors they use when write fails.)
+
+* either whole sector is correctly written or nothing is written during
+ powerfail.
+
+ (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
+ like this, and are unsuitable for ext3. Because RAM tends to fail
+ faster than rest of system during powerfail, special hw killing
+ DMA transfers may be neccessary. Not sure how common that problem
+ is on generic PC machines).
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+ (Note that barriers are disabled by default, use "barrier=1"
+ mount option after making sure hw can support them).
+
References
==========
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Can one avoid replay of the journal then if it would be unclean?
Just curious.
M.
Pavel Machek wrote:
> Using ext3 is only safe if storage subsystem meets certain
> criteria. Document those.
>
> Errors=remount-ro is documented as default, but superblock setting
> overrides that and mkfs defaults to errors=continue... so the default
> is errors=continue in practice.
>
> readonly mount does actually write to the media in some cases. Document that.
>
> Signed-off-by: Pavel Machek <[email protected]>
>
> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
On Sat 2009-01-03 22:17:11, Martin MOKREJŠ wrote:
> Can one avoid replay of the journal then if it would be unclean?
> Just curious.
Well, mounting unclean filesystem is dangerous but depending on
circumstances, it may be better than writing to the filesystems.
(You may not be able to read some data and may provoke kernel bugs,
but at least you don't damage what is on disk. If you are collecting
evidence -- not writing is very important. If you suspect something is
very wrong with the drive, not writing is good idea).
Pavel
>
> Pavel Machek wrote:
> > Using ext3 is only safe if storage subsystem meets certain
> > criteria. Document those.
> >
> > Errors=remount-ro is documented as default, but superblock setting
> > overrides that and mkfs defaults to errors=continue... so the default
> > is errors=continue in practice.
> >
> > readonly mount does actually write to the media in some cases. Document that.
> >
> > Signed-off-by: Pavel Machek <[email protected]>
> >
> > diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
[Fixed top-posting]
2009/1/3 Martin MOKREJ? <[email protected]>:
> Pavel Machek wrote:
>> readonly mount does actually write to the media in some cases. Document that.
>>
> Can one avoid replay of the journal then if it would be unclean?
> Just curious.
Nope. If the underlying block device is read-only then mounting the
filesystem will fail. I tried to fix this some time ago, and have a
set of patches that almost always work, but "almost always" isn't good
enough. Unfortunately I never managed to figure out a way to finish it
off without disgusting hacks or major surgery.
> M.
Cheers,
Duane.
--
"I never could learn to drink that blood and call it wine" - Bob Dylan
On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
> [Fixed top-posting]
>
> 2009/1/3 Martin MOKREJŠ <[email protected]>:
> > Pavel Machek wrote:
> >> readonly mount does actually write to the media in some cases. Document that.
> >>
> > Can one avoid replay of the journal then if it would be unclean?
> > Just curious.
>
> Nope. If the underlying block device is read-only then mounting the
> filesystem will fail. I tried to fix this some time ago, and have a
> set of patches that almost always work, but "almost always" isn't good
> enough. Unfortunately I never managed to figure out a way to finish it
> off without disgusting hacks or major surgery.
Uhuh, can you just ignore the journal and mount it anyway?
...basically treating it like an ext2?
...ok, that will present "old" version of the filesystem to the
user... violating fsync() semantics.
Still handy for recovering badly broken filesystems, I'd say.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Pavel Machek wrote:
> On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
>> [Fixed top-posting]
>>
>> 2009/1/3 Martin MOKREJŠ <[email protected]>:
>>> Pavel Machek wrote:
>>>> readonly mount does actually write to the media in some cases. Document that.
>>>>
>>> Can one avoid replay of the journal then if it would be unclean?
>>> Just curious.
>> Nope. If the underlying block device is read-only then mounting the
>> filesystem will fail. I tried to fix this some time ago, and have a
>> set of patches that almost always work, but "almost always" isn't good
>> enough. Unfortunately I never managed to figure out a way to finish it
>> off without disgusting hacks or major surgery.
>
> Uhuh, can you just ignore the journal and mount it anyway?
> ...basically treating it like an ext2?
>
> ...ok, that will present "old" version of the filesystem to the
> user... violating fsync() semantics.
Hmm, so if my dual-boot machine does not shutdown correctly and I boot
accidentally in M$ Win where I use ext2 IFS driver and modify some
stuff on the ext3 drive, after a while reboot to linux and the journal
get re-played ... Mmm ...
>
> Still handy for recovering badly broken filesystems, I'd say.
Me as well. How about improving you doc patch with some summary of
this thread (although it is probably not over yet)? ;-) Definitely,
a note that one can mount it as ext2 while read-only would be helpful
when doing some forensics on the disk.
2009/1/3 Pavel Machek <[email protected]>:
> On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
>> [Fixed top-posting]
>>
>> 2009/1/3 Martin MOKREJ? <[email protected]>:
>> > Pavel Machek wrote:
>> >> readonly mount does actually write to the media in some cases. Document that.
>> >>
>> > Can one avoid replay of the journal then if it would be unclean?
>> > Just curious.
>>
>> Nope. If the underlying block device is read-only then mounting the
>> filesystem will fail. I tried to fix this some time ago, and have a
>> set of patches that almost always work, but "almost always" isn't good
>> enough. Unfortunately I never managed to figure out a way to finish it
>> off without disgusting hacks or major surgery.
>
> Uhuh, can you just ignore the journal and mount it anyway?
> ...basically treating it like an ext2?
I'm afraid not, ext2 won't mount an FS with EXT3_FEATURE_INCOMPAT_RECOVER set.
> ...ok, that will present "old" version of the filesystem to the
> user... violating fsync() semantics.
>
> Still handy for recovering badly broken filesystems, I'd say.
>
> Pavel
Cheers,
Duane.
--
"I never could learn to drink that blood and call it wine" - Bob Dylan
2009/1/3 Martin MOKREJ? <[email protected]>:
> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
> accidentally in M$ Win where I use ext2 IFS driver and modify some
> stuff on the ext3 drive, after a while reboot to linux and the journal
> get re-played ... Mmm ...
You *really* wouldn't want to be doing that.
The other scenario that people have reported trouble with is
suspending the system, booting a live CD which "read-only" mounts the
filesystem (and replays the journal), then resuming.
Cheers,
Duane.
--
"I never could learn to drink that blood and call it wine" - Bob Dylan
Duane Griffin wrote:
> 2009/1/3 Martin MOKREJŠ <[email protected]>:
>> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
>> accidentally in M$ Win where I use ext2 IFS driver and modify some
>> stuff on the ext3 drive, after a while reboot to linux and the journal
>> get re-played ... Mmm ...
>
> You *really* wouldn't want to be doing that.
>
> The other scenario that people have reported trouble with is
> suspending the system, booting a live CD which "read-only" mounts the
> filesystem (and replays the journal), then resuming.
Why does not "mount -ro" die when it would have to replay the journal
with a message that user must run fsck.ext3 in order to be able to mount
it albeit read-only? Still I would prefer having an extra switch to
force mount RO while not touching the journal for disk forensics.
I think that would also prevent the cases when a LiveCD/rescue distribution
would not mount+replay it automagically but user would really have to
provide the switch to the command. I am really not using the recovery
boot cd to touch my partitions in some cases unwillingly.
Sure that does not prevent my case when I let ext2 IFS writing onto
my ext3 partition. Actually, couldn't the driver at least warn me
the journal log is non-empty (am just a user, sorry, cannot check
myself the code at http://www.fs-driver.org if it could do at least this
although it does not understand ext3). ;-)
Martin
Martin MOKREJŠ wrote:
> Duane Griffin wrote:
>> 2009/1/3 Martin MOKREJŠ <[email protected]>:
>>> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
>>> accidentally in M$ Win where I use ext2 IFS driver and modify some
>>> stuff on the ext3 drive, after a while reboot to linux and the journal
>>> get re-played ... Mmm ...
>> You *really* wouldn't want to be doing that.
>>
>> The other scenario that people have reported trouble with is
>> suspending the system, booting a live CD which "read-only" mounts the
>> filesystem (and replays the journal), then resuming.
>
> Why does not "mount -ro" die when it would have to replay the journal
> with a message that user must run fsck.ext3 in order to be able to mount
> it albeit read-only? Still I would prefer having an extra switch to
That would break typical system bootup in the unclean journal case,
normally the root FS is mounted read-only to start with (which replays
the journal) and remounted read-write later on - and usually the fsck
utilities are located on the root filesystem..
> force mount RO while not touching the journal for disk forensics.
> I think that would also prevent the cases when a LiveCD/rescue distribution
> would not mount+replay it automagically but user would really have to
> provide the switch to the command. I am really not using the recovery
> boot cd to touch my partitions in some cases unwillingly.
I agree, there should be a way to force it to mount "really read only"
so it doesn't try to replay the journal. That might require just
ignoring the journal content, which may result in the FS appearing
corrupt, but for recovery/forensics purposes that seems better than
nothing..
2009/1/3 Martin MOKREJ? <[email protected]>:
> Why does not "mount -ro" die when it would have to replay the journal
> with a message that user must run fsck.ext3 in order to be able to mount
> it albeit read-only? Still I would prefer having an extra switch to
> force mount RO while not touching the journal for disk forensics.
> I think that would also prevent the cases when a LiveCD/rescue distribution
> would not mount+replay it automagically but user would really have to
> provide the switch to the command. I am really not using the recovery
> boot cd to touch my partitions in some cases unwillingly.
Well, that would make things rather tricky. As in, shutting down
uncleanly would render your system unbootable.
> Sure that does not prevent my case when I let ext2 IFS writing onto
> my ext3 partition. Actually, couldn't the driver at least warn me
> the journal log is non-empty (am just a user, sorry, cannot check
> myself the code at http://www.fs-driver.org if it could do at least this
> although it does not understand ext3). ;-)
The driver certainly should warn you in that case. I have no idea
whether it does, as I don't use it, sorry.
Cheers,
Duane.
--
"I never could learn to drink that blood and call it wine" - Bob Dylan
Robert Hancock wrote:
> Martin MOKREJŠ wrote:
>> Duane Griffin wrote:
>>> 2009/1/3 Martin MOKREJŠ <[email protected]>:
>>>> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
>>>> accidentally in M$ Win where I use ext2 IFS driver and modify some
>>>> stuff on the ext3 drive, after a while reboot to linux and the journal
>>>> get re-played ... Mmm ...
>>> You *really* wouldn't want to be doing that.
>>>
>>> The other scenario that people have reported trouble with is
>>> suspending the system, booting a live CD which "read-only" mounts the
>>> filesystem (and replays the journal), then resuming.
>>
>> Why does not "mount -ro" die when it would have to replay the journal
>> with a message that user must run fsck.ext3 in order to be able to mount
>> it albeit read-only? Still I would prefer having an extra switch to
>
> That would break typical system bootup in the unclean journal case,
> normally the root FS is mounted read-only to start with (which replays
> the journal) and remounted read-write later on - and usually the fsck
> utilities are located on the root filesystem..
Couldn't that be handled by e.g. openRC during boot, to provide the
say to be provided --force-journal-replay during "normal" boot?
Yes, that would mean e2fsprogs would become incompatible with older
versions but why not "fix" the logic?
>
>> force mount RO while not touching the journal for disk forensics. I
>> think that would also prevent the cases when a LiveCD/rescue
>> distribution would not mount+replay it automagically but user would
>> really have to provide the switch to the command. I am really not
>> using the recovery boot cd to touch my partitions in some cases
>> unwillingly.
>
> I agree, there should be a way to force it to mount "really read only"
> so it doesn't try to replay the journal. That might require just
> ignoring the journal content, which may result in the FS appearing
> corrupt, but for recovery/forensics purposes that seems better than
> nothing..
Fully agree.
M.
Duane Griffin wrote:
> 2009/1/3 Martin MOKREJŠ <[email protected]>:
>> Why does not "mount -ro" die when it would have to replay the journal
>> with a message that user must run fsck.ext3 in order to be able to mount
>> it albeit read-only? Still I would prefer having an extra switch to
>> force mount RO while not touching the journal for disk forensics.
>> I think that would also prevent the cases when a LiveCD/rescue distribution
>> would not mount+replay it automagically but user would really have to
>> provide the switch to the command. I am really not using the recovery
>> boot cd to touch my partitions in some cases unwillingly.
>
> Well, that would make things rather tricky. As in, shutting down
> uncleanly would render your system unbootable.
??? If I am booted off a CD/DVD drive I just do not want my system
to be touched. I am fine if the dist mounts my drives automagically
in read-only mode but if that currently forces journal replay then no,
thanks. ;)
M.
On Sun 2009-01-04 00:01:58, Martin MOKREJŠ wrote:
> Pavel Machek wrote:
> > On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
> >> [Fixed top-posting]
> >>
> >> 2009/1/3 Martin MOKREJŠ <[email protected]>:
> >>> Pavel Machek wrote:
> >>>> readonly mount does actually write to the media in some cases. Document that.
> >>>>
> >>> Can one avoid replay of the journal then if it would be unclean?
> >>> Just curious.
> >> Nope. If the underlying block device is read-only then mounting the
> >> filesystem will fail. I tried to fix this some time ago, and have a
> >> set of patches that almost always work, but "almost always" isn't good
> >> enough. Unfortunately I never managed to figure out a way to finish it
> >> off without disgusting hacks or major surgery.
> >
> > Uhuh, can you just ignore the journal and mount it anyway?
> > ...basically treating it like an ext2?
> >
> > ...ok, that will present "old" version of the filesystem to the
> > user... violating fsync() semantics.
>
> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
> accidentally in M$ Win where I use ext2 IFS driver and modify some
> stuff on the ext3 drive, after a while reboot to linux and the journal
> get re-played ... Mmm ...
ext2 driver should refuse to mount dirty ext3 filesystem. (Linux ext2
driver does that).
> > Still handy for recovering badly broken filesystems, I'd say.
>
> Me as well. How about improving you doc patch with some summary of
> this thread (although it is probably not over yet)? ;-) Definitely,
> a note that one can mount it as ext2 while read-only would be helpful
> when doing some forensics on the disk.
No, you can't mount unclean ext3 as an ext2; patch to do that would be
possible but...
I believe the patch is correct & useful.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
2009/1/4 Martin MOKREJ? <[email protected]>:
> Duane Griffin wrote:
>> 2009/1/3 Martin MOKREJ? <[email protected]>:
>>> Why does not "mount -ro" die when it would have to replay the journal
>>> with a message that user must run fsck.ext3 in order to be able to mount
>>> it albeit read-only? Still I would prefer having an extra switch to
>>> force mount RO while not touching the journal for disk forensics.
>>> I think that would also prevent the cases when a LiveCD/rescue distribution
>>> would not mount+replay it automagically but user would really have to
>>> provide the switch to the command. I am really not using the recovery
>>> boot cd to touch my partitions in some cases unwillingly.
>>
>> Well, that would make things rather tricky. As in, shutting down
>> uncleanly would render your system unbootable.
>
> ??? If I am booted off a CD/DVD drive I just do not want my system
> to be touched. I am fine if the dist mounts my drives automagically
> in read-only mode but if that currently forces journal replay then no,
> thanks. ;)
I agree, it isn't a great situation. Nonetheless, it has always been
thus for ext3, and so far we've muddled along. Unless and until we can
replay the journal in-memory without touching the on-disk data, we are
stuck with it.
We can't refuse to mount an unclean FS, as that would break booting.
We also can't ignore the journal by default, if/when we get a patch to
do so at all, as that effectively corrupts random chunks of the FS.
Fine for forensics and recovery; not so much for booting from.
> M.
Cheers,
Duane.
--
"I never could learn to drink that blood and call it wine" - Bob Dylan
On Sat, Jan 03, 2009 at 01:38:15PM +0100, Pavel Machek wrote:
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* writes to media never fail. Even if disk returns error condition during
> + write, ext3 can't handle that correctly, because success on fsync was already
> + returned when data hit the journal.
> +
> + (Fortunately writes failing are very uncommon on disks, as they
> + have spare sectors they use when write fails.)
This is not unique to ext3; per the discussion two weeks ago, this is
largely because of the fsync() interface not possibly being able to
return errors caused by failures when creating or modifying parent
directories. Given this, it's a bit misleading to place this in the
Documentation/filesystems/ext3.txt. At the minimum it should include
a discussion about what the issues might be, and given that pretty
much any Unix/Linux filesystem doesn't have a way of reflecting these
errors to application programs, it probably should be in a
filesystem-independent documentation file.
> +* either whole sector is correctly written or nothing is written during
> + powerfail.
> +
> + (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
> + like this, and are unsuitable for ext3. Because RAM tends to fail
> + faster than rest of system during powerfail, special hw killing
> + DMA transfers may be neccessary. Not sure how common that problem
> + is on generic PC machines).
Again, this is true for other filesystems (it was first discovered on
SGI "pizza boxes" machines running XFS, and special hardware changes
added to allow DMA aborts) --- in fact, because of ext3's use of
physical block journaling, it's much more likely that it will recover
from these sorts of errors. So it's very misleading to have this sort
of discussion in Documentation/filesystems/ext3.txt.
> +* either write caching is disabled, or hw can do barriers and they are enabled.
> +
> + (Note that barriers are disabled by default, use "barrier=1"
> + mount option after making sure hw can support them).
We really should get akpm to agree to accept the patch to default
barriers by default instead. :-)
- Ted
On Sun, 04 Jan 2009 00:41:51 GMT, Duane Griffin said:
> I agree, it isn't a great situation. Nonetheless, it has always been
> thus for ext3, and so far we've muddled along. Unless and until we can
> replay the journal in-memory without touching the on-disk data, we are
> stuck with it.
Is there a way using md/dm/lvm etc to make the source partition R/O and
replay the journal onto a CoW snapshop? Admittedly, not easy to do inside
the 'mount' command itself, but at least it might be workable for LiveCD R/O
mounts and forensics work, where you can *tell* beforehand that's what you
want and can jump through setup games before doing the mount...
Pavel Machek wrote:
[CC: Alan Cox because of his reply in the "XFS internal error" thread]
> Using ext3 is only safe if storage subsystem meets certain
> criteria. Document those.
Thanks for this patch. However, after reading this, I have a stupid
question: which file system should I use if I had to reinstall my
computers from scratch now?
Ext3 means either hardware that supports barriers (not sure how to
check, and anyway I have to use encryption on the work laptop due to the
corporate policy) or disabling write cache (but, as Alan Cox said, this
shortens the lifespan of the disk). Does this requirement apply to other
journaling filesystems? Do I need journaling at all, given that I have
an UPS on my desktop and a battery in the laptop?
--
Alexander E. Patrakov
On Sun, 04 Jan 2009 18:35:41 +0500, "Alexander E. Patrakov" said:
> Ext3 means either hardware that supports barriers (not sure how to
> check, and anyway I have to use encryption on the work laptop due to the
> corporate policy) or disabling write cache (but, as Alan Cox said, this
> shortens the lifespan of the disk).
False dichotomy. This isn't an "either/or", as there's a *third* case:
"understand the issues and risks involved if you have a write cache and
no barrier support, and learn to deal with it".
As you point out, if it's a laptop with a battery, the risk may be *very* low.
Let's say there's a 1 in 10,000 chance that you'll trash a file system and
need to restore from backups.
That may be totally acceptable if you've already estimated a 1 in 500 chance
of the whole damned laptop going walkies while you're not looking, and then
you *still* need to be able to restore from backups onto a replacement machine.
Yes, for some systems, the whole barriers/write cache thing is in fact very
important. But for others, data loss due to spilled coffee is a bigger worry...
2009/1/4 <[email protected]>:
> On Sun, 04 Jan 2009 00:41:51 GMT, Duane Griffin said:
>
>> I agree, it isn't a great situation. Nonetheless, it has always been
>> thus for ext3, and so far we've muddled along. Unless and until we can
>> replay the journal in-memory without touching the on-disk data, we are
>> stuck with it.
>
> Is there a way using md/dm/lvm etc to make the source partition R/O and
> replay the journal onto a CoW snapshop? Admittedly, not easy to do inside
> the 'mount' command itself, but at least it might be workable for LiveCD R/O
> mounts and forensics work, where you can *tell* beforehand that's what you
> want and can jump through setup games before doing the mount...
Yes, something like that is best practice, as I understand it. The
LiveCD init scripts could check whether they are about to R/O mount an
ext[34] filesystem needing recovery and either refuse with a useful
message to the user, or even automatically create and mount a COW
snapshot, as you described. They'd still need to warn the user though,
since things like remounting R/W wouldn't work as expected.
Cheers,
Duane.
--
"I never could learn to drink that blood and call it wine" - Bob Dylan
Alexander E. Patrakov wrote:
[]
> Ext3 means either hardware that supports barriers (not sure how to
> check, and anyway I have to use encryption on the work laptop due to the
> corporate policy) or disabling write cache (but, as Alan Cox said, this
> shortens the lifespan of the disk). Does this requirement apply to other
> journaling filesystems? Do I need journaling at all, given that I have
> an UPS on my desktop and a battery in the laptop?
There's another possibility too, somewhat more risky. Namely, run with
write cache ON by default, and switch it off when running off battery
(either UPS or notebook). Should save both worlds, PROVIDED the battery
actually/UPS works :)
/mjt
On Sun, Jan 04, 2009 at 06:35:41PM +0500, Alexander E. Patrakov wrote:
>
> Ext3 means either hardware that supports barriers (not sure how to
> check
Pretty much all modern disk drives supports barriers. And note that
w/o barriers ext3 has worked pretty well. *If* you have a workload
pushes your system into a mode which where it is very low on memory,
so it is constantly paging/thrashing and you have a workload which is
metadata intensive, and you crash the machine while it is thrashing,
it is possible to end up in a situation where your filesystem is
corrupted and you have to use e2fsck to correct the filesystem. In
practice this is often not the case, which is why the default for ext3
has been with barriers disabled, and most people have not noted major
problems. This is why Andrew Morton has refused accept the patch for
ext3 which disables barriers by default; he's not convinced the
performance hit is worth the improvement in reliability.
Ext4 does enable barriers by defaults, mainly because filesystem
developers tend to be believe the reliability is more important than
performance. (On the other hand, Google runs with ext2 w/o
journalling, because everything is replicated three times and it's
easier to just blow away the filesystem and resync from one of the
duplicate copies; so in the right circumstances, maybe worrying only
about performance and ignoring reliability makes perfect sense.)
> and anyway I have to use encryption on the work laptop due to the
> corporate policy
If dm supported barriers, this wouldn't be an issue. Personally, I
find the convenience of LVM is so useful that I use ext4 with LVM,
even though the barrier requests get dropped on the ground. And I'm a
kernel developer, and I use a laptop with suspend/resume, which means
I often crash uncleanly --- and I've not lost data yet, despite the
lack of barriers. (On the other hand, my laptop has 4 gigs of memory,
so I'm rarely thrashing due memory pressure.)
> or disabling write cache (but, as Alan Cox said, this
> shortens the lifespan of the disk).
Huh? I've never heard an assertion that disabling the write cache (I
assume you mean using write-through caching as opposed to write-back
caching), shortens the lifespan of disk drives. Aggressive battery
saving mode is far more likely to shorten disk drive life, due to
spinning the platters up and down a lot.
> Does this requirement apply to other
> journaling filesystems? Do I need journaling at all, given that I have
> an UPS on my desktop and a battery in the laptop?
Which requirement? Barriers? Most journaling filesystems simply
enable barriers by default.
And journalling is useful so that if your system crashes, say due to
suspend and resume not working out, or the battery runs dry without
your noticing it, you can avoid running fsck at boot time. It's
really more about shorting the boot time after a crash more than
anything else.
- Ted
On Sun, Jan 04, 2009 at 02:24:43PM +0000, Duane Griffin wrote:
> > Is there a way using md/dm/lvm etc to make the source partition R/O and
> > replay the journal onto a CoW snapshop? Admittedly, not easy to do inside
> > the 'mount' command itself, but at least it might be workable for LiveCD R/O
> > mounts and forensics work, where you can *tell* beforehand that's what you
> > want and can jump through setup games before doing the mount...
>
> Yes, something like that is best practice, as I understand it. The
> LiveCD init scripts could check whether they are about to R/O mount an
> ext[34] filesystem needing recovery and either refuse with a useful
> message to the user, or even automatically create and mount a COW
> snapshot, as you described. They'd still need to warn the user though,
> since things like remounting R/W wouldn't work as expected.
So what's the use case where people want to be able to mount a
filesystem needing recovery read/only without running the journal?
- Ted
Theodore Tso wrote:
> So what's the use case where people want to be able to mount a
> filesystem needing recovery read/only without running the journal?
Corrupted SD card[1] that's been locked to read only for recovery
purposes without having the FS tear itself apart further?
Others seem to be saying that it is useful for forensics...
[1] http://pavelmachek.livejournal.com/68701.html
On Sun, 4 Jan 2009, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 02:24:43PM +0000, Duane Griffin wrote:
> > > Is there a way using md/dm/lvm etc to make the source partition R/O and
> > > replay the journal onto a CoW snapshop? Admittedly, not easy to do inside
> > > the 'mount' command itself, but at least it might be workable for LiveCD R/O
> > > mounts and forensics work, where you can *tell* beforehand that's what you
> > > want and can jump through setup games before doing the mount...
> >
> > Yes, something like that is best practice, as I understand it. The
> > LiveCD init scripts could check whether they are about to R/O mount an
> > ext[34] filesystem needing recovery and either refuse with a useful
> > message to the user, or even automatically create and mount a COW
> > snapshot, as you described. They'd still need to warn the user though,
> > since things like remounting R/W wouldn't work as expected.
>
> So what's the use case where people want to be able to mount a
> filesystem needing recovery read/only without running the journal?
As mentioned before, suspending a laptop (running from hdd), running a live CD,
and expecting everything to work fine when resuming from hdd?
I think most people get shocked when they discover that mounting something
read-only may actualy write to the media. This is a bit unexpected (hey, if I
mount `read-only', I expect that no writes will happen), as it behaved
differently before the introduction of journalling.
As for mounting the root file system read-only during early boot up, and
remounting it read-write later, I guess it's quite complicated to replay the
journal (in RAM) on read-only mount, and deferring the replay writeback until
remounting read-write?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Sun, Jan 04, 2009 at 07:08:01PM +0000, Sitsofe Wheeler wrote:
> Theodore Tso wrote:
>> So what's the use case where people want to be able to mount a
>> filesystem needing recovery read/only without running the journal?
>
> Corrupted SD card[1] that's been locked to read only for recovery
> purposes without having the FS tear itself apart further?
In that case, the right answer is to copy the 32 GB SD card to hard
drive, and then operate on the hard drive..... In general, if the
media has started going bad, the *first* thing you want to do is an
immediate copy of the media to some place stable.
> Others seem to be saying that it is useful for forensics...
Again, the best thing to do is a full image copy of the drive before
you do anything else.....
If someone wants to implement code to scans the journal, and create a
redirection map where whenever the filesystem needs to read from block
N, it reads from block M instead, they should feel free to do so. But
so far, each of the use cases people are talking about are pretty rare
cases, which is probably why we don't have it at moment.
In fact, it's probably possible to create this as a pure userspace
solution using devicemapper.
- Ted
On Sun, Jan 04, 2009 at 08:21:06PM +0100, Geert Uytterhoeven wrote:
> As mentioned before, suspending a laptop (running from hdd), running
> a live CD, and expecting everything to work fine when resuming from
> hdd?
>
> I think most people get shocked when they discover that mounting
> something read-only may actualy write to the media. This is a bit
> unexpected (hey, if I mount `read-only', I expect that no writes
> will happen), as it behaved differently before the introduction of
> journalling.
It's been this way for about a decade.... that being said, if you
really want to do this, you can today via "mount -o ro,noload /dev/XXX
/mntpt". However, the system could crash or fail because the
filesystem without having run the journal could be quite inconsistent.
> As for mounting the root file system read-only during early boot up, and
> remounting it read-write later, I guess it's quite complicated to replay the
> journal (in RAM) on read-only mount, and deferring the replay writeback until
> remounting read-write?
It's not *that* hard; if someone would like to cons up a patch, please
feel free.... but it's certainly not a high priority for me or most
of the other ext3 filesystem developers.
- Ted
2009/1/4 Theodore Tso <[email protected]>:
> On Sun, Jan 04, 2009 at 08:21:06PM +0100, Geert Uytterhoeven wrote:
>> As for mounting the root file system read-only during early boot up, and
>> remounting it read-write later, I guess it's quite complicated to replay the
>> journal (in RAM) on read-only mount, and deferring the replay writeback until
>> remounting read-write?
>
> It's not *that* hard; if someone would like to cons up a patch, please
> feel free.... but it's certainly not a high priority for me or most
> of the other ext3 filesystem developers.
If anyone is interested I'd be happy to dust off and send them my old
patches to implement this. There are a couple of issues with it.
First, I never got around to implementing remount R/W support. Second,
I had to introduce a rather nasty hack in order to handle un-escaping
JFS magic numbers.
> - Ted
Cheers,
Duane.
--
"I never could learn to drink that blood and call it wine" - Bob Dylan
On Sun 2009-01-04 18:35:41, Alexander E. Patrakov wrote:
> Pavel Machek wrote:
> [CC: Alan Cox because of his reply in the "XFS internal error" thread]
>
>> Using ext3 is only safe if storage subsystem meets certain
>> criteria. Document those.
>
> Thanks for this patch. However, after reading this, I have a stupid
> question: which file system should I use if I had to reinstall my
> computers from scratch now?
ext2 is still the safest default... if you can live with fsck.
ext3 is the safest from the journalling ones, AFAICT.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Saturday 03 January 2009 17:01:58 Martin MOKREJŠ wrote:
> > Still handy for recovering badly broken filesystems, I'd say.
>
> Me as well. How about improving you doc patch with some summary of
> this thread (although it is probably not over yet)? ;-) Definitely,
> a note that one can mount it as ext2 while read-only would be helpful
> when doing some forensics on the disk.
Although make sure you _do_ mount it as read only because if you mount an ext3
filesystem read/write as ext2 I've had it zap the journal entirely and then
you have to tune2fs -j the sucker to turn it back into ext3.
Ext3 is... touchy.
Rob
On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* writes to media never fail. Even if disk returns error condition during
> + write, ext3 can't handle that correctly, because success on fsync was
> already + returned when data hit the journal.
> +
> + (Fortunately writes failing are very uncommon on disks, as they
> + have spare sectors they use when write fails.)
> +
> +* either whole sector is correctly written or nothing is written during
> + powerfail.
> +
> + (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
> + like this, and are unsuitable for ext3.
Want to document the granularity issues with flash, while you're at it?
An inherent problem with using flash as a normal block device is that the
flash erase size is bigger than most filesystem sector sizes. So when you
request a write, it may erase and rewrite the next 64k, 128k, or even a couple
megabytes on the really _big_ ones.
If you lose power in the middle of that, ext3 won't notice that data in the
"sectors" _after_ the one your were trying to write to got trashed.
The flash filesystems take this into account as part of their wear levelling
stuff (they normally copy the entire chunk into a new chunk, leaving the old
one in place until it's no longer needed), but they need to query the device
to get the erase granularity in order to do that, which is why they don't work
on non-flash block devices.
Rob
On Sunday 04 January 2009, Robert Hancock wrote:
> I agree, there should be a way to force it to mount "really read only"
> so it doesn't try to replay the journal. That might require just
> ignoring the journal content, which may result in the FS appearing
> corrupt, but for recovery/forensics purposes that seems better than
> nothing..
For forensics you ALWAYS get a copy of the full disk first,
which you set read only with blockdev --setro /dev/$MYDISK.
You then restore from this copy.
Best Regard
Ingo Oeser, been there, done that
On Sun, Jan 04, 2009 at 07:51:27PM +0000, Duane Griffin wrote:
>
> If anyone is interested I'd be happy to dust off and send them my old
> patches to implement this. There are a couple of issues with it.
> First, I never got around to implementing remount R/W support. Second,
> I had to introduce a rather nasty hack in order to handle un-escaping
> JFS magic numbers.
Can you dust off the patches and send a copy to
[email protected] so we have them archived someplace where
hopefully someone might have time to look at it?
- Ted
2009/1/4 Theodore Tso <[email protected]>:
> On Sun, Jan 04, 2009 at 07:51:27PM +0000, Duane Griffin wrote:
>>
>> If anyone is interested I'd be happy to dust off and send them my old
>> patches to implement this. There are a couple of issues with it.
>> First, I never got around to implementing remount R/W support. Second,
>> I had to introduce a rather nasty hack in order to handle un-escaping
>> JFS magic numbers.
>
> Can you dust off the patches and send a copy to
> [email protected] so we have them archived someplace where
> hopefully someone might have time to look at it?
OK, will do. I've posted them there before, but not the latest version
that properly handles un-escaping JFS magic numbers (albeit in an ugly
way). I'll rebase them on top of the latest ext4 patch queue and
repost.
> - Ted
Cheers,
Duane.
--
"I never could learn to drink that blood and call it wine" - Bob Dylan
On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote:
>
> Want to document the granularity issues with flash, while you're at it?
>
> An inherent problem with using flash as a normal block device is that the
> flash erase size is bigger than most filesystem sector sizes. So when you
> request a write, it may erase and rewrite the next 64k, 128k, or even a couple
> megabytes on the really _big_ ones.
>
> If you lose power in the middle of that, ext3 won't notice that data in the
> "sectors" _after_ the one your were trying to write to got trashed.
True enough, although the newer SSD's will have this problem addressed
(although at least initially, they are **far** more costly than the
el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
alongside battery-powered shavers and trashy ipod speakers).
I will stress again, that most of this doesn't belong in
Documentation/filesystems/ext3.txt, as most of this is *not*
ext3-specific.
- Ted
On Sun 2009-01-04 17:06:34, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote:
> >
> > Want to document the granularity issues with flash, while you're at it?
> >
> > An inherent problem with using flash as a normal block device is that the
> > flash erase size is bigger than most filesystem sector sizes. So when you
> > request a write, it may erase and rewrite the next 64k, 128k, or even a couple
> > megabytes on the really _big_ ones.
> >
> > If you lose power in the middle of that, ext3 won't notice that data in the
> > "sectors" _after_ the one your were trying to write to got trashed.
>
> True enough, although the newer SSD's will have this problem addressed
> (although at least initially, they are **far** more costly than the
> el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
> alongside battery-powered shavers and trashy ipod speakers).
>
> I will stress again, that most of this doesn't belong in
> Documentation/filesystems/ext3.txt, as most of this is *not*
> ext3-specific.
I've initially done the patch for ext3 because that's what I'm using
and becuase I felt responsible for documenting it after a huge thread.
At least barrier=1 seems to be ext3 specific, and perhaps logfs or
something can survive full eraseblocks disappearing. Anyway, i guess
we all agree that this needs to be documented _somewhere_, and that's
what I'm trying to do.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Hi!
On Sat 2009-01-03 21:32:11, Theodore Tso wrote:
> On Sat, Jan 03, 2009 at 01:38:15PM +0100, Pavel Machek wrote:
> > +Requirements
> > +============
> > +
> > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> > +behaving disk subsystem, data that have been successfully synced will
> > +stay on the disk. Sane means:
> > +
> > +* writes to media never fail. Even if disk returns error condition during
> > + write, ext3 can't handle that correctly, because success on fsync was already
> > + returned when data hit the journal.
> > +
> > + (Fortunately writes failing are very uncommon on disks, as they
> > + have spare sectors they use when write fails.)
>
> This is not unique to ext3; per the discussion two weeks ago, this is
> largely because of the fsync() interface not possibly being able to
Ok, so I guess I should split the patch to truly ext3-specific part,
and the part that is common for all the filesystems. I guess I'll need
some help with everything but ext2 and ext3...
> return errors caused by failures when creating or modifying parent
> directories. Given this, it's a bit misleading to place this in the
> Documentation/filesystems/ext3.txt. At the minimum it should include
> a discussion about what the issues might be, and given that pretty
> much any Unix/Linux filesystem doesn't have a way of reflecting these
> errors to application programs, it probably should be in a
> filesystem-independent documentation file.
Ok. I'll have to think about good name of that file.
> > +* either write caching is disabled, or hw can do barriers and they are enabled.
> > +
> > + (Note that barriers are disabled by default, use "barrier=1"
> > + mount option after making sure hw can support them).
>
> We really should get akpm to agree to accept the patch to default
> barriers by default instead. :-)
:-). Yes, that would help a bit.
(No, it is not complete solution. barrier=0/writeback on should be
still documented as unsafe).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
ext3 has quite unexpected semantics or "ro" and defaults are usually
not what they are documented to be, due to mkfs override.
Signed-off-by: Pavel Machek <[email protected]>
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..113db1f 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -14,6 +14,11 @@ Options
When mounting an ext3 filesystem, the following option are accepted:
(*) == default
+ro Mount filesystem read only. Note that ext3 will replay
+ the journal (and thus write to the partition) even when
+ mounted "read only". "ro, noload" can be used to prevent
+ writes to the filesystem.
+
journal=update Update the ext3 file system's journal to the current
format.
@@ -27,7 +32,9 @@ journal_dev=devnum When the external jou
identified through its new major/minor numbers encoded
in devnum.
-noload Don't load the journal on mounting.
+noload Don't load the journal on mounting. Note that this forces
+ mount of inconsistent filesystem, which can lead to
+ various problems.
data=journal All data are committed into the journal prior to being
written into the main file system.
@@ -95,6 +102,8 @@ debug Extra debugging information is s
errors=remount-ro(*) Remount the filesystem read-only on an error.
errors=continue Keep going on a filesystem error.
errors=panic Panic and halt the machine if an error occurs.
+ (Note that default is overriden by superblock
+ setting on most systems).
data_err=ignore(*) Just print an error message if an error occurs
in a file data buffer in ordered mode.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sun 2009-01-04 13:38:34, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 06:35:41PM +0500, Alexander E. Patrakov wrote:
> >
> > Ext3 means either hardware that supports barriers (not sure how to
> > check
>
> Pretty much all modern disk drives supports barriers. And note that
> w/o barriers ext3 has worked pretty well. *If* you have a workload
> pushes your system into a mode which where it is very low on memory,
> so it is constantly paging/thrashing and you have a workload which is
> metadata intensive, and you crash the machine while it is thrashing,
> it is possible to end up in a situation where your filesystem is
> corrupted and you have to use e2fsck to correct the filesystem. In
Are you sure you need to have thrashing? AFAICT metadata + fsync heavy
workload should be enough... and there were scripts to easily repeat
that.
> > Does this requirement apply to other
> > journaling filesystems? Do I need journaling at all, given that I have
> > an UPS on my desktop and a battery in the laptop?
>
> Which requirement? Barriers? Most journaling filesystems simply
> enable barriers by default.
>
> And journalling is useful so that if your system crashes, say due to
> suspend and resume not working out, or the battery runs dry without
> your noticing it, you can avoid running fsck at boot time. It's
> really more about shorting the boot time after a crash more than
> anything else.
Actually, journalling with barriers=0 should still be "safe" in case of
kernel crashes (*), right? Because if just kernel is dead, disk
firmware will still write the cache back, AFAICT.
Pavel
(*) kernel crashes that do not involve writing random garbage to disk.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sun 2009-01-04 14:31:41, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 07:08:01PM +0000, Sitsofe Wheeler wrote:
> > Theodore Tso wrote:
> >> So what's the use case where people want to be able to mount a
> >> filesystem needing recovery read/only without running the journal?
> >
> > Corrupted SD card[1] that's been locked to read only for recovery
> > purposes without having the FS tear itself apart further?
>
> In that case, the right answer is to copy the 32 GB SD card to hard
> drive, and then operate on the hard drive..... In general, if the
> media has started going bad, the *first* thing you want to do is an
> immediate copy of the media to some place stable.
Not neccessarily.
If I have a bit of precious data and lot of junk on the card, I want
to copy out the precious data before the card dies. Reading the whole
media may just take too long.
That's probably very true for rotating harddrives after headcrash...
"ro, noload" seems like very acceptable solution in that case.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sun 2009-01-04 13:49:49, Rob Landley wrote:
> On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
> > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> > +behaving disk subsystem, data that have been successfully synced will
> > +stay on the disk. Sane means:
> > +
> > +* writes to media never fail. Even if disk returns error condition during
> > + write, ext3 can't handle that correctly, because success on fsync was
> > already + returned when data hit the journal.
> > +
> > + (Fortunately writes failing are very uncommon on disks, as they
> > + have spare sectors they use when write fails.)
> > +
> > +* either whole sector is correctly written or nothing is written during
> > + powerfail.
> > +
> > + (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
> > + like this, and are unsuitable for ext3.
>
> Want to document the granularity issues with flash, while you're at it?
>
> An inherent problem with using flash as a normal block device is that the
> flash erase size is bigger than most filesystem sector sizes. So when you
> request a write, it may erase and rewrite the next 64k, 128k, or even a couple
> megabytes on the really _big_ ones.
>
> If you lose power in the middle of that, ext3 won't notice that data in the
> "sectors" _after_ the one your were trying to write to got trashed.
>
> The flash filesystems take this into account as part of their wear levelling
> stuff (they normally copy the entire chunk into a new chunk, leaving the old
> one in place until it's no longer needed), but they need to query the device
> to get the erase granularity in order to do that, which is why they don't work
> on non-flash block devices.
Is there linux filesystem that can handle that? I know jffs2, but
that's unsuitable for stuff like USB thumb drives, right?
Does this sound like a fair summary?
Sector writes are atomic (ATOMIC-SECTORS)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Either whole sector is correctly written or nothing is written during
powerfail.
Unfortuantely, none of the cheap USB/SD flash cards I seen do
behave like this, and are unsuitable for all linux filesystems
I know.
An inherent problem with using flash as a normal block
device is that the flash erase size is bigger than
most filesystem sector sizes. So when you request a
write, it may erase and rewrite the next 64k, 128k, or
even a couple megabytes on the really _big_ ones.
If you lose power in the middle of that, filesystem
won't notice that data in the "sectors" _after_ the
one your were trying to write to got trashed.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sun, Jan 04, 2009 at 08:21:06PM +0100, Geert Uytterhoeven wrote:
> On Sun, 4 Jan 2009, Theodore Tso wrote:
> > On Sun, Jan 04, 2009 at 02:24:43PM +0000, Duane Griffin wrote:
> > > > Is there a way using md/dm/lvm etc to make the source partition R/O and
> > > > replay the journal onto a CoW snapshop? Admittedly, not easy to do inside
> > > > the 'mount' command itself, but at least it might be workable for LiveCD R/O
> > > > mounts and forensics work, where you can *tell* beforehand that's what you
> > > > want and can jump through setup games before doing the mount...
> > >
> > > Yes, something like that is best practice, as I understand it. The
> > > LiveCD init scripts could check whether they are about to R/O mount an
> > > ext[34] filesystem needing recovery and either refuse with a useful
> > > message to the user, or even automatically create and mount a COW
> > > snapshot, as you described. They'd still need to warn the user though,
> > > since things like remounting R/W wouldn't work as expected.
> >
> > So what's the use case where people want to be able to mount a
> > filesystem needing recovery read/only without running the journal?
>
> As mentioned before, suspending a laptop (running from hdd), running a live CD,
> and expecting everything to work fine when resuming from hdd?
Any particular reason why suspend doesn't run the journal during
shutdown and leave a clean filesystem? It shouldn't take that
long surely.
I know it doesn't solve the "it really just crashed" problem, but
you don't tend to unsuspend from a crash anyway.
Bron ( just curious )
On Sun 2009-01-04 17:06:34, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote:
> >
> > Want to document the granularity issues with flash, while you're at it?
> >
> > An inherent problem with using flash as a normal block device is that the
> > flash erase size is bigger than most filesystem sector sizes. So when you
> > request a write, it may erase and rewrite the next 64k, 128k, or even a couple
> > megabytes on the really _big_ ones.
> >
> > If you lose power in the middle of that, ext3 won't notice that data in the
> > "sectors" _after_ the one your were trying to write to got trashed.
>
> True enough, although the newer SSD's will have this problem addressed
> (although at least initially, they are **far** more costly than the
> el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
> alongside battery-powered shavers and trashy ipod speakers).
>
> I will stress again, that most of this doesn't belong in
> Documentation/filesystems/ext3.txt, as most of this is *not*
> ext3-specific.
Agreed... So what about this one?
---
Document linux filesystem expectations. Ext3 can't handle write errors
of any kind, and can't handle non-atomic sector writes. Other
filesystems are probably even worse...
Signed-off-by: Pavel Machek <[email protected]>
diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..7817a9c
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,44 @@
+Linux filesystems can only work correctly when several conditions are
+met in the block layer and below (disks, flash cards). Some of them
+are obvious ("data on media should not change randomly"), some are
+less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly, because success
+on fsync was already returned when data hit the journal.
+
+ Fortunately writes failing are very uncommon on traditional
+ spinning disks, as they have spare sectors they use when write
+ fails.
+
+Sector writes are atomic (ATOMIC-SECTORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+ Unfortuantely, none of the cheap USB/SD flash cards I seen do
+ behave like this, and are unsuitable for all linux filesystems
+ I know.
+
+ An inherent problem with using flash as a normal block
+ device is that the flash erase size is bigger than
+ most filesystem sector sizes. So when you request a
+ write, it may erase and rewrite the next 64k, 128k, or
+ even a couple megabytes on the really _big_ ones.
+
+ If you lose power in the middle of that, filesystem
+ won't notice that data in the "sectors" _after_ the
+ one your were trying to write to got trashed.
+
+ Because RAM tends to fail faster than rest of system during
+ powerfail, special hw killing DMA transfers may be neccessary;
+ otherwise, disks may write garbage during powerfail.
+ Not sure how common that problem is on generic PC machines.
+
+
+
+
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..8cb64b0 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -188,6 +197,25 @@ mke2fs: create a ext3 partition with th
debugfs: ext2 and ext3 file system debugger.
ext2online: online (mounted) ext2 and ext3 filesystem resizer
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux filesystems have similar
+expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+ (Note that barriers are disabled by default, use "barrier=1"
+ mount option after making sure hw can support them).
+
References
==========
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sun 2009-01-04 17:06:34, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 01:49:49PM -0600, Rob Landley wrote:
> >
> > Want to document the granularity issues with flash, while you're at it?
> >
> > An inherent problem with using flash as a normal block device is that the
> > flash erase size is bigger than most filesystem sector sizes. So when you
> > request a write, it may erase and rewrite the next 64k, 128k, or even a couple
> > megabytes on the really _big_ ones.
> >
> > If you lose power in the middle of that, ext3 won't notice that data in the
> > "sectors" _after_ the one your were trying to write to got trashed.
>
> True enough, although the newer SSD's will have this problem addressed
> (although at least initially, they are **far** more costly than the
> el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
> alongside battery-powered shavers and trashy ipod speakers).
Hey, I got one of those el-cheapo 32GB SD cards. I fully expected it
to be slow, but eating my data 3 times per month was unexpected even
for me.
I'm not even sure where the blame is. I certainly blame the Linux
documentation: there should be "DON'T USE CRAPPY SD CARDS" warning in
big bold letters somewhere. I guess mkfs.ext3 should just refuse to
make filesystem on them. (Of course, the manufacturer should have told
me that the card is crap; I can bet it can not even work with
VFAT/Windows).
Plus I'd hope some filesystem materializes that can handle 128KB
"block size"... because the el-cheapo card I have here is actually
pretty sane. It seems to store data I put on it, and should be safe to
use with huge block size...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Pavel Machek wrote:
> Is there linux filesystem that can handle that? I know jffs2, but
> that's unsuitable for stuff like USB thumb drives, right?
This raises the question that if nothing can handle it which FS is the
least bad? The last I heard people were saying that with cheap SSDs the
recommendation was FAT [1] but in the future btrfs, nilfs and logfs
would be better.
[1] http://lkml.org/lkml/2008/10/14/129
On Sun, 4 Jan 2009, Pavel Machek wrote:
> On Sun 2009-01-04 13:49:49, Rob Landley wrote:
>> On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
>>> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
>>> +behaving disk subsystem, data that have been successfully synced will
>>> +stay on the disk. Sane means:
>>> +
>>> +* writes to media never fail. Even if disk returns error condition during
>>> + write, ext3 can't handle that correctly, because success on fsync was
>>> already + returned when data hit the journal.
>>> +
>>> + (Fortunately writes failing are very uncommon on disks, as they
>>> + have spare sectors they use when write fails.)
>>> +
>>> +* either whole sector is correctly written or nothing is written during
>>> + powerfail.
>>> +
>>> + (Unfortuantely, none of the cheap USB/SD flash cards I seen do behave
>>> + like this, and are unsuitable for ext3.
>>
>> Want to document the granularity issues with flash, while you're at it?
>>
>> An inherent problem with using flash as a normal block device is that the
>> flash erase size is bigger than most filesystem sector sizes. So when you
>> request a write, it may erase and rewrite the next 64k, 128k, or even a couple
>> megabytes on the really _big_ ones.
>>
>> If you lose power in the middle of that, ext3 won't notice that data in the
>> "sectors" _after_ the one your were trying to write to got trashed.
>>
>> The flash filesystems take this into account as part of their wear levelling
>> stuff (they normally copy the entire chunk into a new chunk, leaving the old
>> one in place until it's no longer needed), but they need to query the device
>> to get the erase granularity in order to do that, which is why they don't work
>> on non-flash block devices.
>
> Is there linux filesystem that can handle that? I know jffs2, but
> that's unsuitable for stuff like USB thumb drives, right?
>
> Does this sound like a fair summary?
>
> Sector writes are atomic (ATOMIC-SECTORS)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Either whole sector is correctly written or nothing is written during
> powerfail.
>
> Unfortuantely, none of the cheap USB/SD flash cards I seen do
> behave like this, and are unsuitable for all linux filesystems
> I know.
>
> An inherent problem with using flash as a normal block
> device is that the flash erase size is bigger than
> most filesystem sector sizes. So when you request a
> write, it may erase and rewrite the next 64k, 128k, or
> even a couple megabytes on the really _big_ ones.
>
> If you lose power in the middle of that, filesystem
> won't notice that data in the "sectors" _after_ the
> one your were trying to write to got trashed.
around, not after. the block you are reading could be in the middle or at
the end of an eraseblock.
David Lang
On Sun, Jan 04, 2009 at 11:40:52PM +0100, Pavel Machek wrote:
> Not neccessarily.
>
> If I have a bit of precious data and lot of junk on the card, I want
> to copy out the precious data before the card dies. Reading the whole
> media may just take too long.
>
> That's probably very true for rotating harddrives after headcrash...
For a small amount data, maybe; but the number of seeks is often far
more destructive than the amount of time the disk is spinning. And in
practice, what generally happens is the user starts looking around to
make sure there wasn't anything else on the disk worth saving, and now
data is getting copied off based on human reaction time. So that's
why I normally advise users that doing a full image copy of the disk
is much better than, say, "cp -r /home/luser /backup", or cd'ing
around a filesystem hierarchy and trying to save files one by one.
Note also that with SD cards, reading is generally non-destructive and
the time it takes to copy off say, 32 GB really isn't that long.
- Ted
On Sun, Jan 04, 2009 at 11:37:56PM +0100, Pavel Machek wrote:
>
> Are you sure you need to have thrashing? AFAICT metadata + fsync heavy
> workload should be enough... and there were scripts to easily repeat
> that.
The memory pressure is needed to force disk buffers out to disk sooner
than fsync() would normally force buffers out. The scripts which I've
seen induced memory pressure. If the disk is *super* aggressive at
reordering writes, I suppose a heavy fsync workload might be enough on
its own, but in practice, it's generally not enough.
- Ted
On Sunday 04 January 2009 16:06:34 Theodore Tso wrote:
> True enough, although the newer SSD's will have this problem addressed
> (although at least initially, they are **far** more costly than the
> el-cheapo 32GB SD cards you can find at the checkout counter at Fry's
> alongside battery-powered shavers and trashy ipod speakers).
I have great faith in the ability of PC hardware to continue to be crap for
the foreseeable future.
> I will stress again, that most of this doesn't belong in
> Documentation/filesystems/ext3.txt, as most of this is *not*
> ext3-specific.
Yes and no. Ext3 is enough of a "default" filesystem for Linux that some
documentation on when _not_ to use sounds like a good idea.
That said, some kind of a "choosing a filesystem" file would be good, perhaps
under the filesystems directory. (Then the ext3 doc would just need a brief
comment and a pointer to the other file.)
Rob
On Sunday 04 January 2009 16:55:45 Pavel Machek wrote:
> On Sun 2009-01-04 13:49:49, Rob Landley wrote:
> > On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
> > > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> > > +behaving disk subsystem, data that have been successfully synced will
> > > +stay on the disk. Sane means:
> > > +
> > > +* writes to media never fail. Even if disk returns error condition
> > > during + write, ext3 can't handle that correctly, because success on
> > > fsync was already + returned when data hit the journal.
> > > +
> > > + (Fortunately writes failing are very uncommon on disks, as they
> > > + have spare sectors they use when write fails.)
> > > +
> > > +* either whole sector is correctly written or nothing is written
> > > during + powerfail.
> > > +
> > > + (Unfortuantely, none of the cheap USB/SD flash cards I seen do
> > > behave + like this, and are unsuitable for ext3.
> >
> > Want to document the granularity issues with flash, while you're at it?
> >
> > An inherent problem with using flash as a normal block device is that the
> > flash erase size is bigger than most filesystem sector sizes. So when
> > you request a write, it may erase and rewrite the next 64k, 128k, or even
> > a couple megabytes on the really _big_ ones.
> >
> > If you lose power in the middle of that, ext3 won't notice that data in
> > the "sectors" _after_ the one your were trying to write to got trashed.
> >
> > The flash filesystems take this into account as part of their wear
> > levelling stuff (they normally copy the entire chunk into a new chunk,
> > leaving the old one in place until it's no longer needed), but they need
> > to query the device to get the erase granularity in order to do that,
> > which is why they don't work on non-flash block devices.
>
> Is there linux filesystem that can handle that? I know jffs2, but
> that's unsuitable for stuff like USB thumb drives, right?
Any of the flash filesystems should handle that. The main problem with jffs2
is it doesn't scale well to large device sizes. UBIFS is supposed to scale
much better, but I haven't played with it yet.
And the thing about USB thumb drives is they present as a normal block device,
_not_ as flash, so you can't _query_ their erase granularity. (It's like
those hardware raid cards that wouldn't tell you they were striping and such
so you had to figure out a well-performing layout all by yourself.) They do
it magically behind the scenes, and if the power goes out (or you yank the
device out unexpectedly) if they haven't got a built-in capacitor or battery
to have enough power to complete their pending transaction, you're screwed.
Plus they do horrible wear levelling, the lot of 'em. Read Val Henson's
livejournal entry about it: http://valhenson.livejournal.com/25228.html
There was also a marvelous thread Linus participated in on some hardware
industry web message board, but I have no idea where it's gone...
> Does this sound like a fair summary?
See Ted's comment. The summary's fine, the question is where to put this sort
of thing...
> If you lose power in the middle of that, filesystem
> won't notice that data in the "sectors" _after_ the
> one your were trying to write to got trashed.
Well, the journal won't notice. An e2fsck will notice huge swaths of missing
metadata, but won't be able to do anything about it. (And if what got zapped
was file _contents_ rather than metadata, you're on your own finding it. Fun,
isn't it?)
> Pavel
Rob
On Sunday 04 January 2009 17:00:53 Pavel Machek wrote:
> Document linux filesystem expectations. Ext3 can't handle write errors
> of any kind, and can't handle non-atomic sector writes. Other
> filesystems are probably even worse...
These concerns look like they're specifically for block backed filesystems,
which is one of four different types. I wrote a longish incoherent rant to
the busybox list about the different types of filesystems a couple months
back, in the context of a thread about implementing the "mount" command.
Dunno how relevant it is:
http://lists.busybox.net/pipermail/busybox/2008-November/067970.html
There are a couple fun relevant corner cases, such as the fact that nfs is the
only filesystem I'm aware of where the return value of close() can actually
mean something. (Due to the cacheing, you tend to get errors reported
_there_. I don't remember why, if I ever knew.)
> Signed-off-by: Pavel Machek <[email protected]>
>
> diff --git a/Documentation/filesystems/expectations.txt
> b/Documentation/filesystems/expectations.txt new file mode 100644
> index 0000000..7817a9c
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> @@ -0,0 +1,44 @@
> +Linux filesystems can only work correctly when several conditions are
> +met in the block layer and below (disks, flash cards). Some of them
> +are obvious ("data on media should not change randomly"), some are
> +less so.
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly, because success
> +on fsync was already returned when data hit the journal.
> +
> + Fortunately writes failing are very uncommon on traditional
> + spinning disks, as they have spare sectors they use when write
> + fails.
The failures show up in dmesg(), and some filesystems will remount themselves
read only if the physical media driver manages to propogate an error back to
to the filesystem. (Note that the scsi subsystem has historically had so many
glue layers that it couldn't manage to do this; that's been improved over the
years but whether or not it actually _works_ now, I couldn't tell you.)
Some kind of system monitor could notice the dmesg entries, but the actual
write goes into the cache and the physical media error normally happens long
after the program that did the write returned from its write call, often after
it closed the file, and sometimes after it exited.
Even sync() and fsync() won't help you there because if multiple processes do
that, only the _first_ one will get the physical media error. (The filesystem
doesn't associate physical media errors with processes; there's too many
layers in between and it's not necessarily a 1:1 relationship anyway.)
> +Sector writes are atomic (ATOMIC-SECTORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> + Unfortuantely, none of the cheap USB/SD flash cards I seen do
> + behave like this, and are unsuitable for all linux filesystems
> + I know.
My impression is you might as well leave the suckers vfat. It's a stupid
little filesystem but its very stupidity makes it as resistant to damage as
anything else (which admittedly isn't much), and it's had such a history of
_taking_ damage that the tools to cope with damage to it are actually pretty
good.
That said, constant updates to the first few sectors will burn out your USB
flash disk if you use it as something other than a backup media. That's true
with a lot of filesystems. (Hardware wear levelling isn't very good, it
cycles between the same dozen or so physical sectors for each logical sector.)
In general, those game consoles that say "please don't power off the thing
while we're writing to flash" have a reason for the message. :)
> + An inherent problem with using flash as a normal block
> + device is that the flash erase size is bigger than
> + most filesystem sector sizes. So when you request a
> + write, it may erase and rewrite the next 64k, 128k, or
> + even a couple megabytes on the really _big_ ones.
> +
> + If you lose power in the middle of that, filesystem
> + won't notice that data in the "sectors" _after_ the
> + one your were trying to write to got trashed.
> +
> + Because RAM tends to fail faster than rest of system during
> + powerfail, special hw killing DMA transfers may be neccessary;
> + otherwise, disks may write garbage during powerfail.
> + Not sure how common that problem is on generic PC machines.
> +
> +
> +
> +
> diff --git a/Documentation/filesystems/ext3.txt
> b/Documentation/filesystems/ext3.txt index 9dd2a3b..8cb64b0 100644
> --- a/Documentation/filesystems/ext3.txt
> +++ b/Documentation/filesystems/ext3.txt
> @@ -188,6 +197,25 @@ mke2fs: create a ext3 partition with th
> debugfs: ext2 and ext3 file system debugger.
> ext2online: online (mounted) ext2 and ext3 filesystem resizer
>
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* write errors not allowed
> +
> +* sector writes are atomic
> +
> +(see expectations.txt; note that most/all linux filesystems have similar
> +expectations)
nfs, cifs, procfs, sysfs, usbfs, tmpfs, ramfs, fuse...
> +* either write caching is disabled, or hw can do barriers and they are
> enabled. +
> + (Note that barriers are disabled by default, use "barrier=1"
> + mount option after making sure hw can support them).
So how does one make sure hw can support them?
Rob
On Sunday 04 January 2009 17:13:08 Sitsofe Wheeler wrote:
> Pavel Machek wrote:
> > Is there linux filesystem that can handle that? I know jffs2, but
> > that's unsuitable for stuff like USB thumb drives, right?
>
> This raises the question that if nothing can handle it which FS is the
> least bad? The last I heard people were saying that with cheap SSDs the
> recommendation was FAT [1] but in the future btrfs, nilfs and logfs
> would be better.
>
> [1] http://lkml.org/lkml/2008/10/14/129
I wonder if the flash filesystems could be told via mount options that they're
to use a normal block device as if it was a flash with granularity X?
They can't explicitly control erase, but writing to any block in a block group
will erase and rewrite the whole group so they can just do large write
transactions close to each other and the device should aggregate enough for an
erase block. (Plus don't touch anything _outside_ where you guess an erase
block to be until you've finished writing the whole block, which they
presumably already do.)
The other question is whether there's any way to guess an erase granularity
that's "good enough" for a device of size X, maybe larger than the device
actually does but not smaller than any remotely sane manufacturer would
implement. (And just _don't_ partition these suckers, so you don't have to
worry about partitions aligning with erase block sizes.)
Rob
On Saturday 03 January 2009 18:19:00 Pavel Machek wrote:
> No, you can't mount unclean ext3 as an ext2; patch to do that would be
> possible but...
tune2fs -O ^has_journal /dev/blah
fsck.ext2 -f /dev/blah
Rob
On Sun, 4 Jan 2009, Rob Landley wrote:
> On Sunday 04 January 2009 17:13:08 Sitsofe Wheeler wrote:
>> Pavel Machek wrote:
>>> Is there linux filesystem that can handle that? I know jffs2, but
>>> that's unsuitable for stuff like USB thumb drives, right?
>>
>> This raises the question that if nothing can handle it which FS is the
>> least bad? The last I heard people were saying that with cheap SSDs the
>> recommendation was FAT [1] but in the future btrfs, nilfs and logfs
>> would be better.
>>
>> [1] http://lkml.org/lkml/2008/10/14/129
>
> I wonder if the flash filesystems could be told via mount options that they're
> to use a normal block device as if it was a flash with granularity X?
>
> They can't explicitly control erase, but writing to any block in a block group
> will erase and rewrite the whole group so they can just do large write
> transactions close to each other and the device should aggregate enough for an
> erase block. (Plus don't touch anything _outside_ where you guess an erase
> block to be until you've finished writing the whole block, which they
> presumably already do.)
this capability would help for raid arrays as well.
if you have a raid5/6 array writing one sector to a stripe results in you
reading the pairity block for that stripe, reading the rest of the sectors
for the block on that disk, recalculating the pairity information and
writing the changed sectors out to both disks.
if you are writing the entire stripe, you could calculate the pairity and
just write everything out (no reads nessasary).
this would make sequential writes to raid5/6 arrays almost as fast as if
they were raid0 stripes.
if you could define 'erase block size' to be the raid stripe size the same
approach would work for both systems.
when I asked about the on the md list a couple of years ago, the response
that I got was that it was a good idea, but there was no way to get the
information about the low-level topology to the higher levels that would
need to act on the information. now that there is a second case where this
is needed, any mechanism that gets created should be made so that it's
useable for both.
David Lang
>>>>> "Pavel" == Pavel Machek <[email protected]> writes:
Pavel> Does this sound like a fair summary?
Pavel> Sector writes are atomic (ATOMIC-SECTORS)
Pavel> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I'd just like to point out that the all-or-nothing hardware sector
atomity thing is -- to a large extent -- a myth.
It is mostly true on SCSI class devices because various UNIX, RAID array
and database vendors have spent many years leaning very hard on the
drive manufacturers to make it so.
But it's not a hard guarantee, you can't get it in writing, and it's not
in any of the standards. Hybrid drives with flash had potential to
close that particular loophole but those appear to be dead in the water.
--
Martin K. Petersen Oracle Linux Engineering
On Sunday 04 January 2009 13:21:06 Geert Uytterhoeven wrote:
> I think most people get shocked when they discover that mounting something
> read-only may actualy write to the media. This is a bit unexpected (hey, if
> I mount `read-only', I expect that no writes will happen), as it behaved
> differently before the introduction of journalling.
Is this an unreasonable use case:
kill -STOP $(pidof qemu)
mount -o loop,ro hdb.img blah
cp blah/thingy thingy
umount blah
kill -CONT $(pidof qemu)
Currently, if your loopback mount is -t ext3 it'll write to the block device,
and if your mount is -t ext2 it'll refuse to work on an unclean ext3
filesystem, even if it's read only. (But it _will_ work on an unclean ext2
filesystem.)
My theory when I first found out about this was "the filesystem developers
hate me personally".
Rob
>>>>> "Rob" == Rob Landley <[email protected]> writes:
Rob> I wonder if the flash filesystems could be told via mount options
Rob> that they're to use a normal block device as if it was a flash with
Rob> granularity X?
I posted some patches a few months ago that allowed us to do this. In
particular they expose the underlying I/O topology to the filesystems.
That includes minimum, preferred and maximum I/O size for both read and
write as well as alignment. The patches also allow stacking so we get
alignment right on say LVM on top of MD on top of a partitioned disk.
At Kernel Summit/Plumbers Linus absolutely hated this idea in the
context of SSDs. And I don't necessarily disagree with his point that
intel (claim to have) solved this problem.
However, there's still lots of crappy devices out there that we need to
support. And we absolutely need this for RAID (both software and
hardware) as well. I've been meaning to post a new round of these
patches. I'll take a look at them again this week.
The intent was to use the alignment and block sizes to honor erase block
boundaries when merging requests.
SCSI already has knobs that expose the appropriate sizes although not
many vendors implement them yet. I've been talking to a few SSD vendors
about exposing similar parameters with SATA. Most of them are willing
and will happily share this information. Other vendors stop responding
when you ask them too many questions.
--
Martin K. Petersen Oracle Linux Engineering
On Sunday 04 January 2009 17:30:52 Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 11:40:52PM +0100, Pavel Machek wrote:
> > Not neccessarily.
> >
> > If I have a bit of precious data and lot of junk on the card, I want
> > to copy out the precious data before the card dies. Reading the whole
> > media may just take too long.
> >
> > That's probably very true for rotating harddrives after headcrash...
>
> For a small amount data, maybe; but the number of seeks is often far
> more destructive than the amount of time the disk is spinning. And in
> practice, what generally happens is the user starts looking around to
> make sure there wasn't anything else on the disk worth saving, and now
> data is getting copied off based on human reaction time. So that's
> why I normally advise users that doing a full image copy of the disk
> is much better than, say, "cp -r /home/luser /backup", or cd'ing
> around a filesystem hierarchy and trying to save files one by one.
That would be true if the disk hardware wasn't doing a gazillion retries to
read a bad sector internally (taking 5 seconds to come back and report
failure), and then the darn scsi layer added another gazillion retries on top
of that, and the two multiply together to make it so slow that that when you
leave the thing copying the disk overnight it's STILL not done 24 hours later.
Going in and cherry picking individual files looks kind of appealing in that
situation.
Rob
P.S. Yeah, I had a laptop hard drive crash a month or so back. I remember
when it was still possible to buy storage devices that didn't get arbitrarily
routed through the SCSI layer. I miss those days. I found the patch to route
ramdisks through the scsi layer amusing, though.
On Sunday 04 January 2009 22:02:05 [email protected] wrote:
> when I asked about the on the md list a couple of years ago, the response
> that I got was that it was a good idea, but there was no way to get the
> information about the low-level topology to the higher levels that would
> need to act on the information.
Mount option or an argument to mkfs that stores it in the superblock both work
for me.
> now that there is a second case where this
> is needed, any mechanism that gets created should be made so that it's
> useable for both.
The embedded and supercomputing worlds have more in common with each other
than either does with the desktop...
> David Lang
Rob
On Sun, 4 Jan 2009, Rob Landley wrote:
> On Sunday 04 January 2009 17:30:52 Theodore Tso wrote:
>> On Sun, Jan 04, 2009 at 11:40:52PM +0100, Pavel Machek wrote:
>>> Not neccessarily.
>>>
>>> If I have a bit of precious data and lot of junk on the card, I want
>>> to copy out the precious data before the card dies. Reading the whole
>>> media may just take too long.
>>>
>>> That's probably very true for rotating harddrives after headcrash...
>>
>> For a small amount data, maybe; but the number of seeks is often far
>> more destructive than the amount of time the disk is spinning. And in
>> practice, what generally happens is the user starts looking around to
>> make sure there wasn't anything else on the disk worth saving, and now
>> data is getting copied off based on human reaction time. So that's
>> why I normally advise users that doing a full image copy of the disk
>> is much better than, say, "cp -r /home/luser /backup", or cd'ing
>> around a filesystem hierarchy and trying to save files one by one.
>
> That would be true if the disk hardware wasn't doing a gazillion retries to
> read a bad sector internally (taking 5 seconds to come back and report
> failure), and then the darn scsi layer added another gazillion retries on top
> of that, and the two multiply together to make it so slow that that when you
> leave the thing copying the disk overnight it's STILL not done 24 hours later.
> Going in and cherry picking individual files looks kind of appealing in that
> situation.
I've also had cases where one particular spot on the drive is bad. any
attempt to read that sector fails and causes the drive to error out until
a reboot. grabbing individual files I could skip the file(s) in the
affected portion and retreive everything else on the drive (or in some
cases raid array with multiple failures)
David Lang
Rob Landley wrote:
> On Sunday 04 January 2009 17:30:52 Theodore Tso wrote:
>> On Sun, Jan 04, 2009 at 11:40:52PM +0100, Pavel Machek wrote:
>>> Not neccessarily.
>>>
>>> If I have a bit of precious data and lot of junk on the card, I want
>>> to copy out the precious data before the card dies. Reading the whole
>>> media may just take too long.
>>>
>>> That's probably very true for rotating harddrives after headcrash...
>> For a small amount data, maybe; but the number of seeks is often far
>> more destructive than the amount of time the disk is spinning. And in
>> practice, what generally happens is the user starts looking around to
>> make sure there wasn't anything else on the disk worth saving, and now
>> data is getting copied off based on human reaction time. So that's
>> why I normally advise users that doing a full image copy of the disk
>> is much better than, say, "cp -r /home/luser /backup", or cd'ing
>> around a filesystem hierarchy and trying to save files one by one.
>
> That would be true if the disk hardware wasn't doing a gazillion retries to
> read a bad sector internally (taking 5 seconds to come back and report
> failure), and then the darn scsi layer added another gazillion retries on top
> of that, and the two multiply together to make it so slow that that when you
> leave the thing copying the disk overnight it's STILL not done 24 hours later.
> Going in and cherry picking individual files looks kind of appealing in that
> situation.
>
> Rob
>
> P.S. Yeah, I had a laptop hard drive crash a month or so back. I remember
> when it was still possible to buy storage devices that didn't get arbitrarily
> routed through the SCSI layer. I miss those days. I found the patch to route
> ramdisks through the scsi layer amusing, though.
SCSI layer doesn't do any retries itself. Block layer does.
Even with zero software retries however, if there are a ton of bad
sectors it can still take ages for them to all fail reading one at a
time just from the disk's retries..
> >Sector writes are atomic (ATOMIC-SECTORS)
> >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >Either whole sector is correctly written or nothing is written during
> >powerfail.
> >
> > Unfortuantely, none of the cheap USB/SD flash cards I seen do
> > behave like this, and are unsuitable for all linux filesystems
> > I know.
> >
> > An inherent problem with using flash as a normal block
> > device is that the flash erase size is bigger than
> > most filesystem sector sizes. So when you request a
> > write, it may erase and rewrite the next 64k, 128k, or
> > even a couple megabytes on the really _big_ ones.
> >
> > If you lose power in the middle of that, filesystem
> > won't notice that data in the "sectors" _after_ the
> > one your were trying to write to got trashed.
>
> around, not after. the block you are reading could be in the middle or at
> the end of an eraseblock.
Applied, thanks.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> >>>>> "Pavel" == Pavel Machek <[email protected]> writes:
>
> Pavel> Does this sound like a fair summary?
>
> Pavel> Sector writes are atomic (ATOMIC-SECTORS)
> Pavel> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> I'd just like to point out that the all-or-nothing hardware sector
> atomity thing is -- to a large extent -- a myth.
It is a myth that linux filesystems depend on for safe operation :-(.
> It is mostly true on SCSI class devices because various UNIX, RAID array
> and database vendors have spent many years leaning very hard on the
> drive manufacturers to make it so.
>
> But it's not a hard guarantee, you can't get it in writing, and it's not
> in any of the standards. Hybrid drives with flash had potential to
> close that particular loophole but those appear to be dead in the water.
So "in practice it works but vendors will not guarantee that"?
How much true is it for normal SATA drives? Are there some tests I can
just run on a machine, powercycle it few times, and it tells me if my
disk is non-ATOMIC-SECTORS?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> On Sunday 04 January 2009 17:00:53 Pavel Machek wrote:
> > Document linux filesystem expectations. Ext3 can't handle write errors
> > of any kind, and can't handle non-atomic sector writes. Other
> > filesystems are probably even worse...
>
> These concerns look like they're specifically for block backed filesystems,
> which is one of four different types. I wrote a longish incoherent
> rant to
I updated the docs. It now states "block-backed filesystems" in the
first sentence.
> > +Write errors not allowed (NO-WRITE-ERRORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Writes to media never fail. Even if disk returns error condition
> > +during write, filesystems can't handle that correctly, because success
> > +on fsync was already returned when data hit the journal.
> > +
> > + Fortunately writes failing are very uncommon on traditional
> > + spinning disks, as they have spare sectors they use when write
> > + fails.
>
> The failures show up in dmesg(), and some filesystems will remount themselves
> read only if the physical media driver manages to propogate an error back to
> to the filesystem. (Note that the scsi subsystem has historically
Well, you may get an error in dmesg(), but your data are already gone
at that point (and apps don't read dmesg, anyway :-).
> Even sync() and fsync() won't help you there because if multiple processes do
> that, only the _first_ one will get the physical media error. (The filesystem
> doesn't associate physical media errors with processes; there's too many
> layers in between and it's not necessarily a 1:1 relationship
> anyway.)
sync() does not even have return value.
Yep. I'm trying to get fsync manpage updated.
> > +* either write caching is disabled, or hw can do barriers and they are
> > enabled. +
> > + (Note that barriers are disabled by default, use "barrier=1"
> > + mount option after making sure hw can support them).
>
> So how does one make sure hw can support them?
hdparm -I reports them. If you don't see "Native Command Queueing",
you have a problem.
Interestingly, neither x60 notebook not pretty recent amd workstation
has NCQ... Amd notebook seems to be ok.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> That would be true if the disk hardware wasn't doing a gazillion retries to
> read a bad sector internally (taking 5 seconds to come back and report
> failure), and then the darn scsi layer added another gazillion retries on top
> of that, and the two multiply together to make it so slow that that when you
> leave the thing copying the disk overnight it's STILL not done 24 hours later.
> Going in and cherry picking individual files looks kind of appealing in that
> situation.
You could of course just learn to use the functions the kernel provides.
If you want to recover disk blocks without retrying you can do that via
SG_IO. If you want to adjust the timeout and retry levels you can do that
too via sysfs.
Alan
> How much true is it for normal SATA drives? Are there some tests I can
> just run on a machine, powercycle it few times, and it tells me if my
> disk is non-ATOMIC-SECTORS?
No.
And even if it did writes to one sector can damage another. The
mathematical certainly stuff lives only in the world of maths. In the real
world everything is probabilities.
Alan
> If dm supported barriers, this wouldn't be an issue. Personally, I
"If the dm people applied the patches to support barriers" I believe is
the correct description - Andi ?
dm and md want fixing and even in the md case it isn't hard to do right.
> > or disabling write cache (but, as Alan Cox said, this
> > shortens the lifespan of the disk).
>
> Huh? I've never heard an assertion that disabling the write cache (I
> assume you mean using write-through caching as opposed to write-back
> caching), shortens the lifespan of disk drives. Aggressive battery
Thats what I was told by a disk vendor - simply because the drive makes a
lot more mechanical movements and writes.
> your noticing it, you can avoid running fsck at boot time. It's
> really more about shorting the boot time after a crash more than
> anything else.
That depends enormously on your environment. In a secure environment full
data journalling is practically essential to avoid the tiny risk of bits
of important data turning up in another users file.
Alan
On Sun, Jan 04, 2009 at 11:34:33PM +0100, Pavel Machek wrote:
> @@ -14,6 +14,11 @@ Options
> When mounting an ext3 filesystem, the following option are accepted:
> (*) == default
>
> +ro Mount filesystem read only. Note that ext3 will replay
> + the journal (and thus write to the partition) even when
> + mounted "read only". "ro, noload" can be used to prevent
> + writes to the filesystem.
I'd sugest "ro,noload" since the spaces screw up the mount options
parsing both on the command-line and in /etc/fstab. So how about:
Using the mount options "ro,noload" can be used....
> @@ -95,6 +102,8 @@ debug Extra debugging information is s
> errors=remount-ro(*) Remount the filesystem read-only on an error.
> errors=continue Keep going on a filesystem error.
> errors=panic Panic and halt the machine if an error occurs.
> + (Note that default is overriden by superblock
> + setting on most systems).
The default is always specified by the superblock setting. So users
will probably find it easier to understand if we remove the "(*)" and
to add the explanatory comment:
(These mount options override the errors behavior
specified in the superblock, which can be configured
using tune2fs)
Pavel, thanks for working on improving the documentation; with these
fixes,
Signed-off-by: "Theodore Ts'o" <[email protected]>
- Ted
On Monday 05 January 2009 05:19:13 Alan Cox wrote:
> You could of course just learn to use the functions the kernel provides.
> If you want to recover disk blocks without retrying you can do that via
> SG_IO. If you want to adjust the timeout and retry levels you can do that
> too via sysfs.
Good to know, but "my laptop hard drive just died" is not the optimal time to
learn these sorts of things.
> Alan
Rob
>>>>> "Pavel" == Pavel Machek <[email protected]> writes:
>> It is mostly true on SCSI class devices because various UNIX, RAID
>> array and database vendors have spent many years leaning very hard on
>> the drive manufacturers to make it so.
>>
>> But it's not a hard guarantee, you can't get it in writing, and it's
>> not in any of the standards. Hybrid drives with flash had potential
>> to close that particular loophole but those appear to be dead in the
>> water.
Pavel> So "in practice it works but vendors will not guarantee that"?
It works some of the time. But in reality if you yank power halfway
during a write operation the end result is undefined.
The saving grace for normal users is that the potential corruption is
limited to a couple of sectors.
The current suck of flash SSDs is that the erase block size amplifies
this problem by at least one order of magnitude, often two. I have a
couple of SSDs here that will leave my filesystem in shambles every time
the machine crashes. I quickly got tired of reinstalling Fedora several
times per week so now my main machine is back to spinning media.
The people that truly and deeply care about this type of write atomicity
(i.e. enterprises) deploy disk arrays that will do the right thing in
face of an error. This involves NVRAM, mirrored caches, uninterruptible
power supplies, etc. Brute force if you will.
High-end arrays even give you atomicity at a bigger granularity such as
filesystem or database blocks. On some storage you can say "this LUN is
used for an Oracle database that always writes in multiples of 8KB" and
the array will guarantee that each 8KB block of the I/O is written in
its entirety or not at all. Some arrays even allow you to verify Oracle
logical block checksums to ensure that the I/O is intact and internally
consistent.
I have been bugging storage vendors about a per-I/O write atomicity
setting for a while. But it really messes up their pipelining so they
aren't keen on the idea. We may be able to get some of it fixed as a
side-effect of the DIF bits vs. the impending switch to 4KB sectors,
though.
--
Martin K. Petersen Oracle Linux Engineering
On Sun, Jan 04, 2009 at 01:56:32PM -0600, Rob Landley wrote:
> On Saturday 03 January 2009 17:01:58 Martin MOKREJŠ wrote:
> > > Still handy for recovering badly broken filesystems, I'd say.
> >
> > Me as well. How about improving you doc patch with some summary of
> > this thread (although it is probably not over yet)? ;-) Definitely,
> > a note that one can mount it as ext2 while read-only would be helpful
> > when doing some forensics on the disk.
>
> Although make sure you _do_ mount it as read only because if you mount an ext3
> filesystem read/write as ext2 I've had it zap the journal entirely and then
> you have to tune2fs -j the sucker to turn it back into ext3.
>
> Ext3 is... touchy.
Um.... horse pucky:
# mke2fs -q -t ext3 /dev/thunk/footest
# debugfs -R features /dev/thunk/footest
debugfs 1.41.3 (12-Oct-2008)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype sparse_super large_file
# mount -t ext2 /dev/thunk/footest /mnt
# touch /mnt/foo
# umount /mnt
# debugfs -R features /dev/thunk/footest
debugfs 1.41.3 (12-Oct-2008)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype sparse_super large_file
- Ted
>>>>> "Rob" == Rob Landley <[email protected]> writes:
Rob> On Monday 05 January 2009 05:19:13 Alan Cox wrote:
>> You could of course just learn to use the functions the kernel
>> provides. If you want to recover disk blocks without retrying you
>> can do that via SG_IO. If you want to adjust the timeout and retry
>> levels you can do that too via sysfs.
Rob> Good to know, but "my laptop hard drive just died" is not the
Rob> optimal time to learn these sorts of things.
http://www.garloff.de/kurt/linux/ddrescue/
--
Martin K. Petersen Oracle Linux Engineering
On Mon, Jan 05, 2009 at 02:15:44PM -0500, Martin K. Petersen wrote:
>
> It works some of the time. But in reality if you yank power halfway
> during a write operation the end result is undefined.
>
> The saving grace for normal users is that the potential corruption is
> limited to a couple of sectors.
A few years ago it was asserted to me that the internal block size for
spinning magnetic media was around 32k. So if the hard drive doesn't
have enough of a capacitor or other energy reserve to complete its
internal read-modify-write cycle, attempts to read the 32k chunk of
disk could result in hard ECC failures that would cause the blocks in
question to all return uncorrectiable read errors when they are
accessed.
Of course, if the memory goes south first, and you're in the middle of
streaming a 128k update to the inode the filesystem, and the power
fails, and the memory start returning garbage during the DMA
operation, you may have much bigger problems. :-)
So it's probably more than "a couple of sectors"....
> The current suck of flash SSDs is that the erase block size amplifies
> this problem by at least one order of magnitude, often two. I have a
> couple of SSDs here that will leave my filesystem in shambles every time
> the machine crashes. I quickly got tired of reinstalling Fedora several
> times per week so now my main machine is back to spinning media.
The erase block size is typically 1 to 4 megabytes, from my
understanding. So yeah, that's easily 1-2 orders of magnitude. Worse
yet, flash's sequential streaming write speeds are much slower than
hard drive's (anywhere from a factor of 3 to 12 depending on
cheap/trashy the flash drive happens to be), so that opens the time
window even further, by possibly as much as another order of magnitude.
I also suspect that HDD manufactures have learned various tricks (due
to enterprise storage/database vendors leaning on them) to make the
drives appear more atomic in the face of hard drive errors, and also,
in Pavel's case, as I recall he was using the card in a laptop where
the SD card protruded slightly from the laptop case, and it was very
easy for it to get dislodged, meaning that power failures during
writes were even more likely than you would expect with a fixed HDD or
SDD which is secured into place using screws or other more reliable
mounting hardware.
Put all of this together, given that Pavel's Really Trashy 32GB SD was
probably the full 3 orders of magnitude worse than traditional HDD,
and he was having many more failures due to physical mounting issues,
it's not surprising that most people haven't see problems with
traditional HDD's, even none of this is guaranteed by the hard drive
vendors.
> The people that truly and deeply care about this type of write atomicity
> (i.e. enterprises) deploy disk arrays that will do the right thing in
> face of an error. This involves NVRAM, mirrored caches, uninterruptible
> power supplies, etc. Brute force if you will.
Don't forget non-cheasy mounting options so an accidental brush
against the side of the unit doesn't cause the hard drive to become
disconnected from system and suffer a power drop. I guess that gets
filed under "Brute force" as well. :-)
- Ted
P.S. I feel obliged to point out that in my Lenovo X61s, the SD card
is flush with the laptop case when inserted, and I've never had a
problem with the SD card prematurely ejected during operaiton. :-)
Theodore Tso wrote:
> Don't forget non-cheasy mounting options so an accidental brush
> against the side of the unit doesn't cause the hard drive to become
> disconnected from system and suffer a power drop. I guess that gets
> filed under "Brute force" as well. :-)
Are you thinking of sync? If so I have experience of this not helping
with ext3 on an 8Gbyte SD card in an EeePC 900. Sooner or later a bunch
of zeros overwrites the early part of the partition and an fsck tears
the FS apart. This seems to happen quickly if you are booting your root
from the SD card (no swap though). A FAT32 partition seems to be
unperturbed so far (but it's not being used the same way as the ext3
partition).
> P.S. I feel obliged to point out that in my Lenovo X61s, the SD card
> is flush with the laptop case when inserted, and I've never had a
> problem with the SD card prematurely ejected during operaiton. :-)
I have never had premature ejection of SD with my Eee...
Rob Landley wrote:
> There was also a marvelous thread Linus participated in on some hardware
> industry web message board, but I have no idea where it's gone...
http://www.realworldtech.com/forums/index.cfm?action=detail&id=92702&threadid=92678&roomid=2
? Can you remember a bit more about this thread?
On Mon, Jan 05, 2009 at 08:36:28PM +0000, Sitsofe Wheeler wrote:
> Theodore Tso wrote:
>> Don't forget non-cheasy mounting options so an accidental brush
>> against the side of the unit doesn't cause the hard drive to become
>> disconnected from system and suffer a power drop. I guess that gets
>> filed under "Brute force" as well. :-)
>
> Are you thinking of sync?
No, I was talking about physical mounting issues; as I said earlier in
the e-mail message you replied to (you must not have read my e-mail
carefully), Pavel ran into a problem where the SD card protruded
slightly from the laptop case, and it would easily (via physical
contact) get loosed from its connector so that it would become
disconnected from the laptop, causing it to lose power, sometimes
while it was writing, leading to filesystem corruptions.
> If so I have experience of this not helping
> with ext3 on an 8Gbyte SD card in an EeePC 900. Sooner or later a bunch
> of zeros overwrites the early part of the partition and an fsck tears
> the FS apart. This seems to happen quickly if you are booting your root
> from the SD card (no swap though). A FAT32 partition seems to be
> unperturbed so far (but it's not being used the same way as the ext3
> partition).
A quick google search found some interesting posts on the subject:
http://forum.eeeuser.com/viewtopic.php?id=37174
http://lists.alioth.debian.org/pipermail/debian-eeepc-devel/2008-August/000837.html
http://www.spinics.net/lists/linux-scsi/msg28197.html
http://kerneltrap.org/mailarchive/git-commits-head/2008/8/6/2832894
- Ted
On Monday 05 January 2009 16:25:52 Sitsofe Wheeler wrote:
> Rob Landley wrote:
> > There was also a marvelous thread Linus participated in on some hardware
> > industry web message board, but I have no idea where it's gone...
>
> http://www.realworldtech.com/forums/index.cfm?action=detail&id=92702&thread
>id=92678&roomid=2 ? Can you remember a bit more about this thread?
Yeah, that looks like it.
It's a big thread, but I found it educational. There's a lot more interesting
info was scattered further down. (Linus himself replied a dozen times.)
Rob
On Mon 2009-01-05 09:57:13, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 11:34:33PM +0100, Pavel Machek wrote:
> > @@ -14,6 +14,11 @@ Options
> > When mounting an ext3 filesystem, the following option are accepted:
> > (*) == default
> >
> > +ro Mount filesystem read only. Note that ext3 will replay
> > + the journal (and thus write to the partition) even when
> > + mounted "read only". "ro, noload" can be used to prevent
> > + writes to the filesystem.
>
> I'd sugest "ro,noload" since the spaces screw up the mount options
> parsing both on the command-line and in /etc/fstab. So how about:
>
> Using the mount options "ro,noload" can be used....
Too many "using", but yes, fixed, thanks.
> > @@ -95,6 +102,8 @@ debug Extra debugging information is s
> > errors=remount-ro(*) Remount the filesystem read-only on an error.
> > errors=continue Keep going on a filesystem error.
> > errors=panic Panic and halt the machine if an error occurs.
> > + (Note that default is overriden by superblock
> > + setting on most systems).
>
> The default is always specified by the superblock setting. So users
> will probably find it easier to understand if we remove the "(*)" and
> to add the explanatory comment:
>
> (These mount options override the errors behavior
> specified in the superblock, which can be configured
> using tune2fs)
>
> Pavel, thanks for working on improving the documentation; with these
> fixes,
Thanks!
---
ext3 has quite unexpected semantics or "ro" and defaults are
not what they are documented to be, due to mkfs override.
Signed-off-by: Pavel Machek <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..49c08bf 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -14,6 +14,11 @@ Options
When mounting an ext3 filesystem, the following option are accepted:
(*) == default
+ro Mount filesystem read only. Note that ext3 will replay
+ the journal (and thus write to the partition) even when
+ mounted "read only". Mount options "ro,noload" can be
+ used to prevent writes to the filesystem.
+
journal=update Update the ext3 file system's journal to the current
format.
@@ -27,7 +32,9 @@ journal_dev=devnum When the external jou
identified through its new major/minor numbers encoded
in devnum.
-noload Don't load the journal on mounting.
+noload Don't load the journal on mounting. Note that this forces
+ mount of inconsistent filesystem, which can lead to
+ various problems.
data=journal All data are committed into the journal prior to being
written into the main file system.
@@ -92,9 +99,12 @@ nocheck
debug Extra debugging information is sent to syslog.
-errors=remount-ro(*) Remount the filesystem read-only on an error.
+errors=remount-ro Remount the filesystem read-only on an error.
errors=continue Keep going on a filesystem error.
errors=panic Panic and halt the machine if an error occurs.
+ (These mount options override the errors behavior
+ specified in the superblock, which can be
+ configured using tune2fs.)
data_err=ignore(*) Just print an error message if an error occurs
in a file data buffer in ordered mode.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sat, 03 Jan 2009, Pavel Machek wrote:
> On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
> > [Fixed top-posting]
> >
> > 2009/1/3 Martin MOKREJ? <[email protected]>:
> > > Pavel Machek wrote:
> > >> readonly mount does actually write to the media in some cases. Document that.
> > >>
> > > Can one avoid replay of the journal then if it would be unclean?
> > > Just curious.
> >
> > Nope. If the underlying block device is read-only then mounting the
> > filesystem will fail. I tried to fix this some time ago, and have a
> > set of patches that almost always work, but "almost always" isn't good
> > enough. Unfortunately I never managed to figure out a way to finish it
> > off without disgusting hacks or major surgery.
>
> Uhuh, can you just ignore the journal and mount it anyway?
An ext3 file system that needs journal recovery sets one of the ext2
incompatible flags to prevent just that.
> ...basically treating it like an ext2?
>
> ...ok, that will present "old" version of the filesystem to the
> user... violating fsync() semantics.
>
> Still handy for recovering badly broken filesystems, I'd say.
While you cannot have that, you'll need to dump the file system
(possibly with dd_rescue) to another medium and work on the copy.
That's what you should do anyways. ;-)
I think if you really want to mount the file system without journal
replay, you need to clear the needs-recovery "incompat" flag (on the
copy, obviously).
--
Matthias Andree
On Sun, 04 Jan 2009, Martin MOKREJ? wrote:
> Pavel Machek wrote:
> > On Sat 2009-01-03 22:17:15, Duane Griffin wrote:
> >> [Fixed top-posting]
> >>
> >> 2009/1/3 Martin MOKREJ? <[email protected]>:
> >>> Pavel Machek wrote:
> >>>> readonly mount does actually write to the media in some cases. Document that.
> >>>>
> >>> Can one avoid replay of the journal then if it would be unclean?
> >>> Just curious.
> >> Nope. If the underlying block device is read-only then mounting the
> >> filesystem will fail. I tried to fix this some time ago, and have a
> >> set of patches that almost always work, but "almost always" isn't good
> >> enough. Unfortunately I never managed to figure out a way to finish it
> >> off without disgusting hacks or major surgery.
> >
> > Uhuh, can you just ignore the journal and mount it anyway?
> > ...basically treating it like an ext2?
> >
> > ...ok, that will present "old" version of the filesystem to the
> > user... violating fsync() semantics.
>
> Hmm, so if my dual-boot machine does not shutdown correctly and I boot
> accidentally in M$ Win where I use ext2 IFS driver and modify some
> stuff on the ext3 drive, after a while reboot to linux and the journal
> get re-played ... Mmm ...
If the ext2 IFS driver mounts an ext3 file system that needs journal
replay, the IFS driver is broken (unless it can replay the journal, of
course - I stopped using that driver long ago, being unhappy with it).
--
Matthias Andree
On Sun, 04 Jan 2009, Rob Landley wrote:
> On Sunday 04 January 2009 17:30:52 Theodore Tso wrote:
> > On Sun, Jan 04, 2009 at 11:40:52PM +0100, Pavel Machek wrote:
> > > Not neccessarily.
> > >
> > > If I have a bit of precious data and lot of junk on the card, I want
> > > to copy out the precious data before the card dies. Reading the whole
> > > media may just take too long.
> > >
> > > That's probably very true for rotating harddrives after headcrash...
> >
> > For a small amount data, maybe; but the number of seeks is often far
> > more destructive than the amount of time the disk is spinning. And in
> > practice, what generally happens is the user starts looking around to
> > make sure there wasn't anything else on the disk worth saving, and now
> > data is getting copied off based on human reaction time. So that's
> > why I normally advise users that doing a full image copy of the disk
> > is much better than, say, "cp -r /home/luser /backup", or cd'ing
> > around a filesystem hierarchy and trying to save files one by one.
>
> That would be true if the disk hardware wasn't doing a gazillion retries to
> read a bad sector internally (taking 5 seconds to come back and report
> failure), and then the darn scsi layer added another gazillion retries on top
> of that, and the two multiply together to make it so slow that that when you
> leave the thing copying the disk overnight it's STILL not done 24 hours later.
> Going in and cherry picking individual files looks kind of appealing in that
> situation.
Well, I recently (Dec 1st or so) had a venerable HDD fail with a couple
of bad sectors; with oldish backups (couple of days) (Samsung SP2004C
plugged to a VIA VT6420 in the south bridge, VT8237).
I couldn't use dd_rescue since the disk drive was detached by the OS
upon the disk's hitting the first error.
If it's a software or hardware fault I cannot say, FreeBSD
7.1-PRERELEASE showed the same behaviour as did the openSUSE 11.0 i686
kernels, but then again it might be either OS losing patience with the
drive doing excessive reads, or the drive actually violating the bus
protocols or hanging. It didn't need power cycling though, detaching and
reattaching the ATA bus was sufficient.
For me, recovery was possible with rsync (or cp -Rp): run rsync -avH
until the drive froze, figure which file was affected, note the name,
remount the drive, rm the affected file, and continue.
Is there a "don't retry reads" setting in the kernel that I miss?
(I still have the drive, so I can try some error handling patches if
desired.)
--
Matthias Andree
On Mon, 05 Jan 2009, Martin K. Petersen wrote:
> >>>>> "Rob" == Rob Landley <[email protected]> writes:
>
> Rob> On Monday 05 January 2009 05:19:13 Alan Cox wrote:
> >> You could of course just learn to use the functions the kernel
> >> provides. If you want to recover disk blocks without retrying you
> >> can do that via SG_IO. If you want to adjust the timeout and retry
> >> levels you can do that too via sysfs.
>
> Rob> Good to know, but "my laptop hard drive just died" is not the
> Rob> optimal time to learn these sorts of things.
>
> http://www.garloff.de/kurt/linux/ddrescue/
While nice, it does not reconfigure the block layer to reduce retries;
at least not in a manner I see at a glance; no sysctl or SG_IO or ioctl
or fcntl anywhere.
--
Matthias Andree
On Tue, Jan 06, 2009 at 11:08:10AM +0100, Matthias Andree wrote:
> On Sun, 04 Jan 2009, Martin MOKREJŠ wrote:
> > Hmm, so if my dual-boot machine does not shutdown correctly and I boot
> > accidentally in M$ Win where I use ext2 IFS driver and modify some
> > stuff on the ext3 drive, after a while reboot to linux and the journal
> > get re-played ... Mmm ...
>
> If the ext2 IFS driver mounts an ext3 file system that needs journal
> replay, the IFS driver is broken (unless it can replay the journal, of
> course - I stopped using that driver long ago, being unhappy with it).
Indeed; that's why there is a INCOMPAT NEEDS_RECOVERY feature flag to
prevent compliant ext2 implementations from mounting an ext3
filesystem that needs recovery. We've thought about most of these
issues, almost a decade ago...
- Ted
On Tue, Jan 06, 2009 at 11:41:46AM +0100, Matthias Andree wrote:
> > Rob> Good to know, but "my laptop hard drive just died" is not the
> > Rob> optimal time to learn these sorts of things.
> >
> > http://www.garloff.de/kurt/linux/ddrescue/
>
> While nice, it does not reconfigure the block layer to reduce retries;
> at least not in a manner I see at a glance; no sysctl or SG_IO or ioctl
> or fcntl anywhere.
Well, Kurt Garloff wrote that program years and years ago. I'm sure
if someone created patches he'd probably accept them, though. It's
still the best program I've found for doing image backups in
catastrophic situations.
- Ted
Theodore Tso <[email protected]> writes:
> On Tue, Jan 06, 2009 at 11:41:46AM +0100, Matthias Andree wrote:
>> > Rob> Good to know, but "my laptop hard drive just died" is not the
>> > Rob> optimal time to learn these sorts of things.
>> >
>> > http://www.garloff.de/kurt/linux/ddrescue/
>>
>> While nice, it does not reconfigure the block layer to reduce retries;
>> at least not in a manner I see at a glance; no sysctl or SG_IO or ioctl
>> or fcntl anywhere.
>
> Well, Kurt Garloff wrote that program years and years ago. I'm sure
> if someone created patches he'd probably accept them, though. It's
> still the best program I've found for doing image backups in
> catastrophic situations.
Better would be just to incorporate the functionality as an option
into standard GNU dd. Then everyone would easily have access to it.
-Andi
--
[email protected]
On Tue, Jan 06, 2009 at 04:40:33PM +0100, Andi Kleen wrote:
> > Well, Kurt Garloff wrote that program years and years ago. I'm sure
> > if someone created patches he'd probably accept them, though. It's
> > still the best program I've found for doing image backups in
> > catastrophic situations.
>
> Better would be just to incorporate the functionality as an option
> into standard GNU dd. Then everyone would easily have access to it.
I'm not sure whether the GNU coreutils maintainer would be willing to
accept a series of Linux-specific interfaces, but dd_rescue also has
the advantage that it uses a large blocksize for speed, but when an
error is returned, it backs off to a small block size to recovery the
maximum amount of data, and then later returns to the large block
size. (Ideally, it should be able to query the disk drive to
determine its internal block size, and use that for the smaller block
size, but I'm not sure if there's a standardized way that value is
exposed by HDD's or SDD's.)
The dd_rescue program also has a progress bar, which as we all know
makes things go faster :-), and is useful because it means the user
knows how much of the disk has been copied, and whether he/she should
go to sleep for the night, or grab a cup of coffee or beer. Its user
interface is also much simpler, and it's much easier to interrupt it
and start it up again where it left off. (You can do this with dd,
but the average inexperienced user will be horribly confused by the dd
man page, and might easily screw up or skip one of the seek or skip
options.)
Of course, the right answer is to pursue both paths, although my
experiences getting changes into the core/file/shell-utils has been
frustrating and unpleasant, although granted that was over ten years
ago, and hopefully the maintainer has been replaced since then by one
who is more responsive. OTOH, Kurt's a good guy, and would probably
be willing to accept patches to improve dd_rescue.
- Ted
> Of course, the right answer is to pursue both paths, although my
> experiences getting changes into the core/file/shell-utils has been
> frustrating and unpleasant, although granted that was over ten years
I submitted a change to coreutils some time ago and the maintainer
(Jim Meyering) was easy to work with in my experience.
-Andi
--
[email protected]
On Monday 05 January 2009 13:16:58 Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 01:56:32PM -0600, Rob Landley wrote:
> > On Saturday 03 January 2009 17:01:58 Martin MOKREJŠ wrote:
> > > > Still handy for recovering badly broken filesystems, I'd say.
> > >
> > > Me as well. How about improving you doc patch with some summary of
> > > this thread (although it is probably not over yet)? ;-) Definitely,
> > > a note that one can mount it as ext2 while read-only would be helpful
> > > when doing some forensics on the disk.
> >
> > Although make sure you _do_ mount it as read only because if you mount an
> > ext3 filesystem read/write as ext2 I've had it zap the journal entirely
> > and then you have to tune2fs -j the sucker to turn it back into ext3.
> >
> > Ext3 is... touchy.
>
> Um.... horse pucky:
Well I managed to kill it more than once, but I could easily have the
reproduction sequence wrong. (I wasn't _trying_ to do it again...)
> # mke2fs -q -t ext3 /dev/thunk/footest
> # debugfs -R features /dev/thunk/footest
> debugfs 1.41.3 (12-Oct-2008)
> Filesystem features: has_journal ext_attr resize_inode dir_index filetype
> sparse_super large_file # mount -t ext2 /dev/thunk/footest /mnt
> # touch /mnt/foo
> # umount /mnt
> # debugfs -R features /dev/thunk/footest
> debugfs 1.41.3 (12-Oct-2008)
> Filesystem features: has_journal ext_attr resize_inode dir_index filetype
> sparse_super large_file
If I can figure out what I did, I'll get back to you.
> - Ted
Rob
On Tuesday 06 January 2009 09:57:29 Theodore Tso wrote:
> On Tue, Jan 06, 2009 at 04:40:33PM +0100, Andi Kleen wrote:
> > > Well, Kurt Garloff wrote that program years and years ago. I'm sure
> > > if someone created patches he'd probably accept them, though. It's
> > > still the best program I've found for doing image backups in
> > > catastrophic situations.
> >
> > Better would be just to incorporate the functionality as an option
> > into standard GNU dd. Then everyone would easily have access to it.
>
> I'm not sure whether the GNU coreutils maintainer would be willing to
> accept a series of Linux-specific interfaces, but dd_rescue also has
> the advantage that it uses a large blocksize for speed, but when an
> error is returned, it backs off to a small block size to recovery the
> maximum amount of data, and then later returns to the large block
> size. (Ideally, it should be able to query the disk drive to
> determine its internal block size, and use that for the smaller block
> size, but I'm not sure if there's a standardized way that value is
> exposed by HDD's or SDD's.)
I don't suppose there a Documentation file to put data recovery information
in?
(Maybe the new filesystems expectations file, which doesn't seem the best name
for it...?)
Rob
On Monday 05 January 2009 05:43:29 Alan Cox wrote:
> > Huh? I've never heard an assertion that disabling the write cache (I
> > assume you mean using write-through caching as opposed to write-back
> > caching), shortens the lifespan of disk drives. Aggressive battery
>
> Thats what I was told by a disk vendor - simply because the drive makes a
> lot more mechanical movements and writes.
It certainly sounds like less write cacheing would shorten the lifespan of
flash devices...
Rob
On Tue, 6 Jan 2009, Pavel Machek wrote:
> On Mon 2009-01-05 09:57:13, Theodore Tso wrote:
> > On Sun, Jan 04, 2009 at 11:34:33PM +0100, Pavel Machek wrote:
> > > @@ -14,6 +14,11 @@ Options
> > > When mounting an ext3 filesystem, the following option are accepted:
> > > (*) == default
> > >
> > > +ro Mount filesystem read only. Note that ext3 will replay
> > > + the journal (and thus write to the partition) even when
> > > + mounted "read only". "ro, noload" can be used to prevent
> > > + writes to the filesystem.
> >
> > I'd sugest "ro,noload" since the spaces screw up the mount options
> > parsing both on the command-line and in /etc/fstab. So how about:
> >
> > Using the mount options "ro,noload" can be used....
>
> Too many "using", but yes, fixed, thanks.
>
> > > @@ -95,6 +102,8 @@ debug Extra debugging information is s
> > > errors=remount-ro(*) Remount the filesystem read-only on an error.
> > > errors=continue Keep going on a filesystem error.
> > > errors=panic Panic and halt the machine if an error occurs.
> > > + (Note that default is overriden by superblock
> > > + setting on most systems).
> >
> > The default is always specified by the superblock setting. So users
> > will probably find it easier to understand if we remove the "(*)" and
> > to add the explanatory comment:
> >
> > (These mount options override the errors behavior
> > specified in the superblock, which can be configured
> > using tune2fs)
> >
> > Pavel, thanks for working on improving the documentation; with these
> > fixes,
>
> Thanks!
>
> ---
>
> ext3 has quite unexpected semantics or "ro" and defaults are
> not what they are documented to be, due to mkfs override.
>
> Signed-off-by: Pavel Machek <[email protected]>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
> index 9dd2a3b..49c08bf 100644
> --- a/Documentation/filesystems/ext3.txt
> +++ b/Documentation/filesystems/ext3.txt
> @@ -14,6 +14,11 @@ Options
> When mounting an ext3 filesystem, the following option are accepted:
> (*) == default
>
> +ro Mount filesystem read only. Note that ext3 will replay
> + the journal (and thus write to the partition) even when
> + mounted "read only". Mount options "ro,noload" can be
> + used to prevent writes to the filesystem.
> +
> journal=update Update the ext3 file system's journal to the current
> format.
>
> @@ -27,7 +32,9 @@ journal_dev=devnum When the external jou
> identified through its new major/minor numbers encoded
> in devnum.
>
> -noload Don't load the journal on mounting.
> +noload Don't load the journal on mounting. Note that this forces
> + mount of inconsistent filesystem, which can lead to
> + various problems.
>
> data=journal All data are committed into the journal prior to being
> written into the main file system.
> @@ -92,9 +99,12 @@ nocheck
>
> debug Extra debugging information is sent to syslog.
>
> -errors=remount-ro(*) Remount the filesystem read-only on an error.
> +errors=remount-ro Remount the filesystem read-only on an error.
> errors=continue Keep going on a filesystem error.
> errors=panic Panic and halt the machine if an error occurs.
> + (These mount options override the errors behavior
> + specified in the superblock, which can be
> + configured using tune2fs.)
>
> data_err=ignore(*) Just print an error message if an error occurs
> in a file data buffer in ordered mode.
>
So, documentation guys, are you going to take this patch through the
Documentation tree (tytso already Signed off on that), or should I take it
through trivial tree?
Thanks,
--
Jiri Kosina
SUSE Labs
Jiri Kosina wrote:
> On Tue, 6 Jan 2009, Pavel Machek wrote:
>
>> On Mon 2009-01-05 09:57:13, Theodore Tso wrote:
>>> On Sun, Jan 04, 2009 at 11:34:33PM +0100, Pavel Machek wrote:
>>>> @@ -14,6 +14,11 @@ Options
>>>> When mounting an ext3 filesystem, the following option are accepted:
>>>> (*) == default
>>>>
>>>> +ro Mount filesystem read only. Note that ext3 will replay
>>>> + the journal (and thus write to the partition) even when
>>>> + mounted "read only". "ro, noload" can be used to prevent
>>>> + writes to the filesystem.
>>> I'd sugest "ro,noload" since the spaces screw up the mount options
>>> parsing both on the command-line and in /etc/fstab. So how about:
>>>
>>> Using the mount options "ro,noload" can be used....
>> Too many "using", but yes, fixed, thanks.
>>
>>>> @@ -95,6 +102,8 @@ debug Extra debugging information is s
>>>> errors=remount-ro(*) Remount the filesystem read-only on an error.
>>>> errors=continue Keep going on a filesystem error.
>>>> errors=panic Panic and halt the machine if an error occurs.
>>>> + (Note that default is overriden by superblock
>>>> + setting on most systems).
>>> The default is always specified by the superblock setting. So users
>>> will probably find it easier to understand if we remove the "(*)" and
>>> to add the explanatory comment:
>>>
>>> (These mount options override the errors behavior
>>> specified in the superblock, which can be configured
>>> using tune2fs)
>>>
>>> Pavel, thanks for working on improving the documentation; with these
>>> fixes,
>> Thanks!
>>
>> ---
>>
>> ext3 has quite unexpected semantics or "ro" and defaults are
>> not what they are documented to be, due to mkfs override.
>>
>> Signed-off-by: Pavel Machek <[email protected]>
>> Signed-off-by: "Theodore Ts'o" <[email protected]>
>>
>> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
>> index 9dd2a3b..49c08bf 100644
>> --- a/Documentation/filesystems/ext3.txt
>> +++ b/Documentation/filesystems/ext3.txt
>> @@ -14,6 +14,11 @@ Options
>> When mounting an ext3 filesystem, the following option are accepted:
>> (*) == default
>>
>> +ro Mount filesystem read only. Note that ext3 will replay
>> + the journal (and thus write to the partition) even when
>> + mounted "read only". Mount options "ro,noload" can be
>> + used to prevent writes to the filesystem.
>> +
>> journal=update Update the ext3 file system's journal to the current
>> format.
>>
>> @@ -27,7 +32,9 @@ journal_dev=devnum When the external jou
>> identified through its new major/minor numbers encoded
>> in devnum.
>>
>> -noload Don't load the journal on mounting.
>> +noload Don't load the journal on mounting. Note that this forces
>> + mount of inconsistent filesystem, which can lead to
>> + various problems.
>>
>> data=journal All data are committed into the journal prior to being
>> written into the main file system.
>> @@ -92,9 +99,12 @@ nocheck
>>
>> debug Extra debugging information is sent to syslog.
>>
>> -errors=remount-ro(*) Remount the filesystem read-only on an error.
>> +errors=remount-ro Remount the filesystem read-only on an error.
>> errors=continue Keep going on a filesystem error.
>> errors=panic Panic and halt the machine if an error occurs.
>> + (These mount options override the errors behavior
>> + specified in the superblock, which can be
>> + configured using tune2fs.)
>>
>> data_err=ignore(*) Just print an error message if an error occurs
>> in a file data buffer in ordered mode.
>>
>
> So, documentation guys, are you going to take this patch through the
> Documentation tree (tytso already Signed off on that), or should I take it
(probably should be Acked-by or Reviewed-by
if he isn't merging it)
> through trivial tree?
I'm so far behind on doc patches that I haven't read any of this thread
yet, so you can merge it IMO.
Thanks,
--
~Randy
On Fri, 9 Jan 2009, Randy Dunlap wrote:
> I'm so far behind on doc patches that I haven't read any of this thread
> yet, so you can merge it IMO.
OK, I have applied it, thanks Pavel.
--
Jiri Kosina
SUSE Labs
Alan Cox <[email protected]> writes:
> > That would be true if the disk hardware wasn't doing a gazillion retries to
> > read a bad sector internally (taking 5 seconds to come back and report
> > failure), and then the darn scsi layer added another gazillion retries on top
> > of that, and the two multiply together to make it so slow that that when you
> > leave the thing copying the disk overnight it's STILL not done 24 hours later.
> > Going in and cherry picking individual files looks kind of appealing in that
> > situation.
>
> You could of course just learn to use the functions the kernel provides.
> If you want to recover disk blocks without retrying you can do that via
> SG_IO. If you want to adjust the timeout and retry levels you can do that
> too via sysfs.
Sure but maybe the default values might be altered. I think the current
tradeoff has set the cursor way too far for retries.
I remember seeing I/O error on CDs resulting in zillions of retries on
errors on USB discs resulting in resetting the USB port again & again
for hours... (CD case is years ago, but I usually see the USB layer
trying reseting for quite a long time at least once per month)
> I remember seeing I/O error on CDs resulting in zillions of retries on
> errors on USB discs resulting in resetting the USB port again & again
> for hours... (CD case is years ago, but I usually see the USB layer
> trying reseting for quite a long time at least once per month)
Most of the CD ones are caused by tools like hal continuing to probe the
device regularly and causing new avalanches of errors.