2006-01-22 06:43:44

by John Richard Moser

[permalink] [raw]
Subject: soft update vs journaling?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So I've been researching, because I thought this "Soft Update" thing
that BSD uses was some weird freak-ass way to totally corrupt a file
system if the power drops. Seems I was wrong; it's actually just the
opposite, an alternate solution to journaling. So let's compare notes.

I'm not quite clear on what the benefits versus costs of Soft Update
versus Journaling are, so I'll run down what I got, and anyone who wants
to give input can run down on what they got, and we can compare. Maybe
someone will write a Soft Update system into Linux one day, far, far
into the future; but I doubt it. It might, however, be interesting to
compare ext2 + SU to ext3; and giving the chance to solve problems such
as delayed delete (i.e. file system fills up while soft update has not
yet executed a delete; try reacting by looking for a delete to suddenly
actually execute) might also be cool.


Soft Update appears to buffer and order meta-data writes in a dependency
scheme that makes certain that inconsistencies can't happen. Apparently
this means writing up directory entries before inodes, or something to
that effect. I can't see how this would help in the middle of a buffer
flush (half a dentry written? Partially deleted inode? Inode "deleted"
but not freed from disk?), so maybe someone can fill me in.

Journaling apparently means writing out meta-data to a log before
transferring it to the file system. No matter what happens, a proper
journal (for fun I've designed a transaction log format for low level
filesystems; it's entirely possible to have interrupt at any bit
recoverable) can always be checked over and either rolled back or rolled
forward. This is easy to design.

Soft Update appears to have the advantage of not needing multiple
writes. There's no need for journal flushing and then disk flushing;
you just flush the meta-data. Also, soft update systems mount
instantly, because there's no journal to play back, and the file system
is always consistent. It may be technically feasible to impliment soft
update on any old file system; I'm unclear as to how exactly to make any
soft-update work, so I can't say if this is absolutely possible (think
for vfat, consistent at all times and still Win32 compatible; great for
flash drives).

Unfortunately, soft update can leave retarded situations where areas of
disk are allocated still after a system failure during an inode delete.
This won't cause inconsistencies in the on-disk structure, however; you
can freely use the disk without causing even more damage. The system
just has to sanity check stuff while running and clean up such damage as
it sees it.

Journaling appears to have the advantage that the data gets to disk
faster. It also seems easier a concept to grasp (i.e. I understand it
fully). It's old, tried, trusted, and durable. You also don't have to
worry about having odd meta-data writes that leave deleted files around
in certain circumstances, eating up space.

Unfortunately, journaling uses a chunk of space. Imagine a journal on a
USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
Sure it could be done in 8 or 4 or so; or (in one of my file system
designs) a static 16KiB block could reference dynamicly allocated
journal space, allowing the system to sacrifice performance and shrink
the journal when more space is needed. Either way, slow media like
floppies will suffer, HARD; and flash devices will see a lot of
write/erase all over the journal area, causing wear on that spot.

So, that's my understanding. Any comments? Enlighten me.
- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

Creative brains are a valuable, limited resource. They shouldn't be
wasted on re-inventing the wheel when there are so many fascinating
new problems waiting out there.
-- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD0yldhDd4aOud5P8RAhzBAJwOvWpAYb+m3Zg8ugnvuY10K74jZgCeL69s
y0172JATNX+q8jzrYGAJ/xc=
=7Dcn
-----END PGP SIGNATURE-----


2006-01-22 08:51:15

by Jan Engelhardt

[permalink] [raw]
Subject: Re: soft update vs journaling?

>Unfortunately, journaling uses a chunk of space. Imagine a journal on a
>USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
>Sure it could be done in 8 or 4 or so; or (in one of my file system
>designs) a static 16KiB block could reference dynamicly allocated
>journal space, allowing the system to sacrifice performance and shrink
>the journal when more space is needed. Either way, slow media like
>floppies will suffer, HARD; and flash devices will see a lot of
>write/erase all over the journal area, causing wear on that spot.

- Smallest reiserfs3 journal size is 513 blocks - some 2 megabytes,
which would be ok with me for a 128meg drive.
Most of the time you need vfat anyway for your flashstick to make
useful use of it on Windows.

- reiser4's journal is even smaller than reiser3's with a new fresh
filesystem - same goes for jfs and xfs (below 1 megabyte IIRC)

- I would not use a journalling filesystem at all on media that degrades
faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs).
There are specially-crafted filesystems for that, mostly jffs and udf.

- You really need a hell of a power fluctuation to get a disk crippled.
Just powering off (and potentially on after a few milliseconds) did
(in my cases) just stop a disk write whereever it happened to be,
and that seemed easily correctable.


Jan Engelhardt
--

2006-01-22 09:31:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: soft update vs journaling?

On Sun, Jan 22, 2006 at 01:42:38AM -0500, John Richard Moser wrote:
> Soft Update appears to have the advantage of not needing multiple
> writes. There's no need for journal flushing and then disk flushing;
> you just flush the meta-data.

Not quite true; there are cases where Soft Update will have to do
multiple writes, when a particular block containing meta-data has
multiple changes in it that have to be committed to the filesystem at
different times in order to maintain consistency; this is particularly
true when a block is part of the inode table, for example. When this
happens, the soft update machinery has to allocate memory for a block
and then undo changes to that block which come from transactions that
are not yet ready to be written to disk yet.

In general, though, it is true that Soft Updates can result in fewer
disk writes compared to filesystems that utilizing traditional
journaling approaches, and this might even be noticeable if your
workload is heavily skewed towards metadata updates. (This is mainly
true in benchmarks that are horrendously disconneted to the real
world, such as dbench.)

One major downside with Soft Updates that you haven't mentioned in
your note, is that the amount of complexity it adds to the filesystem
is tremendous; the filesystem has to keep track of a very complex
state machinery, with knowledge of about the ordering constraints of
each change to the filesystem and how to "back out" parts of the
change when that becomes necessary.

Whenever you want to extend a filesystem to add some new feature, such
as online resizing, for example, it's not enough to just add that
feature; you also have to modify the black magic which is the Soft
Updates machinery. This significantly increases the difficulty to add
new features to a filesystem, and can add as a roadblock to people
wanting to add new features. I can't say for sure that this is why

BSD UFS doesn't have online resizing yet; and while I can't
conclusively blame the lack of this feature on Soft Updates, it is
clear that adding this and other features is much more difficult when
you are dealing with soft update code.

> Also, soft update systems mount instantly, because there's no
> journal to play back, and the file system is always consistent.

This is only true if you don't care about recovering lost data blocks.
Fixing this requires that you run the equivalent of fsck on the
filesystem. If you do, then it is major difference in performance.
Even if you can do the fsck scan on-line, it will greatly slow down
normal operations while recovering from a system crash, and the
slowdown associated with doing a journal replay is far smaller in
comparison.

> Unfortunately, journaling uses a chunk of space. Imagine a journal on a
> USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
> Sure it could be done in 8 or 4 or so; or (in one of my file system
> designs) a static 16KiB block could reference dynamicly allocated
> journal space, allowing the system to sacrifice performance and shrink
> the journal when more space is needed. Either way, slow media like
> floppies will suffer, HARD; and flash devices will see a lot of
> write/erase all over the journal area, causing wear on that spot.

If you are using flash, use a filesystem which is optimized for flash,
such as JFFS2. Otherwise, note that in most cases disk space is
nearly free, so allocating even 128 megs for the journal is chump
change when you're talking about a 200GB or larger hard drive.

Also note that if you have to use slow media, one of the things which
you can do is use a separate (fast) device for your journal; there is
no rule which says the journal has to be on the slow device.

- Ted

2006-01-22 18:41:26

by John Richard Moser

[permalink] [raw]
Subject: Re: soft update vs journaling?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Jan Engelhardt wrote:
>>Unfortunately, journaling uses a chunk of space. Imagine a journal on a
>>USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
>>Sure it could be done in 8 or 4 or so; or (in one of my file system
>>designs) a static 16KiB block could reference dynamicly allocated
>>journal space, allowing the system to sacrifice performance and shrink
>>the journal when more space is needed. Either way, slow media like
>>floppies will suffer, HARD; and flash devices will see a lot of
>>write/erase all over the journal area, causing wear on that spot.
>
>
> - Smallest reiserfs3 journal size is 513 blocks - some 2 megabytes,
> which would be ok with me for a 128meg drive.
> Most of the time you need vfat anyway for your flashstick to make
> useful use of it on Windows.
>
> - reiser4's journal is even smaller than reiser3's with a new fresh
> filesystem - same goes for jfs and xfs (below 1 megabyte IIRC)
>

Nice, but does not solve. . .

> - I would not use a journalling filesystem at all on media that degrades
> faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs).
> There are specially-crafted filesystems for that, mostly jffs and udf.
>

Yes. They'll degrade very, very fast. This is where Soft Update would
have an advantage. Another issue here is we can't just slap a journal
onto vfat, for all those flash devices that we want to share with Windows.

> - You really need a hell of a power fluctuation to get a disk crippled.
> Just powering off (and potentially on after a few milliseconds) did
> (in my cases) just stop a disk write whereever it happened to be,
> and that seemed easily correctable.

Yeah, I never said you could cripple a disk with power problems. You
COULD destroy a NAND in a flash device by nuking the thing with
10000000000000 writes to the same area.

>
>
> Jan Engelhardt

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

Creative brains are a valuable, limited resource. They shouldn't be
wasted on re-inventing the wheel when there are so many fascinating
new problems waiting out there.
-- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD09GOhDd4aOud5P8RAr1lAJ9fGMSJOd4QALc4nCbx+jDLgTlijwCbBM94
r60oZO/x2Q0xEWeF9sp9Vz8=
=63vo
-----END PGP SIGNATURE-----

2006-01-22 18:55:33

by John Richard Moser

[permalink] [raw]
Subject: Re: soft update vs journaling?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Theodore Ts'o wrote:
> On Sun, Jan 22, 2006 at 01:42:38AM -0500, John Richard Moser wrote:
>
>>Soft Update appears to have the advantage of not needing multiple
>>writes. There's no need for journal flushing and then disk flushing;
>>you just flush the meta-data.
>
>
> Not quite true; there are cases where Soft Update will have to do
> multiple writes, when a particular block containing meta-data has
> multiple changes in it that have to be committed to the filesystem at
> different times in order to maintain consistency; this is particularly

Yes, that makes sense.

> true when a block is part of the inode table, for example. When this
> happens, the soft update machinery has to allocate memory for a block
> and then undo changes to that block which come from transactions that
> are not yet ready to be written to disk yet.
>
> In general, though, it is true that Soft Updates can result in fewer
> disk writes compared to filesystems that utilizing traditional
> journaling approaches, and this might even be noticeable if your
> workload is heavily skewed towards metadata updates. (This is mainly
> true in benchmarks that are horrendously disconneted to the real
> world, such as dbench.)

Yeah, microbenchmarks are like "AFAFAFAFAFFAF THIS WILL NEVAR HAPPAN MOR
THAN 1 EVERY BILLION ZILLION YARS BUT LOOK WER FASTAR BY LIKE 1
MICROSECOND" stuff.

>
> One major downside with Soft Updates that you haven't mentioned in
> your note, is that the amount of complexity it adds to the filesystem
> is tremendous; the filesystem has to keep track of a very complex
> state machinery, with knowledge of about the ordering constraints of
> each change to the filesystem and how to "back out" parts of the
> change when that becomes necessary.

Yes, I had figured soft update would be a lot more complex than
journaling. Though, could this be majorly implimented filesystem
independent? I could see a "Soft Update API" to allow file systems to
sketch out dependencies each meta-data operation has and describe order;
it would, of course, be a total pain in the ass to do.

>
> Whenever you want to extend a filesystem to add some new feature, such
> as online resizing, for example, it's not enough to just add that

Online resizing is ever safe? I mean, with on-disk filesystem layout
support I could somewhat believe it for growing; for shrinking you'd
need a way to move files around without damaging them (possible). I
guess it would be.

So how does this work? Move files -> alter file system superblocks?

> feature; you also have to modify the black magic which is the Soft
> Updates machinery. This significantly increases the difficulty to add
> new features to a filesystem, and can add as a roadblock to people
> wanting to add new features. I can't say for sure that this is why
>
> BSD UFS doesn't have online resizing yet; and while I can't
> conclusively blame the lack of this feature on Soft Updates, it is
> clear that adding this and other features is much more difficult when
> you are dealing with soft update code.
>

Nod.

>
>>Also, soft update systems mount instantly, because there's no
>>journal to play back, and the file system is always consistent.
>
>
> This is only true if you don't care about recovering lost data blocks.
> Fixing this requires that you run the equivalent of fsck on the
> filesystem. If you do, then it is major difference in performance.
> Even if you can do the fsck scan on-line, it will greatly slow down
> normal operations while recovering from a system crash, and the
> slowdown associated with doing a journal replay is far smaller in
> comparison.

A passive-active approach could passively generate a list of inodes from
dentries as they're accessed; and actively walk the directory tree when
the disk is idle. Then a quick allocation check between inodes and
whatever allocation lists or trees there are could be done.

This has the disadvantage that if the system is under heavy load, the
recovery won't be done. There's also a period where the disk may be
rather full, causing fragmentation or out of space errors along the way.
The only way to counter this would be to force a mandatory minimum
amount of recovery activity per time interval, which again causes your
problem.

>
>
>>Unfortunately, journaling uses a chunk of space. Imagine a journal on a
>>USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
>>Sure it could be done in 8 or 4 or so; or (in one of my file system
>>designs) a static 16KiB block could reference dynamicly allocated
>>journal space, allowing the system to sacrifice performance and shrink
>>the journal when more space is needed. Either way, slow media like
>>floppies will suffer, HARD; and flash devices will see a lot of
>>write/erase all over the journal area, causing wear on that spot.
>
>
> If you are using flash, use a filesystem which is optimized for flash,
> such as JFFS2. Otherwise, note that in most cases disk space is

What about a NAND flash chip on a USB drive like a SanDisk Cruizer Mini?
Or hell, a compact flash card for use in a digital camera.

> nearly free, so allocating even 128 megs for the journal is chump
> change when you're talking about a 200GB or larger hard drive.
>
> Also note that if you have to use slow media, one of the things which
> you can do is use a separate (fast) device for your journal; there is
> no rule which says the journal has to be on the slow device.

Unless it's portable and you don't want to reconfigure every system.

>
> - Ted
>

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

Creative brains are a valuable, limited resource. They shouldn't be
wasted on re-inventing the wheel when there are so many fascinating
new problems waiting out there.
-- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD09TdhDd4aOud5P8RAv1HAJ9SUeY0c42RognwsR6ve1w4XvFalwCdFc8N
feGuco4l9lz4yQB4U3tDcW8=
=4QFG
-----END PGP SIGNATURE-----

2006-01-22 19:05:34

by Adrian Bunk

[permalink] [raw]
Subject: Re: soft update vs journaling?

On Sun, Jan 22, 2006 at 09:51:10AM +0100, Jan Engelhardt wrote:
>...
> - I would not use a journalling filesystem at all on media that degrades
> faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs).
> There are specially-crafted filesystems for that, mostly jffs and udf.
>...

[ ] you know what the "j" in "jffs" stands for

> Jan Engelhardt

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-01-22 19:08:27

by Arjan van de Ven

[permalink] [raw]
Subject: Re: soft update vs journaling?

On Sun, 2006-01-22 at 20:05 +0100, Adrian Bunk wrote:
> On Sun, Jan 22, 2006 at 09:51:10AM +0100, Jan Engelhardt wrote:
> >...
> > - I would not use a journalling filesystem at all on media that degrades
> > faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs).
> > There are specially-crafted filesystems for that, mostly jffs and udf.
> >...
>
> [ ] you know what the "j" in "jffs" stands for

it stands for "logging" since jffs2 at least is NOT a journalling
filesystem.... but a logging one. I assume jffs is too.


2006-01-22 19:25:07

by Adrian Bunk

[permalink] [raw]
Subject: Re: soft update vs journaling?

On Sun, Jan 22, 2006 at 08:08:17PM +0100, Arjan van de Ven wrote:
> On Sun, 2006-01-22 at 20:05 +0100, Adrian Bunk wrote:
> > On Sun, Jan 22, 2006 at 09:51:10AM +0100, Jan Engelhardt wrote:
> > >...
> > > - I would not use a journalling filesystem at all on media that degrades
> > > faster as harddisks (flash drives, CD-RWs/DVD-RWs/RAMs).
> > > There are specially-crafted filesystems for that, mostly jffs and udf.
> > >...
> >
> > [ ] you know what the "j" in "jffs" stands for
>
> it stands for "logging" since jffs2 at least is NOT a journalling
> filesystem.... but a logging one. I assume jffs is too.

Ah, sorry.

It seems I confused this with Reiser4 and it's wandering logs.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-01-22 19:26:49

by James Courtier-Dutton

[permalink] [raw]
Subject: Re: soft update vs journaling?

John Richard Moser wrote:
>
> Unfortunately, journaling uses a chunk of space. Imagine a journal on a
> USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
> Sure it could be done in 8 or 4 or so; or (in one of my file system
> designs) a static 16KiB block could reference dynamicly allocated
> journal space, allowing the system to sacrifice performance and shrink
> the journal when more space is needed. Either way, slow media like
> floppies will suffer, HARD; and flash devices will see a lot of
> write/erase all over the journal area, causing wear on that spot.
>

My understanding is that if one designed a power supply with enough
headroom, one could remove the power and still have time to write dirty
sectors to the USB flash stick. Would this not remove the need for a
journaling fs on a flash stick. This would remove the "wear on that
spot" problem. Actually USB flash sticks are a bit clever, in that they
add an extra layer of translation to the write. I.e. If you write to the
same sector again and again, the USB flash stick will actually write it
to a different area of the memory each time. This is specifically done
to save the "wear on that spot" problem.

This "flush on power fail" approach is not so easy with a HD because it
uses more power and takes longer to flush.

James

2006-01-22 19:51:06

by Diego Calleja

[permalink] [raw]
Subject: Re: soft update vs journaling?

El Sun, 22 Jan 2006 04:31:44 -0500,
Theodore Ts'o <[email protected]> escribi?:


> One major downside with Soft Updates that you haven't mentioned in
> your note, is that the amount of complexity it adds to the filesystem
> is tremendous; the filesystem has to keep track of a very complex
> state machinery, with knowledge of about the ordering constraints of
> each change to the filesystem and how to "back out" parts of the
> change when that becomes necessary.


And FreeBSD is implementing journaling for UFS and getting rid of
softupdates [1]. While this not proves that softupdates is "a bad idea",
i think this proves why the added sofupdates complexity doesn't seem
to pay off in the real world.

[1]: http://lists.freebsd.org/pipermail/freebsd-hackers/2004-December/009261.html

"4. Journaled filesystem. While we can debate the merits of speed and
data integrety of journalling vs. softupdates, the simple fact remains
that softupdates still requires a fsck run on recovery, and the
multi-terabyte filesystems that are possible these days make fsck a very
long and unpleasant experience, even with bg-fsck. There was work at
some point at RPI to add journaling to UFS, but there hasn't been much
status on that in a long time. There have also been proposals and
works-in-progress to port JFS, ReiserFS, and XFS. Some of these efforts
are still alive, but they need to be seen through to completion"

2006-01-22 20:39:55

by Suleiman Souhlal

[permalink] [raw]
Subject: Re: soft update vs journaling?

Diego Calleja wrote:
> And FreeBSD is implementing journaling for UFS and getting rid of
> softupdates [1]. While this not proves that softupdates is "a bad idea",
> i think this proves why the added sofupdates complexity doesn't seem
> to pay off in the real world.

You read the message wrong: We're not getting rid of softupdates.
-- Suleiman

2006-01-22 20:51:43

by Diego Calleja

[permalink] [raw]
Subject: Re: soft update vs journaling?

El Sun, 22 Jan 2006 12:39:38 -0800,
Suleiman Souhlal <[email protected]> escribi?:

> Diego Calleja wrote:
> > And FreeBSD is implementing journaling for UFS and getting rid of
> > softupdates [1]. While this not proves that softupdates is "a bad idea",
> > i think this proves why the added sofupdates complexity doesn't seem
> > to pay off in the real world.
>
> You read the message wrong: We're not getting rid of softupdates.
> -- Suleiman


Oh, both systems will be available at the same time? That will be
certainyl a good place to compare both approachs.

2006-01-22 21:03:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: soft update vs journaling?

On Sun, Jan 22, 2006 at 01:54:23PM -0500, John Richard Moser wrote:
> > Whenever you want to extend a filesystem to add some new feature, such
> > as online resizing, for example, it's not enough to just add that
>
> Online resizing is ever safe? I mean, with on-disk filesystem layout
> support I could somewhat believe it for growing; for shrinking you'd
> need a way to move files around without damaging them (possible). I
> guess it would be.
>
> So how does this work? Move files -> alter file system superblocks?

The online resizing support in ext3 only grows the filesystems; it
doesn't shrink it. What is currently supported in 2.6 requires you to
reserve space in advance. There is also a slight modification to the
ext2/3 filesystem format which is only supported by Linux 2.6 which
allows you to grow the filesystem without needing to move filesystem
data structures around; the kernel patches for actualling doing this
new style of online resizing aren't yet in mainline yet, although they
have been posted to ext2-devel for evaluation.

> A passive-active approach could passively generate a list of inodes from
> dentries as they're accessed; and actively walk the directory tree when
> the disk is idle. Then a quick allocation check between inodes and
> whatever allocation lists or trees there are could be done.

That doesn't really help, because in order to release the unused disk
blocks, you have to walk every single inode and keep track of the
block allocation bitmaps for the entire filesystem. If you have a
really big filesystem, it may require hundreds of megabytes of
non-swappable kernel memory. And if you try to do this in userspace,
it becomes an unholy mess trying to keep the userspace and in-kernel
mounted filesystem data structures in sync.

- Ted

2006-01-22 22:44:17

by Kyle Moffett

[permalink] [raw]
Subject: Re: soft update vs journaling?

On Jan 22, 2006, at 16:02, Theodore Ts'o wrote:
>> Online resizing is ever safe? I mean, with on-disk filesystem
>> layout support I could somewhat believe it for growing; for
>> shrinking you'd need a way to move files around without damaging
>> them (possible). I guess it would be.
>>
>> So how does this work? Move files -> alter file system superblocks?
>
> The online resizing support in ext3 only grows the filesystems; it
> doesn't shrink it. What is currently supported in 2.6 requires you
> to reserve space in advance. There is also a slight modification
> to the ext2/3 filesystem format which is only supported by Linux
> 2.6 which allows you to grow the filesystem without needing to move
> filesystem data structures around; the kernel patches for
> actualling doing this new style of online resizing aren't yet in
> mainline yet, although they have been posted to ext2-devel for
> evaluation.

From my understanding of HFS+/HFSX, this is actually one of the
nicer bits of that filesystem architecture. It stores the data-
structures on-disk using extents in such a way that you probably
could hot-resize the disk without significant RAM overhead (both grow
and shrink) as long as there's enough free space. Essentially, every
block on the disk is represented by an allocation block, and all data
structures refer to allocation block offsets. The allocation file
bitmap itself is comprised of allocation blocks and mapped by a set
of extent descriptors. The result is that it is possible to fragment
the allocation file, catalog file, and any other on-disk structures
(with the sole exception of the 1K boot block and the 512-byte volume
headers at the very start and end of the volume).

At the moment I'm educating myself on the operation of MFS/HFS/HFS+/
HFSX and the linux kernel VFS by writing a completely new combined
hfsx driver, which I eventually plan to add online-resizing support
and a variety of other features to.

One question though: Does anyone have any good recent references to
"How to write a blockdev-based Linux Filesystem" docs? I've searched
the various crufty corners of the web, Documentation/, etc, and found
enough to get started, but (for example), I had a hard time
determining from the various sources what a struct file_system_type
was supposed to have in it, and what the available default
address_space/superblock ops are.

Cheers,
Kyle Moffett

--
They _will_ find opposing experts to say it isn't, if you push hard
enough the wrong way. Idiots with a PhD aren't hard to buy.
-- Rob Landley



2006-01-23 00:07:28

by John Richard Moser

[permalink] [raw]
Subject: Re: soft update vs journaling?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



James Courtier-Dutton wrote:
> John Richard Moser wrote:
>
>>
>> Unfortunately, journaling uses a chunk of space. Imagine a journal on a
>> USB flash stick of 128M; a typical ReiserFS journal is 32 megabytes!
>> Sure it could be done in 8 or 4 or so; or (in one of my file system
>> designs) a static 16KiB block could reference dynamicly allocated
>> journal space, allowing the system to sacrifice performance and shrink
>> the journal when more space is needed. Either way, slow media like
>> floppies will suffer, HARD; and flash devices will see a lot of
>> write/erase all over the journal area, causing wear on that spot.
>>
>
> My understanding is that if one designed a power supply with enough
> headroom, one could remove the power and still have time to write dirty
> sectors to the USB flash stick. Would this not remove the need for a
> journaling fs on a flash stick.

Depends on how much meta-data you have to write out. What if you just
altered 6000 files? Now you have a ton of dentries to destroy and
inodes to invalidate, some FAT entries to free up, etc. What if the
user just pulled the drive out of the USB port? Or the USB port is
faulty and lost connection (I've seen it!).

> This would remove the "wear on that
> spot" problem.

Wha? You mean remove the trigger, not the underlying problem.

> Actually USB flash sticks are a bit clever, in that they
> add an extra layer of translation to the write. I.e. If you write to the
> same sector again and again, the USB flash stick will actually write it
> to a different area of the memory each time. This is specifically done
> to save the "wear on that spot" problem.

Yeah, built-in write balancing is nice.

>
> This "flush on power fail" approach is not so easy with a HD because it
> uses more power and takes longer to flush.

The "flush on power fail" is retarded because it takes extra hardware
and doesn't work if the USB port itself loses connection or if the user
is just dumb enough to pull/knock the drive out. It won't work with
mini hard disks either, as you say.

"Flush on power fail" is pretty much getting a 10 minute UPS and issuing
'shutdown -h now' when the UPS signals init, which there's already
contingencies for (can also suspend to disk). It won't help if the PSU
burns out, if the system crashes, if the power cord is pulled, or if the
dog walks around your chair and you turn and bump your foot into the
"power" button on the UPS itself.

>
> James
>
>

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

Creative brains are a valuable, limited resource. They shouldn't be
wasted on re-inventing the wheel when there are so many fascinating
new problems waiting out there.
-- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD1B37hDd4aOud5P8RAsCtAJ0TZM4I9T9gE6PMbfUhMux8zrxE9wCff67G
kdlY0fvfJQXmDljz6KekSxc=
=BV+l
-----END PGP SIGNATURE-----

2006-01-23 01:01:29

by John Richard Moser

[permalink] [raw]
Subject: Re: soft update vs journaling?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Diego Calleja wrote:
> El Sun, 22 Jan 2006 04:31:44 -0500,
> Theodore Ts'o <[email protected]> escribi?:
>
>
>
>>One major downside with Soft Updates that you haven't mentioned in
>>your note, is that the amount of complexity it adds to the filesystem
>>is tremendous; the filesystem has to keep track of a very complex
>>state machinery, with knowledge of about the ordering constraints of
>>each change to the filesystem and how to "back out" parts of the
>>change when that becomes necessary.
>
>
>
> And FreeBSD is implementing journaling for UFS and getting rid of
> softupdates [1]. While this not proves that softupdates is "a bad idea",
> i think this proves why the added sofupdates complexity doesn't seem
> to pay off in the real world.
>

Yeah, the huge TB fsck thing became a problem. I wonder still if it'd
be useful for small vfat file systems (floppies, usb drives); nobody has
led me to believe it's definitely feasible to not corrupt meta-data in
this way.

I guess journaling is looking a lot better. :)

> [1]: http://lists.freebsd.org/pipermail/freebsd-hackers/2004-December/009261.html
>
> "4. Journaled filesystem. While we can debate the merits of speed and
> data integrety of journalling vs. softupdates, the simple fact remains
> that softupdates still requires a fsck run on recovery, and the
> multi-terabyte filesystems that are possible these days make fsck a very
> long and unpleasant experience, even with bg-fsck. There was work at
> some point at RPI to add journaling to UFS, but there hasn't been much
> status on that in a long time. There have also been proposals and
> works-in-progress to port JFS, ReiserFS, and XFS. Some of these efforts
> are still alive, but they need to be seen through to completion"
>

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

Creative brains are a valuable, limited resource. They shouldn't be
wasted on re-inventing the wheel when there are so many fascinating
new problems waiting out there.
-- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD1CqhhDd4aOud5P8RAjvDAJ0W9pcNQ31v0RWSSIGVitnSpfvReQCdHBah
usgY72whnDcCwgshpVFW02o=
=Px/i
-----END PGP SIGNATURE-----

2006-01-23 01:03:39

by John Richard Moser

[permalink] [raw]
Subject: Re: soft update vs journaling?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Theodore Ts'o wrote:
> On Sun, Jan 22, 2006 at 01:54:23PM -0500, John Richard Moser wrote:
>
>>>Whenever you want to extend a filesystem to add some new feature, such
>>>as online resizing, for example, it's not enough to just add that
>>
>>Online resizing is ever safe? I mean, with on-disk filesystem layout
>>support I could somewhat believe it for growing; for shrinking you'd
>>need a way to move files around without damaging them (possible). I
>>guess it would be.
>>
>>So how does this work? Move files -> alter file system superblocks?
>
>
> The online resizing support in ext3 only grows the filesystems; it
> doesn't shrink it. What is currently supported in 2.6 requires you to
> reserve space in advance. There is also a slight modification to the
> ext2/3 filesystem format which is only supported by Linux 2.6 which
> allows you to grow the filesystem without needing to move filesystem
> data structures around; the kernel patches for actualling doing this
> new style of online resizing aren't yet in mainline yet, although they
> have been posted to ext2-devel for evaluation.
>
>
>>A passive-active approach could passively generate a list of inodes from
>>dentries as they're accessed; and actively walk the directory tree when
>>the disk is idle. Then a quick allocation check between inodes and
>>whatever allocation lists or trees there are could be done.
>
>
> That doesn't really help, because in order to release the unused disk
> blocks, you have to walk every single inode and keep track of the
> block allocation bitmaps for the entire filesystem. If you have a
> really big filesystem, it may require hundreds of megabytes of
> non-swappable kernel memory. And if you try to do this in userspace,
> it becomes an unholy mess trying to keep the userspace and in-kernel
> mounted filesystem data structures in sync.
>

Yeah I figured that you couldn't take action until everything was seen;
I can see how you could have problems with all that kernel memory ;)
FUSE driver, anyone? :>

(I've actually looked into FUSE for the rootfs, via loading a fuser
driver from an init.d and then replacing bash with init on the rootfs;
haven't found an ext2 or xfs fuse driver to test with)
> - Ted
>

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

Creative brains are a valuable, limited resource. They shouldn't be
wasted on re-inventing the wheel when there are so many fascinating
new problems waiting out there.
-- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD1CskhDd4aOud5P8RArmJAJ9mgLjkxUcg5GW1o4q88Cb6ESmdCACZAS00
M1R+7biZpmOCCCBkEXVQL7w=
=060w
-----END PGP SIGNATURE-----

2006-01-23 01:09:50

by Suleiman Souhlal

[permalink] [raw]
Subject: Re: soft update vs journaling?

John Richard Moser wrote:
> Yeah, the huge TB fsck thing became a problem. I wonder still if it'd
> be useful for small vfat file systems (floppies, usb drives); nobody has
> led me to believe it's definitely feasible to not corrupt meta-data in
> this way.

Please note that you don't *HAVE* to run fsck at every reboot. All
background fsck does is reclaim unused blocks.

-- Suleiman

2006-01-23 02:11:04

by John Richard Moser

[permalink] [raw]
Subject: Re: soft update vs journaling?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Suleiman Souhlal wrote:
> John Richard Moser wrote:
>
>> Yeah, the huge TB fsck thing became a problem. I wonder still if it'd
>> be useful for small vfat file systems (floppies, usb drives); nobody has
>> led me to believe it's definitely feasible to not corrupt meta-data in
>> this way.
>
>
> Please note that you don't *HAVE* to run fsck at every reboot. All
> background fsck does is reclaim unused blocks.
>

Duly noted, now can you answer my question?

> -- Suleiman
>

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

Creative brains are a valuable, limited resource. They shouldn't be
wasted on re-inventing the wheel when there are so many fascinating
new problems waiting out there.
-- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD1DryhDd4aOud5P8RAjiwAJ9xH5V/W2i5U/oVzT6AjdmBVk5+iwCfWD2j
JzBRinqiqDd/rIQFkS9QIsQ=
=SlOI
-----END PGP SIGNATURE-----

2006-01-23 05:32:31

by Michael Loftis

[permalink] [raw]
Subject: Re: soft update vs journaling?



--On January 22, 2006 1:42:38 AM -0500 John Richard Moser
<[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> So I've been researching, because I thought this "Soft Update" thing
> that BSD uses was some weird freak-ass way to totally corrupt a file
> system if the power drops. Seems I was wrong; it's actually just the
> opposite, an alternate solution to journaling. So let's compare notes.

I hate to say it...but in my experience, this has been exactly the case
with soft updates and FreeBSD 4 up to 4.11 pre releases.

Whenever something untoward would happen, the filesystem almost always lost
files and/or data, usually just files though. In practice it's never
really worked too well for me. It also still requires a full fsck on boot,
which means long boot times for recovery on large filesystems.

2006-01-23 07:24:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: soft update vs journaling?

On Sun, Jan 22, 2006 at 05:44:08PM -0500, Kyle Moffett wrote:
> From my understanding of HFS+/HFSX, this is actually one of the
> nicer bits of that filesystem architecture. It stores the data-
> structures on-disk using extents in such a way that you probably
> could hot-resize the disk without significant RAM overhead (both grow
> and shrink) as long as there's enough free space.

Hot-shrinking a filesystem is certainly possible for any filesystem,
but the problem is how many filesystem data structures you have to
walk in order to find all the owner of all of the blocks that you have
to relocate. That generallly isn't a RAM overhead problem, but the
fact that in general, most filesystems don't have an efficient way to
answer the question, "who owns this arbitrary disk block?" Having
extents means you have a slightly more efficient encoding system, but
it still is the case that you have to check potentially every file in
the filesystem to see if it is the owner of one of the disk blocks
that needs to be moved when you are shrinking the filesystem.

You could of course design a filesystem which maintained a reverse map
data structure, but it would slow the filesystem down since it would
be a separate data structure that would have to be updated each time
you allocated or freed a disk block. And the only use for such a data
structure would be to make shrinking a filesystem more efficient.
Given that this is generally not a common operation, it seems unlikely
that a filesystem designer would choose to make this particular
tradeoff.

- Ted

2006-01-23 13:31:59

by Mitchell Blank Jr

[permalink] [raw]
Subject: Re: soft update vs journaling?

Theodore Ts'o wrote:
> in general, most filesystems don't have an efficient way to
> answer the question, "who owns this arbitrary disk block?"
[...]
> Given that this is generally not a common operation, it seems unlikely
> that a filesystem designer would choose to make this particular
> tradeoff.

True -- a much more rational approach would be to provide a translation
table for "old block #" to "new block #" -- then when the filesystem sees
a reference to an invalid blocknumber (>= the filesystem size) it can just
translate it to its new location.

You have to be careful if the filesystem is regrown since some of those
block numbers may now be valid again. It can easily be handled by just
moving the data back to its original block # and removing the mapping.

This doesn't completely remove the extra cost on the block allocator
fastpath: if an block is freed it must make sure to remove any entry pointing
to it from the translation table or you can't handle regrowth properly
(the block could have been reused by a file pointing to the real block # --
you won't know whether to move it back or not). However, this is probably
a lot cheaper than maintaining a full reverse-map, plus you only have to
take it after a shrink has actually happened.

-Mitch

2006-01-23 13:33:34

by Kyle Moffett

[permalink] [raw]
Subject: Re: soft update vs journaling?

On Jan 23, 2006, at 02:24, Theodore Ts'o wrote:
> Hot-shrinking a filesystem is certainly possible for any
> filesystem, but the problem is how many filesystem data structures
> you have to walk in order to find all the owner of all of the
> blocks that you have to relocate. That generally isn't a RAM
> overhead problem, but the fact that in general, most filesystems
> don't have an efficient way to answer the question, "who owns this
> arbitrary disk block?" Having extents means you have a slightly
> more efficient encoding system, but it still is the case that you
> have to check potentially every file in the filesystem to see if it
> is the owner of one of the disk blocks that needs to be moved when
> you are shrinking the filesystem.

The way that I'm considering implementing this is by intentionally
fragmenting the allocation bitmap, catalog file, etc, such that each
1/8 or so of the disk contains its own allocation bitmap describing
its contents, its own set of files or directories, etc. The
allocator would largely try to keep individual btree fragments
cohesive, such that one of the 1/8th divisions of the disk would only
have pertinent data for itself. The idea would be that when trying
to look up an allocation block, in the common case you need only
parse a much smaller subsection of the disk structures.

> And the only use for such a [reverse-mapping] data structure would
> be to make shrinking a filesystem more efficient.

Not entirely true. I _believe_ you could use such data structures to
make the allocation algorithm much more robust against fragmentation
if you record the right kind of information.

Cheers,
Kyle Moffett

--
If you don't believe that a case based on [nothing] could potentially
drag on in court for _years_, then you have no business playing with
the legal system at all.
-- Rob Landley



2006-01-23 13:52:53

by Antonio Vargas

[permalink] [raw]
Subject: Re: soft update vs journaling?

On 1/23/06, Kyle Moffett <[email protected]> wrote:
> On Jan 23, 2006, at 02:24, Theodore Ts'o wrote:
> > Hot-shrinking a filesystem is certainly possible for any
> > filesystem, but the problem is how many filesystem data structures
> > you have to walk in order to find all the owner of all of the
> > blocks that you have to relocate. That generally isn't a RAM
> > overhead problem, but the fact that in general, most filesystems
> > don't have an efficient way to answer the question, "who owns this
> > arbitrary disk block?" Having extents means you have a slightly
> > more efficient encoding system, but it still is the case that you
> > have to check potentially every file in the filesystem to see if it
> > is the owner of one of the disk blocks that needs to be moved when
> > you are shrinking the filesystem.
>
> The way that I'm considering implementing this is by intentionally
> fragmenting the allocation bitmap, catalog file, etc, such that each
> 1/8 or so of the disk contains its own allocation bitmap describing
> its contents, its own set of files or directories, etc. The
> allocator would largely try to keep individual btree fragments
> cohesive, such that one of the 1/8th divisions of the disk would only
> have pertinent data for itself. The idea would be that when trying
> to look up an allocation block, in the common case you need only
> parse a much smaller subsection of the disk structures.

this sounds exactly the same as ext2/ext3 allocation groups :)

> > And the only use for such a [reverse-mapping] data structure would
> > be to make shrinking a filesystem more efficient.
>
> Not entirely true. I _believe_ you could use such data structures to
> make the allocation algorithm much more robust against fragmentation
> if you record the right kind of information.
>
> Cheers,
> Kyle Moffett
>
> --
> If you don't believe that a case based on [nothing] could potentially
> drag on in court for _years_, then you have no business playing with
> the legal system at all.
> -- Rob Landley
>

--
Greetz, Antonio Vargas aka winden of network

http://wind.codepixel.com/
[email protected]
[email protected]

Every day, every year
you have to work
you have to study
you have to scene.

2006-01-23 16:48:26

by Kyle Moffett

[permalink] [raw]
Subject: Linux VFS architecture questions

On Jan 23, 2006, at 08:52:51, Antonio Vargas wrote:
> On 1/23/06, Kyle Moffett <[email protected]> wrote:
>> The way that I'm considering implementing this is by intentionally
>> fragmenting the allocation bitmap, catalog file, etc, such that
>> each 1/8 or so of the disk contains its own allocation bitmap
>> describing its contents, its own set of files or directories,
>> etc. The allocator would largely try to keep individual btree
>> fragments cohesive, such that one of the 1/8th divisions of the
>> disk would only have pertinent data for itself. The idea would be
>> that when trying to look up an allocation block, in the common
>> case you need only parse a much smaller subsection of the disk
>> structures.
>
> this sounds exactly the same as ext2/ext3 allocation groups :)

Great! I'm trying to learn about filesystem design and
implementation, which is why I started writing my own hfsplus
filesystem (otherwise I would have just used the in-kernel one). Do
you have any recommended reading (either online or otherwise) for
someone trying to understand the kernel's VFS and blockdev
interfaces? I _think_ I understand the basics of buffer_head,
super_block, and have some idea of how to use aops, but it's tough
going trying to find out what functions to call to manage cached disk
blocks, or under what conditions the various VFS functions are
called. I'm trying to write up a "Linux Disk-Based Filesystem
Developers Guide" based on what I learn, but it's remarkably sparse
so far.

One big question I have: HFS/HFS+ have an "extents overflow" btree
that contains extents beyond the first 3 (for HFS) or 8 (for HFS+).
I would like to speculatively cache parts of that btree when the
files are accessed, but not if memory is short, and I would like to
allow the filesystem to free up parts of the btree under the same
circumstances. I have a preliminary understanding of how to trigger
the filesystem to read various blocks of metadata (using
buffer_heads) or file data for programs (by returning a block number
from the appropriate aops function), but how do I allocate data
structures as "easily reclaimable" and indicate to the kernel that it
can ask me to reclaim that memory?

Thanks for the help!

Cheers,
Kyle Moffett

2006-01-23 17:00:14

by Pekka Enberg

[permalink] [raw]
Subject: Re: Linux VFS architecture questions

Hi Kyle,

On 1/23/06, Kyle Moffett <[email protected]> wrote:
> Great! I'm trying to learn about filesystem design and
> implementation, which is why I started writing my own hfsplus
> filesystem (otherwise I would have just used the in-kernel one). Do
> you have any recommended reading (either online or otherwise) for
> someone trying to understand the kernel's VFS and blockdev
> interfaces? I _think_ I understand the basics of buffer_head,
> super_block, and have some idea of how to use aops, but it's tough
> going trying to find out what functions to call to manage cached disk
> blocks, or under what conditions the various VFS functions are
> called. I'm trying to write up a "Linux Disk-Based Filesystem
> Developers Guide" based on what I learn, but it's remarkably sparse
> so far.

Did you read Documentation/filesystems/vfs.txt? Also, books Linux
Kernel Development and Understanding the Linux Kernel have fairly good
information on VFS (and related) stuff.

Pekka

2006-01-23 17:51:12

by Kyle Moffett

[permalink] [raw]
Subject: Re: Linux VFS architecture questions

On Jan 23, 2006, at 12:00, Pekka Enberg wrote:
> Hi Kyle,
>
> On 1/23/06, Kyle Moffett <[email protected]> wrote:
>> Great! I'm trying to learn about filesystem design and
>> implementation, which is why I started writing my own hfsplus
>> filesystem (otherwise I would have just used the in-kernel one).
>> Do you have any recommended reading (either online or otherwise)
>> for someone trying to understand the kernel's VFS and blockdev
>> interfaces? I _think_ I understand the basics of buffer_head,
>> super_block, and have some idea of how to use aops, but it's tough
>> going trying to find out what functions to call to manage cached
>> disk blocks, or under what conditions the various VFS functions
>> are called. I'm trying to write up a "Linux Disk-Based Filesystem
>> Developers Guide" based on what I learn, but it's remarkably
>> sparse so far.
>
> Did you read Documentation/filesystems/vfs.txt?

Yeah, that was the first thing I looked at. Once I've got things
figured out, I'll probably submit a fairly hefty patch to that file
to add additional documentation.

> Also, books Linux Kernel Development and Understanding the Linux
> Kernel have fairly good information on VFS (and related) stuff.

Ah, thanks again! It looks like both of those are available through
my university's Safari/ProQuest subscription (http://
safari.oreilly.com/), so I'll take a look right away!

Cheers,
Kyle Moffett

--
I lost interest in "blade servers" when I found they didn't throw
knives at people who weren't supposed to be in your machine room.
-- Anthony de Boer


2006-01-23 17:54:20

by Randy Dunlap

[permalink] [raw]
Subject: Re: Linux VFS architecture questions

On Mon, 23 Jan 2006, Kyle Moffett wrote:

> On Jan 23, 2006, at 12:00, Pekka Enberg wrote:
> > Hi Kyle,
> >
> > On 1/23/06, Kyle Moffett <[email protected]> wrote:
> >> Great! I'm trying to learn about filesystem design and
> >> implementation, which is why I started writing my own hfsplus
> >> filesystem (otherwise I would have just used the in-kernel one).
> >> Do you have any recommended reading (either online or otherwise)
> >> for someone trying to understand the kernel's VFS and blockdev
> >> interfaces? I _think_ I understand the basics of buffer_head,
> >> super_block, and have some idea of how to use aops, but it's tough
> >> going trying to find out what functions to call to manage cached
> >> disk blocks, or under what conditions the various VFS functions
> >> are called. I'm trying to write up a "Linux Disk-Based Filesystem
> >> Developers Guide" based on what I learn, but it's remarkably
> >> sparse so far.
> >
> > Did you read Documentation/filesystems/vfs.txt?
>
> Yeah, that was the first thing I looked at. Once I've got things
> figured out, I'll probably submit a fairly hefty patch to that file
> to add additional documentation.
>
> > Also, books Linux Kernel Development and Understanding the Linux
> > Kernel have fairly good information on VFS (and related) stuff.
>
> Ah, thanks again! It looks like both of those are available through
> my university's Safari/ProQuest subscription (http://
> safari.oreilly.com/), so I'll take a look right away!

This web page is terribly out of date, but you might find
a few helpful link on it (near the bottom):
http://www.xenotime.net/linux/linux-fs.html

--
~Randy

2006-01-23 18:52:14

by John Richard Moser

[permalink] [raw]
Subject: Re: soft update vs journaling?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Michael Loftis wrote:
>
>
> --On January 22, 2006 1:42:38 AM -0500 John Richard Moser
> <[email protected]> wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> So I've been researching, because I thought this "Soft Update" thing
>> that BSD uses was some weird freak-ass way to totally corrupt a file
>> system if the power drops. Seems I was wrong; it's actually just the
>> opposite, an alternate solution to journaling. So let's compare notes.
>
>
> I hate to say it...but in my experience, this has been exactly the case
> with soft updates and FreeBSD 4 up to 4.11 pre releases.
>
> Whenever something untoward would happen, the filesystem almost always
> lost files and/or data, usually just files though. In practice it's
> never really worked too well for me. It also still requires a full fsck
> on boot, which means long boot times for recovery on large filesystems.

You lost files in use, or random files?

Soft Update was designed to assure file system consistency. In typical
usage, when you drop power on something like FAT, you create a 'hole' in
the filesystem. This hole could be something like files pointing to
allocated blocks belonging to other files; or crossed dentries; etc. As
you use the file system, it simply accepts the information it gets,
because it doesn't look bad until you look at EVERYTHING. The effect is
akin to repeatedly sodomizing the file system in this newly created
hole; you just cause more and more damage until the system gives out.
The system makes allocations and decisions based on faulty data and
really, really screws things up.

The idea of Soft Update was to make sure that while you may lose
something, when you come back up the FS is in a safely usable state.
The fsck only colors in a view of the FS and frees up blocks that don't
seem to be allocated by any particular file, an annoying but mostly
harmless side effect of losing power in this scheme.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

Creative brains are a valuable, limited resource. They shouldn't be
wasted on re-inventing the wheel when there are so many fascinating
new problems waiting out there.
-- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD1SXXhDd4aOud5P8RAj9PAJ9G5CF6gfPx470/Ak+OlaKogZhMSwCeKORg
Q7AZegZunZ3S2hTSNVnXFlc=
=7Rme
-----END PGP SIGNATURE-----

2006-01-23 19:32:24

by Matthias Andree

[permalink] [raw]
Subject: Re: soft update vs journaling?

On Mon, 23 Jan 2006, John Richard Moser wrote:

> The idea of Soft Update was to make sure that while you may lose
> something, when you come back up the FS is in a safely usable state.

Soft Updates are *extremely* sensitive to reordered writes, and more
likely to be reordered at the same time than streaming to a linear
journal is. Don't even THINK of using softupdates without enforcing
write order. ext3fs, particularly with data=ordered or data=journal, is
much more forgiving in my experience. Not that I'd endorse dangerous use
of file system, but the average user just doesn't know.

FreeBSD (stable@ Cc:d) has no notion of write barriers as of yet as it
seems, wedging the SCSI bus in the middle of a write sequence causes
major devastations with WCE=1, and took me two runs of fsck to repair
(unfortunately I needed the (test) machine back up at once, so no time
to snapshot the b0rked partition for later scrutiny), and found myself
with two hundred files relocated to the lost+found office^Wdirectory.

Of course, it's the "Doctor, doctor, it always hurts my right eye if I'm
drinking coffee" -- "well, remove the spoon from your mug before
drinking then" (don't do that) category of "bug", but it hosts practical
relevance...

--
Matthias Andree

2006-01-23 20:48:47

by folkert

[permalink] [raw]
Subject: Re: soft update vs journaling?

> You could of course design a filesystem which maintained a reverse map
> data structure, but it would slow the filesystem down since it would
> be a separate data structure that would have to be updated each time
> you allocated or freed a disk block. And the only use for such a data
> structure would be to make shrinking a filesystem more efficient.
> Given that this is generally not a common operation, it seems unlikely
> that a filesystem designer would choose to make this particular
> tradeoff.

Or you could set if switched off by default. E.g. reserve the space for
it and activate it as soon as some magic switch is set in the kernel.
Then some background processs should update it while als keeping track
of current changes. Then when everything is finished, update some flag
to let the resizer know it can do its job.


Folkert van Heusden

--
http://www.vanheusden.com/recoverdm/ - got an unreadable cd with scratches?
recoverdm might help you recovering data
--------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, http://www.vanheusden.com

2006-01-24 02:37:24

by Jörn Engel

[permalink] [raw]
Subject: Re: soft update vs journaling?

On Sun, 22 January 2006 20:08:17 +0100, Arjan van de Ven wrote:
>
> it stands for "logging" since jffs2 at least is NOT a journalling
> filesystem.... but a logging one. I assume jffs is too.

s/logging/log-structured/

People could (and did) argue that jffs[|2] is a journalling
filesystem consisting of a journal and _no_ regular storage. Which is
quite sane. Having a live-fast, die-young journal confined to a small
portion of the device would kill it quickly, no doubt.

J?rn

--
Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo
vorher keine existiert hat.
-- Doris Lessing