On Sat, 10 Feb 2007 22:06:37 -0500, Sorin Faibish <[email protected]> wrote:
> Introducing DualFS
>
> File System developers played with the idea of separation of
> meta-data from data in file systems for a while. The idea was
> lately revived by a small group of file system enthusiasts
> from Spain (from the little known University of Murcia) and
> it is called DualFS. We believe that the separation idea
> will bring to Linux file systems great value.
>
> We see DualFS as a next-generation journaling file system which
> has the same consistency guaranties as traditional journaling
> file systems but better performance characteristics. The new file
> system puts data and meta-data on different devices (usually, two
> partitions on the same disk or different disks), and manages them
> in very different ways:
>
> 1. DualFS has only one copy of every meta-data block. This copy is
> in the meta-data device,
>
> 2. The meta-data device is a log which is used by DualFS both to
> read and to write meta-data blocks.
>
> 3. DualFS avoids an extra copy of meta-data blocks, which allow
> DualFS to achieve higher performance than other journaling file
> systems.
>
> 4. DualFS implements performance enhancements: meta-data prefetch,
> on-line meta-data relocation and faster fsck and mkfs operations.
>
> 5. DualFS file system is suitable for use with TB and PB of storage
>
> We have carried out different experiments which compare DualFS and
> other popular Linux file systems, namely, Ext2, Ext3, XFS, JFS, and
> ReiserFS. The results, both performance and management, prove the
> value of the new file system design based on the separation of data
> and metadata which increase performance dramatically up to 97% by
> simply using an additional partition of same disk.
>
> We have performed extensive tests using micro-benchmarks as well
> as macro-benchmarks including Postmark v1.5, SpecWeb99, TPCC-uva.
> We also measured performance of maintenance tasks like mkfs and
> fsck which all show that DualFS performance is superior to all the
> other file systems tested with performance advantage in the range
> between 50-300% depending on the benchmark and the configuration.
> And all this performance advantage is a direct result of the
> separation of the meta-data and data.
>
> The project started in 2000 by Juan Piernas Canovas as the primary
> and almost unique contributor, with some small contributions by Toni
> Cortes, and Jose M. Garcia. The project was stopped for some time. We
> restarted the project last year, and after several months of updates
> and tests we created a SourceForge project with the intent to share
> the value of this old and yet new concept.
>
> The DualFS code, tools and performance papers are available at:
>
> <http://sourceforge.net/projects/dualfs>
>
> The code requires kernel patches to 2.4.19 (oldies but goodies) and
> a separate fsck code. The latest kernel we used it for is 2.6.11
> and we hope with you help to port it to the latest Linux kernel.
>
> We will present the architecture, principles and performance
> characterization at the LFS07 next week.
>
> We are very interested to get your feedback and criticism.
>
> Sorin Faibish and Juan Piernas Canovas
> --------------------------------------
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
On Feb 14 2007 16:10, sfaibish wrote:
>>
>> 1. DualFS has only one copy of every meta-data block. This copy is
>> in the meta-data device,
Where does this differ from typical filesystems like xfs?
At least ext3 and xfs have an option to store the log/journal
on another device too.
>> The DualFS code, tools and performance papers are available at:
>>
>> <http://sourceforge.net/projects/dualfs>
>>
>> The code requires kernel patches to 2.4.19 (oldies but goodies) and
>> a separate fsck code. The latest kernel we used it for is 2.6.11
>> and we hope with you help to port it to the latest Linux kernel.
Where is the patch for 2.6.11? Sorry, 2.4.19 is just too old (even if
only considering the 2.4 branch).
[ And perhaps "goldies" flew better with "oldies" ;-) ]
Jan
--
ft: http://freshmeat.net/p/chaostables/
Hi all,
On Wed, 14 Feb 2007, Jan Engelhardt wrote:
>
> On Feb 14 2007 16:10, sfaibish wrote:
>>>
>>> 1. DualFS has only one copy of every meta-data block. This copy is
>>> in the meta-data device,
>
> Where does this differ from typical filesystems like xfs?
> At least ext3 and xfs have an option to store the log/journal
> on another device too.
No, it is not the same. DualFS uses two 'logical' devices, one for data,
and one for meta-data, but these devices are usually partitions on the
same disk, they are not two different disks. And DualFS uses the meta-data
device to both read and write meta-data blocks, whereas the other
journaling file systems only use the log to write meta-data.
It's true that XFS, Ext3 and other journaling file systems can use a
separate disk for the log, but, even then, they have to write two copies
of every meta-data element. However, in this case, DualFS is even better.
If the data and meta-data devices of DualFS can be on different disks,
DualFS is able to READ and WRITE data and meta-data blocks in PARALLEL.
The other journaling file systems, however, can only write one of the two
copies of every meta-data block in parallel to other file systems
operations, but they can not write the second copy, and read and write
data and meta-data blocks, in parallel.
>
>>> The DualFS code, tools and performance papers are available at:
>>>
>>> <http://sourceforge.net/projects/dualfs>
>>>
>>> The code requires kernel patches to 2.4.19 (oldies but goodies) and
>>> a separate fsck code. The latest kernel we used it for is 2.6.11
>>> and we hope with you help to port it to the latest Linux kernel.
>
> Where is the patch for 2.6.11? Sorry, 2.4.19 is just too old (even if
> only considering the 2.4 branch).
> [ And perhaps "goldies" flew better with "oldies" ;-) ]
>
The patch for 2.6.11 is not still stable enough to be released. Be patient
;-)
Juan.
>
> Jan
>
--
D. Juan Piernas C?novas
Departamento de Ingenier?a y Tecnolog?a de Computadores
Facultad de Inform?tica. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657 Fax: +34968364151
email: [email protected]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index
*** Por favor, env?eme sus documentos en formato texto, HTML, PDF o PostScript :-) ***
Juan Piernas Canovas <[email protected]> writes:
[playing devil's advocate here]
> If the data and meta-data devices of DualFS can be on different disks,
> DualFS is able to READ and WRITE data and meta-data blocks in
> PARALLEL.
XFS can do this too using its real time volumes (which don't contain
any metadata). It can also have a separate log.
Also many storage subsystems have some internal parallelism
in writing (e.g. a RAID can write on different disks in parallel for
a single partition) so i'm not sure your distinction is that useful.
If you stripe two disks with a standard fs versus use one of them
as metadata volume and the other as data volume with dualfs i would
expect the striped variant usually be faster because it will give
parallelism not only to data versus metadata, but also to all data
versus other data.
Also I would expect your design to be slow for metadata read intensive
workloads. E.g. have you tried to boot a root partition with dual fs?
That's a very important IO benchmark for desktop Linux systems.
-Andi
On Feb 15 2007 21:38, Andi Kleen wrote:
>
>Also I would expect your design to be slow for metadata read intensive
>workloads. E.g. have you tried to boot a root partition with dual fs?
>That's a very important IO benchmark for desktop Linux systems.
Did someone say metadata intensive? Try kernel tarballs, they're a
perfect workload. Tons of directories, and even more files here and
there. Works wonders.
Jan
--
On Thu, 15 February 2007 19:38:14 +0100, Juan Piernas Canovas wrote:
>
> The patch for 2.6.11 is not still stable enough to be released. Be patient
> ;-)
While I don't want to discourage you, this is about the point in
development where most log structured filesystems stopped. Doing a
little web research, you will notice those todo-lists with "cleaner"
being the top item for...years!
Getting that one to work robustly is _very_ hard work and just today
I've noticed that mine was not as robust as I would have liked to think.
Also, you may note that by updating to newer kernels, the VM writeout
policies can change and impact your cleaner. To an extent even that you
had a rock-solid filesystem with 2.6.18 and thing crumble between your
fingers in 2.6.19 or later.
If the latter happens, most likely the VM is not to blame, it just
proved that your cleaner is still getting some corner-cases wrong and
needs more work. There goes another week of debugging. :(
Jörn
--
You ain't got no problem, Jules. I'm on the motherfucker. Go back in
there, chill them niggers out and wait for the Wolf, who should be
coming directly.
-- Marsellus Wallace
Hi Andi,
On Thu, 15 Feb 2007, Andi Kleen wrote:
> Juan Piernas Canovas <[email protected]> writes:
>
> [playing devil's advocate here]
>
>> If the data and meta-data devices of DualFS can be on different disks,
>> DualFS is able to READ and WRITE data and meta-data blocks in
>> PARALLEL.
>
> XFS can do this too using its real time volumes (which don't contain
> any metadata). It can also have a separate log.
But you still need several 'real' devices to separate data and meta-data
blocks. DualFS does the same with just one real device. Probably 'data
device' and 'meta-data device' names are a bit confusing. Think about
them as partitions, not as real devices.
>
> Also many storage subsystems have some internal parallelism
> in writing (e.g. a RAID can write on different disks in parallel for
> a single partition) so i'm not sure your distinction is that useful.
>
But we are talking about a different case. What I have said is that if you
use two devices, one for the 'regular' file system and another one for the
log, DualFS is better in that case because it can use the log for reads.
Other journaling file systems can not do that.
> If you stripe two disks with a standard fs versus use one of them
> as metadata volume and the other as data volume with dualfs i would
> expect the striped variant usually be faster because it will give
> parallelism not only to data versus metadata, but also to all data
> versus other data.
>
If you have a RAID system, both the data and meta-data devices of DualFS
can be stripped, and you get the same result. No problem for DualFS :)
> Also I would expect your design to be slow for metadata read intensive
> workloads. E.g. have you tried to boot a root partition with dual fs?
> That's a very important IO benchmark for desktop Linux systems.
>
I do not think so. The performance of DualFS is superb in meta-data read
intensive workloads. And it is also better than the performance of other
file system when reading a directory tree with several copies of the Linux
kernel source code (I showed those results on Tuesday at the LSF07
workshop).
> -Andi
>
Regards,
Juan.
--
D. Juan Piernas C?novas
Departamento de Ingenier?a y Tecnolog?a de Computadores
Facultad de Inform?tica. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657 Fax: +34968364151
email: [email protected]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index
*** Por favor, env?eme sus documentos en formato texto, HTML, PDF o PostScript :-) ***
Hi J?rn,
On Thu, 15 Feb 2007, [utf-8] J?rn Engel wrote:
> On Thu, 15 February 2007 19:38:14 +0100, Juan Piernas Canovas wrote:
>>
>> The patch for 2.6.11 is not still stable enough to be released. Be patient
>> ;-)
>
> While I don't want to discourage you, this is about the point in
> development where most log structured filesystems stopped. Doing a
> little web research, you will notice those todo-lists with "cleaner"
> being the top item for...years!
>
> Getting that one to work robustly is _very_ hard work and just today
> I've noticed that mine was not as robust as I would have liked to think.
> Also, you may note that by updating to newer kernels, the VM writeout
> policies can change and impact your cleaner. To an extent even that you
> had a rock-solid filesystem with 2.6.18 and thing crumble between your
> fingers in 2.6.19 or later.
>
> If the latter happens, most likely the VM is not to blame, it just
> proved that your cleaner is still getting some corner-cases wrong and
> needs more work. There goes another week of debugging. :(
>
> J?rn
>
Actually, the version of DualFS for Linux 2.4.19 implements a cleaner. In
our case, the cleaner is not really a problem because there is not too
much to clean (the meta-data device only contains meta-data blocks which
are 5-6% of the file system blocks; you do not have to move data blocks).
Anyway, thank you for warning me ;-)
Regards,
Juan.
--
D. Juan Piernas C?novas
Departamento de Ingenier?a y Tecnolog?a de Computadores
Facultad de Inform?tica. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657 Fax: +34968364151
email: [email protected]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index
*** Por favor, env?eme sus documentos en formato texto, HTML, PDF o PostScript :-) ***
> >Also many storage subsystems have some internal parallelism
> >in writing (e.g. a RAID can write on different disks in parallel for
> >a single partition) so i'm not sure your distinction is that useful.
> >
> But we are talking about a different case. What I have said is that if you
> use two devices, one for the 'regular' file system and another one for the
> log, DualFS is better in that case because it can use the log for reads.
> Other journaling file systems can not do that.
Shadow paging based systems typically can, but we have no widely used
one on Linux (reiser4 would be probably the closest)
> >If you stripe two disks with a standard fs versus use one of them
> >as metadata volume and the other as data volume with dualfs i would
> >expect the striped variant usually be faster because it will give
> >parallelism not only to data versus metadata, but also to all data
> >versus other data.
> >
> If you have a RAID system, both the data and meta-data devices of DualFS
> can be stripped, and you get the same result. No problem for DualFS :)
Sure, but then you need four disks. And if your workloads happens
to be much more data intensive than metadata intensive the
stripped spindles assigned to metadata only will be more idle
than the ones doing data.
Stripping everything from the same pool has the potential
to adapt itself to any workload mix better.
I can see that you win for some specific workloads, but it is
hard to see how you can win over a wide range of workloads
because of that.
>
> >Also I would expect your design to be slow for metadata read intensive
> >workloads. E.g. have you tried to boot a root partition with dual fs?
> >That's a very important IO benchmark for desktop Linux systems.
> >
> I do not think so. The performance of DualFS is superb in meta-data read
> intensive workloads . And it is also better than the performance of other
> file system when reading a directory tree with several copies of the Linux
> kernel source code (I showed those results on Tuesday at the LSF07
> workshop)
PDFs available?
Is that with running a LFS style cleaner inbetween or without?
I would be interested in a "install distro with installer ; boot afterwards
from it" type benchmark. Do you have something like this?
-Andi
On Thu, 15 Feb 2007 14:46:34 -0500, Jan Engelhardt
<[email protected]> wrote:
>
> On Feb 15 2007 21:38, Andi Kleen wrote:
>>
>> Also I would expect your design to be slow for metadata read intensive
>> workloads. E.g. have you tried to boot a root partition with dual fs?
>> That's a very important IO benchmark for desktop Linux systems.
>
> Did someone say metadata intensive? Try kernel tarballs, they're a
> perfect workload. Tons of directories, and even more files here and
> there. Works wonders.
I just did now per your request. To make things more relevant I
created a file structure from the 2.4.19 kernel sources and repeated it
recursively into the deepest dir level (10) 4 times ending up with
7280 directories with 40 levels of directories depth and 1 GB
data set size. I run both tar and untar operations on the tree
for ext3, reiserfs, jfs and DualFS. I remounted the FS before
each test. I end up with 7280 directories 40 levels depth and
1 GB data. Both tar file and directory tree were on the FS under
test.
Here are the results - elapse time in sec:
tar untar
ext3: 144 143
reiserfs: 100 100
JFS: 196 140
DualFS: 63 54
Hope this helps.
>
>
> Jan
/Sorin
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
Hi,
On Fri, 16 Feb 2007, Andi Kleen wrote:
>>> If you stripe two disks with a standard fs versus use one of them
>>> as metadata volume and the other as data volume with dualfs i would
>>> expect the striped variant usually be faster because it will give
>>> parallelism not only to data versus metadata, but also to all data
>>> versus other data.
>>>
>> If you have a RAID system, both the data and meta-data devices of DualFS
>> can be stripped, and you get the same result. No problem for DualFS :)
>
> Sure, but then you need four disks. And if your workloads happens
> to be much more data intensive than metadata intensive the
> stripped spindles assigned to metadata only will be more idle
> than the ones doing data.
>
> Stripping everything from the same pool has the potential
> to adapt itself to any workload mix better.
>
Why do you need four disks? Data and meda-data devices of DualFS can be on
different disks, can be two partitions of the same disk, or can be two
areas of the same partition. The important thing is that data and
meta-data blocks are separated and that they are managed in different
ways. Please, take a look at the presentation (see below).
> I can see that you win for some specific workloads, but it is
> hard to see how you can win over a wide range of workloads
> because of that.
>
No, we win for a wide range of common workloads. See the results in the
PDF (see below).
>>
>>> Also I would expect your design to be slow for metadata read intensive
>>> workloads. E.g. have you tried to boot a root partition with dual fs?
>>> That's a very important IO benchmark for desktop Linux systems.
>>>
>> I do not think so. The performance of DualFS is superb in meta-data read
>> intensive workloads . And it is also better than the performance of other
>> file system when reading a directory tree with several copies of the Linux
>> kernel source code (I showed those results on Tuesday at the LSF07
>> workshop)
>
> PDFs available?
>
Sure:
http://www.ditec.um.es/~piernas/dualfs/presentation-lsf07-final.pdf
> Is that with running a LFS style cleaner inbetween or without?
>
'With' a cleaner.
> I would be interested in a "install distro with installer ; boot afterwards
> from it" type benchmark. Do you have something like this?
>
> -Andi
I think that the results sent by Sorin answer your question :-)
Regards,
Juan.
--
D. Juan Piernas C?novas
Departamento de Ingenier?a y Tecnolog?a de Computadores
Facultad de Inform?tica. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657 Fax: +34968364151
email: [email protected]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index
*** Por favor, env?eme sus documentos en formato texto, HTML, PDF o PostScript :-) ***
On Thu, 15 February 2007 23:59:14 +0100, Juan Piernas Canovas wrote:
> >
> Actually, the version of DualFS for Linux 2.4.19 implements a cleaner. In
> our case, the cleaner is not really a problem because there is not too
> much to clean (the meta-data device only contains meta-data blocks which
> are 5-6% of the file system blocks; you do not have to move data blocks).
That sounds as if you have not hit the "interesting" cases yet. Fun
starts when your device is near-full and you have a write-intensive
workload. In your case, that would be metadata-write-intensive. For
one, this is where write performance of log-structured filesystems
usually goes down the drain. And worse, it is where the cleaner can
run into a deadlock.
Being good where log-structured filesystems usually are horrible is a
challenge. And I'm sure many people are more interested in those
performance number than in the ones you shine at. :)
Jörn
--
Joern's library part 14:
http://www.sandpile.org/
Jörn Engel wrote:
> On Thu, 15 February 2007 23:59:14 +0100, Juan Piernas Canovas wrote:
>> Actually, the version of DualFS for Linux 2.4.19 implements a cleaner. In
>> our case, the cleaner is not really a problem because there is not too
>> much to clean (the meta-data device only contains meta-data blocks which
>> are 5-6% of the file system blocks; you do not have to move data blocks).
>
> That sounds as if you have not hit the "interesting" cases yet. Fun
> starts when your device is near-full and you have a write-intensive
> workload. In your case, that would be metadata-write-intensive. For
> one, this is where write performance of log-structured filesystems
> usually goes down the drain. And worse, it is where the cleaner can
> run into a deadlock.
>
> Being good where log-structured filesystems usually are horrible is a
> challenge. And I'm sure many people are more interested in those
> performance number than in the ones you shine at. :)
>
Actually I am interested in the common case, where the machine is not
out of space, or memory, or CPU, but when it is appropriately sized to
the workload. Not that I lack interest in corner cases, but the "running
flat out" case doesn't reflect case where there's enough hardware, now
the o/s needs to use it well.
The one high load benchmark I would love to see is a web server, running
tux, with a load over a large (number of files) distributed data set.
The much faster tar create times posted make me think that a server with
a lot of files would benefit, when CPU and memory requirements are not a
bottleneck.
--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
On Fri, 16 February 2007 18:47:48 -0500, Bill Davidsen wrote:
> >
> Actually I am interested in the common case, where the machine is not
> out of space, or memory, or CPU, but when it is appropriately sized to
> the workload. Not that I lack interest in corner cases, but the "running
> flat out" case doesn't reflect case where there's enough hardware, now
> the o/s needs to use it well.
There is one detail about this specific corner case you may be missing.
Most log-structured filesystems don't just drop in performance - they
can run into a deadlock and the only recovery from this is the lovely
backup-mkfs-restore procedure.
If it was just performance, I would agree with you.
Jörn
--
He that composes himself is wiser than he that composes a book.
-- B. Franklin
Jörn Engel wrote:
> On Fri, 16 February 2007 18:47:48 -0500, Bill Davidsen wrote:
>
>> Actually I am interested in the common case, where the machine is not
>> out of space, or memory, or CPU, but when it is appropriately sized to
>> the workload. Not that I lack interest in corner cases, but the "running
>> flat out" case doesn't reflect case where there's enough hardware, now
>> the o/s needs to use it well.
>>
>
> There is one detail about this specific corner case you may be missing.
> Most log-structured filesystems don't just drop in performance - they
> can run into a deadlock and the only recovery from this is the lovely
> backup-mkfs-restore procedure.
>
I missed that. Which corner case did you find triggers this in DualFS?
> If it was just performance, I would agree with you.
--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
On Sat, 17 February 2007 13:10:23 -0500, Bill Davidsen wrote:
> >
> I missed that. Which corner case did you find triggers this in DualFS?
This is not specific to DualFS, it applies to any log-structured
filesystem.
Garbage collection always needs at least one spare segment to collect
valid data into. Regular writes may require additional free segments,
so GC has to kick in and free those when space is getting tight. (1)
GC frees segments by writing all valid data in it into the spare
segment. If there is remaining space in the spare segment, GC can move
more data from further segment. Nice and simple.
The requirement is that GC *always* frees more segments than it uses up
doing so. If that requirement is not fulfilled, GC will simply use up
its last spare segment without freeing a new one. We have a deadlock.
Now imagine your filesystem is 90% full and all data is spread perfectly
across all segments. The best segment you could pick for GC is 90%
full. One would imagine that GC would only need to copy those 90% into
a spare segment and have freed 100%, making overall progress.
But more log-structured filesystems maintain a tree of some sorts on the
medium. If you move data elsewhere, you also need to update the
indirect block pointing to it. So that has to get written as well. If
you have doubly or triply indirect blocks, those need to get written.
So you can end up writing 180% or more to free 100%. Deadlock.
And if you read the documentation of the original Sprite LFS or any
other of the newer log-structured filesystems, you usually won't see a
solution to this problem, or even an acknowledgement that the problem
exists in the first place. But there is no shortage of log-structured
filesystem projects that were abandoned years ago and have "cleaner" or
"garbage collector" as their top item on the todo-list. Coincidence?
(1) GC may also kick in earlier, but that is just an optimization and
doesn't change the worst case, so that bit is irrelevant here.
Btw, the deadlock problem is solvable and I definitely don't want to
discourage further work in this area. DualFS does look interesting.
But my solution for this problem will likely eat up all the performance
DualFS has gained and more, as it isn't aimed at hard disks. So someone
has to come up with a different idea.
Jörn
--
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham
On Sat, 17 Feb 2007 13:36:46 -0500, J?rn Engel <[email protected]>
wrote:
> On Sat, 17 February 2007 13:10:23 -0500, Bill Davidsen wrote:
>> >
>> I missed that. Which corner case did you find triggers this in DualFS?
>
> This is not specific to DualFS, it applies to any log-structured
> filesystem.
>
> Garbage collection always needs at least one spare segment to collect
> valid data into. Regular writes may require additional free segments,
> so GC has to kick in and free those when space is getting tight. (1)
>
> GC frees segments by writing all valid data in it into the spare
> segment. If there is remaining space in the spare segment, GC can move
> more data from further segment. Nice and simple.
>
> The requirement is that GC *always* frees more segments than it uses up
> doing so. If that requirement is not fulfilled, GC will simply use up
> its last spare segment without freeing a new one. We have a deadlock.
>
> Now imagine your filesystem is 90% full and all data is spread perfectly
> across all segments. The best segment you could pick for GC is 90%
> full. One would imagine that GC would only need to copy those 90% into
> a spare segment and have freed 100%, making overall progress.
>
> But more log-structured filesystems maintain a tree of some sorts on the
> medium. If you move data elsewhere, you also need to update the
> indirect block pointing to it. So that has to get written as well. If
> you have doubly or triply indirect blocks, those need to get written.
> So you can end up writing 180% or more to free 100%. Deadlock.
>
> And if you read the documentation of the original Sprite LFS or any
> other of the newer log-structured filesystems, you usually won't see a
> solution to this problem, or even an acknowledgement that the problem
> exists in the first place. But there is no shortage of log-structured
> filesystem projects that were abandoned years ago and have "cleaner" or
> "garbage collector" as their top item on the todo-list. Coincidence?
>
>
> (1) GC may also kick in earlier, but that is just an optimization and
> doesn't change the worst case, so that bit is irrelevant here.
>
>
> Btw, the deadlock problem is solvable and I definitely don't want to
> discourage further work in this area. DualFS does look interesting.
> But my solution for this problem will likely eat up all the performance
> DualFS has gained and more, as it isn't aimed at hard disks. So someone
> has to come up with a different idea.
DualFS can probably get around this corner case as it is up to the user
to select the size of the MD device size. If you want to prevent this
corner case you can always use a device bigger than 10% of the data device
which is exagerate for any FS assuming that the directory files are so
large (this is when you have billions of files with long names).
In general the problem you mention is mainly due to the data blocks
filling the file system. In DualFS case you have the choice of selecting
different sizes for the MD and Data volume. When Data volume gets full
the GC will have a problem but the MD device will not have a problem.
It is my understanding that most of the GC problem you mention is
due to the filling of the FS with data and the result is a MD operation
being disrupted by the filling of the FS with data blocks. As about the
performance impact on solving this problem, as you mentioned all
journal FSs will have this problem, I am sure that DualFS performance
impact will be less than others at least due to using only one MD
write instead of 2.
>
> J?rn
>
--
Best Regards
Sorin Faibish
Senior Technologist
Senior Consulting Software Engineer Network Storage Group
EMC? where information lives
Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : [email protected]
On Sat, 17 February 2007 15:47:01 -0500, Sorin Faibish wrote:
>
> DualFS can probably get around this corner case as it is up to the user
> to select the size of the MD device size. If you want to prevent this
> corner case you can always use a device bigger than 10% of the data device
> which is exagerate for any FS assuming that the directory files are so
> large (this is when you have billions of files with long names).
> In general the problem you mention is mainly due to the data blocks
> filling the file system. In DualFS case you have the choice of selecting
> different sizes for the MD and Data volume. When Data volume gets full
> the GC will have a problem but the MD device will not have a problem.
> It is my understanding that most of the GC problem you mention is
> due to the filling of the FS with data and the result is a MD operation
> being disrupted by the filling of the FS with data blocks. As about the
> performance impact on solving this problem, as you mentioned all
> journal FSs will have this problem, I am sure that DualFS performance
> impact will be less than others at least due to using only one MD
> write instead of 2.
You seem to make the usual mistakes when people start to think about
this problem. But I could misinterpret you, so let me paraphrase your
mail in questions and answer what I believe you said.
Q: Are journaling filesystems identical to log-structured filesystems?
Not quite. Journaling filesystems usually have a very small journal (or
log, same thing) and only store the information necessary for atomic
transactions in the journal. Not sure what a "journal FS" is, but the
name seems closer to a journaling filesystem.
Q: DualFS seperates Data and Metadata. Does that make a difference?
Not really. What I called "data" in my previous mail is a
log-structured filesystems view of data. DualFS stored file content
seperately, so from an lfs view, that doesn't even exist. But directory
content exists and behaves just like file content wrt. the deadlock
problem. Any data or metadata that cannot be GC'd by simply copying but
requires writing further information like indirect blocks, B-Tree nodes,
etc. will cause the problem.
Q: If the user simply reserves some extra space, does the problem go
away?
Definitely not. It will be harder to hit, but a rare deadlock is still
a deadlock. Again, this is only concerned with the log-structured part
of DualFS, so we can ignore the Data volume.
When data is spread perfectly across all segments, the best segment one
can pick for GC is just as bad as the worst. So let us take some
examples. If 50% of the lfs is free, you can pick a 50% segment for GC.
Writing every single block in it may require writing one additional
indirect block, so GC is required to write out a 100% segment. It
doesn't make any progress at 50% (in a worst case scenario) and could
deadlock if less than 50% were free.
If, however, GC has to write out a singly and a doubly indirect block,
67% of the lfs need to be free. In general, if the maximum height of
your tree is N, you need (N-1)/N * 100% free space. Most people refer
to that as "too much".
If you have less free space, the filesystem will work just fine "most of
the time". That is nice and cool, but it won't help your rare user that
happens to hit the rare deadlock. Any lfs needs a strategy to prevent
this deadlock for good, not just make it mildly unlikely.
Jörn
--
"Error protection by error detection and correction."
-- from a university class
Maybe this is a decent approach to deal with the problem. First some
definitions. T is the target segment to be cleaned, S is the spare
segment that valid data is written to, O are other segments that contain
indirect blocks I for valid data D in T.
Have two different GC mechanisms to choose between:
1. Regular GC that copies D and I into S. On average D+I should require
less space than S can offer.
2. Slow GC only copies D into S. Indirect blocks get modified in-place
in O. This variant requires more seeks due to writing in various O,
but it guarantees that D always requires less space than S can offer.
Whenever you are running out of spare segments and are in danger of the
deadlock, switch to mechanism 2. Now your correctness problem is
reduced to a performance problem.
Jörn
--
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham
Hi J?rn,
I understand the problem that you describe with respect to the GC, but
let me explain why I think that it has a small impact on DualFS.
Actually, the GC may become a problem when the number of free segments is
50% or less. If your LFS always guarantees, at least, 50% of free
"segments" (note that I am talking about segments, not free space), the
deadlock problem disappears, right? This is a quite naive solution, but it
works.
In a traditional LFS, with data and meta-data blocks, 50% of free segments
represents a huge amount of wasted disk space. But, in DualFS, 50% of free
segments in the meta-data device is not too much. In a typical Ext2,
or Ext3 file system, there are 20 data blocks for every meta-data block
(that is, meta-data blocks are 5% of the disk blocks used by files).
Since files are implemented in DualFS in the same way, we can suppose the
same ratio for DualFS (1).
Now, let us assume that the data device takes 90% of the disk space, and
the meta-data device the other 10%. When the data device gets full, the
meta-data blocks will be using the half of the meta-data device, and the
other half (5% of the entire disk) will be free. Frankly, 5% is not too
much.
Remember, I am supposing a naive implementation of the cleaner. With a
cleverer one, the meta-data device can be smaller, and the amount of
disk space finally wasted can be smaller too. The following paper proposes
some improvements:
- Jeanna Neefe Matthews, Drew Roselli, Adam Costello, Randy Wang, and
Thomas Anderson. "Improving the Performance of Log-structured File
Systems with Adaptive Methods". Proc. Sixteenth ACM Symposium on
Operating Systems Principles (SOSP), October 1997, pages 238 - 251.
BTW, I think that what they propose is very similar to the two-strategies
GC that you propose in a separate e-mail.
The point of all the above is that you must improve the common case, and
manage the worst case correctly. And that is the idea behind DualFS :)
Regards,
Juan.
(1) DualFS can also use extents to implement regular files, so the ratio
of data blocks with respect to meta-data blocks can be greater.
On Sun, 18 Feb 2007, [utf-8] J?rn Engel wrote:
> On Sat, 17 February 2007 15:47:01 -0500, Sorin Faibish wrote:
>>
>> DualFS can probably get around this corner case as it is up to the user
>> to select the size of the MD device size. If you want to prevent this
>> corner case you can always use a device bigger than 10% of the data device
>> which is exagerate for any FS assuming that the directory files are so
>> large (this is when you have billions of files with long names).
>> In general the problem you mention is mainly due to the data blocks
>> filling the file system. In DualFS case you have the choice of selecting
>> different sizes for the MD and Data volume. When Data volume gets full
>> the GC will have a problem but the MD device will not have a problem.
>> It is my understanding that most of the GC problem you mention is
>> due to the filling of the FS with data and the result is a MD operation
>> being disrupted by the filling of the FS with data blocks. As about the
>> performance impact on solving this problem, as you mentioned all
>> journal FSs will have this problem, I am sure that DualFS performance
>> impact will be less than others at least due to using only one MD
>> write instead of 2.
>
> You seem to make the usual mistakes when people start to think about
> this problem. But I could misinterpret you, so let me paraphrase your
> mail in questions and answer what I believe you said.
>
> Q: Are journaling filesystems identical to log-structured filesystems?
>
> Not quite. Journaling filesystems usually have a very small journal (or
> log, same thing) and only store the information necessary for atomic
> transactions in the journal. Not sure what a "journal FS" is, but the
> name seems closer to a journaling filesystem.
>
> Q: DualFS seperates Data and Metadata. Does that make a difference?
>
> Not really. What I called "data" in my previous mail is a
> log-structured filesystems view of data. DualFS stored file content
> seperately, so from an lfs view, that doesn't even exist. But directory
> content exists and behaves just like file content wrt. the deadlock
> problem. Any data or metadata that cannot be GC'd by simply copying but
> requires writing further information like indirect blocks, B-Tree nodes,
> etc. will cause the problem.
>
> Q: If the user simply reserves some extra space, does the problem go
> away?
>
> Definitely not. It will be harder to hit, but a rare deadlock is still
> a deadlock. Again, this is only concerned with the log-structured part
> of DualFS, so we can ignore the Data volume.
>
> When data is spread perfectly across all segments, the best segment one
> can pick for GC is just as bad as the worst. So let us take some
> examples. If 50% of the lfs is free, you can pick a 50% segment for GC.
> Writing every single block in it may require writing one additional
> indirect block, so GC is required to write out a 100% segment. It
> doesn't make any progress at 50% (in a worst case scenario) and could
> deadlock if less than 50% were free.
>
> If, however, GC has to write out a singly and a doubly indirect block,
> 67% of the lfs need to be free. In general, if the maximum height of
> your tree is N, you need (N-1)/N * 100% free space. Most people refer
> to that as "too much".
>
> If you have less free space, the filesystem will work just fine "most of
> the time". That is nice and cool, but it won't help your rare user that
> happens to hit the rare deadlock. Any lfs needs a strategy to prevent
> this deadlock for good, not just make it mildly unlikely.
>
> J?rn
>
>
--
D. Juan Piernas C?novas
Departamento de Ingenier?a y Tecnolog?a de Computadores
Facultad de Inform?tica. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657 Fax: +34968364151
email: [email protected]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index
*** Por favor, env?eme sus documentos en formato texto, HTML, PDF o PostScript :-) ***
On Tue, Feb 20, 2007 at 12:57:50AM +0100, Juan Piernas Canovas wrote:
> Now, let us assume that the data device takes 90% of the disk space, and
> the meta-data device the other 10%. When the data device gets full, the
> meta-data blocks will be using the half of the meta-data device, and the
> other half (5% of the entire disk) will be free. Frankly, 5% is not too
> much.
I'd like to very strongly second this. We have a target maximum of 80%
usage on all partitions, with emailed warnings at 85% and paged warnings
at 90%. I consider any partition over 90% to be dangerously over-full.
Given that disks space is approximately US$1/Gb in RAID6 SATA these
days, it doesn't cost much to make sure there's some free.
What matters a lot more is the speed at which you can get data on and
off these huge devices, because drive speed isn't keeping up with
capacity. It takes about a week to fill one of the external storage
monsters we're buying these days, and that's streaming writes. Makes
restoring from backup a scary proposition given the downtime it will
take.
But back on topic - I'd be very willing to pay a 5% disk space penalty
for the sort of performance that DualFS is promising, but I'd need a
2.6 series kernel with the O(1) scheduler or performance would go down
the toilet in the other direction.
Bron.
P.S. do you have any figures for high MMAP load? We run big Cyrus, and
it has a serious MMAP fetish.
On Tue, 20 February 2007 00:57:50 +0100, Juan Piernas Canovas wrote:
>
> I understand the problem that you describe with respect to the GC, but
> let me explain why I think that it has a small impact on DualFS.
>
> Actually, the GC may become a problem when the number of free segments is
> 50% or less. If your LFS always guarantees, at least, 50% of free
> "segments" (note that I am talking about segments, not free space), the
> deadlock problem disappears, right? This is a quite naive solution, but it
> works.
I don't see how you can guarantee 50% free segments. Can you explain
that bit?
> In a traditional LFS, with data and meta-data blocks, 50% of free segments
> represents a huge amount of wasted disk space. But, in DualFS, 50% of free
> segments in the meta-data device is not too much. In a typical Ext2,
> or Ext3 file system, there are 20 data blocks for every meta-data block
> (that is, meta-data blocks are 5% of the disk blocks used by files).
> Since files are implemented in DualFS in the same way, we can suppose the
> same ratio for DualFS (1).
This will work fairly well for most people. It is possible to construct
metadata-heavy workloads, however. Many large directories containing
symlinks or special files (char/block devices, sockets, fifos,
whiteouts) come to mind. Most likely noone of your user will ever want
that, but a malicious attacker might.
That, btw, brings me to a completely unrelated topic. Having a fixed
ratio a metadata to data is simple to implement, but allowing this ratio
to dynamically change would be nicer for administration. You can add
that to the Christmas wishlist for the nice boys, if you like.
> Remember, I am supposing a naive implementation of the cleaner. With a
> cleverer one, the meta-data device can be smaller, and the amount of
> disk space finally wasted can be smaller too. The following paper proposes
> some improvements:
>
> - Jeanna Neefe Matthews, Drew Roselli, Adam Costello, Randy Wang, and
> Thomas Anderson. "Improving the Performance of Log-structured File
> Systems with Adaptive Methods". Proc. Sixteenth ACM Symposium on
> Operating Systems Principles (SOSP), October 1997, pages 238 - 251.
>
> BTW, I think that what they propose is very similar to the two-strategies
> GC that you propose in a separate e-mail.
Will have to read it up after I get some sleep. It is late.
> The point of all the above is that you must improve the common case, and
> manage the worst case correctly. And that is the idea behind DualFS :)
A fine principle to work with. Surprisingly, what is the worst case for
you is the common case for LogFS, so maybe I'm more interested in it
than most people. Or maybe I'm just more paranoid.
Anyway, keep up the work. It is an interesting idea to pursue.
Jörn
--
He who knows that enough is enough will always have enough.
-- Lao Tsu
Juan Piernas Canovas wrote:
>
> The point of all the above is that you must improve the common case,
> and manage the worst case correctly.
That statement made it to my quote file. Of course "correctly" hopefully
means getting to the desired behavior without a performance hit so bad
it becomes a "jackpot case" and is correct in result but too slow to be
useful.
--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
Hi J?rn,
On Tue, 20 Feb 2007, [utf-8] J?rn Engel wrote:
> On Tue, 20 February 2007 00:57:50 +0100, Juan Piernas Canovas wrote:
>>
>> Actually, the GC may become a problem when the number of free segments is
>> 50% or less. If your LFS always guarantees, at least, 50% of free
>> "segments" (note that I am talking about segments, not free space), the
>> deadlock problem disappears, right? This is a quite naive solution, but it
>> works.
>
> I don't see how you can guarantee 50% free segments. Can you explain
> that bit?
It is quite simple. If 50% of your segments are busy, and the other 50%
are free, and the file system needs a new segment, the cleaner starts
freeing some of busy ones. If the cleaner is unable to free one segment at
least, your file system gets "full" (and it returns a nice ENOSPC error).
This solution wastes the half of your storage device, but it is
deadlock-free. Obviously, there are better approaches.
>
>> In a traditional LFS, with data and meta-data blocks, 50% of free segments
>> represents a huge amount of wasted disk space. But, in DualFS, 50% of free
>> segments in the meta-data device is not too much. In a typical Ext2,
>> or Ext3 file system, there are 20 data blocks for every meta-data block
>> (that is, meta-data blocks are 5% of the disk blocks used by files).
>> Since files are implemented in DualFS in the same way, we can suppose the
>> same ratio for DualFS (1).
>
> This will work fairly well for most people. It is possible to construct
> metadata-heavy workloads, however. Many large directories containing
> symlinks or special files (char/block devices, sockets, fifos,
> whiteouts) come to mind. Most likely noone of your user will ever want
> that, but a malicious attacker might.
>
Quotas, bigger meta-data device, cleverer cleaner,... there are
solutions :)
>> The point of all the above is that you must improve the common case, and
>> manage the worst case correctly. And that is the idea behind DualFS :)
>
> A fine principle to work with. Surprisingly, what is the worst case for
> you is the common case for LogFS, so maybe I'm more interested in it
> than most people. Or maybe I'm just more paranoid.
>
No, you are right. It is the common case for LogFS because it has data and
meta-data blocks in the same address space, but that is not the case of
DualFS. Anyway, I'm very interested in your work because any solution to
the problem of the GC will be also applicable to DualFS. So, keep up with
it. ;-)
Juan.
--
D. Juan Piernas C?novas
Departamento de Ingenier?a y Tecnolog?a de Computadores
Facultad de Inform?tica. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657 Fax: +34968364151
email: [email protected]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index
*** Por favor, env?eme sus documentos en formato texto, HTML, PDF o PostScript :-) ***
On Wed, 21 February 2007 05:36:22 +0100, Juan Piernas Canovas wrote:
> >
> >I don't see how you can guarantee 50% free segments. Can you explain
> >that bit?
> It is quite simple. If 50% of your segments are busy, and the other 50%
> are free, and the file system needs a new segment, the cleaner starts
> freeing some of busy ones. If the cleaner is unable to free one segment at
> least, your file system gets "full" (and it returns a nice ENOSPC error).
> This solution wastes the half of your storage device, but it is
> deadlock-free. Obviously, there are better approaches.
Ah, ok. It is deadlock free, if the maximal height of your tree is 2.
It is not 100% deadlock free if the height is 3 or more.
Also, I strongly suspect that your tree is higher than 2. A medium
sized directory will have data blocks, indirect blocks and the inode
proper, which gives you a height of 3. Your inodes need to get accessed
somehow and unless they have fixed positions like in ext2, you need a
further tree structure of some sorts, so you're more likely looking at a
height of 5.
With a height of 5, you would need to keep 80% of you metadata free.
That is starting to get wasteful.
So I suspect that my proposed alternate cleaner mechanism or the even
better "hole plugging" mechanism proposed in the paper a few posts above
would be a better path to follow.
> >A fine principle to work with. Surprisingly, what is the worst case for
> >you is the common case for LogFS, so maybe I'm more interested in it
> >than most people. Or maybe I'm just more paranoid.
>
> No, you are right. It is the common case for LogFS because it has data and
> meta-data blocks in the same address space, but that is not the case of
> DualFS. Anyway, I'm very interested in your work because any solution to
> the problem of the GC will be also applicable to DualFS. So, keep up with
> it. ;-)
Actually, no. It is the common case for LogFS because it is designed
for flash media. Unlike hard disks, flash lifetime is limited by the
amount of data written to it. Therefore, having a cleaner run when the
filesystem is idle would cause unnecessary writes and reduce lifetime.
As a result, the LogFS cleaner runs as lazily as possible and the
filesystem tries hard not to mix data with different lifetimes in one
segment. LogFS tries to avoid the cleaner like the plague. But if it
ever needs to run it, the deadlock scenario is very close and I need to
be very aware of it. :)
In a way, the DualFS approach does not change rules for the
log-structured filesystem at all. If you had designed your filesystem
in such a way that you simply used two existent filesystems and wrote
Actual Data (AD) to one, Metadata (MD) to another, what is MD to DualFS
is plain data to one of your underlying filesystems. It can cause a bit
of confusion, because I tend to call MD "data" and you tend to call AD
"data", but that is about all.
Jörn
--
But this is not to say that the main benefit of Linux and other GPL
software is lower-cost. Control is the main benefit--cost is secondary.
-- Bruce Perens
Hi J?rn,
On Wed, 21 Feb 2007, [utf-8] J?rn Engel wrote:
> On Wed, 21 February 2007 05:36:22 +0100, Juan Piernas Canovas wrote:
>>>
>>> I don't see how you can guarantee 50% free segments. Can you explain
>>> that bit?
>> It is quite simple. If 50% of your segments are busy, and the other 50%
>> are free, and the file system needs a new segment, the cleaner starts
>> freeing some of busy ones. If the cleaner is unable to free one segment at
>> least, your file system gets "full" (and it returns a nice ENOSPC error).
>> This solution wastes the half of your storage device, but it is
>> deadlock-free. Obviously, there are better approaches.
>
> Ah, ok. It is deadlock free, if the maximal height of your tree is 2.
> It is not 100% deadlock free if the height is 3 or more.
>
> Also, I strongly suspect that your tree is higher than 2. A medium
> sized directory will have data blocks, indirect blocks and the inode
> proper, which gives you a height of 3. Your inodes need to get accessed
> somehow and unless they have fixed positions like in ext2, you need a
> further tree structure of some sorts, so you're more likely looking at a
> height of 5.
>
> With a height of 5, you would need to keep 80% of you metadata free.
> That is starting to get wasteful.
>
> So I suspect that my proposed alternate cleaner mechanism or the even
> better "hole plugging" mechanism proposed in the paper a few posts above
> would be a better path to follow.
I do not understand. Do you mean that if I have 10 segments, 5 busy and 5
free, after cleaning I could need 6 segments? How? Where the extra blocks
come from?
Juan.
--
D. Juan Piernas C?novas
Departamento de Ingenier?a y Tecnolog?a de Computadores
Facultad de Inform?tica. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657 Fax: +34968364151
email: [email protected]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index
*** Por favor, env?eme sus documentos en formato texto, HTML, PDF o PostScript :-) ***
On Wed, 21 February 2007 19:31:40 +0100, Juan Piernas Canovas wrote:
>
> I do not understand. Do you mean that if I have 10 segments, 5 busy and 5
> free, after cleaning I could need 6 segments? How? Where the extra blocks
> come from?
This is a fairly complicated subject and I have trouble explaining it to
people - even though I hope that maybe one or two dozen understand it by
now. So let me try to give you an example:
In LogFS, inodes are stored in an inode file. There are no B-Trees yet,
so the regular unix indirect blocks are used. My example will be
writing to a directory, so that should only involve metadata by your
definition and be a valid example for DualFS as well. If it is not,
please tell me where the difference lies.
The directory is large, so appending to it involves writing a datablock
(D0), and indirect block (D1) and a doubly indirect block (D2).
Before:
Segment 1: [some data] [ D1 ] [more data]
Segment 2: [some data] [ D0 ] [more data]
Segment 3: [some data] [ D2 ] [more data]
Segment 4: [ empty ]
...
After:
Segment 1: [some data] [garbage] [more data]
Segment 2: [some data] [garbage] [more data]
Segment 3: [some data] [garbage] [more data]
Segment 4: [D0][D1][D2][ empty ]
...
Ok. After this, the position of D2 on the medium has changed. So we
need to update the inode and write that as well. If the inode number
for this directory is high, we will need to write the inode (I0), an
indirect block (I1) and a doubly indirect block (I2). The picture
becomes a bit more complicates.
Before:
Segment 1: [some data] [ D1 ] [more data]
Segment 2: [some data] [ D0 ] [more data]
Segment 3: [some data] [ D2 ] [more data]
Segment 4: [ empty ]
Segment 5: [some data] [ I1 ] [more data]
Segment 6: [some data] [ I0 ] [more data]
Segment 7: [some data] [ I2 ] [more data]
...
After:
Segment 1: [some data] [garbage] [more data]
Segment 2: [some data] [garbage] [more data]
Segment 3: [some data] [garbage] [more data]
Segment 4: [D0][D1][D2][I0][I1][I2][ empty ]
Segment 5: [some data] [garbage] [more data]
Segment 6: [some data] [garbage] [more data]
Segment 7: [some data] [garbage] [more data]
...
So what has just happened? The user did a single "touch foo" in a large
directory and has caused six objects to move. Unless some of those
objects were in the same segment before, we now have six segments
containing a tiny amount of garbage.
And there is almost no way how you can squeeze that garbage back out.
The cleaner will fundamentally do the same thing as a regular write - it
will move objects. So if you want to clean a segment containing the
block of a different directory, you may again have to move five
additional objects, the indirect blocks, inode and ifile indirect
blocks.
At this point, your cleaner is becoming a threat. There is a real
danger that it will create more garbage in unrelated segments than it
frees up. I claim that you cannot keep 50% clean segments, unless you
move away from the simplistic cleaner I described above.
Jörn
--
If you're willing to restrict the flexibility of your approach,
you can almost always do something better.
-- John Carmack
Hi J?rn,
I have been thinking about the problem that you describe, and,
definitively, DualFS does not have that problem. I could be wrong, but,
I actually believe that the GC implemented by DualFS is deadlock-free.
The key is the design of the log-structured file system used by DualFS for
the meta-data device, which is different to the design that you propose.
On Wed, 21 Feb 2007, [utf-8] J?rn Engel wrote:
> On Wed, 21 February 2007 19:31:40 +0100, Juan Piernas Canovas wrote:
>>
>> I do not understand. Do you mean that if I have 10 segments, 5 busy and 5
>> free, after cleaning I could need 6 segments? How? Where the extra blocks
>> come from?
>
> This is a fairly complicated subject and I have trouble explaining it to
> people - even though I hope that maybe one or two dozen understand it by
> now. So let me try to give you an example:
>
> In LogFS, inodes are stored in an inode file. There are no B-Trees yet,
> so the regular unix indirect blocks are used. My example will be
> writing to a directory, so that should only involve metadata by your
> definition and be a valid example for DualFS as well. If it is not,
> please tell me where the difference lies.
>
> The directory is large, so appending to it involves writing a datablock
> (D0), and indirect block (D1) and a doubly indirect block (D2).
>
> Before:
> Segment 1: [some data] [ D1 ] [more data]
> Segment 2: [some data] [ D0 ] [more data]
> Segment 3: [some data] [ D2 ] [more data]
> Segment 4: [ empty ]
> ...
DualFS writes meta-blocks in variable-sized chunks that we call partial
segments. The meta-data device, however, is divided into segments, which
have the same size. A partial segment can be as large a a segment, but a
segment usually has more that one partial segment. Besides, a partial
segment can not cross a segment boundary.
A partial segment is a transaction unit, and contains "all" the blocks
modified by a file system operation, including indirect blocks and i-nodes
(actually, it contains the blocks modified by several file system
operations, but let us assume that every partial segment only contains the
blocks modified by a single file system operation).
So, the above figure is as follows in DualFS:
Before:
Segment 1: [some data] [ D0 D1 D2 I ] [more data]
Segment 2: [ some data ]
Segment 3: [ empty ]
If the datablock D0 is modified, what you get is:
Segment 1: [some data] [ garbage ] [more data]
Segment 2: [ some data ]
Segment 3: [ D0 D1 D2 I ] [ empty ]
This is very similar to what the cleaner does. Therefore, moving a direct
datablock (D0) to a new segment does not require more space than in the
original segment. That is, cleaning a segment in DualFS requires just one
free segment, and not more.
The result is that you can use all the free segments in DualFS, and its
cleaner is simple and deadlock-free. Probably the design is not the most
space-efficient in the world, but it removes some other serious problems.
And, remember, we are talking about meta-data (which is a small part of
the file system), and disk space (which is quite inexpensive).
Regards,
Juan.
>
> After:
> Segment 1: [some data] [garbage] [more data]
> Segment 2: [some data] [garbage] [more data]
> Segment 3: [some data] [garbage] [more data]
> Segment 4: [D0][D1][D2][ empty ]
> ...
>
> Ok. After this, the position of D2 on the medium has changed. So we
> need to update the inode and write that as well. If the inode number
> for this directory is high, we will need to write the inode (I0), an
> indirect block (I1) and a doubly indirect block (I2). The picture
> becomes a bit more complicates.
>
> Before:
> Segment 1: [some data] [ D1 ] [more data]
> Segment 2: [some data] [ D0 ] [more data]
> Segment 3: [some data] [ D2 ] [more data]
> Segment 4: [ empty ]
> Segment 5: [some data] [ I1 ] [more data]
> Segment 6: [some data] [ I0 ] [more data]
> Segment 7: [some data] [ I2 ] [more data]
> ...
>
> After:
> Segment 1: [some data] [garbage] [more data]
> Segment 2: [some data] [garbage] [more data]
> Segment 3: [some data] [garbage] [more data]
> Segment 4: [D0][D1][D2][I0][I1][I2][ empty ]
> Segment 5: [some data] [garbage] [more data]
> Segment 6: [some data] [garbage] [more data]
> Segment 7: [some data] [garbage] [more data]
> ...
>
> So what has just happened? The user did a single "touch foo" in a large
> directory and has caused six objects to move. Unless some of those
> objects were in the same segment before, we now have six segments
> containing a tiny amount of garbage.
>
> And there is almost no way how you can squeeze that garbage back out.
> The cleaner will fundamentally do the same thing as a regular write - it
> will move objects. So if you want to clean a segment containing the
> block of a different directory, you may again have to move five
> additional objects, the indirect blocks, inode and ifile indirect
> blocks.
>
> At this point, your cleaner is becoming a threat. There is a real
> danger that it will create more garbage in unrelated segments than it
> frees up. I claim that you cannot keep 50% clean segments, unless you
> move away from the simplistic cleaner I described above.
>
> J?rn
>
>
--
D. Juan Piernas C?novas
Departamento de Ingenier?a y Tecnolog?a de Computadores
Facultad de Inform?tica. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657 Fax: +34968364151
email: [email protected]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index
*** Por favor, env?eme sus documentos en formato texto, HTML, PDF o PostScript :-) ***
On Thu, 22 February 2007 05:30:03 +0100, Juan Piernas Canovas wrote:
>
> DualFS writes meta-blocks in variable-sized chunks that we call partial
> segments. The meta-data device, however, is divided into segments, which
> have the same size. A partial segment can be as large a a segment, but a
> segment usually has more that one partial segment. Besides, a partial
> segment can not cross a segment boundary.
Sure, that's a fairly common approach.
> A partial segment is a transaction unit, and contains "all" the blocks
> modified by a file system operation, including indirect blocks and i-nodes
> (actually, it contains the blocks modified by several file system
> operations, but let us assume that every partial segment only contains the
> blocks modified by a single file system operation).
>
> So, the above figure is as follows in DualFS:
>
> Before:
> Segment 1: [some data] [ D0 D1 D2 I ] [more data]
> Segment 2: [ some data ]
> Segment 3: [ empty ]
>
> If the datablock D0 is modified, what you get is:
>
> Segment 1: [some data] [ garbage ] [more data]
> Segment 2: [ some data ]
> Segment 3: [ D0 D1 D2 I ] [ empty ]
You have fairly strict assumptions about the "Before:" picture. But
what happens if those assumptions fail. To give you an example, imagine
the following small script:
$ for i in `seq 1000000`; do touch $i; done
This will create a million dentries in one directory. It will also
create a million inodes, but let us ignore those for a moment. It is
fairly unlikely that you can fit a million dentries into [D0], so you
will need more than one block. Let's call them [DA], [DB], [DC], etc.
So you have to write out the first block [DA].
Before:
Segment 1: [some data] [ DA D1 D2 I ] [more data]
Segment 2: [ some data ]
Segment 3: [ empty ]
If the datablock D0 is modified, what you get is:
Segment 1: [some data] [ garbage ] [more data]
Segment 2: [ some data ]
Segment 3: [ DA D1 D2 I ] [ empty ]
That is exactly your picture. Fine. Next you write [DB].
Before: see above
After:
Segment 1: [some data] [ garbage ] [more data]
Segment 2: [ some data ]
Segment 3: [ DA][garbage] [ DB D1 D2 I ] [ empty]
You write [DC]. Note that Segment 3 does not have enough space for
another partial segment:
Segment 1: [some data] [ garbage ] [more data]
Segment 2: [ some data ]
Segment 3: [ DA][garbage] [ DB][garbage] [wasted]
Segment 4: [ DC D1 D2 I ] [ empty ]
You write [DD] and [DE]:
Segment 1: [some data] [ garbage ] [more data]
Segment 2: [ some data ]
Segment 3: [ DA][garbage] [ DB][garbage] [wasted]
Segment 4: [ DC][garbage] [ DD][garbage] [wasted]
Segment 5: [ DE D1 D2 I ] [ empty ]
And some time later you even have to switch to a new indirect block, so
you get before:
Segment n : [ DX D1 D2 I ] [ empty ]
After:
Segment n : [ DX D1][garb] [ DY DI D2 I ] [ empty]
What you end up with after all this is quite unlike you "Before"
picture. Instead of this:
> Segment 1: [some data] [ D0 D1 D2 I ] [more data]
You may have something closer to this:
> >Segment 1: [some data] [ D1 ] [more data]
> >Segment 2: [some data] [ D0 ] [more data]
> >Segment 3: [some data] [ D2 ] [more data]
You should try the testcase and look at a dump of your filesystem
afterwards. I usually just read the raw device in a hex editor.
Jörn
--
Beware of bugs in the above code; I have only proved it correct, but
not tried it.
-- Donald Knuth
Hi J?rn,
On Thu, 22 Feb 2007, [utf-8] J?rn Engel wrote:
>> A partial segment is a transaction unit, and contains "all" the blocks
>> modified by a file system operation, including indirect blocks and i-nodes
>> (actually, it contains the blocks modified by several file system
>> operations, but let us assume that every partial segment only contains the
>> blocks modified by a single file system operation).
>>
>> So, the above figure is as follows in DualFS:
>>
>> Before:
>> Segment 1: [some data] [ D0 D1 D2 I ] [more data]
>> Segment 2: [ some data ]
>> Segment 3: [ empty ]
>>
>> If the datablock D0 is modified, what you get is:
>>
>> Segment 1: [some data] [ garbage ] [more data]
>> Segment 2: [ some data ]
>> Segment 3: [ D0 D1 D2 I ] [ empty ]
>
> You have fairly strict assumptions about the "Before:" picture. But
The "Before" figure is intentionally simple because it is enough to
understand how the meta-data device of DualFS works, and why the cleaner
is deadlock-free ;-)
> what happens if those assumptions fail. To give you an example, imagine
> the following small script:
>
> $ for i in `seq 1000000`; do touch $i; done
>
> This will create a million dentries in one directory. It will also
> create a million inodes, but let us ignore those for a moment. It is
> fairly unlikely that you can fit a million dentries into [D0], so you
> will need more than one block. Let's call them [DA], [DB], [DC], etc.
> So you have to write out the first block [DA].
>
> Before:
> Segment 1: [some data] [ DA D1 D2 I ] [more data]
> Segment 2: [ some data ]
> Segment 3: [ empty ]
>
> If the datablock D0 is modified, what you get is:
>
> Segment 1: [some data] [ garbage ] [more data]
> Segment 2: [ some data ]
> Segment 3: [ DA D1 D2 I ] [ empty ]
>
> That is exactly your picture. Fine. Next you write [DB].
>
> Before: see above
> After:
> Segment 1: [some data] [ garbage ] [more data]
> Segment 2: [ some data ]
> Segment 3: [ DA][garbage] [ DB D1 D2 I ] [ empty]
>
> You write [DC]. Note that Segment 3 does not have enough space for
> another partial segment:
>
> Segment 1: [some data] [ garbage ] [more data]
> Segment 2: [ some data ]
> Segment 3: [ DA][garbage] [ DB][garbage] [wasted]
> Segment 4: [ DC D1 D2 I ] [ empty ]
>
> You write [DD] and [DE]:
> Segment 1: [some data] [ garbage ] [more data]
> Segment 2: [ some data ]
> Segment 3: [ DA][garbage] [ DB][garbage] [wasted]
> Segment 4: [ DC][garbage] [ DD][garbage] [wasted]
> Segment 5: [ DE D1 D2 I ] [ empty ]
>
> And some time later you even have to switch to a new indirect block, so
> you get before:
>
> Segment n : [ DX D1 D2 I ] [ empty ]
>
> After:
>
> Segment n : [ DX D1][garb] [ DY DI D2 I ] [ empty]
>
> What you end up with after all this is quite unlike you "Before"
> picture. Instead of this:
>
>> Segment 1: [some data] [ D0 D1 D2 I ] [more data]
>
I agree with all the above, althoug it displays the wort case because a
partial segment usually contains several datablocks, and a few indirect
blocks. But it is fine for our purposes.
> You may have something closer to this:
>
>>> Segment 1: [some data] [ D1 ] [more data]
>>> Segment 2: [some data] [ D0 ] [more data]
>>> Segment 3: [some data] [ D2 ] [more data]
>
I do not agree with this picture, because it does not show that all the
indirect blocks which point to a direct block are along with it in the
same segment. That figure should look like:
Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [some data] [ D0 D1' D2' ] [more data]
Segment 3: [some data] [ DB D1 D2 ] [more data]
where D0, DA, and DB are datablocks, D1 and D2 indirect blocks which
point to the datablocks, and D1' and D2' obsolete copies of those
indirect blocks. By using this figure, is is clear that if you need to
move D0 to clean the segment 2, you will need only one free segment at
most, and not more. You will get:
Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [ free ]
Segment 3: [some data] [ DB D1' D2' ] [more data]
......
Segment n: [ D0 D1 D2 ] [ empty ]
That is, D0 needs in the new segment the same space that it needs in the
previous one.
The differences are subtle but important.
Regards,
Juan.
> You should try the testcase and look at a dump of your filesystem
> afterwards. I usually just read the raw device in a hex editor.
>
> J?rn
>
>
--
D. Juan Piernas C?novas
Departamento de Ingenier?a y Tecnolog?a de Computadores
Facultad de Inform?tica. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657 Fax: +34968364151
email: [email protected]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index
*** Por favor, env?eme sus documentos en formato texto, HTML, PDF o PostScript :-) ***
On Thu, 22 February 2007 20:57:12 +0100, Juan Piernas Canovas wrote:
>
> I do not agree with this picture, because it does not show that all the
> indirect blocks which point to a direct block are along with it in the
> same segment. That figure should look like:
>
> Segment 1: [some data] [ DA D1' D2' ] [more data]
> Segment 2: [some data] [ D0 D1' D2' ] [more data]
> Segment 3: [some data] [ DB D1 D2 ] [more data]
>
> where D0, DA, and DB are datablocks, D1 and D2 indirect blocks which
> point to the datablocks, and D1' and D2' obsolete copies of those
> indirect blocks. By using this figure, is is clear that if you need to
> move D0 to clean the segment 2, you will need only one free segment at
> most, and not more. You will get:
>
> Segment 1: [some data] [ DA D1' D2' ] [more data]
> Segment 2: [ free ]
> Segment 3: [some data] [ DB D1' D2' ] [more data]
> ......
> Segment n: [ D0 D1 D2 ] [ empty ]
>
> That is, D0 needs in the new segment the same space that it needs in the
> previous one.
>
> The differences are subtle but important.
Ah, now I see. Yes, that is deadlock-free. If you are not accounting
the bytes of used space but the number of used segments, and you count
each partially used segment the same as a 100% used segment, there is no
deadlock.
Some people may consider this to be cheating, however. It will cause
more than 50% wasted space. All obsolete copies are garbage, after all.
With a maximum tree height of N, you can have up to (N-1) / N of your
filesystem occupied by garbage.
It also means that "df" will have unexpected output. You cannot
estimate how much data can fit into the filesystem, as that depends on
how much garbage you will accumulate in the segments. Admittedly this
is not a problem for DualFS, as the uncertainty only exists for
metadata, do "df" for DualFS still makes sense.
Another downside is that with large amounts of garbage between otherwise
useful data, your disk cache hit rate goes down. Read performance is
suffering. But that may be a fair tradeoff and will only show up in
large metadata reads in the uncached (per Linux) case. Seems fair.
Quite interesting, actually. The costs of your design are disk space,
depending on the amount and depth of your metadata, and metadata read
performance. Disk space is cheap and metadata reads tend to be slow for
most filesystems, in comparison to data reads. You gain faster metadata
writes and loss of journal overhead. I like the idea.
Jörn
--
All art is but imitation of nature.
-- Lucius Annaeus Seneca
Jorg,
I am very found of all your comments and your positive attitude
on DualFS. I also understand that you have much more experience
than us in regard to GC and "cleaners". DualFS implementation is
using maybe old technology that can be definetly improved. Although
we understand the value of DualFS I don't believe we had the
intention to solve all the file systems issues but to address
some that are critical for the Linux users. We would be very
happy to improve DualFS cleaner based on your experience. So,
if you are interested we can rewrite/rearchitect the cleaner
using your ideas. After all we all want to have better FSs
for Linux and we believe that there is goodness in DualFS.
On Fri, 23 Feb 2007 08:26:45 -0500, J?rn Engel <[email protected]>
wrote:
> On Thu, 22 February 2007 20:57:12 +0100, Juan Piernas Canovas wrote:
>>
>> I do not agree with this picture, because it does not show that all the
>> indirect blocks which point to a direct block are along with it in the
>> same segment. That figure should look like:
>>
>> Segment 1: [some data] [ DA D1' D2' ] [more data]
>> Segment 2: [some data] [ D0 D1' D2' ] [more data]
>> Segment 3: [some data] [ DB D1 D2 ] [more data]
>>
>> where D0, DA, and DB are datablocks, D1 and D2 indirect blocks which
>> point to the datablocks, and D1' and D2' obsolete copies of those
>> indirect blocks. By using this figure, is is clear that if you need to
>> move D0 to clean the segment 2, you will need only one free segment at
>> most, and not more. You will get:
>>
>> Segment 1: [some data] [ DA D1' D2' ] [more data]
>> Segment 2: [ free ]
>> Segment 3: [some data] [ DB D1' D2' ] [more data]
>> ......
>> Segment n: [ D0 D1 D2 ] [ empty ]
>>
>> That is, D0 needs in the new segment the same space that it needs in the
>> previous one.
>>
>> The differences are subtle but important.
>
> Ah, now I see. Yes, that is deadlock-free. If you are not accounting
> the bytes of used space but the number of used segments, and you count
> each partially used segment the same as a 100% used segment, there is no
> deadlock.
>
> Some people may consider this to be cheating, however. It will cause
> more than 50% wasted space. All obsolete copies are garbage, after all.
> With a maximum tree height of N, you can have up to (N-1) / N of your
> filesystem occupied by garbage.
>
> It also means that "df" will have unexpected output. You cannot
> estimate how much data can fit into the filesystem, as that depends on
> how much garbage you will accumulate in the segments. Admittedly this
> is not a problem for DualFS, as the uncertainty only exists for
> metadata, do "df" for DualFS still makes sense.
>
> Another downside is that with large amounts of garbage between otherwise
> useful data, your disk cache hit rate goes down. Read performance is
> suffering. But that may be a fair tradeoff and will only show up in
> large metadata reads in the uncached (per Linux) case. Seems fair.
>
> Quite interesting, actually. The costs of your design are disk space,
> depending on the amount and depth of your metadata, and metadata read
> performance. Disk space is cheap and metadata reads tend to be slow for
> most filesystems, in comparison to data reads. You gain faster metadata
> writes and loss of journal overhead. I like the idea.
>
> J?rn
>
/Sorin
Hi J?rn,
On Fri, 23 Feb 2007, [utf-8] J?rn Engel wrote:
> On Thu, 22 February 2007 20:57:12 +0100, Juan Piernas Canovas wrote:
>>
>> I do not agree with this picture, because it does not show that all the
>> indirect blocks which point to a direct block are along with it in the
>> same segment. That figure should look like:
>>
>> Segment 1: [some data] [ DA D1' D2' ] [more data]
>> Segment 2: [some data] [ D0 D1' D2' ] [more data]
>> Segment 3: [some data] [ DB D1 D2 ] [more data]
>>
>> where D0, DA, and DB are datablocks, D1 and D2 indirect blocks which
>> point to the datablocks, and D1' and D2' obsolete copies of those
>> indirect blocks. By using this figure, is is clear that if you need to
>> move D0 to clean the segment 2, you will need only one free segment at
>> most, and not more. You will get:
>>
>> Segment 1: [some data] [ DA D1' D2' ] [more data]
>> Segment 2: [ free ]
>> Segment 3: [some data] [ DB D1' D2' ] [more data]
>> ......
>> Segment n: [ D0 D1 D2 ] [ empty ]
>>
>> That is, D0 needs in the new segment the same space that it needs in the
>> previous one.
>>
>> The differences are subtle but important.
>
> Ah, now I see. Yes, that is deadlock-free. If you are not accounting
> the bytes of used space but the number of used segments, and you count
> each partially used segment the same as a 100% used segment, there is no
> deadlock.
>
> Some people may consider this to be cheating, however. It will cause
> more than 50% wasted space. All obsolete copies are garbage, after all.
> With a maximum tree height of N, you can have up to (N-1) / N of your
> filesystem occupied by garbage.
I do not agree. Fortunately, the greatest part of the files are written at
once, so what you usually have is:
Segment 1: [ data ]
Segment 2: [some data] [ D0 DA DB D1 D2 ] [more data]
Segment 3: [ data ]
......
On the other hand, the DualFS cleaner tries to clean several segments
everytime it runs. Therefore, if you have the following case:
Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [some data] [ D0 D1' D2' ] [more data]
Segment 3: [some data] [ DB D1' D2' ] [more data]
......
after cleaning, you can have this one:
Segment 3: [ free ]
Segment 3: [ free ]
Segment 3: [ free ]
......
Segment i: [D0 DA DB D1 D2 ] [ more data ]
Moreover, if the cleaner starts running when the free space drops below a
specific threshold, it is very difficult to waste more than 50% of disk
space, specially with meta-data (actually, I am unable to imagine that
situation :).
> Another downside is that with large amounts of garbage between otherwise
> useful data, your disk cache hit rate goes down. Read performance is
> suffering. But that may be a fair tradeoff and will only show up in
> large metadata reads in the uncached (per Linux) case. Seems fair.
Well, our experimental results say another thing. As I have said, the
greatest part of the files are written at once, so their meta-data blocks
are together on disk. This allows DualFS to implement an explicit
prefetching of meta-data blocks which is quite effective, specially when
there are several processes reading from disk at the same time.
On the other hand, DualFS also implements an on-line meta-data relocation
mechanism which can help to improve meta-data prefetching, and garbage
collection.
Obviously, there can be some slow-growing files that can produce some
garbage, but they do not hurt the overall performance of the file system.
>
> Quite interesting, actually. The costs of your design are disk space,
> depending on the amount and depth of your metadata, and metadata read
> performance. Disk space is cheap and metadata reads tend to be slow for
> most filesystems, in comparison to data reads. You gain faster metadata
> writes and loss of journal overhead. I like the idea.
>
Yeah :) If you have taken a look to my presentation at LFS07, the disk
traffic of meta-data blocks is dominated by writes.
> J?rn
>
Juan.
--
D. Juan Piernas C?novas
Departamento de Ingenier?a y Tecnolog?a de Computadores
Facultad de Inform?tica. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657 Fax: +34968364151
email: [email protected]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index
*** Por favor, env?eme sus documentos en formato texto, HTML, PDF o PostScript :-) ***
On Sun, 25 February 2007 03:41:40 +0100, Juan Piernas Canovas wrote:
>
> Well, our experimental results say another thing. As I have said, the
> greatest part of the files are written at once, so their meta-data blocks
> are together on disk. This allows DualFS to implement an explicit
> prefetching of meta-data blocks which is quite effective, specially when
> there are several processes reading from disk at the same time.
>
> On the other hand, DualFS also implements an on-line meta-data relocation
> mechanism which can help to improve meta-data prefetching, and garbage
> collection.
>
> Obviously, there can be some slow-growing files that can produce some
> garbage, but they do not hurt the overall performance of the file system.
Well, my concerns about the design have gone. There remain some
concerns about the source code and I hope they will disappear just as
fast. :)
Obviously, a patch against 2.4.x is fairly useless. Iirc, you claimed
somewhere to have a patch against 2.6.11, but I was unable to find that.
Porting 2.6.11 to 2.6.20 should be simple enough.
Then there is some assembly code inside the patch that you seem to have
copied from some other project. I would be surprised if that is really
required. If you can replace it with C code, please do.
If the assembly actually is a performance gain (and I consider it your
duty to prove that), you can have a two-patch series with the first
introducing DualFS and the second adding the assembly as a config option
for one architecture.
> Yeah :) If you have taken a look to my presentation at LFS07, the disk
> traffic of meta-data blocks is dominated by writes.
Last time I tried it was only available to members. Is it generally
available now?
Jörn
--
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
On Sun, 25 Feb 2007, [utf-8] J?rn Engel wrote:
> On Sun, 25 February 2007 03:41:40 +0100, Juan Piernas Canovas wrote:
>>
>> Well, our experimental results say another thing. As I have said, the
>> greatest part of the files are written at once, so their meta-data blocks
>> are together on disk. This allows DualFS to implement an explicit
>> prefetching of meta-data blocks which is quite effective, specially when
>> there are several processes reading from disk at the same time.
>>
>> On the other hand, DualFS also implements an on-line meta-data relocation
>> mechanism which can help to improve meta-data prefetching, and garbage
>> collection.
>>
>> Obviously, there can be some slow-growing files that can produce some
>> garbage, but they do not hurt the overall performance of the file system.
>
> Well, my concerns about the design have gone. There remain some
> concerns about the source code and I hope they will disappear just as
> fast. :)
>
This is a bit more complicated ;)
> Obviously, a patch against 2.4.x is fairly useless. Iirc, you claimed
> somewhere to have a patch against 2.6.11, but I was unable to find that.
> Porting 2.6.11 to 2.6.20 should be simple enough.
I'm working on a public patch of DualFS for Linux 2.6.x. It's a matter of
time.
>
> Then there is some assembly code inside the patch that you seem to have
> copied from some other project. I would be surprised if that is really
> required. If you can replace it with C code, please do.
>
> If the assembly actually is a performance gain (and I consider it your
> duty to prove that), you can have a two-patch series with the first
> introducing DualFS and the second adding the assembly as a config option
> for one architecture.
>
No problem. I will see if the assembly code can be replaced with bare C.
Regards,
Juan.
--
D. Juan Piernas C?novas
Departamento de Ingenier?a y Tecnolog?a de Computadores
Facultad de Inform?tica. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657 Fax: +34968364151
email: [email protected]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index
*** Por favor, env?eme sus documentos en formato texto, HTML, PDF o PostScript :-) ***
On 2/14/07, sfaibish <[email protected]> wrote:
> On Sat, 10 Feb 2007 22:06:37 -0500, Sorin Faibish <[email protected]> wrote:
>
> > Introducing DualFS
> >
> > File System developers played with the idea of separation of
> > meta-data from data in file systems for a while. The idea was
> > lately revived by a small group of file system enthusiasts
> > from Spain (from the little known University of Murcia) and
> > it is called DualFS. We believe that the separation idea
> > will bring to Linux file systems great value.
> >
> > We see DualFS as a next-generation journaling file system which
> > has the same consistency guaranties as traditional journaling
> > file systems but better performance characteristics. The new file
> > system puts data and meta-data on different devices (usually, two
> > partitions on the same disk or different disks)
Do you guys have an option of using just one partiton, and
divide it into two fixed parts at mkfs-time , one part X% (for metadata)
and 2nd part at (100-X)% (file file blocks) ? Would not this option
be easier-to-administer choice for naive users ? Or, you do not
view your FS as an option for naive users in the first place ?
Yakov
Yakov Lerner wrote:
> On 2/14/07, sfaibish <[email protected]> wrote:
>> On Sat, 10 Feb 2007 22:06:37 -0500, Sorin Faibish <[email protected]>
>> wrote:
>>
>> > Introducing DualFS
>> >
>> > File System developers played with the idea of separation of
>> > meta-data from data in file systems for a while. The idea was
>> > lately revived by a small group of file system enthusiasts
>> > from Spain (from the little known University of Murcia) and
>> > it is called DualFS. We believe that the separation idea
>> > will bring to Linux file systems great value.
>> >
>> > We see DualFS as a next-generation journaling file system which
>> > has the same consistency guaranties as traditional journaling
>> > file systems but better performance characteristics. The new file
>> > system puts data and meta-data on different devices (usually, two
>> > partitions on the same disk or different disks)
>
> Do you guys have an option of using just one partiton, and
> divide it into two fixed parts at mkfs-time , one part X% (for metadata)
> and 2nd part at (100-X)% (file file blocks) ? Would not this option
> be easier-to-administer choice for naive users ? Or, you do not
> view your FS as an option for naive users in the first place ?
Personally i would prefer to put the metadata-part as a file inside another filesystem.
- The files can shrink and grow as needed
- If you have several Partitions like that you don't have that much overhead because of fixed size metadata
For e.g. I have 41 external HDDs (11TB, currently with XFS), when i could split the metadata from the actual data.
And when it would be possible to mount the metadata without it's data i would gain the following niceties.
- I could gather all metadata in one place
- No "overhead" for filesystem on the external HDDs just "pure" data.
- They would be searchable, at least on the metadata level, so i could search something and then connect the needed HDD for full access.
In my case the metadata overhead should be quite small as the smallest file is about 400MB and range up to 5GB. For e.g. i have HDDs which are full by only about 90 files.
The file are mostly write-once, so the current overhead for dualfs would be 'extreme' for my case.
AFAIR the overhead of XFS (with log reduced to near minimum) was about 4-5MB in total for a 200GB HDD. But i'm currently at work so i can't verify that.
--
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated,
cryptic, powerful, unforgiving, dangerous.
On Mon, 26 Feb 2007 06:49:05 -0500, Yakov Lerner <[email protected]> wrote:
> On 2/14/07, sfaibish <[email protected]> wrote:
>> On Sat, 10 Feb 2007 22:06:37 -0500, Sorin Faibish <[email protected]>
>> wrote:
>>
>> > Introducing DualFS
>> >
>> > File System developers played with the idea of separation of
>> > meta-data from data in file systems for a while. The idea was
>> > lately revived by a small group of file system enthusiasts
>> > from Spain (from the little known University of Murcia) and
>> > it is called DualFS. We believe that the separation idea
>> > will bring to Linux file systems great value.
>> >
>> > We see DualFS as a next-generation journaling file system which
>> > has the same consistency guaranties as traditional journaling
>> > file systems but better performance characteristics. The new file
>> > system puts data and meta-data on different devices (usually, two
>> > partitions on the same disk or different disks)
>
> Do you guys have an option of using just one partiton, and
> divide it into two fixed parts at mkfs-time , one part X% (for metadata)
> and 2nd part at (100-X)% (file file blocks) ?
From an ease-of-use perspective I agree as this could be an option
but the point is to be able to manage the MD completely separate
in cases when you want for example to extend the MD volumes or
data volumes independently for example. It is all about different
address spaces and fencing. I am not sure is good for fsck tasks.
> Would not this option
> be easier-to-administer choice for naive users ?
From a code perspective should be possible. We could put it on a TODO
list if of interest for many. It is not a matter of naive users
or not. We are all naive when we use FS and are not experts in FS.
> Or, you do not
> view your FS as an option for naive users in the first place
If we are all naive of course we want DualFS to be useful for
all. We are open for proposed features and criticism from
the community. It is just a mater of resources.
>
> Yakov
>
>
>
--
Best Regards
Sorin Faibish
Senior Technologist
Senior Consulting Software Engineer Network Storage Group
EMC? where information lives
Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : [email protected]