LinuxLists.cc - Porting Zfs features to ext2/3

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

On Sun, Jul 27, 2008 at 01:49:53AM -0700, postrishi wrote:
> I want to know that has any work been done to port the Zfs features to
> ext2/3

There are some new features in ext4, but the primary goal of ext4
development is to improve the filesystem by adding features such as
extents, delayed allocation, etc., in an evolutionary way, so that an
existing ext3 filesystem could be upgraded to ext4. So the primary
goal is not to try to complete with the ZFS features set.

The btrfs filesystem effort is an attempt to create a filesystem that
will leapfrog the ZFS feature set, but it will probably take longer to
reach production ready status than ext4.

Regards,

- Ted

2008-07-27 22:54:43

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

On Sun, 2008-07-27 at 01:49 -0700, postrishi wrote:
> Hello
>
> I want to know that has any work been done to port the Zfs features to
> ext2/3

Did you know that ZFS is available for Linux?

Cheers,
Eric

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2008-07-27 23:04:25

by Shehjar Tikoo

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

Hi Ted

Theodore Tso wrote:
> The btrfs filesystem effort is an attempt to create a filesystem
> that will leapfrog the ZFS feature set, but it will probably take
> longer to reach production ready status than ext4.

Since you mention btrfs here and since I've read this earlier too, do
you know if btrfs will be the default Linux file system in the future,
like extX has been?

>
> Regards,
>
> - Ted

2008-07-27 23:37:29

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

On Mon, Jul 28, 2008 at 08:49:18AM +1000, Shehjar Tikoo wrote:
> Hi Ted
>
> Theodore Tso wrote:
>> The btrfs filesystem effort is an attempt to create a filesystem that
>> will leapfrog the ZFS feature set, but it will probably take longer to
>> reach production ready status than ext4.
>
> Since you mention btrfs here and since I've read this earlier too, do
> you know if btrfs will be the default Linux file system in the future,
> like extX has been?

The nature of Linux is such that these sorts of decisions are not made
by anyone other than each individual system administrator, and by the
distributions who choose which filesystem they which to use as the
"default". There is no such thing as an "official" default
filesystem. For example, the Maemo distribution, which Nokia
distributes for use on the N800/N810 devices, uses jffs2 as its
default filesystem, since those devices use a flash storage device.
DragonLinux, which is designed to be installed on top of DOS/Windows
uses UMSDOS as its default distribution.

What happens in the future, who can say? At some point the ext2/3/4
filesystem, which is based fundamentally on a BSD Fast Filesystem
design base, may get displaced by a filesystem which uses some very
different design as a starting point, when the advantages of starting
with that different design outweighs the advantages of backwards
compatibility and broad base of support which is enjoyed by ext2/3/4.

To give one example from the past, filesystems like JFS were
theoretically better than ext3 at the time, but unfortunately all of
the expertise was concentrated in one company (IBM), and so
distributions were slow to accept it. In the meantime, ext3 was able
to add enough features (htree directories, better SMP scalability) to
eventually meet and then surpass JFS's technical advantages.

XFS has a number of technical advantages over ext3, but the number of
people who understand it are small, and people seem to like the tools
built for ext3 --- and now ext4 has a number of features that were
previously exclusive to XFS. XFS is still the best filesystem for
very large, SGI-class machines, however. But for general purpose
computing, most people are more comfortable with ext3.

Yet the fact that we are retaining backwards compatibility with ext3
does constraint our ability to add radical new features. So
eventually some filesystem will probably overtake ext2/3/4. Will that
btrfs? I don't think anyone can answer that question. I *have* been
helping out the btrfs design team, though, giving them advice such as
making sure that they try to gather contributors from a wide variety
of distributions and other Linux companies. So I hope they do become
successful. In the meantime, though, ext4 is a great extension to the
ext2/3 filesystem family. But in the long run, it may very well be
that btrfs will be more successful than some future attempt to create
an ext5; and that's fine.

Regards,

- Ted

2008-07-27 23:38:56

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

On Sun, Jul 27, 2008 at 04:54:41PM -0600, Eric Anopolsky wrote:
> On Sun, 2008-07-27 at 01:49 -0700, postrishi wrote:
> > Hello
> >
> > I want to know that has any work been done to port the Zfs features to
> > ext2/3
>
> Did you know that ZFS is available for Linux?

ZFS is available in a FUSE filesystem. As a userspace filesystem, it
means a huge number of context switches to get data between the disk,
to the kernel, to the FUSE userspace, back to the kernel, and to the
process trying to access the ZFS file. That's not going to be high
performance. For someone who wants to migrate from Solaris to Linux,
it might be useful, but I'm not sure you would really want to use a
ZFS/FUSE implementation in production.

- Ted

2008-07-28 03:59:32

by Shehjar Tikoo

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

Theodore Tso wrote:
> On Mon, Jul 28, 2008 at 08:49:18AM +1000, Shehjar Tikoo wrote:
>> Hi Ted
>>
>> Theodore Tso wrote:
>>> The btrfs filesystem effort is an attempt to create a filesystem that
>>> will leapfrog the ZFS feature set, but it will probably take longer to
>>> reach production ready status than ext4.
>> Since you mention btrfs here and since I've read this earlier too, do
>> you know if btrfs will be the default Linux file system in the future,
>> like extX has been?
>
> ...
> What happens in the future, who can say? At some point the ext2/3/4
> filesystem, which is based fundamentally on a BSD Fast Filesystem
> design base, may get displaced by a filesystem which uses some very
> different design as a starting point, when the advantages of starting
> with that different design outweighs the advantages of backwards
> compatibility and broad base of support which is enjoyed by ext2/3/4.

Thanks. That sums up the trade-off pretty clearly.

-Shehjar

2008-07-28 04:16:11

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

On Sun, 2008-07-27 at 19:38 -0400, Theodore Tso wrote:
> On Sun, Jul 27, 2008 at 04:54:41PM -0600, Eric Anopolsky wrote:
> > On Sun, 2008-07-27 at 01:49 -0700, postrishi wrote:
> > > Hello
> > >
> > > I want to know that has any work been done to port the Zfs features to
> > > ext2/3
> >
> > Did you know that ZFS is available for Linux?
>
> ZFS is available in a FUSE filesystem. As a userspace filesystem, it
> means a huge number of context switches to get data between the disk,
> to the kernel, to the FUSE userspace, back to the kernel, and to the
> process trying to access the ZFS file. That's not going to be high
> performance. For someone who wants to migrate from Solaris to Linux,
> it might be useful, but I'm not sure you would really want to use a
> ZFS/FUSE implementation in production.

It's true that ZFS on FUSE performance isn't all it could be right now.
However, ZFS on FUSE is currently not taking advantage of mechanisms
FUSE provides to improve performance. For an example of what can be
achieved, check out http://www.ntfs-3g.org/performance.html .

FWIW, I am satisfied with its performance for backups of my home
directory and for my fileserver, which is limited by a 100Mbps
connection to the rest of the network. I do not recommend it as a root
filesystem yet.

Cheers,
Eric

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2008-07-28 12:41:20

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

On Sun, Jul 27, 2008 at 10:15:59PM -0600, Eric Anopolsky wrote:
> It's true that ZFS on FUSE performance isn't all it could be right now.
> However, ZFS on FUSE is currently not taking advantage of mechanisms
> FUSE provides to improve performance. For an example of what can be
> achieved, check out http://www.ntfs-3g.org/performance.html .

Yes... and take a look at the metadata operations numbers. FUSE can
do things to accellerate bulk read/write, but metadata-intensive
operations will (I suspect) always be slow. I also question whether
the FUSE implementation will have the safety that has always been the
Raison d'?tre of ZFS. Have you or the ZFS/FUSE developers done tests
where you are writing to the filesystem, and then someone pulls the
plug on the fileserver while ZFS is writing? Does the filesystem
recovery cleanly from such a scenario?

- Ted

2008-07-29 03:58:36

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

Please let me know if I'm getting off topic for the ext4-devel list. My
point is not to advocate ZFS over ext3/4 since ZFS still has its share
of issues. No resizing raidz vdevs, for example, and performance in
certain areas. My only point is to make it clear that ZFS on Linux is
available (and not necessarily a bad choice) to people reading the
ext4-devel mailing list looking for ZFS-like features like the original
poster.

On Mon, 2008-07-28 at 08:40 -0400, Theodore Tso wrote:
> On Sun, Jul 27, 2008 at 10:15:59PM -0600, Eric Anopolsky wrote:
> > It's true that ZFS on FUSE performance isn't all it could be right now.
> > However, ZFS on FUSE is currently not taking advantage of mechanisms
> > FUSE provides to improve performance. For an example of what can be
> > achieved, check out http://www.ntfs-3g.org/performance.html .
>
> Yes... and take a look at the metadata operations numbers. FUSE can
> do things to accellerate bulk read/write, but metadata-intensive
> operations will (I suspect) always be slow.

It doesn't seem too much worse than the other non-ext3 filesystems in
the comparison. I'm sure everyone would prefer a non-FUSE implementation
and the licensing issues aren't going to go away, but this post on Jeff
Bonwick's blog gives some hope:
http://blogs.sun.com/bonwick/entry/casablanca . Even so, not everyone
needs a whole lot of speed in the metadata operations area.

> I also question whether
> the FUSE implementation will have the safety that has always been the
> Raison d'être of ZFS. Have you or the ZFS/FUSE developers done tests
> where you are writing to the filesystem, and then someone pulls the
> plug on the fileserver while ZFS is writing? Does the filesystem
> recovery cleanly from such a scenario?

I haven't personally tried pulling the plug, but I've tried holding down
the power button on my laptop until it powers off. Everything works fine
and scrubs (the closest ZFS gets to fsck) don't report any checksum
errors. The filesystem driver updates the on-disk filesystem atomically
every five seconds (less time in special circumstances) so there's never
any point at which the filesystem would need recovery. The next time the
filesystem is mounted the system sees the state the filesystem was in up
to five seconds before the power went out. The FUSEness of the
filesystem driver doesn't seem to affect this.

Cheers,
Eric

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2008-07-29 16:46:39

by Ric Wheeler

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

Eric Anopolsky wrote:
> Please let me know if I'm getting off topic for the ext4-devel list. My
> point is not to advocate ZFS over ext3/4 since ZFS still has its share
> of issues. No resizing raidz vdevs, for example, and performance in
> certain areas. My only point is to make it clear that ZFS on Linux is
> available (and not necessarily a bad choice) to people reading the
> ext4-devel mailing list looking for ZFS-like features like the original
> poster.
>
> On Mon, 2008-07-28 at 08:40 -0400, Theodore Tso wrote:
>
>> On Sun, Jul 27, 2008 at 10:15:59PM -0600, Eric Anopolsky wrote:
>>
>>> It's true that ZFS on FUSE performance isn't all it could be right now.
>>> However, ZFS on FUSE is currently not taking advantage of mechanisms
>>> FUSE provides to improve performance. For an example of what can be
>>> achieved, check out http://www.ntfs-3g.org/performance.html .
>>>
>> Yes... and take a look at the metadata operations numbers. FUSE can
>> do things to accellerate bulk read/write, but metadata-intensive
>> operations will (I suspect) always be slow.
>>
>
> It doesn't seem too much worse than the other non-ext3 filesystems in
> the comparison. I'm sure everyone would prefer a non-FUSE implementation
> and the licensing issues aren't going to go away, but this post on Jeff
> Bonwick's blog gives some hope:
> http://blogs.sun.com/bonwick/entry/casablanca . Even so, not everyone
> needs a whole lot of speed in the metadata operations area.
>
>
>> I also question whether
>> the FUSE implementation will have the safety that has always been the
>> Raison d'être of ZFS. Have you or the ZFS/FUSE developers done tests
>> where you are writing to the filesystem, and then someone pulls the
>> plug on the fileserver while ZFS is writing? Does the filesystem
>> recovery cleanly from such a scenario?
>>
>
> I haven't personally tried pulling the plug, but I've tried holding down
> the power button on my laptop until it powers off. Everything works fine
> and scrubs (the closest ZFS gets to fsck) don't report any checksum
> errors. The filesystem driver updates the on-disk filesystem atomically
> every five seconds (less time in special circumstances) so there's never
> any point at which the filesystem would need recovery. The next time the
> filesystem is mounted the system sees the state the filesystem was in up
> to five seconds before the power went out. The FUSEness of the
> filesystem driver doesn't seem to affect this.
>
> Cheers,
> Eric
>
Does that mean you always lose the last 5 seconds of data before the
power outage?

We had an earlier thread where Chris had a good test for making a case
for the write barrier code being enabled by default. It would be neat to
try that on ZFS ;-) The expected behaviour should be that any fsync()'ed
files should be there (regardless of the 5 seconds) and other
non-fsync'ed files might or might not be there, but that all file system
integrity is complete.

It would also be very interesting to try and do a drive hot pull.

Thanks!

Ric

2008-07-29 21:05:04

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

Theodore Tso <tytso <at> mit.edu> writes:
> On Sun, Jul 27, 2008 at 10:15:59PM -0600, Eric Anopolsky wrote:
> > It's true that ZFS on FUSE performance isn't all it could be right now.
> > However, ZFS on FUSE is currently not taking advantage of mechanisms
> > FUSE provides to improve performance. For an example of what can be
> > achieved, check out http://www.ntfs-3g.org/performance.html .
>
> Yes... and take a look at the metadata operations numbers.

Those are old numbers of the unoptimized ntfs-3g driver, which could be
at least 3-30 times better.

- create: until recently, when ext3 defaulted htree, the unoptimized
ntfs-3g was 2-4x faster. But nobody seems to really care because it's
not a real-world benchmark (creation of 0 byte size files).

- lookup: by enabling FUSE entry cache the performance will be exactly
the same (no user space involvement), or the bottleneck will be the
disk seek time and how an fs optimizes for it.

> FUSE can do things to accellerate bulk read/write,

FUSE can also cache attributes, positive/negative lookups, file data
and hopefully the new performance improving features, infrastructure
being worked on will be added in the future too.

> but metadata-intensive operations will (I suspect) always be slow.

Basically FUSE file systems can be considered as in-kernel network
file systems where the network latency is a context switch. Yes,
things need to be done a bit differently sometimes but achieving
high-performance even for metadata operations is not impossible.

> I also question whether
> the FUSE implementation will have the safety that has always been the
> Raison d'être of ZFS. Have you or the ZFS/FUSE developers done tests
> where you are writing to the filesystem, and then someone pulls the
> plug on the fileserver while ZFS is writing? Does the filesystem
> recovery cleanly from such a scenario?

This is an implementation detail, irrelevant to FUSE.

Regards, Szaka

--
NTFS-3G: http://ntfs-3g.org

2008-07-29 22:52:39

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

Theodore Tso <tytso <at> mit.edu> writes:
> On Sun, Jul 27, 2008 at 04:54:41PM -0600, Eric Anopolsky wrote:
> > On Sun, 2008-07-27 at 01:49 -0700, postrishi wrote:
> > >
> > > I want to know that has any work been done to port the Zfs features to
> > > ext2/3
> >
> > Did you know that ZFS is available for Linux?
>
> ZFS is available in a FUSE filesystem. As a userspace filesystem, it
> means a huge number of context switches to get data between the disk,
> to the kernel, to the FUSE userspace, back to the kernel, and to the
> process trying to access the ZFS file.

Transferring 4 kB from a commodity disk (disk seek + rotational delay +
transfer) takes usually between 100-4,000 usec. Two context switches
take about 2 usec, i.e. only 2-0.05% of the full data transfer time.

In other words, there isn't really time for huge number of context
switches because most of the time is spent waiting for the disk.

> That's not going to be high performance.

I did also an in memory test on a [email protected], with disk I/O completely
eliminated. Results:

tmpfs: 975 MB/sec
ntfs-3g: 889 MB/sec (note, this FUSE driver is not optimized yet)
ext3: 675 MB/sec

> For someone who wants to migrate from Solaris to Linux,
> it might be useful, but I'm not sure you would really want to use a
> ZFS/FUSE implementation in production.

It seems the problem is that the only active ZFS-FUSE developer was
hired by SUN and since then not much (visible) is happening.

Regards, Szaka

--
NTFS-3G: http://ntfs-3g.org

2008-07-30 01:29:47

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

On Tue, Jul 29, 2008 at 10:52:26PM +0000, Szabolcs Szakacsits wrote:
> I did also an in memory test on a [email protected], with disk I/O completely
> eliminated. Results:
>
> tmpfs: 975 MB/sec
> ntfs-3g: 889 MB/sec (note, this FUSE driver is not optimized yet)
> ext3: 675 MB/sec

Again, I agree that you can optimize bulk data transfer. It'll be
metadata operations where I'm really not convinced FUSE will be able
to be acceptable for many workloads. If you are doing a sequential
I/O in huge chunks, sure you can amortize the overhead of the
userspace context switches.

The test you did above looks bad because ext3 does lots of small I/O
to the loop device. The CPU overhead is not a big deal for real
disks, but when you do a pure memory test, it definitely becomes an
issue. Try doing an in-memory test with ext2, and you'll see much
better results, much closer to tmpfs. The reason? Blktrace tells the
tale. Ext2 looks like this:

254,4 1 1 0.000000000 23109 Q W 180224 + 96 [pdflush]
254,4 1 2 0.000030032 23109 Q W 180320 + 8 [pdflush]
254,4 1 3 0.000328538 23109 Q W 180328 + 1024 [pdflush]
254,4 1 4 0.000628162 23109 Q W 181352 + 1024 [pdflush]
254,4 1 5 0.000925550 23109 Q W 182376 + 1024 [pdflush]
254,4 1 6 0.001317715 23109 Q W 183400 + 1024 [pdflush]
254,4 1 7 0.001619783 23109 Q W 184424 + 1024 [pdflush]
254,4 1 8 0.001913400 23109 Q W 185448 + 1024 [pdflush]
254,4 1 9 0.002206738 23109 Q W 186472 + 1024 [pdflush]

Ext3 looks like this:

254,4 0 1 0.000000000 23109 Q W 131072 + 8 [pdflush]
254,4 0 2 0.000040578 23109 Q W 131080 + 8 [pdflush]
254,4 0 3 0.000059575 23109 Q W 131088 + 8 [pdflush]
254,4 0 4 0.000076617 23109 Q W 131096 + 8 [pdflush]
254,4 0 5 0.000093728 23109 Q W 131104 + 8 [pdflush]
254,4 0 6 0.000110211 23109 Q W 131112 + 8 [pdflush]
254,4 0 7 0.000127253 23109 Q W 131120 + 8 [pdflush]
254,4 0 8 0.000143735 23109 Q W 131128 + 8 [pdflush]

So it's issueing lots of 4k writes, one page at a time, because it
needs to track the completion of each block. This creates a
significant CPU overhead, which dominates in an all-memory test.
Although this is not an issue in real-life today, it will likely
become an issue in real-life solid state disks (SSD's).

Fortunately, ext4's blktrace when copying a large file looks like
this:

254,4 1 1 0.000000000 24574 Q R 648 + 8 [cp]
254,4 1 2 0.000059855 24574 U N [cp] 0
254,4 0 1 0.000427435 0 C R 648 + 8 [0]
254,4 1 3 0.385530672 24313 Q R 520 + 8 [pdflush]
254,4 1 4 0.385558400 24313 U N [pdflush] 0
254,4 1 5 0.385969143 0 C R 520 + 8 [0]
254,4 1 6 0.387101706 24313 Q W 114688 + 1024 [pdflush]
254,4 1 7 0.387269327 24313 Q W 115712 + 1024 [pdflush]
254,4 1 8 0.387434854 24313 Q W 116736 + 1024 [pdflush]
254,4 1 9 0.387598425 24313 Q W 117760 + 1024 [pdflush]
254,4 1 10 0.387831698 24313 Q W 118784 + 1024 [pdflush]
254,4 1 11 0.387996037 24313 Q W 119808 + 1024 [pdflush]
254,4 1 12 0.388162890 24313 Q W 120832 + 1024 [pdflush]
254,4 1 13 0.388325204 24313 Q W 121856 + 1024 [pdflush]

*Much* better. :-)

- Ted

2008-07-30 01:35:08

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

2008-07-30 06:00:29

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

On Tue, 2008-07-29 at 12:46 -0400, Ric Wheeler wrote:
> > I haven't personally tried pulling the plug, but I've tried holding down
> > the power button on my laptop until it powers off. Everything works fine
> > and scrubs (the closest ZFS gets to fsck) don't report any checksum
> > errors. The filesystem driver updates the on-disk filesystem atomically
> > every five seconds (less time in special circumstances) so there's never
> > any point at which the filesystem would need recovery. The next time the
> > filesystem is mounted the system sees the state the filesystem was in up
> > to five seconds before the power went out. The FUSEness of the
> > filesystem driver doesn't seem to affect this.
> >
> Does that mean you always lose the last 5 seconds of data before the
> power outage?

> The expected behaviour should be that any fsync()'ed
> files should be there (regardless of the 5 seconds) and other
> non-fsync'ed files might or might not be there, but that all file
> system integrity is complete.

fsync()s are one of the triggers that cause the filesystem driver to
update the disk more than once every five seconds. This can also happen
when the filesystem is very full and someone is writing to the disk.

> We had an earlier thread where Chris had a good test for making a case
> for the write barrier code being enabled by default. It would be neat to
> try that on ZFS ;-)

To my knowledge, write barriers are not available to FUSE filesystems.
Also, the maintainer of ZFS FUSE discovered that calling fsync() on an
open device file just plain does not work on Linux, so currently the
driver requests write cache flushes directly from the hardware.

> It would also be very interesting to try and do a drive hot pull.

I've got some vacation coming up...sounds like fun! Unless someone
specifically asks on here, I'll post the results in the ZFS on FUSE
mailing list.

Cheers,
Eric

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2008-07-31 01:14:20

[permalink] [raw]

Subject: Re: Porting Zfs features to ext2/3

On Tue, 29 Jul 2008, Theodore Tso wrote:
> On Tue, Jul 29, 2008 at 10:52:26PM +0000, Szabolcs Szakacsits wrote:
> > I did also an in memory test on a [email protected], with disk I/O completely
> > eliminated. Results:
> >
> > tmpfs: 975 MB/sec
> > ntfs-3g: 889 MB/sec (note, this FUSE driver is not optimized yet)
> > ext3: 675 MB/sec
>
> Am I write in guessing that this test involved copying a single large
> file, with no seeks?

Yes, it was writing a large file on a newly created, loop mounted image
file.

> What happens if you try benchmarking unpacking a kernel source tar.bz2
> file? My guess is that ntfs-3g won't look as good. :-)

It seems the tar.bz2 number is not so bad relatively but the metadata
performance difference is much more visible by eliminating the compression
overhead. The results are in second.

ext3 ntfs-3g
tar.bz2 7.7 12.4
tar.gz 3.1 8.7
tar 1.4 7.6

A few seconds could be improved but I think getting similar numbers will
need major FUSE changes and non-trivial work to get rid of most contexts
switches which indeed seem to be the bottleneck according to the profiling
data.

Szaka

--
NTFS-3G: http://ntfs-3g.org

2008-08-04 20:38:21