2005-11-21 09:28:42

by Alfred Brons

[permalink] [raw]
Subject: what is our answer to ZFS?

Hi All,

I just noticed in the news this link:

http://www.opensolaris.org/os/community/zfs/demos/basics

I wonder what would be our respond to this beaste?

btw, you could try it live by using Nexenta
GNU/Solaris LiveCD at
http://www.gnusolaris.org/gswiki/Download which is
Ubuntu-based OpenSolaris
distribution.

So what is ZFS?

ZFS is a new kind of filesystem that provides simple
administration, transactional semantics, end-to-end
data integrity, and immense scalability. ZFS is not an
incremental improvement to existing technology; it is
a fundamentally new approach to data management. We've
blown away 20 years of obsolete assumptions,
eliminated complexity at the source, and created a
storage system that's actually a pleasure to use.

ZFS presents a pooled storage model that completely
eliminates the concept of volumes and the associated
problems of partitions, provisioning, wasted bandwidth
and stranded storage. Thousands of filesystems can
draw from a common storage pool, each one consuming
only as much space as it actually needs.

All operations are copy-on-write transactions, so the
on-disk state is always valid. There is no need to
fsck(1M) a ZFS filesystem, ever. Every block is
checksummed to prevent silent data corruption, and the
data is self-healing in replicated (mirrored or RAID)
configurations.

ZFS provides unlimited constant-time snapshots and
clones. A snapshot is a read-only point-in-time copy
of a filesystem, while a clone is a writable copy of a
snapshot. Clones provide an extremely space-efficient
way to store many copies of mostly-shared data such as
workspaces, software installations, and diskless
clients.

ZFS administration is both simple and powerful. The
tools are designed from the ground up to eliminate all
the traditional headaches relating to managing
filesystems. Storage can be added, disks replaced, and
data scrubbed with straightforward commands.
Filesystems can be created instantaneously, snapshots
and clones taken, native backups made, and a
simplified property mechanism allows for setting of
quotas, reservations, compression, and more.

Alfred



__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com


2005-11-21 09:44:45

by Paulo Jorge Matos

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Check Tarkan "Sun ZFS and Linux" topic on 18th Nov, on this mailing list.
http://marc.theaimsgroup.com/?l=linux-kernel&m=113235728212352&w=2

Cheers,

Paulo Matos

On 21/11/05, Alfred Brons <[email protected]> wrote:
> Hi All,
>
> I just noticed in the news this link:
>
> http://www.opensolaris.org/os/community/zfs/demos/basics
>
> I wonder what would be our respond to this beaste?
>
> btw, you could try it live by using Nexenta
> GNU/Solaris LiveCD at
> http://www.gnusolaris.org/gswiki/Download which is
> Ubuntu-based OpenSolaris
> distribution.
>
> So what is ZFS?
>
> ZFS is a new kind of filesystem that provides simple
> administration, transactional semantics, end-to-end
> data integrity, and immense scalability. ZFS is not an
> incremental improvement to existing technology; it is
> a fundamentally new approach to data management. We've
> blown away 20 years of obsolete assumptions,
> eliminated complexity at the source, and created a
> storage system that's actually a pleasure to use.
>
> ZFS presents a pooled storage model that completely
> eliminates the concept of volumes and the associated
> problems of partitions, provisioning, wasted bandwidth
> and stranded storage. Thousands of filesystems can
> draw from a common storage pool, each one consuming
> only as much space as it actually needs.
>
> All operations are copy-on-write transactions, so the
> on-disk state is always valid. There is no need to
> fsck(1M) a ZFS filesystem, ever. Every block is
> checksummed to prevent silent data corruption, and the
> data is self-healing in replicated (mirrored or RAID)
> configurations.
>
> ZFS provides unlimited constant-time snapshots and
> clones. A snapshot is a read-only point-in-time copy
> of a filesystem, while a clone is a writable copy of a
> snapshot. Clones provide an extremely space-efficient
> way to store many copies of mostly-shared data such as
> workspaces, software installations, and diskless
> clients.
>
> ZFS administration is both simple and powerful. The
> tools are designed from the ground up to eliminate all
> the traditional headaches relating to managing
> filesystems. Storage can be added, disks replaced, and
> data scrubbed with straightforward commands.
> Filesystems can be created instantaneously, snapshots
> and clones taken, native backups made, and a
> simplified property mechanism allows for setting of
> quotas, reservations, compression, and more.
>
> Alfred
>
>
>
> __________________________________
> Yahoo! FareChase: Search multiple travel sites in one click.
> http://farechase.yahoo.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


--
Paulo Jorge Matos - pocm at sat inesc-id pt
Web: http://sat.inesc-id.pt/~pocm
Computer and Software Engineering
INESC-ID - SAT Group

2005-11-21 09:59:17

by Alfred Brons

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Thanks Paulo!
I wasn't aware of this thread.

But my question was: do we have similar functionality
in Linux kernel?

Taking in account ZFS availability as 100% open
source, I'm starting think about migration to Nexenta
OS some of my servers just because of this feature...

Alfred

--- Paulo Jorge Matos <[email protected]> wrote:

> Check Tarkan "Sun ZFS and Linux" topic on 18th Nov,
> on this mailing list.
>
http://marc.theaimsgroup.com/?l=linux-kernel&m=113235728212352&w=2
>
> Cheers,
>
> Paulo Matos
>
> On 21/11/05, Alfred Brons <[email protected]>
> wrote:
> > Hi All,
> >
> > I just noticed in the news this link:
> >
> >
>
http://www.opensolaris.org/os/community/zfs/demos/basics
> >
> > I wonder what would be our respond to this beaste?
> >
> > btw, you could try it live by using Nexenta
> > GNU/Solaris LiveCD at
> > http://www.gnusolaris.org/gswiki/Download which is
> > Ubuntu-based OpenSolaris
> > distribution.
> >
> > So what is ZFS?
> >
> > ZFS is a new kind of filesystem that provides
> simple
> > administration, transactional semantics,
> end-to-end
> > data integrity, and immense scalability. ZFS is
> not an
> > incremental improvement to existing technology; it
> is
> > a fundamentally new approach to data management.
> We've
> > blown away 20 years of obsolete assumptions,
> > eliminated complexity at the source, and created a
> > storage system that's actually a pleasure to use.
> >
> > ZFS presents a pooled storage model that
> completely
> > eliminates the concept of volumes and the
> associated
> > problems of partitions, provisioning, wasted
> bandwidth
> > and stranded storage. Thousands of filesystems can
> > draw from a common storage pool, each one
> consuming
> > only as much space as it actually needs.
> >
> > All operations are copy-on-write transactions, so
> the
> > on-disk state is always valid. There is no need to
> > fsck(1M) a ZFS filesystem, ever. Every block is
> > checksummed to prevent silent data corruption, and
> the
> > data is self-healing in replicated (mirrored or
> RAID)
> > configurations.
> >
> > ZFS provides unlimited constant-time snapshots and
> > clones. A snapshot is a read-only point-in-time
> copy
> > of a filesystem, while a clone is a writable copy
> of a
> > snapshot. Clones provide an extremely
> space-efficient
> > way to store many copies of mostly-shared data
> such as
> > workspaces, software installations, and diskless
> > clients.
> >
> > ZFS administration is both simple and powerful.
> The
> > tools are designed from the ground up to eliminate
> all
> > the traditional headaches relating to managing
> > filesystems. Storage can be added, disks replaced,
> and
> > data scrubbed with straightforward commands.
> > Filesystems can be created instantaneously,
> snapshots
> > and clones taken, native backups made, and a
> > simplified property mechanism allows for setting
> of
> > quotas, reservations, compression, and more.
> >
> > Alfred
> >
> >
> >
> > __________________________________
> > Yahoo! FareChase: Search multiple travel sites in
> one click.
> > http://farechase.yahoo.com
> > -
> > To unsubscribe from this list: send the line
> "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
>
> --
> Paulo Jorge Matos - pocm at sat inesc-id pt
> Web: http://sat.inesc-id.pt/~pocm
> Computer and Software Engineering
> INESC-ID - SAT Group
> -
> To unsubscribe from this list: send the line
> "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>





__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com

2005-11-21 10:08:38

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 2005-11-21 at 01:59 -0800, Alfred Brons wrote:
[...]
> But my question was: do we have similar functionality
> in Linux kernel?

>From reading over the marketing stuff, it seems that they now have LVM2
+ a journalling filesystem + some more nice-to-have features developed
(more or less) from scratch (or not?).
Reiser folks must comment on differences or not to a LVM2 + reiser4
combination.

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services

2005-11-21 10:20:04

by Jörn Engel

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 21 November 2005 01:59:15 -0800, Alfred Brons wrote:
>
> I wasn't aware of this thread.
>
> But my question was: do we have similar functionality
> in Linux kernel?

If you have a simple, technical list of the functionality, your
question will be easily answered. I still haven't found the time to
dig for all the information underneith the marketing blur.

o Checksums for data blocks
Done by jffs2, not done my any hard disk filesystems I'm aware of.

o Snapshots
Use device mapper.
Some log structured filesystems are also under development. For
them, snapshots will be trivial to add. But they don't really exist
yet. (I barely consider reiser4 to exist. Any filesystem that is
not considered good enough for kernel inclusion is effectively still
in development phase.)

o Merge of LVM and filesystem layer
Not done. This has some advantages, but also more complexity than
seperate LVM and filesystem layers. Might be considers "not worth
it" for some years.

o 128 bit
On 32bit machines, you can't even fully utilize a 64bit filesystem
without VFS changes. Have you ever noticed? Thought so.

o other
Dunno, what else they do. There's the official marketing feature
lists, but that's rather useless for comparisons.

J?rn

--
Measure. Don't tune for speed until you've measured, and even then
don't unless one part of the code overwhelms the rest.
-- Rob Pike

2005-11-21 11:20:13

by Andreas Happe

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On 2005-11-21, Alfred Brons <[email protected]> wrote:
> Thanks Paulo!
> I wasn't aware of this thread.
>
> But my question was: do we have similar functionality
> in Linux kernel?

>>> Every block is checksummed to prevent silent data corruption,
>>> and the data is self-healing in replicated (mirrored or RAID)
>>> configurations.

should not be filesystem specific.

>>> ZFS provides unlimited constant-time snapshots and clones. A
>>> snapshot is a read-only point-in-time copy of a filesystem, while a
>>> clone is a writable copy of a snapshot. Clones provide an extremely
>>> space-efficient way to store many copies of mostly-shared data such
>>> as workspaces, software installations, and diskless clients.

lvm2 can do those too (with any filesystem that supports resizing).
Clones would be the snapshot functionality of lvm2.

>>> ZFS administration is both simple and powerful. The tools are
>>> designed from the ground up to eliminate all the traditional
>>> headaches relating to managing filesystems. Storage can be added,
>>> disks replaced, and data scrubbed with straightforward commands.

lvm2.

>>> Filesystems can be created instantaneously, snapshots and clones
>>> taken, native backups made, and a simplified property mechanism
>>> allows for setting of quotas, reservations, compression, and more.

excepct per-file compression all thinks should be doable with normal in-kernel
fs. per-file compression may be doable with ext2 and special patches, an
overlay filesystem or reiser4.

Andreas

2005-11-21 11:30:48

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 21 Nov 2005, Andreas Happe wrote:
> On 2005-11-21, Alfred Brons <[email protected]> wrote:
> > Thanks Paulo!
> > I wasn't aware of this thread.
> >
> > But my question was: do we have similar functionality
> > in Linux kernel?
[snip]
> >>> Filesystems can be created instantaneously, snapshots and clones
> >>> taken, native backups made, and a simplified property mechanism
> >>> allows for setting of quotas, reservations, compression, and more.
>
> excepct per-file compression all thinks should be doable with normal in-kernel
> fs. per-file compression may be doable with ext2 and special patches, an
> overlay filesystem or reiser4.

NTFS has per-file compression although I admit that in Linux this is
read-only at present (mostly because it is low priority).

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2005-11-21 11:45:56

by Diego Calleja

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

El Mon, 21 Nov 2005 01:59:15 -0800 (PST),
Alfred Brons <[email protected]> escribi?:

> Thanks Paulo!
> I wasn't aware of this thread.
>
> But my question was: do we have similar functionality
> in Linux kernel?
>
> Taking in account ZFS availability as 100% open
> source, I'm starting think about migration to Nexenta
> OS some of my servers just because of this feature...



There're some rumors saying that sun might be considering a linux port.

http://www.sun.com/emrkt/campaign_docs/expertexchange/knowledge/solaris_zfs_gen.html#10

Q: Any thoughts on porting ZFS to Linux, AIX, or HPUX?
A: No plans of porting to AIX and HPUX. Porting to Linux is currently
being investigated.

(personally I doubt it, that FAQ was written some time ago and Sun's
executives change their opinion more often than Linus does ;)

2005-11-21 11:47:03

by Matthias Andree

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 21 Nov 2005, J?rn Engel wrote:

> o Checksums for data blocks
> Done by jffs2, not done my any hard disk filesystems I'm aware of.

Then allow me to point you to the Amiga file systems. The variants
commonly dubbed "Old File System" use only 448 (IIRC) out of 512 bytes
in a data block for payload and put their block chaining information,
checksum and other "interesting" things into the blocks. This helps
recoverability a lot but kills performance, so many people (used to) use
the "Fast File System" that uses the full 512 bytes for data blocks.

Whether the Amiga FFS, even with multi-user and directory index updates,
has a lot of importance today, is a different question that you didn't
pose :-)

> yet. (I barely consider reiser4 to exist. Any filesystem that is
> not considered good enough for kernel inclusion is effectively still
> in development phase.)

What the heck is reiserfs? I faintly recall some weirdo crap that broke
NFS throughout the better parts of 2.2 and 2.4, would slowly write junk
into its structures that reiserfsck could only fix months later.

ReiserFS 3.6 still doesn't work right (you cannot create an arbitrary
amount of arbitrary filenames in any one directory even if there's
sufficient space), after a while in production, still random flaws in
the file systems that then require rebuild-tree that works only halfway.
No thanks.

Why would ReiserFS 4 be any different? IMO reiserfs4 should be blocked
from kernel baseline until:

- reiserfs 3.6 is fully fixed up

- reiserfs 4 has been debugged in production outside the kernel for at
least 24 months with a reasonable installed base, by for instance a
large distro using it for the root fs

- there are guarantees that reiserfs 4 will be maintained until the EOL
of the kernel branch it is included into, rather than the current "oh
we have a new toy and don't give a shit about 3.6" behavior.

Harsh words, I know, but either version of reiserfs is totally out of
the game while I have the systems administrator hat on, and the recent
fuss between Namesys and Christoph Hellwig certainly doesn't raise my
trust in reiserfs.

--
Matthias Andree

2005-11-21 11:59:34

by Diego Calleja

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

El Mon, 21 Nov 2005 11:19:59 +0100,
J?rn Engel <[email protected]> escribi?:

> question will be easily answered. I still haven't found the time to
> dig for all the information underneith the marketing blur.

Me neither, at now that we are talking about marketing impact, has people
run benchmarks on it? (I'd do it myself but downloading a iso with a dialup
link takes some time 8)

I've found numbers against other kernels:
http://mail-index.netbsd.org/tech-perform/2005/11/18/0000.html
http://blogs.sun.com/roller/page/roch?entry=zfs_to_ufs_performance_comparison
http://blogs.sun.com/roller/page/erickustarz?entry=fs_perf_201_postmark
http://blogs.sun.com/roller/page/erickustarz?entry=fs_perf_102_filesystem_bw

2005-11-21 12:07:16

by Kasper Sandberg

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 2005-11-21 at 12:46 +0100, Matthias Andree wrote:
> On Mon, 21 Nov 2005, J?rn Engel wrote:
>
> > o Checksums for data blocks
> > Done by jffs2, not done my any hard disk filesystems I'm aware of.
>
> Then allow me to point you to the Amiga file systems. The variants
> commonly dubbed "Old File System" use only 448 (IIRC) out of 512 bytes
> in a data block for payload and put their block chaining information,
> checksum and other "interesting" things into the blocks. This helps
> recoverability a lot but kills performance, so many people (used to) use
> the "Fast File System" that uses the full 512 bytes for data blocks.
>
> Whether the Amiga FFS, even with multi-user and directory index updates,
> has a lot of importance today, is a different question that you didn't
> pose :-)
>
> > yet. (I barely consider reiser4 to exist. Any filesystem that is
> > not considered good enough for kernel inclusion is effectively still
> > in development phase.)
that isnt true, just because it isnt following the kernel coding style
and therefore has to be changed, does not make it any bit more unstable.


>
> What the heck is reiserfs? I faintly recall some weirdo crap that broke
> NFS throughout the better parts of 2.2 and 2.4, would slowly write junk
> into its structures that reiserfsck could only fix months later.
well.. i remember that linux 2.6.0 had alot of bugs, is 2.6.14 still
crap because those particular bugs are fixed now?

>
> ReiserFS 3.6 still doesn't work right (you cannot create an arbitrary
> amount of arbitrary filenames in any one directory even if there's
> sufficient space), after a while in production, still random flaws in
> the file systems that then require rebuild-tree that works only halfway.
> No thanks.
i have used reiserfs for a long time, and have never had the problem
that i was required to use rebuild-tree, not have issues requiring other
actions come, unless i have been hard rebooting/shutting down, in which
case the journal simply replayed a few transactions.

>
> Why would ReiserFS 4 be any different? IMO reiserfs4 should be blocked
> from kernel baseline until:

you seem to believe that reiser4 (note, reiser4, NOT reiserfs4) is just
some simple new revision of reiserfs. well guess what, its an entirely
different filesystem, which before they began the changes to have it
merged, was completely stable, and i have confidence that it will be
just as stable again soon.

>
> - reiserfs 3.6 is fully fixed up
>
so you are saying that if for some reason the via ide driver for old
chipsets are broken, we cant merge a via ide driver for new ide
controllers?

> - reiserfs 4 has been debugged in production outside the kernel for at
> least 24 months with a reasonable installed base, by for instance a
> large distro using it for the root fs
no dist will ever use (except perhaps linspire) before its included in
the kernel.
>
> - there are guarantees that reiserfs 4 will be maintained until the EOL
> of the kernel branch it is included into, rather than the current "oh
> we have a new toy and don't give a shit about 3.6" behavior.
why do you think that reiser4 will not be maintained? if there are bugs
in 3.6 hans is still interrested, but really, do you expect him to still
spend all the time trying to find bugs in 3.6, when people dont seem to
have issues, and while he in fact has created an entirely new
filesystem.
>
> Harsh words, I know, but either version of reiserfs is totally out of
> the game while I have the systems administrator hat on, and the recent
> fuss between Namesys and Christoph Hellwig certainly doesn't raise my
> trust in reiserfs.
so you are saying that if two people doesent get along the product the
one person creates somehow falls in quality?
>

2005-11-21 13:18:36

by Matthias Andree

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 21 Nov 2005, Kasper Sandberg wrote:

> On Mon, 2005-11-21 at 12:46 +0100, Matthias Andree wrote:
> > On Mon, 21 Nov 2005, J?rn Engel wrote:
> >
> > > o Checksums for data blocks
> > > Done by jffs2, not done my any hard disk filesystems I'm aware of.
> >
> > Then allow me to point you to the Amiga file systems. The variants
> > commonly dubbed "Old File System" use only 448 (IIRC) out of 512 bytes

Make that 488. Amiga's traditional file system loses 6 longs (at 32 bit
each) according to Ralph Babel's "The Amiga Guru Book".

> > in a data block for payload and put their block chaining information,
> > checksum and other "interesting" things into the blocks. This helps
> > recoverability a lot but kills performance, so many people (used to) use
> > the "Fast File System" that uses the full 512 bytes for data blocks.
> >
> > Whether the Amiga FFS, even with multi-user and directory index updates,
> > has a lot of importance today, is a different question that you didn't
> > pose :-)
> >
> > > yet. (I barely consider reiser4 to exist. Any filesystem that is
> > > not considered good enough for kernel inclusion is effectively still
> > > in development phase.)

> that isnt true, just because it isnt following the kernel coding style
> and therefore has to be changed, does not make it any bit more unstable.

If the precondition is "adhere to CodingStyle or you don't get it in",
and the CodingStyle has been established for years, I have zero sympathy
with the maintainer if he's told "no, you didn't follow that well-known
style".

> > What the heck is reiserfs? I faintly recall some weirdo crap that broke
> > NFS throughout the better parts of 2.2 and 2.4, would slowly write junk
> > into its structures that reiserfsck could only fix months later.
> well.. i remember that linux 2.6.0 had alot of bugs, is 2.6.14 still
> crap because those particular bugs are fixed now?

Of course not. The point is, it will take many months to shake the bugs
out that are still in it and will only be revealed as it is tested in
more diverse configurations.

> > ReiserFS 3.6 still doesn't work right (you cannot create an arbitrary
> > amount of arbitrary filenames in any one directory even if there's
> > sufficient space), after a while in production, still random flaws in
> > the file systems that then require rebuild-tree that works only halfway.
> > No thanks.

> i have used reiserfs for a long time, and have never had the problem
> that i was required to use rebuild-tree, not have issues requiring other
> actions come, unless i have been hard rebooting/shutting down, in which
> case the journal simply replayed a few transactions.

I have had, without hard shutdowns, problems with reiserfs, and
occasionally problems that couldn't be fixed easily. I have never had
such with ext3 on the same hardware.

> > Why would ReiserFS 4 be any different? IMO reiserfs4 should be blocked
> > from kernel baseline until:
>
> you seem to believe that reiser4 (note, reiser4, NOT reiserfs4) is just
> some simple new revision of reiserfs. well guess what, its an entirely
> different filesystem, which before they began the changes to have it
> merged, was completely stable, and i have confidence that it will be
> just as stable again soon.

I don't care what its name is. I am aware it is a rewrite, and that is
reason to be all the more chary about adopting it early. People believed
3.5 to be stable, too, before someone tried NFS...

Historical fact is, ext3fs was very usable already in the later 0.0.2x
versions, and pretty stable in 0.0.7x, where x is some letter. All that
happened was applying some polish to make it shine, and that it does.

reiserfs was declared stable and then the problems only began. Certainly
merging kernel-space NFS was an additional obstacle at that time, so we
may speak in favor of Namesys because reiserfs was into a merging
target.

However, as reiser4 is a major (or full) rewrite, I won't consider it
for anything except perhaps /var/cache before 2H2007.

> > - reiserfs 3.6 is fully fixed up
>
> so you are saying that if for some reason the via ide driver for old
> chipsets are broken, we cant merge a via ide driver for new ide
> controllers?

More generally, quality should be the prime directive. And before the
reiser4 guys focus on getting their gear merged and then the many bugs
shaken out (there will be bugs found), they should have a chance to
reschedule their internal work to get 3.6 fixed. If they can't, well,
time to mark it DEPRECATED before the new work is merged, and the new
stuff should be marked EXPERIMENTAL for a year.

> > - reiserfs 4 has been debugged in production outside the kernel for at
> > least 24 months with a reasonable installed base, by for instance a
> > large distro using it for the root fs
> no dist will ever use (except perhaps linspire) before its included in
> the kernel.

So you think? I beg to differ. SUSE have adopted reiserfs pretty early,
and it has never shown the promised speed advantages over ext[23]fs in
my testing. SUSE have adopted submount, which also still lives outside
the kernel AFAIK.

> > - there are guarantees that reiserfs 4 will be maintained until the EOL
> > of the kernel branch it is included into, rather than the current "oh
> > we have a new toy and don't give a shit about 3.6" behavior.
> why do you think that reiser4 will not be maintained? if there are bugs
> in 3.6 hans is still interrested, but really, do you expect him to still
> spend all the time trying to find bugs in 3.6, when people dont seem to

I do expect Namesys to fix the *known* bugs, such as hash table overflow
preventing creation of new files. See above about DEPRECATED.

As long as reiserfs 3.6 and/or reiser 4 are standalone projects that
live outside the kernel, nobody cares, but I think pushing forward to
adoption into kernel baseline consistutes a commitment to maintaining
the code.

> have issues, and while he in fact has created an entirely new
> filesystem.

Yup. So the test and fix cycles that were needed for reiserfs 3.5 and
3.6 will start all over. I hope the Namesys guys were to clueful as to
run all their reiserfs 3.X regression tests against 4.X with all
plugins and switches, too.

> > Harsh words, I know, but either version of reiserfs is totally out of
> > the game while I have the systems administrator hat on, and the recent
> > fuss between Namesys and Christoph Hellwig certainly doesn't raise my
> > trust in reiserfs.
> so you are saying that if two people doesent get along the product the
> one person creates somehow falls in quality?

I wrote "trust", not "quality".

Part of my aversion against stuff that bears "reiser" in its name is the
way how it is supposed to be merged upstream, and there Namesys is a bit
lacking. After all, they want their pet in the kernel, not the kernel
wants reiser4.

--
Matthias Andree

2005-11-21 14:18:42

by Kasper Sandberg

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 2005-11-21 at 14:18 +0100, Matthias Andree wrote:
> On Mon, 21 Nov 2005, Kasper Sandberg wrote:
>
> > On Mon, 2005-11-21 at 12:46 +0100, Matthias Andree wrote:
> > > On Mon, 21 Nov 2005, J?rn Engel wrote:
> > >
> > > > o Checksums for data blocks
> > > > Done by jffs2, not done my any hard disk filesystems I'm aware of.
> > >
> > > Then allow me to point you to the Amiga file systems. The variants
> > > commonly dubbed "Old File System" use only 448 (IIRC) out of 512 bytes
>
> Make that 488. Amiga's traditional file system loses 6 longs (at 32 bit
> each) according to Ralph Babel's "The Amiga Guru Book".
>
> > > in a data block for payload and put their block chaining information,
> > > checksum and other "interesting" things into the blocks. This helps
> > > recoverability a lot but kills performance, so many people (used to) use
> > > the "Fast File System" that uses the full 512 bytes for data blocks.
> > >
> > > Whether the Amiga FFS, even with multi-user and directory index updates,
> > > has a lot of importance today, is a different question that you didn't
> > > pose :-)
> > >
> > > > yet. (I barely consider reiser4 to exist. Any filesystem that is
> > > > not considered good enough for kernel inclusion is effectively still
> > > > in development phase.)
>
> > that isnt true, just because it isnt following the kernel coding style
> > and therefore has to be changed, does not make it any bit more unstable.
>
> If the precondition is "adhere to CodingStyle or you don't get it in",
> and the CodingStyle has been established for years, I have zero sympathy
> with the maintainer if he's told "no, you didn't follow that well-known
> style".

that was not the question, the question is if the code is in development
phase or not (being stable or not), where agreed, its their own fault
for not writing code which matches the kernel in coding style, however
that doesent make it the least bit more unstable.

>
> > > What the heck is reiserfs? I faintly recall some weirdo crap that broke
> > > NFS throughout the better parts of 2.2 and 2.4, would slowly write junk
> > > into its structures that reiserfsck could only fix months later.
> > well.. i remember that linux 2.6.0 had alot of bugs, is 2.6.14 still
> > crap because those particular bugs are fixed now?
>
> Of course not. The point is, it will take many months to shake the bugs
> out that are still in it and will only be revealed as it is tested in
> more diverse configurations.

> > > ReiserFS 3.6 still doesn't work right (you cannot create an arbitrary
> > > amount of arbitrary filenames in any one directory even if there's
> > > sufficient space), after a while in production, still random flaws in
> > > the file systems that then require rebuild-tree that works only halfway.
> > > No thanks.
>
> > i have used reiserfs for a long time, and have never had the problem
> > that i was required to use rebuild-tree, not have issues requiring other
> > actions come, unless i have been hard rebooting/shutting down, in which
> > case the journal simply replayed a few transactions.
>
> I have had, without hard shutdowns, problems with reiserfs, and
> occasionally problems that couldn't be fixed easily. I have never had
> such with ext3 on the same hardware.
>
you wouldnt want to know what ext3 did to me, which reiserfs AND reiser4
never did

> > > Why would ReiserFS 4 be any different? IMO reiserfs4 should be blocked
> > > from kernel baseline until:
> >
> > you seem to believe that reiser4 (note, reiser4, NOT reiserfs4) is just
> > some simple new revision of reiserfs. well guess what, its an entirely
> > different filesystem, which before they began the changes to have it
> > merged, was completely stable, and i have confidence that it will be
> > just as stable again soon.
>
> I don't care what its name is. I am aware it is a rewrite, and that is
> reason to be all the more chary about adopting it early. People believed
> 3.5 to be stable, too, before someone tried NFS...
nfs works fine with reiser4. you are judging reiser4 by the problems
reiserfs had.
>
> Historical fact is, ext3fs was very usable already in the later 0.0.2x
> versions, and pretty stable in 0.0.7x, where x is some letter. All that
> happened was applying some polish to make it shine, and that it does.
>
> reiserfs was declared stable and then the problems only began. Certainly
> merging kernel-space NFS was an additional obstacle at that time, so we
> may speak in favor of Namesys because reiserfs was into a merging
> target.
>
> However, as reiser4 is a major (or full) rewrite, I won't consider it
> for anything except perhaps /var/cache before 2H2007.
>
i have had less trouble by using the reiser4 patches before even hans
considered it stable than i had by using ext3.

> > > - reiserfs 3.6 is fully fixed up
> >
> > so you are saying that if for some reason the via ide driver for old
> > chipsets are broken, we cant merge a via ide driver for new ide
> > controllers?
>
> More generally, quality should be the prime directive. And before the
> reiser4 guys focus on getting their gear merged and then the many bugs
> shaken out (there will be bugs found), they should have a chance to
> reschedule their internal work to get 3.6 fixed. If they can't, well,
> time to mark it DEPRECATED before the new work is merged, and the new
> stuff should be marked EXPERIMENTAL for a year.
so then mark reiser4 experimental as namesys themselves wanted.

>
> > > - reiserfs 4 has been debugged in production outside the kernel for at
> > > least 24 months with a reasonable installed base, by for instance a
> > > large distro using it for the root fs
> > no dist will ever use (except perhaps linspire) before its included in
> > the kernel.
>
> So you think? I beg to differ. SUSE have adopted reiserfs pretty early,
> and it has never shown the promised speed advantages over ext[23]fs in
> my testing. SUSE have adopted submount, which also still lives outside
> the kernel AFAIK.
there is quite a big difference between stuff like submount and the
filesystem itself.. and as you pointed out, reiserfs in the beginning
was a disappointment, do you seriously think they are willing to take
the chance again?

>
> > > - there are guarantees that reiserfs 4 will be maintained until the EOL
> > > of the kernel branch it is included into, rather than the current "oh
> > > we have a new toy and don't give a shit about 3.6" behavior.
> > why do you think that reiser4 will not be maintained? if there are bugs
> > in 3.6 hans is still interrested, but really, do you expect him to still
> > spend all the time trying to find bugs in 3.6, when people dont seem to
>
> I do expect Namesys to fix the *known* bugs, such as hash table overflow
> preventing creation of new files. See above about DEPRECATED.
>
reiser4 is meant to be better than reiserfs, which is also one reason he
wants it merged perhaps? but agreed, known bugs should be fixed

> As long as reiserfs 3.6 and/or reiser 4 are standalone projects that
> live outside the kernel, nobody cares, but I think pushing forward to
> adoption into kernel baseline consistutes a commitment to maintaining
> the code.
>
> > have issues, and while he in fact has created an entirely new
> > filesystem.
>
> Yup. So the test and fix cycles that were needed for reiserfs 3.5 and
> 3.6 will start all over. I hope the Namesys guys were to clueful as to
> run all their reiserfs 3.X regression tests against 4.X with all
> plugins and switches, too.
you will find that reiser4 is actually very very good.
>
> > > Harsh words, I know, but either version of reiserfs is totally out of
> > > the game while I have the systems administrator hat on, and the recent
> > > fuss between Namesys and Christoph Hellwig certainly doesn't raise my
> > > trust in reiserfs.
> > so you are saying that if two people doesent get along the product the
> > one person creates somehow falls in quality?
>
> I wrote "trust", not "quality".
my bad.
>
> Part of my aversion against stuff that bears "reiser" in its name is the
> way how it is supposed to be merged upstream, and there Namesys is a bit
> lacking. After all, they want their pet in the kernel, not the kernel
> wants reiser4.
>

2005-11-21 14:19:41

by Tarkan Erimer

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On 11/21/05, Diego Calleja <[email protected]> wrote:
>
> There're some rumors saying that sun might be considering a linux port.
>
> http://www.sun.com/emrkt/campaign_docs/expertexchange/knowledge/solaris_zfs_gen.html#10
>
> Q: Any thoughts on porting ZFS to Linux, AIX, or HPUX?
> A: No plans of porting to AIX and HPUX. Porting to Linux is currently
> being investigated.
>
> (personally I doubt it, that FAQ was written some time ago and Sun's
> executives change their opinion more often than Linus does ;)

If It happenned, Sun or someone has port it to linux.
We will need some VFS changes to handle 128 bit FS as "J?rn ENGEL"
mentionned previous mail in this thread. Is there any plan or action
to make VFS handle 128 bit File Sytems like ZFS or future 128 bit
File Systems ? Any VFS people reply to this, please?

Regards

2005-11-21 14:41:58

by Matthias Andree

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 21 Nov 2005, Kasper Sandberg wrote:

> > If the precondition is "adhere to CodingStyle or you don't get it in",
> > and the CodingStyle has been established for years, I have zero sympathy
> > with the maintainer if he's told "no, you didn't follow that well-known
> > style".
>
> that was not the question, the question is if the code is in development
> phase or not (being stable or not), where agreed, its their own fault
> for not writing code which matches the kernel in coding style, however
> that doesent make it the least bit more unstable.

As mentioned, a file system cannot possibly be stable right after merge.
Having to change formatting is a sweeping change and certainly is a
barrier across which to look for auditing is all the more difficult.

> > I have had, without hard shutdowns, problems with reiserfs, and
> > occasionally problems that couldn't be fixed easily. I have never had
> > such with ext3 on the same hardware.
> >
> you wouldnt want to know what ext3 did to me, which reiserfs AND reiser4
> never did

OK, we have diametral experiences, and I'm not asking since I trust you
that I don't want to know, too :) Let's leave it at that.

> > I don't care what its name is. I am aware it is a rewrite, and that is
> > reason to be all the more chary about adopting it early. People believed
> > 3.5 to be stable, too, before someone tried NFS...

> nfs works fine with reiser4. you are judging reiser4 by the problems
> reiserfs had.

Of course I do, same project lead, and probably many of the same
developers. While they may (and probably will) learn from mistakes,
changing style is more difficult - and that resulted in one of the major
non-acceptance reasons reiser4 suffered.

I won't subscribe to reiser4 specific topics before I've tried it, so
I'll quit. Same about ZFS by the way, it'll be fun some day to try on a
machine that it can trash at will, but for production, it will have to
prove itself first. After all, Sun are still fixing ufs and/or logging
bugs in Solaris 8. (And that's good, they still fix things, and it also
shows how long it takes to really get a file system stable.)

> i have had less trouble by using the reiser4 patches before even hans
> considered it stable than i had by using ext3.

Lucky you. I haven't dared try it yet for lack of a test computer to
trash.

> there is quite a big difference between stuff like submount and the
> filesystem itself.. and as you pointed out, reiserfs in the beginning
> was a disappointment, do you seriously think they are willing to take
> the chance again?

I thing naught about what they're going to put at stake. reiserfs 3 was
an utter failure for me. It was raved about, hyped, and the bottom line
was wasted time and a major disappointment.

> > Yup. So the test and fix cycles that were needed for reiserfs 3.5 and
> > 3.6 will start all over. I hope the Namesys guys were to clueful as to
> > run all their reiserfs 3.X regression tests against 4.X with all
> > plugins and switches, too.
> you will find that reiser4 is actually very very good.

I haven't asked what I'd find, because I'm not searching. And I might
find something else than you did - perhaps because you've picked up all
the good things already when I'll finally go there ;-)

--
Matthias Andree

2005-11-21 15:08:37

by Kasper Sandberg

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 2005-11-21 at 15:41 +0100, Matthias Andree wrote:
> On Mon, 21 Nov 2005, Kasper Sandberg wrote:
>
> > > If the precondition is "adhere to CodingStyle or you don't get it in",
> > > and the CodingStyle has been established for years, I have zero sympathy
> > > with the maintainer if he's told "no, you didn't follow that well-known
> > > style".
> >
> > that was not the question, the question is if the code is in development
> > phase or not (being stable or not), where agreed, its their own fault
> > for not writing code which matches the kernel in coding style, however
> > that doesent make it the least bit more unstable.
>
> As mentioned, a file system cannot possibly be stable right after merge.
> Having to change formatting is a sweeping change and certainly is a
> barrier across which to look for auditing is all the more difficult.
before reiser4 was changed alot, to match the codingstyle (agreed, they
have to obey by the kernels codingstyle), it was stable, so had it been
merged there it wouldnt have been any less stable.

>
> > > I have had, without hard shutdowns, problems with reiserfs, and
> > > occasionally problems that couldn't be fixed easily. I have never had
> > > such with ext3 on the same hardware.
> > >
> > you wouldnt want to know what ext3 did to me, which reiserfs AND reiser4
> > never did
>
> OK, we have diametral experiences, and I'm not asking since I trust you
> that I don't want to know, too :) Let's leave it at that.
>
> > > I don't care what its name is. I am aware it is a rewrite, and that is
> > > reason to be all the more chary about adopting it early. People believed
> > > 3.5 to be stable, too, before someone tried NFS...
>
> > nfs works fine with reiser4. you are judging reiser4 by the problems
> > reiserfs had.
>
> Of course I do, same project lead, and probably many of the same
> developers. While they may (and probably will) learn from mistakes,
> changing style is more difficult - and that resulted in one of the major
> non-acceptance reasons reiser4 suffered.
>
> I won't subscribe to reiser4 specific topics before I've tried it, so
> I'll quit. Same about ZFS by the way, it'll be fun some day to try on a
> machine that it can trash at will, but for production, it will have to
> prove itself first. After all, Sun are still fixing ufs and/or logging
> bugs in Solaris 8. (And that's good, they still fix things, and it also
> shows how long it takes to really get a file system stable.)
>
> > i have had less trouble by using the reiser4 patches before even hans
> > considered it stable than i had by using ext3.
>
> Lucky you. I haven't dared try it yet for lack of a test computer to
> trash.
i too was reluctant, i ended up using it for the things i REALLY dont
want to loose.
>
> > there is quite a big difference between stuff like submount and the
> > filesystem itself.. and as you pointed out, reiserfs in the beginning
> > was a disappointment, do you seriously think they are willing to take
> > the chance again?
>
> I thing naught about what they're going to put at stake. reiserfs 3 was
> an utter failure for me. It was raved about, hyped, and the bottom line
> was wasted time and a major disappointment.
>
> > > Yup. So the test and fix cycles that were needed for reiserfs 3.5 and
> > > 3.6 will start all over. I hope the Namesys guys were to clueful as to
> > > run all their reiserfs 3.X regression tests against 4.X with all
> > > plugins and switches, too.
> > you will find that reiser4 is actually very very good.
>
> I haven't asked what I'd find, because I'm not searching. And I might
> find something else than you did - perhaps because you've picked up all
> the good things already when I'll finally go there ;-)
>

2005-11-21 18:17:45

by Rob Landley

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Monday 21 November 2005 05:45, Diego Calleja wrote:
> El Mon, 21 Nov 2005 01:59:15 -0800 (PST),
> There're some rumors saying that sun might be considering a linux port.
>
> http://www.sun.com/emrkt/campaign_docs/expertexchange/knowledge/solaris_zfs
>_gen.html#10
>
> Q: Any thoughts on porting ZFS to Linux, AIX, or HPUX?
> A: No plans of porting to AIX and HPUX. Porting to Linux is currently
> being investigated.

Translation: We'd like to dangle a carrot in front of Linux users in hopes
they'll try out this feature and possibly get interested in switching to
Solaris because of it. Don't hold your breath on us actually shipping
anything. But we didn't open source Solaris due to competitive pressure from
AIX or HPUX users, so they don't even get the carrot.

Rob

2005-11-21 18:52:15

by Rob Landley

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Monday 21 November 2005 08:19, Tarkan Erimer wrote:
> On 11/21/05, Diego Calleja <[email protected]> wrote:
> If It happenned, Sun or someone has port it to linux.
> We will need some VFS changes to handle 128 bit FS as "J?rn ENGEL"
> mentionned previous mail in this thread. Is there any plan or action
> to make VFS handle 128 bit File Sytems like ZFS or future 128 bit
> File Systems ? Any VFS people reply to this, please?

I believe that on 64 bit platforms, Linux has a 64 bit clean VFS. Python says
2**64 is 18446744073709551616, and that's roughly:
18,446,744,073,709,551,616 bytes
18,446,744,073,709 megs
18,446,744,073 gigs
18,446,744 terabytes
18,446 ... what are those, petabytes?
18 Really big lumps of data we won't be using for a while yet.

And that's just 64 bits. Keep in mind it took us around fifty years to burn
through the _first_ thirty two (which makes sense, since Moore's Law says we
need 1 more bit every 18 months). We may go through it faster than we went
through the first 32 bits, but it'll last us a couple decades at least.

Now I'm not saying we won't exhaust 64 bits eventually. Back to chemistry, it
takes 6.02*10^23 protons to weigh 1 gram, and that's just about 2^79, so it's
feasible that someday we might be able to store more than 64 bits of data per
gram, let alone in big room-sized clusters. But it's not going to be for
years and years, and that's a design problem for Sun.

Sun is proposing it can predict what storage layout will be efficient for as
yet unheard of quantities of data, with unknown access patterns, at least a
couple decades from now. It's also proposing that data compression and
checksumming are the filesystem's job. Hands up anybody who spots
conflicting trends here already? Who thinks the 128 bit requirement came
from marketing rather than the engineers?

If you're worried about being able to access your data 2 or 3 decades from
now, you should _not_ be worried about choice of filesystem. You should be
worried about making it _independent_ of what filesystem it's on. For
example, none of the current journaling filesystems in Linux were available
20 years ago, because fsck didn't emerge as a bottleneck until filesystem
sizes got really big.

Rob

2005-11-21 19:29:17

by Diego Calleja

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

El Mon, 21 Nov 2005 12:52:04 -0600,
Rob Landley <[email protected]> escribi?:

> If you're worried about being able to access your data 2 or 3 decades from
> now, you should _not_ be worried about choice of filesystem. You should be

Sun has invested 4.1$bn in buying storagetek and more money in buying other
small-size storage companies (ie: they're focusing a lot on "storage"). ZFS
fits perfectly there.

2005-11-21 20:03:04

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 2005-11-21 at 12:52 -0600, Rob Landley wrote:
[...]
> couple decades from now. It's also proposing that data compression and
> checksumming are the filesystem's job. Hands up anybody who spots
> conflicting trends here already? Who thinks the 128 bit requirement came
> from marketing rather than the engineers?

Without compressing you probably need 256 bits.

SCNR,
Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services



2005-11-21 20:48:50

by jdow

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

From: "Matthias Andree" <[email protected]>

> On Mon, 21 Nov 2005, Kasper Sandberg wrote:
>
>> On Mon, 2005-11-21 at 12:46 +0100, Matthias Andree wrote:
>> > On Mon, 21 Nov 2005, J?rn Engel wrote:
>> >
>> > > o Checksums for data blocks
>> > > Done by jffs2, not done my any hard disk filesystems I'm aware of.
>> >
>> > Then allow me to point you to the Amiga file systems. The variants
>> > commonly dubbed "Old File System" use only 448 (IIRC) out of 512 bytes
>
> Make that 488. Amiga's traditional file system loses 6 longs (at 32 bit
> each) according to Ralph Babel's "The Amiga Guru Book".

FYI it was not used very often on hard disk file systems. The affect on
performance was "remarkable". Each disk block contained a simple ulong
checksum, a pointer to the next block in the file, and a pointer to the
previous block in the file. The entire file system was built of doubly
linked lists. It was possible to effect remarkable levels of "unerase"
and recover from disk corruption better than most other filesystems I
have seen. But it bad watching glass flow seem fast when you tried to
use it. So as soon as the Amiga Fast File System, FFS, was developed
OFS became a floppy only tool. That lasted until FFS was enabled for
floppies, months later. Then OFS became a legacy compatability feature
that was seldom if ever used by real people. I am not sure how I would
apply a checksum to each block of a file and still maintain reasonable
access speeds. It would be entertaining to see what the ZFS file system
does in this regard so that it doesn't slow down to essentially single
block per transaction disk reads or huge RAM buffer areas such as had
to be used with OFS.

>> > in a data block for payload and put their block chaining information,
>> > checksum and other "interesting" things into the blocks. This helps
>> > recoverability a lot but kills performance, so many people (used to) use
>> > the "Fast File System" that uses the full 512 bytes for data blocks.
>> >
>> > Whether the Amiga FFS, even with multi-user and directory index updates,
>> > has a lot of importance today, is a different question that you didn't
>> > pose :-)

Amiga FFS has some application today, generally for archival data
recovery. I am quite happy that potential is available. The Amiga FFS
and OFS had some features mildly incompatable with 'ix type filesystems
and these features were used frequently. So it is easier to perpetuate
the old Amiga FFS images than to copy them over in many cases.

>> that isnt true, just because it isnt following the kernel coding style
>> and therefore has to be changed, does not make it any bit more unstable.
>
> If the precondition is "adhere to CodingStyle or you don't get it in",
> and the CodingStyle has been established for years, I have zero sympathy
> with the maintainer if he's told "no, you didn't follow that well-known
> style".

Personally I am not a fan of the Linux coding style. However, if I am
going to commit a patch or a large block of Linux only code then its
style will match my understanding of the Linux coding style. This is
merely a show of professionalism on the part of the person creating the
code or patch. A brand new religious war over the issue is the mark of
a stupid boor at this time. It is best to go with the flow. The worst
code to maintain is code that contains eleven thousand eleven hundred
eleven individual idiosyncratic coding styles.

> Matthias Andree

{^_^} Joanne Dow, who pretty much knows Amiga filesystems inside and
out if I feel a need to refresh my working memory on the subject.


2005-11-21 22:39:18

by Bill Davidsen

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Kasper Sandberg wrote:
> On Mon, 2005-11-21 at 14:18 +0100, Matthias Andree wrote:

>>I don't care what its name is. I am aware it is a rewrite, and that is
>>reason to be all the more chary about adopting it early. People believed
>>3.5 to be stable, too, before someone tried NFS...
>
> nfs works fine with reiser4. you are judging reiser4 by the problems
> reiserfs had.

reiser4 will have far more problems than 3.5 without a doubt. The NFS
problem was because it was a use which had not been properly tested, and
that was because it had not been envisioned. You test for the cases you
can envision, the "this is how people will use it" cases. He is judging
by the problems of any increasingly complex software.

reiser4 has a ton of new features not found in other filesystems, and
the developers can't begin to guess how people will use them because
people never had these features before. When files were read, write,
create, delete, permissions and seek, you could think of the ways people
would use them because there were so few things you could do. Then came
attrs, ACLs, etc, etc. All of a sudden people were doing things they
never did before, and there were unforseen, unintended, unsupported
interractions which went off on code paths which reminded people of "the
less traveled way" in the poem. Developers looked at bug reports and
asked why anyone would ever do THAT? But the bugs got fixed and ext3
became stable.

People are going to do things the reiser4 developers didn't envision,
they are going to run it over LVM on top of multilevel RAID using nbd as
part of the array, on real-time, preemptable, NUMA-enabled kernels, on
hardware platforms at best lightly tested... and reiser4 will regularly
lose bladder control because someone has just found another "can't
happen" or "no one would do that" path.

This isn't a criticism of reiser4, Matthias and others are just pointing
out that once any complex capability is added, people will use it in
unexpected ways and it will fail. So don't bother to even think that it
matters that it's been stable for you, because you haven't begun to
drive the wheels of it, no one person can.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-11-21 23:03:05

by Bill Davidsen

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Rob Landley wrote:
> On Monday 21 November 2005 08:19, Tarkan Erimer wrote:
>
>>On 11/21/05, Diego Calleja <[email protected]> wrote:
>>If It happenned, Sun or someone has port it to linux.
>>We will need some VFS changes to handle 128 bit FS as "J?rn ENGEL"
>>mentionned previous mail in this thread. Is there any plan or action
>>to make VFS handle 128 bit File Sytems like ZFS or future 128 bit
>>File Systems ? Any VFS people reply to this, please?
>
>
> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS. Python says
> 2**64 is 18446744073709551616, and that's roughly:
> 18,446,744,073,709,551,616 bytes
> 18,446,744,073,709 megs
> 18,446,744,073 gigs
> 18,446,744 terabytes
> 18,446 ... what are those, petabytes?
> 18 Really big lumps of data we won't be using for a while yet.
>
> And that's just 64 bits. Keep in mind it took us around fifty years to burn
> through the _first_ thirty two (which makes sense, since Moore's Law says we
> need 1 more bit every 18 months). We may go through it faster than we went
> through the first 32 bits, but it'll last us a couple decades at least.
>
> Now I'm not saying we won't exhaust 64 bits eventually. Back to chemistry, it
> takes 6.02*10^23 protons to weigh 1 gram, and that's just about 2^79, so it's
> feasible that someday we might be able to store more than 64 bits of data per
> gram, let alone in big room-sized clusters. But it's not going to be for
> years and years, and that's a design problem for Sun.

There's a more limiting problem, energy. Assuming that the energy to set
one bit is the energy to reverse the spin of an electron, call that s.
If you have each value of 128 bit address a single byte, then
T = s * 8 * 2^128 and T > B
where T is the total enargy to low level format the storage, and B is
the energy to boil all the oceans of earth. That was in one of the
physics magazines earlier this year. there just isn't enough energy
usable to write that much data.
>
> Sun is proposing it can predict what storage layout will be efficient for as
> yet unheard of quantities of data, with unknown access patterns, at least a
> couple decades from now. It's also proposing that data compression and
> checksumming are the filesystem's job. Hands up anybody who spots
> conflicting trends here already? Who thinks the 128 bit requirement came
> from marketing rather than the engineers?

Not me, if you are going larger than 64 bits you have no good reason not
to double the size, it avoids some problems by fitting in two 64 bit
registers nicely without truncation or extension. And we will never need
more than 128 bits so the addressing problems are solved.
>
> If you're worried about being able to access your data 2 or 3 decades from
> now, you should _not_ be worried about choice of filesystem. You should be
> worried about making it _independent_ of what filesystem it's on. For
> example, none of the current journaling filesystems in Linux were available
> 20 years ago, because fsck didn't emerge as a bottleneck until filesystem
> sizes got really big.

I'm gradually copying backups from the 90's off DC600 tapes to CDs,
knowing that they will require at least one more copy in my lifetime
(hopefully).
--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-11-22 00:15:50

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

In article <[email protected]> you wrote:
> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS. Python says
> 2**64 is 18446744073709551616, and that's roughly:
> 18,446,744,073,709,551,616 bytes
> 18,446,744,073,709 megs
> 18,446,744,073 gigs
> 18,446,744 terabytes
> 18,446 ... what are those, petabytes?
> 18 Really big lumps of data we won't be using for a while yet.

The prolem is not about file size. It is about for example unique inode
numbers. If you have a file system which spans multiple volumnes and maybe
nodes, you need more unqiue methods of addressing the files and blocks.

Gruss
Bernd

2005-11-22 00:24:44

by Jeffrey V. Merkey

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Bernd Eckenfels wrote:

>In article <[email protected]> you wrote:
>
>
>>I believe that on 64 bit platforms, Linux has a 64 bit clean VFS. Python says
>>2**64 is 18446744073709551616, and that's roughly:
>>18,446,744,073,709,551,616 bytes
>>18,446,744,073,709 megs
>>18,446,744,073 gigs
>>18,446,744 terabytes
>>18,446 ... what are those, pedabytes (petabytes?)
>>18 zetabytes
>>
There you go. I deal with this a lot so, those are the names.

Linux is currently limited to 16 TB per VFS mount point, it's all mute, unless VFS gets fixed.
mmap won't go above this at present.

Jeff



2005-11-22 01:13:41

by Pavel Machek

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Hi!

> > If It happenned, Sun or someone has port it to linux.
> > We will need some VFS changes to handle 128 bit FS as "J?rn ENGEL"
> > mentionned previous mail in this thread. Is there any plan or action
> > to make VFS handle 128 bit File Sytems like ZFS or future 128 bit
> > File Systems ? Any VFS people reply to this, please?
>
> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS. Python says
> 2**64 is 18446744073709551616, and that's roughly:
> 18,446,744,073,709,551,616 bytes
> 18,446,744,073,709 megs
> 18,446,744,073 gigs
> 18,446,744 terabytes
> 18,446 ... what are those, petabytes?
> 18 Really big lumps of data we won't be using for a while yet.
>
> And that's just 64 bits. Keep in mind it took us around fifty years to burn
> through the _first_ thirty two (which makes sense, since Moore's Law says we
> need 1 more bit every 18 months). We may go through it faster than we went
> through the first 32 bits, but it'll last us a couple decades at least.
>
> Now I'm not saying we won't exhaust 64 bits eventually. Back to chemistry, it
> takes 6.02*10^23 protons to weigh 1 gram, and that's just about 2^79, so it's
> feasible that someday we might be able to store more than 64 bits of data per
> gram, let alone in big room-sized clusters. But it's not going to be for
> years and years, and that's a design problem for Sun.
>
> Sun is proposing it can predict what storage layout will be efficient for as
> yet unheard of quantities of data, with unknown access patterns, at least a
> couple decades from now. It's also proposing that data compression and
> checksumming are the filesystem's job. Hands up anybody who spots
> conflicting trends here already? Who thinks the 128 bit requirement came
> from marketing rather than the engineers?

Actually, if you are storing information in single protons, I'd say
you _need_ checksumming :-).

[I actually agree with Sun here, not trusting disk is good idea. At
least you know kernel panic/oops/etc can't be caused by bit corruption on
the disk.]

Pavel
--
Thanks, Sharp!

2005-11-22 05:43:03

by Rob Landley

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Monday 21 November 2005 14:02, Bernd Petrovitsch wrote:
> On Mon, 2005-11-21 at 12:52 -0600, Rob Landley wrote:
> [...]
>
> > couple decades from now. It's also proposing that data compression and
> > checksumming are the filesystem's job. Hands up anybody who spots
> > conflicting trends here already? Who thinks the 128 bit requirement came
> > from marketing rather than the engineers?
>
> Without compressing you probably need 256 bits.

I assume this is sarcasm. Once again assuming you can someday manage to store
1 bit per electron, it would have a corresponding 2^256 protons*, which would
weigh (in grams):

> print 2**256/(6.02*(10**23))
1.92345663185e+53

Google for the weight of the earth:
http://www.ecology.com/earth-at-a-glance/earth-at-a-glance-feature/
Earth's Weight (Mass): 5.972 sextillion (1,000 trillion) metric tons.
Yeah, alright, mass... So that's 5.972*10^18 metric tons, and a metric ton is
a million grams, so 5.972*10^24 grams...

Google for the mass of the sun says that's 2*10^33 grams. Still nowhere
close.

Basically, as far as I can tell, any device capable of storing 2^256 bits
would collapse into a black hole under its own weight.

By the way, 2^128/avogadro gives 5.65253101198e+14, or 565 million metric
tons. For comparison, the empire state building:
http://www.newyorktransportation.com/info/empirefact2.html
Is 365,000 tons. (Probably not metric, but you get the idea.) Assuming I
haven't screwed up the math, an object capable of storing anywhere near 2^128
bits (constructed as a single giant molecule) would probably be in the size
ballpark of new york, london, or tokyo.

2^64 we may actually live to see the end of someday, but it's not guaranteed.
2^128 becoming relevant in our lifetimes is a touch unlikely.

Rob

* Yeah, I'm glossing over neutrons. I'm also glossing over the possibility of
storing more than one bit per electron and other quauntum strangeness. I
have no idea how you'd _build_ one of these suckers. Nobody does yet.
They're working on it...

2005-11-22 06:34:44

by Rob Landley

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Monday 21 November 2005 18:45, Pavel Machek wrote:
> Hi!
> > Sun is proposing it can predict what storage layout will be efficient for
> > as yet unheard of quantities of data, with unknown access patterns, at
> > least a couple decades from now. It's also proposing that data
> > compression and checksumming are the filesystem's job. Hands up anybody
> > who spots conflicting trends here already? Who thinks the 128 bit
> > requirement came from marketing rather than the engineers?
>
> Actually, if you are storing information in single protons, I'd say
> you _need_ checksumming :-).

You need error correcting codes at the media level. A molecular storage
system like this would probably look a lot more like flash or dram than it
would magnetic media. (For one thing, I/O bandwidth and seek times become a
serious bottleneck with high density single point of access systems.)

> [I actually agree with Sun here, not trusting disk is good idea. At
> least you know kernel panic/oops/etc can't be caused by bit corruption on
> the disk.]

But who said the filesystem was the right level to do this at?

Rob

2005-11-22 07:15:25

by Rob Landley

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Monday 21 November 2005 18:15, Bernd Eckenfels wrote:
> In article <[email protected]> you wrote:
> > I believe that on 64 bit platforms, Linux has a 64 bit clean VFS. Python
> > says 2**64 is 18446744073709551616, and that's roughly:
> > 18,446,744,073,709,551,616 bytes
> > 18,446,744,073,709 megs
> > 18,446,744,073 gigs
> > 18,446,744 terabytes
> > 18,446 ... what are those, petabytes?
> > 18 Really big lumps of data we won't be using for a while yet.
>
> The prolem is not about file size. It is about for example unique inode
> numbers. If you have a file system which spans multiple volumnes and maybe
> nodes, you need more unqiue methods of addressing the files and blocks.

18 quintillion inodes are enough to give every ipv4 address on the internet 4
billion unique inodes. I take it this is not enough space for Sun to work
out a reasonable allocation strategy in?

Rob


2005-11-22 07:45:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, Nov 21, 2005 at 03:59:52PM -0700, Jeff V. Merkey wrote:
> Linux is currently limited to 16 TB per VFS mount point, it's all mute,
> unless VFS gets fixed.
> mmap won't go above this at present.

You're thinking of 32bit architectures. There is no such limit for
64 bit architectures. There are XFS volumes in the 100TB range in production
use.

2005-11-22 07:51:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

> o 128 bit
> On 32bit machines, you can't even fully utilize a 64bit filesystem
> without VFS changes. Have you ever noticed? Thought so.

What is a '128 bit' or '64 bit' filesystem anyway? This description doesn't
make any sense, as there are many different things that can be
addresses in filesystems, and those can be addressed in different ways.
I guess from the marketing documents that they do 128 bit _byte_ addressing
for diskspace. All the interesting Linux filesystems do _block_ addressing
though, and 64bits addressing large enough blocks is quite huge.
128bit inodes again is something could couldn't easily implement, it would
mean a non-scalar ino_t type which guarantees to break userspace. 128
i_size? Again that would totally break userspace because it expects off_t
to be a scalar, so every single file must fit into 64bit _byte_ addressing.
If the surrounding enviroment changes (e.g. we get a 128bit scalar type
on 64bit architectures) that could change pretty easily, similarly to how
ext2 got a 64bit i_size during the 2.3.x LFS work.

2005-11-22 08:16:41

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

In article <[email protected]> you wrote:
> 18 quintillion inodes are enough to give every ipv4 address on the internet 4
> billion unique inodes. I take it this is not enough space for Sun to work
> out a reasonable allocation strategy in?

Yes, I think thats why they did it. However with ipv6, it bevomes 1-inode/node.

Gruss
Bernd

2005-11-22 08:52:43

by Matthias Andree

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 21 Nov 2005, Kasper Sandberg wrote:

> > As mentioned, a file system cannot possibly be stable right after merge.
> > Having to change formatting is a sweeping change and certainly is a
> > barrier across which to look for auditing is all the more difficult.
> before reiser4 was changed alot, to match the codingstyle (agreed, they
> have to obey by the kernels codingstyle), it was stable, so had it been
> merged there it wouldnt have been any less stable.

Code reformatting, unless 100% automatic with a 100% proven and C99
aware formatting tool, also introduces instability.

> > Lucky you. I haven't dared try it yet for lack of a test computer to
> > trash.
> i too was reluctant, i ended up using it for the things i REALLY dont
> want to loose.

So did many when reiser 3 was fresh, it was much raved about its speed,
stability, its alleged recoverability and recovery speed, and then
people started sending full filesystem dumps on tape and other media to
Namesys...

It's impossible to fully test nontrivial code, every option, every
possible state exponentiates the number of possibilities you have to
test to claim 100% coverage.

--
Matthias Andree

2005-11-22 09:20:51

by Matthias Andree

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 21 Nov 2005, Rob Landley wrote:

> On Monday 21 November 2005 08:19, Tarkan Erimer wrote:
> > On 11/21/05, Diego Calleja <[email protected]> wrote:
> > If It happenned, Sun or someone has port it to linux.
> > We will need some VFS changes to handle 128 bit FS as "J?rn ENGEL"
> > mentionned previous mail in this thread. Is there any plan or action
> > to make VFS handle 128 bit File Sytems like ZFS or future 128 bit
> > File Systems ? Any VFS people reply to this, please?
>
> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS. Python says
> 2**64 is 18446744073709551616, and that's roughly:
> 18,446,744,073,709,551,616 bytes
> 18,446,744,073,709 megs

18,446,744,073,710 Mbytes (round up)

> 18,446,744,073 gigs
> 18,446,744 terabytes
> 18,446 ... what are those, petabytes?

18,447 Pbytes, right.

> 18 Really big lumps of data we won't be using for a while yet.

18 Exabytes, indeeed.

Sun decided not to think about sizing for a longer while, and looking at
how long ufs has been around, Sun may have the better laugh in the end.

> Sun is proposing it can predict what storage layout will be efficient for as
> yet unheard of quantities of data, with unknown access patterns, at least a
> couple decades from now. It's also proposing that data compression and
> checksumming are the filesystem's job. Hands up anybody who spots
> conflicting trends here already? Who thinks the 128 bit requirement came
> from marketing rather than the engineers?

Is that important? Who says Sun isn't going to put checksumming and
compression hardware into its machines, and tell ZFS and their hardware
drivers to use it? Keep ZFS tuned for new requirements as they emerge?

AFAIK, no-one has suggested ZFS yet for floppies (including LS120, ZIP
and that stuff - it was also a major hype, now with DVD-RAM, DVD+RW and
DVD-RW few people talk about LS120 or ZIP any more).

What if some breakthrough in storage gives us vastly larger (larger than
predicted harddisk storage density increases) storage densities in 10
years for the same price of a 200 or 300 GB disk drive now?

--
Matthias Andree

2005-11-22 09:25:12

by Matthias Andree

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 21 Nov 2005, Rob Landley wrote:

> 2^64 we may actually live to see the end of someday, but it's not guaranteed.
> 2^128 becoming relevant in our lifetimes is a touch unlikely.

Some people suggested we don't know usage and organization patterns yet,
perhaps something that is very sparse can benefit from linear addressing
in a huge (not to say vastly oversized) address space. Perhaps not.

One real-world example is that we've been doing RAM overcommit for a
long time to account for but not actually perform memory allocations,
and on 32-bit machines, 1 GB of RAM already required highmem until
recently. So here, 64-bit address space comes as an advantage.

--
Matthias Andree

2005-11-22 09:45:57

by Jeffrey V. Merkey

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Christoph Hellwig wrote:

>On Mon, Nov 21, 2005 at 03:59:52PM -0700, Jeff V. Merkey wrote:
>
>
>>Linux is currently limited to 16 TB per VFS mount point, it's all mute,
>>unless VFS gets fixed.
>>mmap won't go above this at present.
>>
>>
>
>You're thinking of 32bit architectures. There is no such limit for
>64 bit architectures. There are XFS volumes in the 100TB range in production
>use.
>
>
>
>
I have 128 TB volumes in production use on 32 bit processors.

Jeff

2005-11-22 10:00:42

by Tarkan Erimer

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On 11/22/05, Matthias Andree <[email protected]> wrote:
> What if some breakthrough in storage gives us vastly larger (larger than
> predicted harddisk storage density increases) storage densities in 10
> years for the same price of a 200 or 300 GB disk drive now?

If all the speculations are true for AtomChip Corp.'s
(http://www.atomchip.com) Optical Technology. We wil begin to use
really large RAMs and Storages very early than we expected.
Their prototypes already begin with 1 TB (both for RAM and Storage).
It's not hard to imagine, a few years later, we can use 100-200 and up
TB Storages and RAMs.

Regards

2005-11-22 10:28:33

by Jörn Engel

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, 22 November 2005 07:51:48 +0000, Christoph Hellwig wrote:
>
> > o 128 bit
> > On 32bit machines, you can't even fully utilize a 64bit filesystem
> > without VFS changes. Have you ever noticed? Thought so.
>
> What is a '128 bit' or '64 bit' filesystem anyway? This description doesn't
> make any sense, as there are many different things that can be
> addresses in filesystems, and those can be addressed in different ways.
> I guess from the marketing documents that they do 128 bit _byte_ addressing
> for diskspace. All the interesting Linux filesystems do _block_ addressing
> though, and 64bits addressing large enough blocks is quite huge.
> 128bit inodes again is something could couldn't easily implement, it would
> mean a non-scalar ino_t type which guarantees to break userspace. 128
> i_size? Again that would totally break userspace because it expects off_t
> to be a scalar, so every single file must fit into 64bit _byte_ addressing.
> If the surrounding enviroment changes (e.g. we get a 128bit scalar type
> on 64bit architectures) that could change pretty easily, similarly to how
> ext2 got a 64bit i_size during the 2.3.x LFS work.

...once the need arises. Even with byte addressing, 64 bit are enough
to handle roughly 46116860 of the biggest hard disks currently
available. Looks like we still have a bit of time to think about the
problem before action is required.

J?rn

--
Victory in war is not repetitious.
-- Sun Tzu

2005-11-22 11:17:30

by Jörn Engel

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, 21 November 2005 12:48:44 -0800, jdow wrote:
>
> that was seldom if ever used by real people. I am not sure how I would
> apply a checksum to each block of a file and still maintain reasonable
> access speeds. It would be entertaining to see what the ZFS file system
> does in this regard so that it doesn't slow down to essentially single
> block per transaction disk reads or huge RAM buffer areas such as had
> to be used with OFS.

Design should be just as ZFS alledgedly does it. Store the checksum
near the indirect block pointers. Seeks for checksums basically don't
exist, as you need to seek for the indirect block pointers anyway.
Only drawback is the effective growth of the area for the
pointers+checksum blocks, which has a small impact on your caches.

J?rn

--
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon

2005-11-22 14:51:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, Nov 22, 2005 at 07:51:48AM +0000, Christoph Hellwig wrote:
>
> What is a '128 bit' or '64 bit' filesystem anyway? This description doesn't
> make any sense, as there are many different things that can be
> addresses in filesystems, and those can be addressed in different ways.
> I guess from the marketing documents that they do 128 bit _byte_ addressing
> for diskspace. All the interesting Linux filesystems do _block_ addressing
> though, and 64bits addressing large enough blocks is quite huge.
> 128bit inodes again is something could couldn't easily implement, it would
> mean a non-scalar ino_t type which guarantees to break userspace. 128
> i_size? Again that would totally break userspace because it expects off_t
> to be a scalar, so every single file must fit into 64bit _byte_ addressing.
> If the surrounding enviroment changes (e.g. we get a 128bit scalar type
> on 64bit architectures) that could change pretty easily, similarly to how
> ext2 got a 64bit i_size during the 2.3.x LFS work.

I will note though that there are people who are asking for 64-bit
inode numbers on 32-bit platforms, since 2**32 inodes are not enough
for certain distributed/clustered filesystems. And this is something
we don't yet support today, and probably will need to think about much
sooner than 128-bit filesystems....


- Ted

2005-11-22 15:27:50

by Jan Harkes

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, Nov 22, 2005 at 09:50:47AM -0500, Theodore Ts'o wrote:
> I will note though that there are people who are asking for 64-bit
> inode numbers on 32-bit platforms, since 2**32 inodes are not enough
> for certain distributed/clustered filesystems. And this is something
> we don't yet support today, and probably will need to think about much
> sooner than 128-bit filesystems....

As far as the kernel is concerned this hasn't been a problem in a while
(2.4.early). The iget4 operation that was introduced by reiserfs (now
iget5) pretty much makes it possible for a filesystem to use anything to
identify it's inodes. The 32-bit inode numbers are simply used as a hash
index.

The only thing that tends to break are userspace archiving tools like
tar, which assume that 2 objects with the same 32-bit st_ino value are
identical. I think that by now several actually double check that the
inode linkcount is larger than 1.

Jan

2005-11-22 15:49:30

by Jan Dittmer

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Tarkan Erimer wrote:
> On 11/22/05, Matthias Andree <[email protected]> wrote:
>
>>What if some breakthrough in storage gives us vastly larger (larger than
>>predicted harddisk storage density increases) storage densities in 10
>>years for the same price of a 200 or 300 GB disk drive now?
>
>
> If all the speculations are true for AtomChip Corp.'s
> (http://www.atomchip.com) Optical Technology. We wil begin to use
> really large RAMs and Storages very early than we expected.
> Their prototypes already begin with 1 TB (both for RAM and Storage).
> It's not hard to imagine, a few years later, we can use 100-200 and up
> TB Storages and RAMs.

http://www.portablegadgets.net/article/59/atomchip-is-a-hoax

Jan

2005-11-22 15:57:54

by Bill Davidsen

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Jeff V. Merkey wrote:
> Bernd Eckenfels wrote:
>
>> In article <[email protected]> you wrote:
>>
>>
>>> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.
>>> Python says 2**64 is 18446744073709551616, and that's roughly:
>>> 18,446,744,073,709,551,616 bytes
>>> 18,446,744,073,709 megs
>>> 18,446,744,073 gigs
>>> 18,446,744 terabytes
>>> 18,446 ... what are those, pedabytes (petabytes?)
>>> 18 zetabytes
>>>
> There you go. I deal with this a lot so, those are the names.
>
> Linux is currently limited to 16 TB per VFS mount point, it's all mute,
> unless VFS gets fixed.
> mmap won't go above this at present.
>
What does "it's all mute" mean?

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-11-22 16:15:02

by Randy Dunlap

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, 22 Nov 2005, Bill Davidsen wrote:

> Jeff V. Merkey wrote:
> > Bernd Eckenfels wrote:
> >
> >> In article <[email protected]> you wrote:
> >>
> >>
> >>> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.
> >>> Python says 2**64 is 18446744073709551616, and that's roughly:
> >>> 18,446,744,073,709,551,616 bytes
> >>> 18,446,744,073,709 megs
> >>> 18,446,744,073 gigs
> >>> 18,446,744 terabytes
> >>> 18,446 ... what are those, pedabytes (petabytes?)
> >>> 18 zetabytes
> >>>
> > There you go. I deal with this a lot so, those are the names.
> >
> > Linux is currently limited to 16 TB per VFS mount point, it's all mute,
> > unless VFS gets fixed.
> > mmap won't go above this at present.
> >
> What does "it's all mute" mean?

It means "it's all moot."

--
~Randy

2005-11-22 16:17:24

by Chris Adams

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Once upon a time, Jan Harkes <[email protected]> said:
>The only thing that tends to break are userspace archiving tools like
>tar, which assume that 2 objects with the same 32-bit st_ino value are
>identical.

That assumption is probably made because that's what POSIX and Single
Unix Specification define: "The st_ino and st_dev fields taken together
uniquely identify the file within the system." Don't blame code that
follows standards for breaking.

>I think that by now several actually double check that theinode
>linkcount is larger than 1.

That is not a good check. I could have two separate files that have
multiple links; if st_ino is the same, how can tar make sense of it?
--
Chris Adams <[email protected]>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

2005-11-22 16:24:48

by Bill Davidsen

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Tarkan Erimer wrote:
> On 11/22/05, Matthias Andree <[email protected]> wrote:
>
>>What if some breakthrough in storage gives us vastly larger (larger than
>>predicted harddisk storage density increases) storage densities in 10
>>years for the same price of a 200 or 300 GB disk drive now?
>
>
> If all the speculations are true for AtomChip Corp.'s
> (http://www.atomchip.com) Optical Technology. We wil begin to use
> really large RAMs and Storages very early than we expected.
> Their prototypes already begin with 1 TB (both for RAM and Storage).
> It's not hard to imagine, a few years later, we can use 100-200 and up
> TB Storages and RAMs.
>
Amazing technology, run XP on a 256 bit 6.8GHz protrietary quantum CPU,
by breaking the words into 64 bit pieces and passing them to XP via a
"RAM packet counter" device.

And they run four copies of XP at once, too, and you don't need to boot
them, they run instantly because... the web page says so?

I assume this is a joke, a scam would have prices ;-)
--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-11-22 16:28:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, Nov 22, 2005 at 10:25:31AM -0500, Jan Harkes wrote:
> On Tue, Nov 22, 2005 at 09:50:47AM -0500, Theodore Ts'o wrote:
> > I will note though that there are people who are asking for 64-bit
> > inode numbers on 32-bit platforms, since 2**32 inodes are not enough
> > for certain distributed/clustered filesystems. And this is something
> > we don't yet support today, and probably will need to think about much
> > sooner than 128-bit filesystems....
>
> As far as the kernel is concerned this hasn't been a problem in a while
> (2.4.early). The iget4 operation that was introduced by reiserfs (now
> iget5) pretty much makes it possible for a filesystem to use anything to
> identify it's inodes. The 32-bit inode numbers are simply used as a hash
> index.

iget4 wasn't even strictly necessary, unless you want to use the inode
cache (which has always been strictly optional for filesystems, even
inode-based ones) --- Linux's VFS is dentry-based, not inode-based, so
we don't use inode numbers to index much of anything inside the
kernel, other than the aforementioned optional inode cache.

The main issue is the lack of a 64-bit interface to extract inode
numbers, which is needed as you point out for userspace archiving
tools like tar. There are also other programs or protocols that in the
past have broken as a result of inode number collisions.

As another example, a quick google search indicates that the some mail
programs can use inode numbers as a part of a technique to create
unique filenames in maildir directories. One could easily also
imagine using inode numbers as part of creating unique ids returned by
an IMAP server --- not something I would recommend, but it's an
example of what some people might have done, since everybody _knows_
they can count on inode numbers on Unix systems, right? POSIX
promises that they won't break!

> The only thing that tends to break are userspace archiving tools like
> tar, which assume that 2 objects with the same 32-bit st_ino value are
> identical. I think that by now several actually double check that the
> inode linkcount is larger than 1.

Um, that's not good enough to avoid failure modes; consider what might
happen if you have two inodes that have hardlinks, so that st_nlink >
1, but whose inode numbers are the same if you only look at the low 32
bits? Oops.

It's not a bad hueristic, if you don't have that many hard-linked
files on your system, but if you have a huge number of hard-linked
trees (such as you might find on a kernel developer with tons of
hard-linked trees), I wouldn't want to count on this always working.

- Ted








>
> Jan

2005-11-22 16:38:39

by Steve Flynn

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On 22/11/05, Randy.Dunlap <[email protected]> wrote:
> On Tue, 22 Nov 2005, Bill Davidsen wrote:
> > Jeff V. Merkey wrote:
> > > Bernd Eckenfels wrote:
> > > Linux is currently limited to 16 TB per VFS mount point, it's all mute,
> > > unless VFS gets fixed.
> > > mmap won't go above this at present.
> > >
> > What does "it's all mute" mean?
>
> It means "it's all moot."

On the contrary, "all mute" is correct - indicating that it doesn't
really matter. All moot means it's open to debate, which is the
opposite of what Bernd meant.

I'll get back to lurking and being boggled by the stuff on the AtomChip website.
--
Steve
Despair - It's always darkest just before it goes pitch black...

2005-11-22 16:55:15

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, 22 Nov 2005, Chris Adams wrote:
> Once upon a time, Jan Harkes <[email protected]> said:
> >The only thing that tends to break are userspace archiving tools like
> >tar, which assume that 2 objects with the same 32-bit st_ino value are
> >identical.
>
> That assumption is probably made because that's what POSIX and Single
> Unix Specification define: "The st_ino and st_dev fields taken together
> uniquely identify the file within the system." Don't blame code that
> follows standards for breaking.

The standards are insufficient however. For example dealing with named
streams or extended attributes if exposed as "normal files" would
naturally have the same st_ino (given they are the same inode as the
normal file data) and st_dev fields.

> >I think that by now several actually double check that theinode
> >linkcount is larger than 1.
>
> That is not a good check. I could have two separate files that have
> multiple links; if st_ino is the same, how can tar make sense of it?

Now that is true. In addition to checking the link count is larger then
1, they should check the file size and if that matches compute the SHA-1
digest of the data (or the MD5 sum or whatever) and probably should also
check the various stat fields for equality before bothering with the
checksum of the file contents.

Or Linux just needs a backup api that programs like this can use to
save/restore files. (Analogous to the MS Backup API but hopefully
less horid...)

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2005-11-22 17:18:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, Nov 22, 2005 at 04:55:08PM +0000, Anton Altaparmakov wrote:
> > That assumption is probably made because that's what POSIX and Single
> > Unix Specification define: "The st_ino and st_dev fields taken together
> > uniquely identify the file within the system." Don't blame code that
> > follows standards for breaking.
>
> The standards are insufficient however. For example dealing with named
> streams or extended attributes if exposed as "normal files" would
> naturally have the same st_ino (given they are the same inode as the
> normal file data) and st_dev fields.

Um, but that's why even Solaris's openat(2) proposal doesn't expose
streams or extended attributes as "normal files". The answer is that
you can't just expose named streams or extended attributes as "normal
files" without screwing yourself.

Also, I haven't checked to see what Solaris does, but technically
their UFS implementation does actually use separate inodes for their
named streams, so stat(2) could return separate inode numbers for the
named streams. (In fact, if you take a Solaris UFS filesystem with
extended attributs, and run it on a Solaris 8 fsck, the directory
containing named streams/extended attributes will show up in
lost+found.)

- Ted

2005-11-22 17:33:57

by Jeffrey V. Merkey

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Bill Davidsen wrote:

> Jeff V. Merkey wrote:
>
>> Bernd Eckenfels wrote:
>>
>>> In article <[email protected]> you wrote:
>>>
>>>
>>>> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.
>>>> Python says 2**64 is 18446744073709551616, and that's roughly:
>>>> 18,446,744,073,709,551,616 bytes
>>>> 18,446,744,073,709 megs
>>>> 18,446,744,073 gigs
>>>> 18,446,744 terabytes
>>>> 18,446 ... what are those, pedabytes (petabytes?)
>>>> 18 zetabytes
>>>>
>> There you go. I deal with this a lot so, those are the names.
>>
>> Linux is currently limited to 16 TB per VFS mount point, it's all
>> mute, unless VFS gets fixed.
>> mmap won't go above this at present.
>>
> What does "it's all mute" mean?
>
Should be spelled "moot". It's a legal term that means "it doesn't matter".

Jeff

2005-11-22 17:43:14

by Jan Harkes

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, Nov 22, 2005 at 11:28:36AM -0500, Theodore Ts'o wrote:
> On Tue, Nov 22, 2005 at 10:25:31AM -0500, Jan Harkes wrote:
> > On Tue, Nov 22, 2005 at 09:50:47AM -0500, Theodore Ts'o wrote:
> > > I will note though that there are people who are asking for 64-bit
> > > inode numbers on 32-bit platforms, since 2**32 inodes are not enough
> > > for certain distributed/clustered filesystems. And this is something
> > > we don't yet support today, and probably will need to think about much
> > > sooner than 128-bit filesystems....
> >
> > As far as the kernel is concerned this hasn't been a problem in a while
> > (2.4.early). The iget4 operation that was introduced by reiserfs (now
> > iget5) pretty much makes it possible for a filesystem to use anything to
> > identify it's inodes. The 32-bit inode numbers are simply used as a hash
> > index.
>
> iget4 wasn't even strictly necessary, unless you want to use the inode
> cache (which has always been strictly optional for filesystems, even
> inode-based ones) --- Linux's VFS is dentry-based, not inode-based, so
> we don't use inode numbers to index much of anything inside the
> kernel, other than the aforementioned optional inode cache.

Ah yes, you're right.

> The main issue is the lack of a 64-bit interface to extract inode
> numbers, which is needed as you point out for userspace archiving
> tools like tar. There are also other programs or protocols that in the
> past have broken as a result of inode number collisions.

64-bit? Coda has been using 128-bit file identifiers for a while now.
And I can imagine someone trying to plug something like git into the VFS
might want to use 168-bits. Or even more for a CAS-based storage that
identifies objects by their SHA256 or SHA512 checksum.

On the other hand, any large scale distributed/cluster based file system
probably will have some sort of snapshot based backup strategy as part
of the file system design. Using tar to back up a couple of tera/peta
bytes just seems like asking for trouble, even keeping track of the
possible hardlinks by remembering previously seen inode numbers over
vast amounts of files will become difficult at some point.

> As another example, a quick google search indicates that the some mail
> programs can use inode numbers as a part of a technique to create
> unique filenames in maildir directories. One could easily also

Hopefully it is only part of the technique. Like combining it with
grabbing a timestamp, the hostname/MAC address where the operation
occurred, etc.

> imagine using inode numbers as part of creating unique ids returned by
> an IMAP server --- not something I would recommend, but it's an
> example of what some people might have done, since everybody _knows_
> they can count on inode numbers on Unix systems, right? POSIX
> promises that they won't break!

Under limited conditions. Not sure how stable/unique 32-bit inode
numbers are on NFS clients, taking into account client-reboots, failing
disks that are restored from tape, or when the file system reuses inode
numbers of recently deleted files, etc. It doesn't matter how much
stability and uniqueness POSIX demands, I simply can't see how it can be
guaranteed in all cases.

> > The only thing that tends to break are userspace archiving tools like
> > tar, which assume that 2 objects with the same 32-bit st_ino value are
> > identical. I think that by now several actually double check that the
> > inode linkcount is larger than 1.
>
> Um, that's not good enough to avoid failure modes; consider what might
> happen if you have two inodes that have hardlinks, so that st_nlink >
> 1, but whose inode numbers are the same if you only look at the low 32
> bits? Oops.
>
> It's not a bad hueristic, if you don't have that many hard-linked
> files on your system, but if you have a huge number of hard-linked
> trees (such as you might find on a kernel developer with tons of
> hard-linked trees), I wouldn't want to count on this always working.

Yeah, bad example for the typical case. But there must be some check to
at least avoid problems when files are removed/created and the inode
numbers are reused during a backup run.

Jan

2005-11-22 18:01:16

by Jeffrey V. Merkey

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Jan Harkes wrote:

>On Tue, Nov 22, 2005 at 11:28:36AM -0500, Theodore Ts'o wrote:
>
>
>>On Tue, Nov 22, 2005 at 10:25:31AM -0500, Jan Harkes wrote:
>>
>>
>>>On Tue, Nov 22, 2005 at 09:50:47AM -0500, Theodore Ts'o wrote:
>>>
>>>
>>>>I will note though that there are people who are asking for 64-bit
>>>>inode numbers on 32-bit platforms, since 2**32 inodes are not enough
>>>>for certain distributed/clustered filesystems. And this is something
>>>>we don't yet support today, and probably will need to think about much
>>>>sooner than 128-bit filesystems....
>>>>
>>>>
>>>As far as the kernel is concerned this hasn't been a problem in a while
>>>(2.4.early). The iget4 operation that was introduced by reiserfs (now
>>>iget5) pretty much makes it possible for a filesystem to use anything to
>>>identify it's inodes. The 32-bit inode numbers are simply used as a hash
>>>index.
>>>
>>>
>>iget4 wasn't even strictly necessary, unless you want to use the inode
>>cache (which has always been strictly optional for filesystems, even
>>inode-based ones) --- Linux's VFS is dentry-based, not inode-based, so
>>we don't use inode numbers to index much of anything inside the
>>kernel, other than the aforementioned optional inode cache.
>>
>>
>
>Ah yes, you're right.
>
>
>
>>The main issue is the lack of a 64-bit interface to extract inode
>>numbers, which is needed as you point out for userspace archiving
>>tools like tar. There are also other programs or protocols that in the
>>past have broken as a result of inode number collisions.
>>
>>
>
>64-bit? Coda has been using 128-bit file identifiers for a while now.
>And I can imagine someone trying to plug something like git into the VFS
>might want to use 168-bits. Or even more for a CAS-based storage that
>identifies objects by their SHA256 or SHA512 checksum.
>
>On the other hand, any large scale distributed/cluster based file system
>probably will have some sort of snapshot based backup strategy as part
>of the file system design. Using tar to back up a couple of tera/peta
>bytes just seems like asking for trouble, even keeping track of the
>possible hardlinks by remembering previously seen inode numbers over
>vast amounts of files will become difficult at some point.
>
>
>
>>As another example, a quick google search indicates that the some mail
>>programs can use inode numbers as a part of a technique to create
>>unique filenames in maildir directories. One could easily also
>>
>>
>
>Hopefully it is only part of the technique. Like combining it with
>grabbing a timestamp, the hostname/MAC address where the operation
>occurred, etc.
>
>
>
>>imagine using inode numbers as part of creating unique ids returned by
>>an IMAP server --- not something I would recommend, but it's an
>>example of what some people might have done, since everybody _knows_
>>they can count on inode numbers on Unix systems, right? POSIX
>>promises that they won't break!
>>
>>
>
>Under limited conditions. Not sure how stable/unique 32-bit inode
>numbers are on NFS clients, taking into account client-reboots, failing
>disks that are restored from tape, or when the file system reuses inode
>numbers of recently deleted files, etc. It doesn't matter how much
>stability and uniqueness POSIX demands, I simply can't see how it can be
>guaranteed in all cases.
>
>
>
>>>The only thing that tends to break are userspace archiving tools like
>>>tar, which assume that 2 objects with the same 32-bit st_ino value are
>>>identical. I think that by now several actually double check that the
>>>inode linkcount is larger than 1.
>>>
>>>
>>Um, that's not good enough to avoid failure modes; consider what might
>>happen if you have two inodes that have hardlinks, so that st_nlink >
>>1, but whose inode numbers are the same if you only look at the low 32
>>bits? Oops.
>>
>>It's not a bad hueristic, if you don't have that many hard-linked
>>files on your system, but if you have a huge number of hard-linked
>>trees (such as you might find on a kernel developer with tons of
>>hard-linked trees), I wouldn't want to count on this always working.
>>
>>
>
>Yeah, bad example for the typical case. But there must be some check to
>at least avoid problems when files are removed/created and the inode
>numbers are reused during a backup run.
>
>Jan
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
Someone needs to fix the mmap problems with some clever translation for
supporting huge files and filesystems
beyond 16 TB. Increasing block sizes will help (increase to 64K in the
buffer cache). I have a lot of input here
ajnd I am supporting huge data storage volumes at present with 32 bit
version, but I have had to insert my own
64K managements layer to interface with the VFS and I have had to also
put some restrictions on file sizes. Packet
capture based FS's can generate more data than any of these traditional
FS's do.

Jeff

2005-11-22 19:07:17

by Pavel Machek

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Hi!

> > > Sun is proposing it can predict what storage layout will be efficient for
> > > as yet unheard of quantities of data, with unknown access patterns, at
> > > least a couple decades from now. It's also proposing that data
> > > compression and checksumming are the filesystem's job. Hands up anybody
> > > who spots conflicting trends here already? Who thinks the 128 bit
> > > requirement came from marketing rather than the engineers?
> >
> > Actually, if you are storing information in single protons, I'd say
> > you _need_ checksumming :-).
>
> You need error correcting codes at the media level. A molecular storage
> system like this would probably look a lot more like flash or dram than it
> would magnetic media. (For one thing, I/O bandwidth and seek times become a
> serious bottleneck with high density single point of access systems.)
>
> > [I actually agree with Sun here, not trusting disk is good idea. At
> > least you know kernel panic/oops/etc can't be caused by bit corruption on
> > the disk.]
>
> But who said the filesystem was the right level to do this at?

Filesystem level may not be the best level to do it at, but doing it
at all is still better than current state-of-the-art. Doing it at
media level is not enough, because then you get interference at IDE
cable or driver bugs etc.

DM layer might be better place to do checksums at, but perhaps
filesystem can do it more efficiently (it knows its own access
patterns), and is definitely easier to setup for the end user.

If you want compression anyway (and you want -- for performance
reasons, if you are working with big texts or geographical data),
doing checksums at the same level just makes sense.
Pavel
--
Thanks, Sharp!

2005-11-22 19:25:41

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, 22 Nov 2005, Theodore Ts'o wrote:
> On Tue, Nov 22, 2005 at 04:55:08PM +0000, Anton Altaparmakov wrote:
> > > That assumption is probably made because that's what POSIX and Single
> > > Unix Specification define: "The st_ino and st_dev fields taken together
> > > uniquely identify the file within the system." Don't blame code that
> > > follows standards for breaking.
> >
> > The standards are insufficient however. For example dealing with named
> > streams or extended attributes if exposed as "normal files" would
> > naturally have the same st_ino (given they are the same inode as the
> > normal file data) and st_dev fields.
>
> Um, but that's why even Solaris's openat(2) proposal doesn't expose
> streams or extended attributes as "normal files". The answer is that
> you can't just expose named streams or extended attributes as "normal
> files" without screwing yourself.

Reiser4 does I believe...

> Also, I haven't checked to see what Solaris does, but technically
> their UFS implementation does actually use separate inodes for their
> named streams, so stat(2) could return separate inode numbers for the
> named streams. (In fact, if you take a Solaris UFS filesystem with
> extended attributs, and run it on a Solaris 8 fsck, the directory
> containing named streams/extended attributes will show up in
> lost+found.)

I was not talking about Solaris/UFS. NTFS has named streams and extended
attributes and both are stored as separate attribute records inside the
same inode as the data attribute. (A bit simplified as multiple inodes
can be in use for one "file" when an inode's attributes become large than
an inode - in that case attributes are either moved whole to a new inode
and/or are chopped up in bits and each bit goes to a different inode.)

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2005-11-22 19:47:44

by Alan

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Maw, 2005-11-22 at 10:17 -0600, Chris Adams wrote:
> That assumption is probably made because that's what POSIX and Single
> Unix Specification define: "The st_ino and st_dev fields taken together
> uniquely identify the file within the system." Don't blame code that
> follows standards for breaking.

It was a nice try but there is a giant gotcha most people forget. Its
only safe to make this assumption while you have all of the
files/directories in question open.

2005-11-22 19:52:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, Nov 22, 2005 at 07:25:20PM +0000, Anton Altaparmakov wrote:
> > > The standards are insufficient however. For example dealing with named
> > > streams or extended attributes if exposed as "normal files" would
> > > naturally have the same st_ino (given they are the same inode as the
> > > normal file data) and st_dev fields.
> >
> > Um, but that's why even Solaris's openat(2) proposal doesn't expose
> > streams or extended attributes as "normal files". The answer is that
> > you can't just expose named streams or extended attributes as "normal
> > files" without screwing yourself.
>
> Reiser4 does I believe...

Reiser4 violates POSIX. News at 11....

> I was not talking about Solaris/UFS. NTFS has named streams and extended
> attributes and both are stored as separate attribute records inside the
> same inode as the data attribute. (A bit simplified as multiple inodes
> can be in use for one "file" when an inode's attributes become large than
> an inode - in that case attributes are either moved whole to a new inode
> and/or are chopped up in bits and each bit goes to a different inode.)

NTFS violates POSIX. News at 11....

- Ted

2005-11-22 19:57:14

by Chris Adams

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Once upon a time, Alan Cox <[email protected]> said:
> It was a nice try but there is a giant gotcha most people forget. Its
> only safe to make this assumption while you have all of the
> files/directories in question open.

Tru64 adds a "st_gen" field to struct stat. It is an unsigned int that
is a "generation" counter for a particular inode. To get a collision
while creating and removing files, you'd have to remove and create a
file with the same inode 2^32 times while tar (or whatever) is running.
Here's what stat(2) says:

Two structure members in <sys/stat.h> uniquely identify a file in a file
system: st_ino, the file serial number, and st_dev, the device id for the
directory that contains the file.

[Tru64 UNIX] However, in the rare case when a user application has been
deleting open files, and a file serial number is reused, a third structure
member in <sys/stat.h>, the file generation number, is needed to uniquely
identify a file. This member, st_gen, is used in addition to st_ino and
st_dev.

--
Chris Adams <[email protected]>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

2005-11-22 20:01:04

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, 22 Nov 2005, Theodore Ts'o wrote:
> On Tue, Nov 22, 2005 at 07:25:20PM +0000, Anton Altaparmakov wrote:
> > > > The standards are insufficient however. For example dealing with named
> > > > streams or extended attributes if exposed as "normal files" would
> > > > naturally have the same st_ino (given they are the same inode as the
> > > > normal file data) and st_dev fields.
> > >
> > > Um, but that's why even Solaris's openat(2) proposal doesn't expose
> > > streams or extended attributes as "normal files". The answer is that
> > > you can't just expose named streams or extended attributes as "normal
> > > files" without screwing yourself.
> >
> > Reiser4 does I believe...
>
> Reiser4 violates POSIX. News at 11....
>
> > I was not talking about Solaris/UFS. NTFS has named streams and extended
> > attributes and both are stored as separate attribute records inside the
> > same inode as the data attribute. (A bit simplified as multiple inodes
> > can be in use for one "file" when an inode's attributes become large than
> > an inode - in that case attributes are either moved whole to a new inode
> > and/or are chopped up in bits and each bit goes to a different inode.)
>
> NTFS violates POSIX. News at 11....

What is your point? I personally couldn't care less about POSIX (or any
other simillarly old-fashioned standards for that matter). What counts is
reality and having a working system that does what I want/need it to do.
If that means violating POSIX, so be it. I am not going to burry my head
in the sand just because POSIX says "you can't do that". Utilities can be
taught to work with the system instead of blindly following standards.

And anyway the Linux kernel defies POSIX left, right, and centre so if you
care that much you ought to be off fixing all those violations... (-;

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2005-11-22 23:03:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, Nov 22, 2005 at 08:00:58PM +0000, Anton Altaparmakov wrote:
>
> What is your point? I personally couldn't care less about POSIX (or any
> other simillarly old-fashioned standards for that matter). What counts is
> reality and having a working system that does what I want/need it to do.
> If that means violating POSIX, so be it. I am not going to burry my head
> in the sand just because POSIX says "you can't do that". Utilities can be
> taught to work with the system instead of blindly following standards.

Finding all of the utilities and userspace applications that depend on
some specific POSIX behavior is hard; and convincing them to change,
instead of fixing the buggy OS, is even harder. But that's OK, no one
has to use your filesystem (or operating system) if doesn't conform to
standards enough that your applications start breaking.

> And anyway the Linux kernel defies POSIX left, right, and centre so if you
> care that much you ought to be off fixing all those violations... (-;

Um, where? Actually, we're pretty close, and we often spend quite a
bit of time fixing places where we don't conform to the standards
correctly. Look at all of the work that's gone into the kernel to
make Linux's threads support POSIX compliant, for example. We did
*not* tell everyone to go rewrite their applications to use
LinuxThreads, even if certain aspects of Posix threads are a little
brain-damaged.

- Ted

2005-11-23 14:53:55

by Andi Kleen

[permalink] [raw]
Subject: Generation numbers in stat was Re: what is slashdot's answer to ZFS?

Chris Adams <[email protected]> writes:
>
> [Tru64 UNIX] However, in the rare case when a user application has been
> deleting open files, and a file serial number is reused, a third structure
> member in <sys/stat.h>, the file generation number, is needed to uniquely
> identify a file. This member, st_gen, is used in addition to st_ino and
> st_dev.

Sounds like a cool idea. Many fs already maintain this information
in the kernel. We still had some unused pad space in the struct stat
so it could be implemented without any compatibility issues
(e.g. in place of __pad0). On old kernels it would be always 0.

-Andi

2005-11-23 15:50:07

by Bill Davidsen

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Jeff V. Merkey wrote:
> Bill Davidsen wrote:
>
>> Jeff V. Merkey wrote:
>>
>>> Bernd Eckenfels wrote:
>>>
>>>> In article <[email protected]> you wrote:
>>>>
>>>>
>>>>> I believe that on 64 bit platforms, Linux has a 64 bit clean VFS.
>>>>> Python says 2**64 is 18446744073709551616, and that's roughly:
>>>>> 18,446,744,073,709,551,616 bytes
>>>>> 18,446,744,073,709 megs
>>>>> 18,446,744,073 gigs
>>>>> 18,446,744 terabytes
>>>>> 18,446 ... what are those, pedabytes (petabytes?)
>>>>> 18 zetabytes
>>>>>
>>> There you go. I deal with this a lot so, those are the names.
>>>
>>> Linux is currently limited to 16 TB per VFS mount point, it's all
>>> mute, unless VFS gets fixed.
>>> mmap won't go above this at present.
>>>
>> What does "it's all mute" mean?
>>
> Should be spelled "moot". It's a legal term that means "it doesn't matter".

Yes, I am well aware of what moot means, had you used that.


--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-11-23 15:51:01

by Bill Davidsen

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Anton Altaparmakov wrote:
> On Tue, 22 Nov 2005, Chris Adams wrote:
>
>>Once upon a time, Jan Harkes <[email protected]> said:
>>
>>>The only thing that tends to break are userspace archiving tools like
>>>tar, which assume that 2 objects with the same 32-bit st_ino value are
>>>identical.
>>
>>That assumption is probably made because that's what POSIX and Single
>>Unix Specification define: "The st_ino and st_dev fields taken together
>>uniquely identify the file within the system." Don't blame code that
>>follows standards for breaking.
>
>
> The standards are insufficient however. For example dealing with named
> streams or extended attributes if exposed as "normal files" would
> naturally have the same st_ino (given they are the same inode as the
> normal file data) and st_dev fields.
>
>
>>>I think that by now several actually double check that theinode
>>>linkcount is larger than 1.
>>
>>That is not a good check. I could have two separate files that have
>>multiple links; if st_ino is the same, how can tar make sense of it?
>
>
> Now that is true. In addition to checking the link count is larger then
> 1, they should check the file size and if that matches compute the SHA-1
> digest of the data (or the MD5 sum or whatever) and probably should also
> check the various stat fields for equality before bothering with the
> checksum of the file contents.
>
> Or Linux just needs a backup api that programs like this can use to
> save/restore files. (Analogous to the MS Backup API but hopefully
> less horid...)
>
In order to prevent the problems mentioned AND satisfy SuS, I would
think that the st_dev field would be the value which should be unique,
which is not always the case currently. The st_inod is a file id on
st_dev, and it would be less confusing if the inode on each st_dev were
unique. Not to mention that some backup programs do look at that st_dev
and could be mightily confused if the meaning is not determinant.

Historical application usage assumes that it is invariant, many
applications were written before pluggable devices and network mounts.
In a perfect world where nothing broke when things were changed, if
there were some UUID on a filesystem, so it looks the same mounted over
network or by direct mount, or loopback mount, etc, then there would be
no confusion.

A backup API would really be nice if it could somehow provide some
unique ID, such that a netowrk or direct backup of the same data would
have the same IDs.
--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-11-23 15:51:07

by Bill Davidsen

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Chris Adams wrote:
> Once upon a time, Alan Cox <[email protected]> said:
>
>>It was a nice try but there is a giant gotcha most people forget. Its
>>only safe to make this assumption while you have all of the
>>files/directories in question open.
>

Right, at the time the structures were created removable (in any sense)
media usually meant 1/2 inch mag tape, not block storage. The inode was
pretty well set by SysIII, IIRC.
>
> Tru64 adds a "st_gen" field to struct stat. It is an unsigned int that
> is a "generation" counter for a particular inode. To get a collision
> while creating and removing files, you'd have to remove and create a
> file with the same inode 2^32 times while tar (or whatever) is running.
> Here's what stat(2) says:
>
> Two structure members in <sys/stat.h> uniquely identify a file in a file
> system: st_ino, the file serial number, and st_dev, the device id for the
> directory that contains the file.
>
> [Tru64 UNIX] However, in the rare case when a user application has been
> deleting open files, and a file serial number is reused, a third structure
> member in <sys/stat.h>, the file generation number, is needed to uniquely
> identify a file. This member, st_gen, is used in addition to st_ino and
> st_dev.
>
Shades of VMS! Of course that's not unique, I believe iso9660 (CD) has
versioning which is almost never used.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-11-23 15:52:01

by Bill Davidsen

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Theodore Ts'o wrote:
> On Tue, Nov 22, 2005 at 07:25:20PM +0000, Anton Altaparmakov wrote:
>
>>>>The standards are insufficient however. For example dealing with named
>>>>streams or extended attributes if exposed as "normal files" would
>>>>naturally have the same st_ino (given they are the same inode as the
>>>>normal file data) and st_dev fields.
>>>
>>>Um, but that's why even Solaris's openat(2) proposal doesn't expose
>>>streams or extended attributes as "normal files". The answer is that
>>>you can't just expose named streams or extended attributes as "normal
>>>files" without screwing yourself.
>>
>>Reiser4 does I believe...
>
>
> Reiser4 violates POSIX. News at 11....
>
>
>>I was not talking about Solaris/UFS. NTFS has named streams and extended
>>attributes and both are stored as separate attribute records inside the
>>same inode as the data attribute. (A bit simplified as multiple inodes
>>can be in use for one "file" when an inode's attributes become large than
>>an inode - in that case attributes are either moved whole to a new inode
>>and/or are chopped up in bits and each bit goes to a different inode.)
>
>
> NTFS violates POSIX. News at 11....
>
True, but perhaps in this case it's time for POSIX to move, the things
in filesystems, and which are used as filesystems have changed a bunch.

It would be nice to have a neutral standard rather than adopting
existing extended implementations, just because the politics of it are
that everyone but MS would hate NTFS, MS would hate any of the existing
others, and a new standard would have the same impact on everyone and
therefore might be viable. Not a quick fix, however, standards take a
LONG time.
--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-11-24 01:52:01

by art

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

check this but remember developers can be contaminated
and sued by SUN over patent stuff

http://www.opensolaris.org/os/community/zfs/source/

xboom

2005-11-24 05:15:35

by Chris Adams

[permalink] [raw]
Subject: Re: Generation numbers in stat was Re: what is slashdot's answer to ZFS?

Once upon a time, Andi Kleen <[email protected]> said:
> Chris Adams <[email protected]> writes:
> > [Tru64 UNIX] However, in the rare case when a user application has been
> > deleting open files, and a file serial number is reused, a third structure
> > member in <sys/stat.h>, the file generation number, is needed to uniquely
> > identify a file. This member, st_gen, is used in addition to st_ino and
> > st_dev.
>
> Sounds like a cool idea. Many fs already maintain this information
> in the kernel. We still had some unused pad space in the struct stat
> so it could be implemented without any compatibility issues
> (e.g. in place of __pad0). On old kernels it would be always 0.

Searching around some, I see that OS X has st_gen, but the man page I
found says it is only available for super-user. It also appears that
AIX and at least some of the BSDs have it (which would make sense I
guess as Tru64, OS X, and IIRC AIX are all BSD derived).

Also, I ses someone pitched it to linux-kernel several years ago but it
didn't appear to go anywhere. Maybe time to rethink that?
--
Chris Adams <[email protected]>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

2005-11-24 08:47:44

by Andi Kleen

[permalink] [raw]
Subject: Re: Generation numbers in stat was Re: what is slashdot's answer to ZFS?

> Also, I ses someone pitched it to linux-kernel several years ago but it
> didn't appear to go anywhere. Maybe time to rethink that?

It just needs someone to post a patch.

-Andi

2005-11-28 12:54:11

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On 2005-11-21T11:19:59, J?rn Engel <[email protected]> wrote:

> o Merge of LVM and filesystem layer
> Not done. This has some advantages, but also more complexity than
> seperate LVM and filesystem layers. Might be considers "not worth
> it" for some years.

This is one of the cooler ideas IMHO. In effect, LVM is just a special
case filesystem - huge blocksizes, few files, mostly no directories,
exports block instead of character/streams "files".

Why do we need to implement a clustered LVM as well as a clustered
filesystem? Because we can't re-use across this boundary and not stack
"real" filesystems, so we need a pseudo-layer we call volume management.
And then, if by accident we need a block device from a filesystem again,
we get to use loop devices. Does that make sense? Not really.

(Same as the distinction between character and block devices in the
kernel.)

Look at how people want to use Xen: host the images on OCFS2/GFS backing
stores. In effect, this uses the CFS as a cluster enabled volume
manager.

If they'd be better integrated (ie: be able to stack filesystems), we
could snapshot/RAID single files (or ultimately, even directories trees)
just like today we can snapshot whole block devices.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

2005-11-29 05:04:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Mon, Nov 28, 2005 at 01:53:51PM +0100, Lars Marowsky-Bree wrote:
> On 2005-11-21T11:19:59, J?rn Engel <[email protected]> wrote:
>
> > o Merge of LVM and filesystem layer
> > Not done. This has some advantages, but also more complexity than
> > seperate LVM and filesystem layers. Might be considers "not worth
> > it" for some years.
>
> This is one of the cooler ideas IMHO. In effect, LVM is just a special
> case filesystem - huge blocksizes, few files, mostly no directories,
> exports block instead of character/streams "files".

This isn't actually a new idea, BTW. Digital's advfs had storage
pools and the ability to have a single advfs filesystem spam multiple
filesystems, and to have multiple adv filesystems using storage pool,
something like ten years ago. Something to keep in mind for those
people looking for prior art for any potential Sun patents covering
ZFS.... (not that I am giving legal advice, of course!)

- Ted

2005-11-29 05:58:03

by Willy Tarreau

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

On Tue, Nov 29, 2005 at 12:04:39AM -0500, Theodore Ts'o wrote:
> On Mon, Nov 28, 2005 at 01:53:51PM +0100, Lars Marowsky-Bree wrote:
> > On 2005-11-21T11:19:59, J?rn Engel <[email protected]> wrote:
> >
> > > o Merge of LVM and filesystem layer
> > > Not done. This has some advantages, but also more complexity than
> > > seperate LVM and filesystem layers. Might be considers "not worth
> > > it" for some years.
> >
> > This is one of the cooler ideas IMHO. In effect, LVM is just a special
> > case filesystem - huge blocksizes, few files, mostly no directories,
> > exports block instead of character/streams "files".
>
> This isn't actually a new idea, BTW. Digital's advfs had storage
> pools and the ability to have a single advfs filesystem spam multiple
> filesystems, and to have multiple adv filesystems using storage pool,
> something like ten years ago. Something to keep in mind for those
> people looking for prior art for any potential Sun patents covering
> ZFS.... (not that I am giving legal advice, of course!)
>
> - Ted

Having played a few months with a machine installed with advfs, I
can say that I *loved* this FS. It could be hot-resized, mounted
into several places at once (a bit like we can do now with --bind),
and best of all, it was by far the fastest FS I had ever seen. I
think that the 512 MB cache for the metadata helped a lot ;-)

Regards,
Willy

2005-11-29 09:30:48

by Andi Kleen

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Theodore Ts'o <[email protected]> writes:

> On Mon, Nov 28, 2005 at 01:53:51PM +0100, Lars Marowsky-Bree wrote:
> > On 2005-11-21T11:19:59, J?rn Engel <[email protected]> wrote:
> >
> > > o Merge of LVM and filesystem layer
> > > Not done. This has some advantages, but also more complexity than
> > > seperate LVM and filesystem layers. Might be considers "not worth
> > > it" for some years.
> >
> > This is one of the cooler ideas IMHO. In effect, LVM is just a special
> > case filesystem - huge blocksizes, few files, mostly no directories,
> > exports block instead of character/streams "files".
>
> This isn't actually a new idea, BTW. Digital's advfs had storage
> pools and the ability to have a single advfs filesystem spam multiple
> filesystems, and to have multiple adv filesystems using storage pool,
> something like ten years ago.

The old JFS code base had something similar before it got ported
to Linux (I believe it came from OS/2). But it was removed.
And miguel did a prototype of it with ext2 at some point long ago.

But to me it's unclear it's a really good idea. Having at least the option
to control where physical storage is placed is nice, especially
if you cannot mirror everything (ZFS seems to assume everything is mirrored)
And separate devices and LVM make that easier.

-Andi

2005-11-29 14:43:03

by John Stoffel

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

>>>>> "Willy" == Willy Tarreau <[email protected]> writes:

Willy> Having played a few months with a machine installed with advfs,
Willy> I can say that I *loved* this FS. It could be hot-resized,
Willy> mounted into several places at once (a bit like we can do now
Willy> with --bind), and best of all, it was by far the fastest FS I
Willy> had ever seen. I think that the 512 MB cache for the metadata
Willy> helped a lot ;-)

It was a wonderful FS, but if you used a PrestoServer NFS accelerator
board on the system with 4mb of RAM, but forgot to actually enable the
battery, bad things happened when the system crashed... you got a nice
4mb hole in the filesystem which caused wonderfully obtuse panics.
All the while the hardware keeps insisting that the battery on the
NVRAM board was just fine... turned out to be a hardware bug on the
NVRAM board, which screwed us completely.

Once that was solved, back in Oct 93 time frame as I recall, the Advfs
filesystem just ran and ran and ran. Too bad DEC/Compaq/HP won't
release it nowdays....

John

2005-11-29 16:03:08

by Chris Adams

[permalink] [raw]
Subject: Re: what is our answer to ZFS?

Once upon a time, Theodore Ts'o <[email protected]> said:
>This isn't actually a new idea, BTW. Digital's advfs had storage
>pools and the ability to have a single advfs filesystem spam multiple
>filesystems, and to have multiple adv filesystems using storage pool,
>something like ten years ago.

A really nice feature of AdvFS is fileset-level snapshots. For my Alpha
servers, I don't have to allocate disk space to snapshot storage; the
fileset uses free space within the fileset for changes while a snapshot
is active. For my Linux servers using LVM, I have to leave a chunk of
space free in the volume group, make sure it is big enough, make sure
only one snapshot exists at a time (or make sure there's enough free
space for multiple snapshots), etc.

AdvFS is also fully integrated with TruCluster; when I started
clustering, I didn't have to change anything for most of my storage.

I will miss AdvFS when we turn off our Alphas for the last time (which
won't be far off I guess; final order date for an HP Alpha system is
less than a year away now).
--
Chris Adams <[email protected]>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.