2013-03-19 23:00:41

by Martin Steigerwald

[permalink] [raw]
Subject: Re: Tux3 Report: Initial fsck has landed

Am Dienstag, 29. Januar 2013 schrieb Daniel Phillips:
> On Mon, Jan 28, 2013 at 5:40 PM, Theodore Ts'o <[email protected]> wrote:
> > On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:
> >> On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
> >> > The situation I'm thinking of is when dealing with VMs, you make a
> >> > filesystem image once and clone it multiple times. Won't that end up
> >> > with the same UUID in the superblock?
> >>
> >> Yes, but one ought to be able to change the UUID a la tune2fs
> >> -U. Even still... so long as the VM images have a different UUID
> >> than the fs that they live on, it ought to be fine.
> >
> > ... and this is something most system administrators should be
> > familiar with. For example, it's one of those things that Norton
> > Ghost when makes file system image copes (the equivalent of "tune2fs
> > -U random /dev/XXX")
>
> Hmm, maybe I missed something but it does not seem like a good idea
> to use the volume UID itself to generate unique-per-volume metadata
> hashes, if users expect to be able to change it. All the metadata hashes
> would need to be changed.

I believe that is what BTRFS is doing.

And yes, AFAIK there is no easy way to change the UUID of a BTRFS filesystems
after it was created.

Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7


2013-03-20 04:08:44

by Daniel Phillips

[permalink] [raw]
Subject: Re: Tux3 Report: Initial fsck has landed

On Tue, Mar 19, 2013 at 9:04 PM, David Lang <[email protected]> wrote:
> On Wed, 20 Mar 2013, Martin Steigerwald wrote:
>
>> Am Dienstag, 29. Januar 2013 schrieb Daniel Phillips:
>>>
>>> On Mon, Jan 28, 2013 at 5:40 PM, Theodore Ts'o <[email protected]> wrote:
>>>>
>>>> On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:
>>>>>
>>>>> On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
>>>>>>
>>>>>> The situation I'm thinking of is when dealing with VMs, you make a
>>>>>> filesystem image once and clone it multiple times. Won't that end up
>>>>>> with the same UUID in the superblock?
>>>>>
>>>>> Yes, but one ought to be able to change the UUID a la tune2fs
>>>>> -U. Even still... so long as the VM images have a different UUID
>>>>> than the fs that they live on, it ought to be fine.
>>>>
>>>> ... and this is something most system administrators should be
>>>> familiar with. For example, it's one of those things that Norton
>>>> Ghost when makes file system image copes (the equivalent of "tune2fs
>>>> -U random /dev/XXX")
>>>
>>> Hmm, maybe I missed something but it does not seem like a good idea
>>> to use the volume UID itself to generate unique-per-volume metadata
>>> hashes, if users expect to be able to change it. All the metadata hashes
>>> would need to be changed.
>>
>> I believe that is what BTRFS is doing.
>>
>> And yes, AFAIK there is no easy way to change the UUID of a BTRFS
>> filesystems
>> after it was created.
>
> In a world where systems are cloned, and many VMs are started from one
> master copy of a filesystem, a UUID is about as far from unique as anything
> you can generate.
>
> BTRFS may have this problem, but why should Tux3 copy the problem?

Tux3 won't copy that problem. We have enough real problems to deal with
as it is, without manufacturing new ones.

Regards,

Daniel

2013-03-20 04:18:19

by David Lang

[permalink] [raw]
Subject: Re: Tux3 Report: Initial fsck has landed

On Wed, 20 Mar 2013, Martin Steigerwald wrote:

> Am Dienstag, 29. Januar 2013 schrieb Daniel Phillips:
>> On Mon, Jan 28, 2013 at 5:40 PM, Theodore Ts'o <[email protected]> wrote:
>>> On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:
>>>> On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
>>>>> The situation I'm thinking of is when dealing with VMs, you make a
>>>>> filesystem image once and clone it multiple times. Won't that end up
>>>>> with the same UUID in the superblock?
>>>>
>>>> Yes, but one ought to be able to change the UUID a la tune2fs
>>>> -U. Even still... so long as the VM images have a different UUID
>>>> than the fs that they live on, it ought to be fine.
>>>
>>> ... and this is something most system administrators should be
>>> familiar with. For example, it's one of those things that Norton
>>> Ghost when makes file system image copes (the equivalent of "tune2fs
>>> -U random /dev/XXX")
>>
>> Hmm, maybe I missed something but it does not seem like a good idea
>> to use the volume UID itself to generate unique-per-volume metadata
>> hashes, if users expect to be able to change it. All the metadata hashes
>> would need to be changed.
>
> I believe that is what BTRFS is doing.
>
> And yes, AFAIK there is no easy way to change the UUID of a BTRFS filesystems
> after it was created.

In a world where systems are cloned, and many VMs are started from one master
copy of a filesystem, a UUID is about as far from unique as anything you can
generate.

BTRFS may have this problem, but why should Tux3 copy the problem?

David Lang

2013-03-20 10:29:47

by Martin Steigerwald

[permalink] [raw]
Subject: Re: Tux3 Report: Initial fsck has landed

Am Mittwoch, 20. M?rz 2013 schrieb David Lang:
> On Wed, 20 Mar 2013, Martin Steigerwald wrote:
> > Am Dienstag, 29. Januar 2013 schrieb Daniel Phillips:
> >> On Mon, Jan 28, 2013 at 5:40 PM, Theodore Ts'o <[email protected]> wrote:
> >>> On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:
> >>>> On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
> >>>>> The situation I'm thinking of is when dealing with VMs, you make a
> >>>>> filesystem image once and clone it multiple times. Won't that end
> >>>>> up with the same UUID in the superblock?
> >>>>
> >>>> Yes, but one ought to be able to change the UUID a la tune2fs
> >>>> -U. Even still... so long as the VM images have a different UUID
> >>>> than the fs that they live on, it ought to be fine.
> >>>
> >>> ... and this is something most system administrators should be
> >>> familiar with. For example, it's one of those things that Norton
> >>> Ghost when makes file system image copes (the equivalent of "tune2fs
> >>> -U random /dev/XXX")
> >>
> >> Hmm, maybe I missed something but it does not seem like a good idea
> >> to use the volume UID itself to generate unique-per-volume metadata
> >> hashes, if users expect to be able to change it. All the metadata
> >> hashes would need to be changed.
> >
> > I believe that is what BTRFS is doing.
> >
> > And yes, AFAIK there is no easy way to change the UUID of a BTRFS
> > filesystems after it was created.
>
> In a world where systems are cloned, and many VMs are started from one
> master copy of a filesystem, a UUID is about as far from unique as
> anything you can generate.
>
> BTRFS may have this problem, but why should Tux3 copy the problem?

I didn?t ask for copying that behavior. I just mentioned it :)

--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7

2013-03-20 15:00:25

by Rob Landley

[permalink] [raw]
Subject: Re: Tux3 Report: Initial fsck has landed

On 03/19/2013 06:00:32 PM, Martin Steigerwald wrote:
> Am Dienstag, 29. Januar 2013 schrieb Daniel Phillips:
> > On Mon, Jan 28, 2013 at 5:40 PM, Theodore Ts'o <[email protected]>
> wrote:
> > > On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:
> > >> On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:
> > >> > The situation I'm thinking of is when dealing with VMs, you
> make a
> > >> > filesystem image once and clone it multiple times. Won't that
> end up
> > >> > with the same UUID in the superblock?
> > >>
> > >> Yes, but one ought to be able to change the UUID a la tune2fs
> > >> -U. Even still... so long as the VM images have a different UUID
> > >> than the fs that they live on, it ought to be fine.
> > >
> > > ... and this is something most system administrators should be
> > > familiar with. For example, it's one of those things that Norton
> > > Ghost when makes file system image copes (the equivalent of
> "tune2fs
> > > -U random /dev/XXX")
> >
> > Hmm, maybe I missed something but it does not seem like a good idea
> > to use the volume UID itself to generate unique-per-volume metadata
> > hashes, if users expect to be able to change it. All the metadata
> hashes
> > would need to be changed.
>
> I believe that is what BTRFS is doing.
>
> And yes, AFAIK there is no easy way to change the UUID of a BTRFS
> filesystems
> after it was created.

I'm confused, http://tux3.org/ lists a bunch of dates from 5 years ago,
then nothing. Is this project dead or not?

Rob-

2013-03-21 01:49:53

by Daniel Phillips

[permalink] [raw]
Subject: Re: Tux3 Report: Initial fsck has landed

On Tue, Mar 19, 2013 at 11:54 PM, Rob Landley <[email protected]> wrote:
> I'm confused, http://tux3.org/ lists a bunch of dates from 5 years ago, then
> nothing. Is this project dead or not?

Not. We haven't done much about updating tux3.org lately, however you
will find plenty of activity here:

https://github.com/OGAWAHirofumi/tux3/tree/master/user

You will also find fairly comprehensive updates on where we are and
where this is going, here:

http://phunq.net/pipermail/tux3/

At the moment we're being pretty quiet because of being in the middle
of developing the next-gen directory index. Not such a small task, as
you might imagine.

Regards,

Daniel

2013-03-22 01:57:19

by Dave Chinner

[permalink] [raw]
Subject: Re: Tux3 Report: Initial fsck has landed

On Wed, Mar 20, 2013 at 06:49:49PM -0700, Daniel Phillips wrote:
> On Tue, Mar 19, 2013 at 11:54 PM, Rob Landley <[email protected]> wrote:
> > I'm confused, http://tux3.org/ lists a bunch of dates from 5 years ago, then
> > nothing. Is this project dead or not?
>
> Not. We haven't done much about updating tux3.org lately, however you
> will find plenty of activity here:
>
> https://github.com/OGAWAHirofumi/tux3/tree/master/user
>
> You will also find fairly comprehensive updates on where we are and
> where this is going, here:
>
> http://phunq.net/pipermail/tux3/
>
> At the moment we're being pretty quiet because of being in the middle
> of developing the next-gen directory index. Not such a small task, as
> you might imagine.

Hi Daniel,

The "next-gen directory index" comment made me curious. I wanted to
know if there's anything I could learn from what you are doing and
whether anything of your new algorithms could be applied to, say,
the XFS directory structure to improve it.

I went looking for design docs and found this:

http://phunq.net/pipermail/tux3/2013-January/001938.html

In a word: Disappointment.

Compared to the XFS directory structure, the most striking
architectural similarity that I see is this:

"the file bteee[sic] effectively is a second directory index
that imposes a stable ordering on directory blocks".

That was the key architectural innovation in the XFS directory
structure that allowed it to provide the correct seekdir/telldir/NFS
readdir semantics and still scale. i.e. virtually mapped directory
entries. I explained this layout recently here:

http://marc.info/?l=linux-ext4&m=136081996316453&w=2
http://marc.info/?l=linux-ext4&m=136082221117399&w=2
http://marc.info/?l=linux-ext4&m=136089526928538&w=2

We could swap the relevant portions of your PHTree design doc with
my comments (and vice versa) and both sets of references would still
make perfect sense. :P

Further, the PHTree description of tag based freespace tracking is
rather close to how XFS uses tags to track free space regions,
including the fact that XFS can be lazy at updating global free
space indexes. The global freespace tree indexing is slightly
different to the XFS method - it's closer to the original V1 dir
code in XFS (that didn't scale at all well) than the current code.
However, that's really a fine detail compared to all the major
structural and algorithmic similarities.

Hence it appears to me that at a fundamental level PHTree is just a
re-implementation of the XFS directory architecture. It's definitely
a *major* step forward from HTree, but it can hardly be considered
revolutionary or "next-gen". It's not even state of the art. Hence:
disappointment.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-03-22 05:42:02

by Daniel Phillips

[permalink] [raw]
Subject: Re: Tux3 Report: Initial fsck has landed

Hi Dave,

Thank you for your insightful post. The answer to the riddle is that
the PHTree scheme as described in the link you cited has already
become "last gen" and that, after roughly ten years of searching, I am
cautiously optimistic that I have discovered a satisfactory next gen
indexing scheme with the properties I was seeking. This is what
Hirofumi and I have busy prototyping and testing for the last few
weeks. More below...

On Thu, Mar 21, 2013 at 6:57 PM, Dave Chinner <[email protected]> wrote:
> On Wed, Mar 20, 2013 at 06:49:49PM -0700, Daniel Phillips wrote:
>> At the moment we're being pretty quiet because of being in the middle
>> of developing the next-gen directory index. Not such a small task, as
>> you might imagine.
>
> The "next-gen directory index" comment made me curious. I wanted to
> know if there's anything I could learn from what you are doing and
> whether anything of your new algorithms could be applied to, say,
> the XFS directory structure to improve it.
>
> I went looking for design docs and found this:
>
> http://phunq.net/pipermail/tux3/2013-January/001938.html
>
> In a word: Disappointment.

Me too. While I convinced myself that the PHTree scheme would scale
significantly better than HTree while being only modestly slower than
HTree in the smaller range (millions of files) even the new scheme
began hitting significant difficulties in the form of write
multiplication in the larger range (billions of files). Most probably,
you discovered the same thing. The problem is not so much about
randomly thrashing the index, because these days even a cheap desktop
can cache the entire index, but rather getting the index onto disk
with proper atomic update at reasonable intervals. We can't accept a
situation where crashing on the 999,999,999th file create requires the
entire index to be rebuilt, or even a significant portion of it. That
means we need ACID commit at normal intervals all the way through the
heavy create load, and unfortunately, that's where the write
multiplication issue rears its ugly head. It turned out that most of
the PHTree index blocks would end up being written to disk hundreds of
times each, effectively stretching out what should be a 10 minute test
to hours.

To solve this, I eventually came up with a secondary indexing scheme
that would kick in under heavy file create load, to take care of
committing enough state to disk at regular intervals that remounting
after a crash would only lose a few seconds of work. With this, PHTree
would satisfy all the performance goals we set out for it, which can
be summarized as: scale smoothly all the way from one file per
directory to one billion files per directory.

The only really distasteful element remaining was the little matter of
having two different directory indexes, the PHTree and the temporary
secondary index. That seems like one index too many. Then the big aha
landed out of the blue: we can actually throw away the primary BTree
and the secondary index will work fine all on its own. So the
secondary index was suddenly promoted to a "next gen" primary index.
This new index (which does not yet have a name) is based on hashing
and sharding and has nothing whatsoever to do with BTrees. It
currently exists in prototype with enough implementation in place to
get some early benchmarks, but we are not quite ready to provide those
details yet. You are completely welcome to call me a tease or be
sceptical, which is just what I would do coming from the other side,
but just now we're in the thick of the heavy work, and the key as we
see it is to keep on concentrating until the time is right. After all,
this amounts to the end of a ten year search that began around the
time HTree went into service in Ext3. Another couple of weeks hardly
seems worth worrying about.

> Compared to the XFS directory structure, the most striking
> architectural similarity that I see is this:
>
> "the file bteee[sic] effectively is a second directory index
> that imposes a stable ordering on directory blocks".
>
> That was the key architectural innovation in the XFS directory
> structure that allowed it to provide the correct seekdir/telldir/NFS
> readdir semantics and still scale. i.e. virtually mapped directory
> entries. I explained this layout recently here:
>
> http://marc.info/?l=linux-ext4&m=136081996316453&w=2
> http://marc.info/?l=linux-ext4&m=136082221117399&w=2
> http://marc.info/?l=linux-ext4&m=136089526928538&w=2
>
> We could swap the relevant portions of your PHTree design doc with
> my comments (and vice versa) and both sets of references would still
> make perfect sense. :P
>
> Further, the PHTree description of tag based freespace tracking is
> rather close to how XFS uses tags to track free space regions,
> including the fact that XFS can be lazy at updating global free
> space indexes. The global freespace tree indexing is slightly
> different to the XFS method - it's closer to the original V1 dir
> code in XFS (that didn't scale at all well) than the current code.
> However, that's really a fine detail compared to all the major
> structural and algorithmic similarities.
>
> Hence it appears to me that at a fundamental level PHTree is just a
> re-implementation of the XFS directory architecture. It's definitely
> a *major* step forward from HTree, but it can hardly be considered
> revolutionary or "next-gen". It's not even state of the art. Hence:
> disappointment.

Insightful indeed, and right on the money. I had no idea we were
reinventing XFS to that extent and would love to spend some time later
dwelling on the details. But at this point, I am willing to cautiously
characterize all that as history, based on the performance numbers we
are seeing from the next gen prototype. We plan to publish details
fairly soon. I will apologize again for the lag. I can only plead that
this kind of work just seems to take more time than it reasonably
should.

Regards,

Daniel