I'd like to give some of my thoughts regarding tux2 (phase-change-tree fs):
* FILES *
If a file's data has been changed, it suffices to update the inode and the
of free blocks bitmap (fbb).
But updating them in one go is not possible - the fbb is located at the
superblock, the inode can be (nearly) anywhere on disk.
- Updating another inode and rewriting the directory isn't possible.
This would change the inode number which can be referenced in more than one
directories; and some programs (nfs) rely on the inode number.
- So we have to have two inode spaces per inode, which are switched with
the superblock.
To do this we give a generation counter into the inode, which is checked
against the latest valid superblock.
(Alternatively, we could use the modification time of the inode and have a
modification time in the superblock).
Of two inode spaces we'd regard the one as active, which has a higher
generation counter/mtime than the other, but only if it's lower or equal
than the one of the superblock.
* FILESYSTEM STRUCTURE / SUPERBLOCKs *
As you wrote it's necessary to have several versions of superblocks around.
As the fbb MUST be updated in sync with the superblocks, they should be
near to each other - and the superblock should be later on disk to be able
to write the fbb first and the mark this version as active without a seek.
Some calculation:
Let's assume a 4kB blocksize.
4kB are 32kBit, which multiplied with 4kB blocksize give 128MB of
adressable memory. So we need one 4kB block for every 128MB of fs size.
So 12.8GB would amount to 100 blocks or 400kB.
- Every copy/version (3 or 4) of the superblock has it's own fbb.
This amounts on the above example of 12.8 GB (with 4 versions) to 1/8192 of
the entire fs space - I think that's ok, especially if there will be more
space needed for the inodes and the partly duplicated other data.
- Every single fbb/sb version is (on the harddisk) a linked list with the
following structure:
- fbb
- magic number
- reference to the next block (block number)
- entries of fbb.
This takes 2*32Bit or 2*64Bit of the 4kB.
- sb: all normal sb entries, enlarged with
- a generation counter/mtime.
- a reference to the next assumed fbb/sb location.
I'd like to make a linked list on the harddisk in order to have this
filesystem dynamically resizeable (at least enlargeable):
- a new phase is generated, with more fbb space and so on.
- a sufficiently large space is taken from the harddisk (which is easy,
since the harddisk is now bigger) - preferable continously
- the old phase is written to disk, with the sb "next reference" pointing
to the new space.
- if the old phase is completed, the new phase is populated like normal
operation OR immediately phased again.
As soon as this new fbb/sb block is written on disk, the newest version of
this filesystem says the new size - voila! Online resizing!
I think there are some loose ends in there.
All comments welcome.
Regards,
Phil
On Tue, 3 Jul 2001, Ph. Marek wrote:
> If a file's data has been changed, it suffices to update the inode and the
> of free blocks bitmap (fbb).
> But updating them in one go is not possible
You seem to have missed some fundamental understanding of
exactly how phase tree works; the wohle point of phase
tree is to make atomic updates like this possible!
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/ http://distro.conectiva.com/
Send all your spam to [email protected] (spam digging piggy)
>> If a file's data has been changed, it suffices to update the inode and the
>> of free blocks bitmap (fbb).
>> But updating them in one go is not possible
>
>You seem to have missed some fundamental understanding of
>exactly how phase tree works; the wohle point of phase
>tree is to make atomic updates like this possible!
Well, my point was, that with several thousand inodes spread over the disk
it won't always be possible to update the inode AND the fbb in one go.
So I proposed the 2nd inode with generation counter!
Regards,
Phil
On Wednesday 04 July 2001 08:16, Ph. Marek wrote:
> >> If a file's data has been changed, it suffices to update the inode and
> >> the of free blocks bitmap (fbb).
> >> But updating them in one go is not possible
> >
> >You seem to have missed some fundamental understanding of
> >exactly how phase tree works; the wohle point of phase
> >tree is to make atomic updates like this possible!
>
> Well, my point was, that with several thousand inodes spread over the disk
> it won't always be possible to update the inode AND the fbb in one go.
> So I proposed the 2nd inode with generation counter!
The cool thing is, it *is* possible, read how here:
http://nl.linux.org/~phillips/tux2/phase.tree.tutorial.html
--
Daniel
>> >> If a file's data has been changed, it suffices to update the inode and
>> >> the of free blocks bitmap (fbb).
>> >> But updating them in one go is not possible
>> >
>> >You seem to have missed some fundamental understanding of
>> >exactly how phase tree works; the wohle point of phase
>> >tree is to make atomic updates like this possible!
>>
>> Well, my point was, that with several thousand inodes spread over the disk
>> it won't always be possible to update the inode AND the fbb in one go.
>> So I proposed the 2nd inode with generation counter!
>
>The cool thing is, it *is* possible, read how here:
>
> http://nl.linux.org/~phillips/tux2/phase.tree.tutorial.html
Well, ok. Your split the inode "files" too.
Hmmm...
That sound more complex than my version (at least now, until I've seen the
implementation - maybe it's easier because it has less special cases than
mine).
And of course the memory usage on the harddisk is much less with your
version as you split your inode data and don't have it duplicated.
Well, I hope to see an implementation soon - I'd like to help, even if it's
only testing.
Thanks for the answer!
Regards,
Phil
On Wednesday 04 July 2001 09:53, Ph. Marek wrote:
> >> Well, my point was, that with several thousand inodes spread over the
> >> disk it won't always be possible to update the inode AND the fbb in one
> >> go. So I proposed the 2nd inode with generation counter!
> >
> >The cool thing is, it *is* possible, read how here:
> >
> > http://nl.linux.org/~phillips/tux2/phase.tree.tutorial.html
>
> Well, ok. Your split the inode "files" too.
>
> Hmmm...
> That sound more complex than my version (at least now, until I've seen the
> implementation - maybe it's easier because it has less special cases than
> mine).
Yes, it's more complex, but not horribly so. It's a lot more efficient, and
that's the point.
> And of course the memory usage on the harddisk is much less with your
> version as you split your inode data and don't have it duplicated.
Yep.
> Well, I hope to see an implementation soon - I'd like to help, even if it's
> only testing.
See you on the list. By the way, if you want to help out right now, could
you run some benchmarks my latest early flush patch? See "[RFC] Early Flush
with Bandwidth Estimation".
--
Daniel