LinuxLists.cc - Dot and dotdot need to be physically present?

2008-08-13 23:37:01

Subject: Dot and dotdot need to be physically present?

Hi Ted,

Many years ago we had a discussion about whether or not the . and ..
directory entries had to be physically present in htree, and I remember
the conclusion was that they had to be, but I forget the argument and
lost track of the email thread. I think the VFS will happily supply
the . and .. entries to getdents on its own. So what was the issue?
Something about telldir?

This is in relation to implementing a cleaner successor to htree that
could end up being useful to Ext4 and Lustre as well as Tux3. See
"phtree" here:

http://tux3.org/design.html

Regards,

Daniel

p.s. Wow, you still have your MIT email. There must be a war story.

2008-08-14 03:47:04

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Dot and dotdot need to be physically present?

On Wed, Aug 13, 2008 at 04:36:59PM -0700, Daniel Phillips wrote:
>
> Many years ago we had a discussion about whether or not the . and ..
> directory entries had to be physically present in htree, and I remember
> the conclusion was that they had to be, but I forget the argument and
> lost track of the email thread. I think the VFS will happily supply
> the . and .. entries to getdents on its own. So what was the issue?
> Something about telldir?

. and .. are needed for backwards compatibility. If you aren't going
to do backwards compatibility, then you might as well not bother
putting the btree in the directory nodes. Just use physically block
numbers directly.

The other reason why '..' is useful is that it helps to knit the
fliesystem back together in case of corruption. (For example, e2fsck
uses the '..' so we can display full pathnames which is very helpful
to system administrators.)

The '.' pointer is slightly less useful, but it is helpful as an
additional sanity check.

If I were doing things all over in a completely incompatible way, I'd
probably put at the beginning of the first directory block (a) a magic
number, (b) the current inode number (as a sanity check), (c) the
parent inode number (i.e., '..'), and (d) a pointer to a physical
block which is the root of the index tree.

- Ted

2008-08-20 23:52:08

by Daniel Phillips

[permalink] [raw]

Subject: Re: Dot and dotdot need to be physically present?

Hi Ted,

Sorry for the lag, I was a little busy.

On Wednesday 13 August 2008 20:47, Theodore Tso wrote:
> On Wed, Aug 13, 2008 at 04:36:59PM -0700, Daniel Phillips wrote:
> >
> > Many years ago we had a discussion about whether or not the . and ..
> > directory entries had to be physically present in htree, and I remember
> > the conclusion was that they had to be, but I forget the argument and
> > lost track of the email thread. I think the VFS will happily supply
> > the . and .. entries to getdents on its own. So what was the issue?
> > Something about telldir?
>
> . and .. are needed for backwards compatibility.

Thankyou, I think I remember now. We had to put . and .. in there to
be able to fall back from indexed to linear scan on old kernels that
know nothing about the index. So my inclination is to leave these out
of the dirent data proper but record them in block headers for
redundancy as you suggest.

> If you aren't going
> to do backwards compatibility, then you might as well not bother
> putting the btree in the directory nodes. Just use physically block
> numbers directly.

Even without any backward compatibility requirement, putting the btree
into a file is a win:

* CPU: for a terabyte volume each radix tree lookup requires 6
dereferences (2^6 fanout) vs 0, 1 or 2 for a modest sized directory
mapped logically in the page cache. This matters because CPU is the
main cost and cause of latency for big directories that are small
enough to fit in cache. (On page cache miss the logical mapping
needs one extra radix tree probe, but these are orders of magnitude
rare than hits.)

* Index fanout for a file-mapped btree is 2^9 while a direct mapped
btree is less, probably 2^8. Less to load, less cache pressure.

* Deferred allocation is harder with physical block pointers because
you have to choose a physical address before you can put data in
the buffer cache. With the page cache, this decision can be
deferred till sync time, when better information is available.

* No need to implement new physical block goal algorithms, the file
locality algorithms will already do the right thing (if possible!)

> The other reason why '..' is useful is that it helps to knit the
> fliesystem back together in case of corruption. (For example, e2fsck
> uses the '..' so we can display full pathnames which is very helpful
> to system administrators.)
>
> The '.' pointer is slightly less useful, but it is helpful as an
> additional sanity check.
>
> If I were doing things all over in a completely incompatible way, I'd
> probably put at the beginning of the first directory block (a) a magic
> number, (b) the current inode number (as a sanity check), (c) the
> parent inode number (i.e., '..'), and (d) a pointer to a physical
> block which is the root of the index tree.

Sensible, and I will do it much like that, but probably in every dirent
block, not just the first one. Recording the physical root of the tree
seems like overkill since the inode number will be there, giving the
root index block via the inode table. That is, if a physical pointer
design to be used. For a logical mapping the directory index root is
always block zero, a simplification that is not possible with physical
pointers.

The directory index itself can be reconstructed on demand, so adding
redundancy only for fsck reconstruction does not seem like a win. We
just want to be able to spot the raw dirents reliably. To that end, a
commit sequence number might be helpful as well, to reduce the chance
of misinterpreting stale, migrated or data blocks.

Regards,

Daniel