> [1.] One line summary of the problem:
> telldir() broken on large ext3 dir_index'd directories because
> getdents() gives d_off==0 for the first entry
Here's a patch which fixes the problem, but note the following warning
from the readdir man page:
According to POSIX, the dirent structure contains a field char d_name[]
of unspecified size, with at most NAME_MAX characters preceding the
terminating null character. Use of other fields will harm the porta-
bility of your programs.
Also, as always, telldir() and seekdir() are truly awful interfaces
because they implicitly assume that (a) a directory is a linear data
structure, and (b) that the position in a directory can be expressed
in a cookie which hsa only 31 bits on 32-bit systems.
So there will be hash colliions that will cause programs that assume
that seekdir(dirent->d_off) will always return the next directory
entry to sometimes lose directory entries in the
not-as-unlikely-as-we-would wish case of a 31-bit hash collision.
Really, any program which is using telldir/seekdir really should be
rewritten to not use these interfaces if at all possible. So with
these caveats....
- Ted
Here is a patch which causes d_off of '.' to be 1, and for seekdir(1)
to cause readdir to return the directory entry of '..'.
Signed-off-by: "Theodore Ts'o" <[email protected]>
===== fs/ext3/namei.c 1.59 vs edited =====
--- 1.59/fs/ext3/namei.c 2004-10-19 05:40:30 -04:00
+++ edited/fs/ext3/namei.c 2004-11-17 17:14:06 -05:00
@@ -610,10 +610,15 @@ int ext3_htree_fill_tree(struct file *di
de = (struct ext3_dir_entry_2 *) frames[0].bh->b_data;
if ((err = ext3_htree_store_dirent(dir_file, 0, 0, de)) != 0)
goto errout;
+ count++;
+ start_hash=2;
+ }
+ if (start_hash==2 && start_minor_hash==0) {
+ de = (struct ext3_dir_entry_2 *) frames[0].bh->b_data;
de = ext3_next_entry(de);
- if ((err = ext3_htree_store_dirent(dir_file, 0, 0, de)) != 0)
+ if ((err = ext3_htree_store_dirent(dir_file, 2, 0, de)) != 0)
goto errout;
- count += 2;
+ count++;
}
while (1) {
Hi,
On Wed, 2004-11-17 at 22:34, Theodore Ts'o wrote:
> Here is a patch which causes d_off of '.' to be 1, and for seekdir(1)
> to cause readdir to return the directory entry of '..'.
Doesn't this make things worse? The old problem was that
seekdir/telldir were broken (which we already knew for certain cases
like hash collisions). But with...
> + start_hash=2;
don't we end up silently ignoring all dirents with a major hash <= 1,
even for unbroken getdents() with no intervening seekdir? Previously
we'd at least fill them into the rbtree, so sequential readdir would
find them even if the hash collided. Now we'll skip them entirely.
If we're going to do this, I think we need to stuff . and .. into the
rbtree with the right hashes, but without ignoring other existing
dirents with colliding hashes.
--Stephen
On Thu, Nov 18, 2004 at 12:00:05AM +0000, Stephen C. Tweedie wrote:
>
> Doesn't this make things worse?
> don't we end up silently ignoring all dirents with a major hash <= 1,
> even for unbroken getdents() with no intervening seekdir?
Oops, yes, I screwed up.
> If we're going to do this, I think we need to stuff . and .. into the
> rbtree with the right hashes, but without ignoring other existing
> dirents with colliding hashes.
We can't just do that, because there are programs that's assume '.'
and '..' are the first and second entries in the directory. Yes, they
are broken and non-portable, but so are programs that depend on
d_off....
So instead what we need to do is wire '.' and '..' to have hash values
of (0,0) and (2,0), respectively, without ignoring other existing
dirents with colliding hashes. (In those cases the programs will
break, but they are statistically rare, and there's not much we can do
in those cases anyway.)
This patch should do this.
Signed-off-by: "Theodore Ts'o" <[email protected]>
--- 1.59/fs/ext3/namei.c 2004-10-19 05:40:30 -04:00
+++ edited/fs/ext3/namei.c 2004-11-17 23:05:35 -05:00
@@ -610,10 +610,14 @@ int ext3_htree_fill_tree(struct file *di
de = (struct ext3_dir_entry_2 *) frames[0].bh->b_data;
if ((err = ext3_htree_store_dirent(dir_file, 0, 0, de)) != 0)
goto errout;
+ count++;
+ }
+ if (start_hash < 2 || (start_hash ==2 && start_minor_hash==0)) {
+ de = (struct ext3_dir_entry_2 *) frames[0].bh->b_data;
de = ext3_next_entry(de);
- if ((err = ext3_htree_store_dirent(dir_file, 0, 0, de)) != 0)
+ if ((err = ext3_htree_store_dirent(dir_file, 2, 0, de)) != 0)
goto errout;
- count += 2;
+ count++;
}
while (1) {
>So instead what we need to do is wire '.' and '..' to have hash values
>of (0,0) and (2,0), respectively, without ignoring other existing
>dirents with colliding hashes. (In those cases the programs will
>break, but they are statistically rare, and there's not much we can do
>in those cases anyway.)
IMO it's better to fix the mess all at once to have it weeded out for some
months.
Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, http://www.gwdg.de
On Thu, Nov 18, 2004 at 12:22:38PM +0100, Jan Engelhardt wrote:
> >So instead what we need to do is wire '.' and '..' to have hash values
> >of (0,0) and (2,0), respectively, without ignoring other existing
> >dirents with colliding hashes. (In those cases the programs will
> >break, but they are statistically rare, and there's not much we can do
> >in those cases anyway.)
>
> IMO it's better to fix the mess all at once to have it weeded out for some
> months.
Programs that assume that '.' and '..' are the first and second
entries of a directory are intrinsically broken; POSIX never
guaranteed this to be the case. Unfortunately, historically things
have always worked that way, and so there may be some broken programs
lurking out there. But there's really not much we can do.
Before, we hard-wired '.' and '..' to always be first, at the cost of
breaking programs that used the (broken by design) POSIX
telldir/seekdir interfaces. Since telldir/seekdir, however badly
designed, are part of POSIX, it seems appropriate to let those
programs work, but the cost is a statistical probability that programs
making assumptions about the order of '.' and '..' will break. We
don't really have a choice here.
(Actually, I guess we could define a new hash function that never
produces certain hash values, but that would break compatibility with
all existing deployed filesystems that use ext3 htree. That's not an
option, either. So again, making a best effort, but breaking programs
that are fundamentally broken is the best we can do.)
- Ted
Hi,
On Thu, 2004-11-18 at 04:53, Theodore Ts'o wrote:
> > If we're going to do this, I think we need to stuff . and .. into the
> > rbtree with the right hashes, but without ignoring other existing
> > dirents with colliding hashes.
>
> We can't just do that, because there are programs that's assume '.'
> and '..' are the first and second entries in the directory. Yes, they
> are broken and non-portable, but so are programs that depend on
> d_off....
Sorry, by "right hashes" I meant adding them with hashes 0 and 2 but in
the correct place in the stream; so if we do have a hash collision on
major-hash==0, we'll get ".." slightly out of order, but will still
correctly provide all the entries. And your second patch seems to do
exactly that. Thanks!
> This patch should do this.
Looks good to me --- have you tested it much?
The only remaining thing I can think of is what happens if the while()
loop at the end of ext3_htree_fill_tree() exits successfully after
filling in hash-major==0. Then we'll restart at 2 next time, and will
return ".." twice.
I'm not sure that's actually possible, though. Moving the filling-in of
".." into the while loop would deal with this rare possibility.
--Stephen
>> >So instead what we need to do is wire '.' and '..' to have hash values
>> >of (0,0) and (2,0), respectively, without ignoring other existing
>> >dirents with colliding hashes. (In those cases the programs will
>> >break, but they are statistically rare, and there's not much we can do
>> >in those cases anyway.)
>>
>> IMO it's better to fix the mess all at once to have it weeded out for some
>> months.
>
>Programs that assume that '.' and '..' are the first and second [...]
>But there's really not much we can do.
>
>Before, we hard-wired '.' and '..' to always be first [...]
>We don't really have a choice here.
>
>(Actually, I guess we could define a new hash function [...]
>So again, making a best effort, but breaking programs
>that are fundamentally broken is the best we can do.)
Looks like there's only way -- or: two ways and one choice:
- Either leave it all as it is ATM (kind of unsatisfying) or
- (unintentionally) break all apps now, or announce that they could brake if they
relied on special features (as outlined above), and at the same time implement
proper directory traversal.
Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, http://www.gwdg.de