2001-02-22 23:08:48

by Bill Crawford

[permalink] [raw]
Subject: Hashing and directories

I was hoping to point out that in real life, most systems that
need to access large numbers of files are already designed to do
some kind of hashing, or at least to divide-and-conquer by using
multi-level directory structures.

A particular reason for this, apart from filesystem efficiency,
is to make it easier for people to find things, as it is usually
easier to spot what you want amongst a hundred things than among
a thousand or ten thousand.

A couple of practical examples from work here at Netcom UK (now
Ebone :), would be say DNS zone files or user authentication data.
We use Solaris and NFS a lot, too, so large directories are a bad
thing in general for us, so we tend to subdivide things using a
very simple scheme: taking the first letter and then sometimes
the second letter or a pair of letters from the filename. This
actually works extremely well in practice, and as mentioned above
provides some positive side-effects.

So I don't think it would actually be sensible to encourage
anyone to use massive directories for too many tasks. It has a
fairly unfortunate impact on applying human intervention to a
broken system, for example, if it takes a long time to find a
file you're looking for.

I guess what I really mean is that I think Linus' strategy of
generally optimizing for the "usual case" is a good thing. It
is actually quite annoying in general to have that many files in
a single directory (think \winnt\... here). So maybe it would
be better to focus on the normal situation of, say, a few hundred
files in a directory rather than thousands ...

I still think it's a good idea to do anything you can to speed
up large directory operations on ext2 though :)

On the plus side, hashes or anything resembling tree structures
would tend to improve the characteristics of insertion and removal
of entries on even moderately sized directories, which would
probably provide a net gain for many folks.

--
/* Bill Crawford, Unix Systems Developer, ebOne, formerly GTS Netcom */
#include "stddiscl.h"


2001-02-22 23:23:22

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Hashing and directories

Bill Crawford wrote:
>
> A particular reason for this, apart from filesystem efficiency,
> is to make it easier for people to find things, as it is usually
> easier to spot what you want amongst a hundred things than among
> a thousand or ten thousand.
>
> A couple of practical examples from work here at Netcom UK (now
> Ebone :), would be say DNS zone files or user authentication data.
> We use Solaris and NFS a lot, too, so large directories are a bad
> thing in general for us, so we tend to subdivide things using a
> very simple scheme: taking the first letter and then sometimes
> the second letter or a pair of letters from the filename. This
> actually works extremely well in practice, and as mentioned above
> provides some positive side-effects.
>

This is sometimes feasible, but sometimes it is a hack with painful
consequences in the form of software incompatibilites.

> I guess what I really mean is that I think Linus' strategy of
> generally optimizing for the "usual case" is a good thing. It
> is actually quite annoying in general to have that many files in
> a single directory (think \winnt\... here). So maybe it would
> be better to focus on the normal situation of, say, a few hundred
> files in a directory rather than thousands ...

Linus' strategy is to not let optimizations for uncommon cases inflict
the common case. However, I think we can make an improvement here that
will work well even on moderate-sized directories.

My main problem with the fixed-depth tree proposal is that is seems to
work well for a certain range of directory sizes, but the range seems a
bit arbitrary. The case of very small directories is also quite
important, too.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-22 23:54:18

by Bill Crawford

[permalink] [raw]
Subject: Re: Hashing and directories

"H. Peter Anvin" wrote:
> Bill Crawford wrote:
...
> > We use Solaris and NFS a lot, too, so large directories are a bad
> > thing in general for us, so we tend to subdivide things using a
> > very simple scheme: taking the first letter and then sometimes
> > the second letter or a pair of letters from the filename. This
> > actually works extremely well in practice, and as mentioned above
> > provides some positive side-effects.

> This is sometimes feasible, but sometimes it is a hack with painful
> consequences in the form of software incompatibilites.

*grin*

We did change the scheme between different versions of our local
software, and that caused one or two small nightmares for me and a
couple other guys who were developing/maintaining systems here.

...

I don't mind improving performance on big directories -- Solaris
sucks when listing a large directory, for example, but is is rock
solid, which is important where we use it.

My worry is that old thing about giving people enough rope to hang
themselves; I'm humanitarian enough that I don't like doing that.

In other words, if we go out and tell people they can put millions
of files in a directory on Linux+ext2, they'll do it, and then they
are going to be upset because 'ls -l' takes a few minutes :)

> > I guess what I really mean is that I think Linus' strategy of
> > generally optimizing for the "usual case" is a good thing. It
> > is actually quite annoying in general to have that many files in
> > a single directory (think \winnt\... here). So maybe it would
> > be better to focus on the normal situation of, say, a few hundred
> > files in a directory rather than thousands ...

> Linus' strategy is to not let optimizations for uncommon cases inflict
> the common case. However, I think we can make an improvement here that
> will work well even on moderate-sized directories.

That's a good point ... I have mis-stated Linus' intention.
I guess he may be along to tick me off in a minute :)

I have no quibbles with that at all ... improvements to the
general case never hurt, even if the greater gain is elsewhere ...

> My main problem with the fixed-depth tree proposal is that is seems to
> work well for a certain range of directory sizes, but the range seems a
> bit arbitrary. The case of very small directories is also quite
> important, too.

Yup.

Sounds like a pretty good idea, however I would be concerned about
the side-effects of, say, getting a lot of hash collisions from a
pathological data set. Very concerned.

I prefer the idea of a real tree-structure ... ReiserFS already
gives very good performance for searching using find, and "rm -rf"
truly is very fast, and I would actually like the benefits of the
structure without the journalling overhead for some filesystems.
I'm thinking especially of /usr and /usr/src here ...

> -hpa

> "Unix gives you enough rope to shoot yourself in the foot."

Doesn't it just? That was my fear ...

Anyway, 'nuff said, just wanted to comment from my experiences.

> http://www.zytor.com/~hpa/puzzle.txt

--
/* Bill Crawford, Unix Systems Developer, ebOne, formerly GTS Netcom */
#include "stddiscl.h"

2001-03-01 20:35:14

by Pavel Machek

[permalink] [raw]
Subject: Re: Hashing and directories

Hi!

> I was hoping to point out that in real life, most systems that
> need to access large numbers of files are already designed to do
> some kind of hashing, or at least to divide-and-conquer by using
> multi-level directory structures.

Yes -- because their workaround kernel slowness.

I had to do this kind of hashing because kernel disliked 70000 html
files (copy of train time tables).

BTW try rm * with 70000 files in directory -- command line will overflow.

> A particular reason for this, apart from filesystem efficiency,
> is to make it easier for people to find things, as it is usually
> easier to spot what you want amongst a hundred things than among
> a thousand or ten thousand.

Yes? Easier to type cat timetab1/2345 that can timetab12345? With bigger
command line size, putting i into *one& directory is definitely easier.


> A couple of practical examples from work here at Netcom UK (now
> Ebone :), would be say DNS zone files or user authentication data.
> We use Solaris and NFS a lot, too, so large directories are a bad
> thing in general for us, so we tend to subdivide things using a
> very simple scheme: taking the first letter and then sometimes
> the second letter or a pair of letters from the filename. This
> actually works extremely well in practice, and as mentioned above
> provides some positive side-effects.

Positive? Try listing all names that contain "linux" with such case. I'll
do ls *linux*. You'll need ls */*linux* ?l/inux* li/nux*. Seems ugly to
me.
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

2001-03-01 20:56:17

by Alexander Viro

[permalink] [raw]
Subject: Re: Hashing and directories



On Sat, 1 Jan 2000, Pavel Machek wrote:

> Hi!
>
> > I was hoping to point out that in real life, most systems that
> > need to access large numbers of files are already designed to do
> > some kind of hashing, or at least to divide-and-conquer by using
> > multi-level directory structures.
>
> Yes -- because their workaround kernel slowness.

Pavel, I'm afraid that you are missing the point. Several, actually:
* limits of _human_ capability to deal with large unstructured
sets of objects
* userland issues (what, you thought that limits on the
command size will go away?)
* portability

The point being: applications and users _do_ know better what structure
is there. Kernel can try to second-guess them and be real good at that, but
inability to second-guess is the last of the reasons why aforementioned
strategies are used.

2001-03-01 21:06:09

by Tigran Aivazian

[permalink] [raw]
Subject: Re: Hashing and directories

On Thu, 1 Mar 2001, Alexander Viro wrote:
> * userland issues (what, you thought that limits on the
> command size will go away?)

the space allowed for arguments is not a userland issue, it is a kernel
limit defined by MAX_ARG_PAGES in binfmts.h, so one could tweak it if one
wanted to without breaking any userland.

Regards,
Tigran


2001-03-01 21:05:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Hashing and directories

Alexander Viro wrote:
> >
> > Yes -- because their workaround kernel slowness.
>
> Pavel, I'm afraid that you are missing the point. Several, actually:
> * limits of _human_ capability to deal with large unstructured
> sets of objects

Not an issue if you're a machine.

> * userland issues (what, you thought that limits on the
> command size will go away?)

Last I checked, the command line size limit wasn't a userland issue, but
rather a limit of the kernel exec(). This might have changed.

> * portability

> The point being: applications and users _do_ know better what structure
> is there. Kernel can try to second-guess them and be real good at that, but
> inability to second-guess is the last of the reasons why aforementioned
> strategies are used.

However, there are issues where users and applications don't want to
structure their namespace for the convenience of the kernel, for good or
bad reasons. There are structures which are known to produce vastly
better performance even in the not-very-uncommon cases, although of
course they provide dramatic improvements in the extreme. I personally
happen to like the hash-indexed B-tree because of their extremely high
fanout and because they don't impose any penalty in the other extreme (if
your directory is small enough to fit in a single block, you skip the
B-tree and have the linear non-hash leaf node only) but there are other
structures which work as well.

I don't see there being any fundamental reason to not do such an
improvement, except the one Alan Cox mentioned -- crash recovery --
(which I think can be dealt with; in my example above as long as the leaf
nodes can get recovered, the tree can be rebuilt. Consider starting each
leaf node block with a magic and a pointer to its home inode; combined
with a leaf node block count in the home inode, that should be quite
robust.) It would be particularly nice to implement this more as an
enhancement to ext3 than ext2, even though the issue is orthogonal, since
ext3 should add a layer of inherent integrity protection.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-03-01 21:16:05

by Alexander Viro

[permalink] [raw]
Subject: Re: Hashing and directories



On Thu, 1 Mar 2001, H. Peter Anvin wrote:

> > * userland issues (what, you thought that limits on the
> > command size will go away?)
>
> Last I checked, the command line size limit wasn't a userland issue, but
> rather a limit of the kernel exec(). This might have changed.

I _really_ don't want to trust the ability of shell to deal with long
command lines. I also don't like the failure modes with history expansion
causing OOM, etc.

AFAICS right now we hit the kernel limit first, but I really doubt that
raising said limit is a good idea.

xargs is there for purpose...
Cheers,
Al

2001-03-01 21:26:32

by Bill Crawford

[permalink] [raw]
Subject: Re: Hashing and directories

Before I reply: I apologise for starting this argument, or at least
making it worse, and please let me say again that I really would like
to see improvements in directory searching etc. ... my original point
was simply a half-joking aside to the effect that we should not
encourage people to put thousands of files in a single directory ...

"H. Peter Anvin" wrote:

> > * userland issues (what, you thought that limits on the
> > command size will go away?)

> Last I checked, the command line size limit wasn't a userland issue, but
> rather a limit of the kernel exec(). This might have changed.

Actually this is also a serious problem. We have a ticketing system
at work that stores all its data in files in large directories, and I
discovered recently that I can only pass about a tenth of the current
file names on a command line. Yes, we have xargs, but ...

Also (this happens to be Solaris on a 8 x 40MHz Sparc system) there
is a significant delay just to read the directory.

> > * portability

Also an issue. Fortunately we store a lot of data on NetApps, so
the performance of the filesystem as such is less of an issue; but,
that means the size of the data transfer across the network involved
in reading a directory can become an issue too.

> > The point being: applications and users _do_ know better what structure
> > is there. Kernel can try to second-guess them and be real good at that, but
> > inability to second-guess is the last of the reasons why aforementioned
> > strategies are used.

All good points ...

> However, there are issues where users and applications don't want to
> structure their namespace for the convenience of the kernel, for good or
> bad reasons.

But there are other reasons (besides the kernel's and filesystems'
handling of directories and name lookups). I think you're spot on
about the performance issues and on-disk structures, though.

> -hpa

--
/* Bill Crawford, Unix Systems Developer, ebOne, formerly GTS Netcom */
#include "stddiscl.h"

2001-03-01 21:24:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Hashing and directories

Alexander Viro wrote:
>
> I _really_ don't want to trust the ability of shell to deal with long
> command lines. I also don't like the failure modes with history expansion
> causing OOM, etc.
>
> AFAICS right now we hit the kernel limit first, but I really doubt that
> raising said limit is a good idea.
>

Arbitrary limits are generally bad. Yes, using a very long command line
is usually a bad idea, but there are cases for which it is the only
reasonable way to do something. Categorically blocking them is not a
good idea either.

> xargs is there for purpose...

Well, yes; using xargs is a good idea, not the least because it enables
some parallelism that wouldn't otherwise be there.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-03-01 21:26:32

by Andreas Dilger

[permalink] [raw]
Subject: Re: Hashing and directories

H. Peter Anvin writes [re hashed directories]:
> I don't see there being any fundamental reason to not do such an
> improvement, except the one Alan Cox mentioned -- crash recovery --
> (which I think can be dealt with; in my example above as long as the leaf
> nodes can get recovered, the tree can be rebuilt.

Actually, with Daniel's implementation, the index blocks will be in
the same file as the directory leaf nodes, so there should be no problem
in losing leaf blocks after a crash (not more so than the current ext2
setup).

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-03-02 08:56:41

by Pavel Machek

[permalink] [raw]
Subject: Re: Hashing and directories

Hi!

> > * userland issues (what, you thought that limits on the
> > command size will go away?)
>
> the space allowed for arguments is not a userland issue, it is a kernel
> limit defined by MAX_ARG_PAGES in binfmts.h, so one could tweak it if one
> wanted to without breaking any userland.

Which is exactly what I done on my system. 2MB for command line is
very nice.
Pavel
--
The best software in life is free (not shareware)! Pavel
GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+

2001-03-02 09:01:01

by Pavel Machek

[permalink] [raw]
Subject: Re: Hashing and directories

Hi!

> > > I was hoping to point out that in real life, most systems that
> > > need to access large numbers of files are already designed to do
> > > some kind of hashing, or at least to divide-and-conquer by using
> > > multi-level directory structures.
> >
> > Yes -- because their workaround kernel slowness.
>
> Pavel, I'm afraid that you are missing the point. Several, actually:
> * limits of _human_ capability to deal with large unstructured
> sets of objects

Okay, then. Let's take example where I met with huge directories.

I'm working with timetables (timetab.sourceforge); at one point, I
download 100000 html files, each with one train. I'm able to deal with
that, just fine. After all, there is no structure in there. If I
wanted to add structure there, I'd have to invent some. I don't want
to. I feel just fine with 100000 files. I type its number, and it
works.

Due to kernel/nfs slowness with such huge directories, I had to split
it into several directories (each with 10000 entries). Now it gets
ugly. I was perfectly able to deal with 100000 entries, but having 10
directories is real pain.

> The point being: applications and users _do_ know better what structure
> is there. Kernel can try to second-guess them and be real good at

And if there's none? Did you take a look at netscape cache? It does
hashing to workaround kernel slowness. Yes, I think it would be much
nicer if it did not workaround anything.
Pavel
--
The best software in life is free (not shareware)! Pavel
GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+

2001-03-02 09:04:42

by Pavel Machek

[permalink] [raw]
Subject: Re: Hashing and directories

Hi!

> > > * userland issues (what, you thought that limits on the
> > > command size will go away?)
> >
> > Last I checked, the command line size limit wasn't a userland issue, but
> > rather a limit of the kernel exec(). This might have changed.
>
> I _really_ don't want to trust the ability of shell to deal with long
> command lines. I also don't like the failure modes with history expansion
> causing OOM, etc.
>
> AFAICS right now we hit the kernel limit first, but I really doubt that
> raising said limit is a good idea.

I am running with 2MB limit right now. I doubt 2MB will lead to OOM.

> xargs is there for purpose...

xargs is very ugly. I want to rm 12*. Just plain "rm 12*". *Not* "find
. -name "12*" | xargs rm, which has terrible issues with files names

"xyzzy"
"bla"
"xyzzy bla"
"12 xyzzy bla"

!

I do not want to deal with xargs. Xargs was made to workaround
limitation at command line size (and is broken in itself). Now we have
hardware that can handle bigger commandlines just fine, xargs should
be killed.
Pavel
--
The best software in life is free (not shareware)! Pavel
GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+

2001-03-02 12:01:36

by Oystein Viggen

[permalink] [raw]
Subject: Re: Hashing and directories

Pavel Machek wrote:

> xargs is very ugly. I want to rm 12*. Just plain "rm 12*". *Not* "find
> . -name "12*" | xargs rm, which has terrible issues with files names
>
> "xyzzy"
> "bla"
> "xyzzy bla"
> "12 xyzzy bla"

These you work around using the smarter, \0 terminated, version:

find . -name "12*" -print0 | xargs -0 rm

"This version would also deal with
this filename ;)"

Oystein
--
ssh -c rot13 otherhost

2001-03-02 12:26:30

by Tobias Ringstrom

[permalink] [raw]
Subject: Re: Hashing and directories

On 2 Mar 2001, Oystein Viggen wrote:
> Pavel Machek wrote:
> > xargs is very ugly. I want to rm 12*. Just plain "rm 12*". *Not* "find
> These you work around using the smarter, \0 terminated, version:

Another example demonstrating why xargs is not always good (and why a
bigger command line is needed) is when you combine it with e.g. wc:

find . -type f -print0 | xargs -0 wc

You cannot trust the summary line from wc, since xargs may have decided to
run wc may times, and thus you have may summary lines. If the kernel
would allow a larger command line, you could run

wc `find . -type f`

and get exacly what you want. And if I'm not mistaken, Linux accepts a
much smaller command line than other "unices" such as Solaris.

...but it's not _that_ important... obviously there has to be an upper
limit somewhere...

/Tobias


2001-03-02 12:59:38

by David Weinehall

[permalink] [raw]
Subject: Re: Hashing and directories

On Fri, Mar 02, 2001 at 10:04:10AM +0100, Pavel Machek wrote:
> Hi!
>
> > > > * userland issues (what, you thought that limits on the
> > > > command size will go away?)
> > >
> > > Last I checked, the command line size limit wasn't a userland issue, but
> > > rather a limit of the kernel exec(). This might have changed.
> >
> > I _really_ don't want to trust the ability of shell to deal with long
> > command lines. I also don't like the failure modes with history expansion
> > causing OOM, etc.
> >
> > AFAICS right now we hit the kernel limit first, but I really doubt that
> > raising said limit is a good idea.
>
> I am running with 2MB limit right now. I doubt 2MB will lead to OOM.

You know, with a box with 4MB of RAM (or indeed 2MB, which should still
be possible on a Linux-system), a 2MB command-line is a very effective
DoS :^)

> > xargs is there for purpose...
>
> xargs is very ugly. I want to rm 12*. Just plain "rm 12*". *Not* "find
> . -name "12*" | xargs rm, which has terrible issues with files names
>
> "xyzzy"
> "bla"
> "xyzzy bla"
> "12 xyzzy bla"
>
> !
>
> I do not want to deal with xargs. Xargs was made to workaround
> limitation at command line size (and is broken in itself). Now we have
> hardware that can handle bigger commandlines just fine, xargs should
> be killed.

/David Weinehall
_ _
// David Weinehall <[email protected]> /> Northern lights wander \\
// Project MCA Linux hacker // Dance across the winter sky //
\> http://www.acc.umu.se/~tao/ </ Full colour fire </

2001-03-02 19:34:58

by Tim Wright

[permalink] [raw]
Subject: Re: Hashing and directories

On Fri, Mar 02, 2001 at 10:04:10AM +0100, Pavel Machek wrote:
>
> xargs is very ugly. I want to rm 12*. Just plain "rm 12*". *Not* "find
> . -name "12*" | xargs rm, which has terrible issues with files names
>
> "xyzzy"
> "bla"
> "xyzzy bla"
> "12 xyzzy bla"
>

Getting a bit OffTopic(TM) here, but that's why the GNU versions of the tools
wisely added options to output '\0' rather than '\n' as a separator for the
data i.e.
find . -name '12*' -print0 | xargs -0 rm
does exactly what you want it to - no surprises.

The point about arbitrary limits, is, however well taken. The fact that the
space for exec args and environment historically was static and of a fixed
size is not a good reason to perpetuate the limitation.

Tim

--
Tim Wright - [email protected] or [email protected] or [email protected]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI

2001-03-03 00:04:26

by Bill Crawford

[permalink] [raw]
Subject: Re: Hashing and directories

Pavel Machek wrote:

> Hi!

> > I was hoping to point out that in real life, most systems that
> > need to access large numbers of files are already designed to do
> > some kind of hashing, or at least to divide-and-conquer by using
> > multi-level directory structures.

> Yes -- because their workaround kernel slowness.

Not just kernel ... because we use NFS a lot, directory searching is
a fair bit quicker with smaller directories (especially when looking
manually at things).

> I had to do this kind of hashing because kernel disliked 70000 html
> files (copy of train time tables).

> BTW try rm * with 70000 files in directory -- command line will overflow.

Sort of my point, again. There are limits to what is sane.

Another example I have cited -- our ticketing system -- is a good one.
If there is subdivision, it can be easier to search subsets of the data.
Can you imagine a source tree with 10k files, all in one directory? I
think *people* need subdivision more than the machines do, a lot of the
time. Another example would be mailboxes ... I have started to build a
hierarchy of mail folders because I have more than a screenful.

> Yes? Easier to type cat timetab1/2345 that can timetab12345? With bigger
> command line size, putting i into *one& directory is definitely easier.

IMO (strictly my own) it is often easier to have things subdivided.
I have had to split up my archive of linux tarballs and patches because
it was getting too big to vgrep.

> > A couple of practical examples from work here at Netcom UK (now
> > Ebone :), would be say DNS zone files or user authentication data.
> > We use Solaris and NFS a lot, too, so large directories are a bad
> > thing in general for us, so we tend to subdivide things using a
> > very simple scheme: taking the first letter and then sometimes
> > the second letter or a pair of letters from the filename. This
> > actually works extremely well in practice, and as mentioned above
> > provides some positive side-effects.

> Positive? Try listing all names that contain "linux" with such case. I'll
> do ls *linux*. You'll need ls */*linux* ?l/inux* li/nux*. Seems ugly to
> me.

It's not that bad, as we tend to be fairly consistent in a scheme. I
only have to remember one of those combinations at a time :)

Anyway, again I apologise for starting or continuing (I forget which)
this thread. I really do understand (and agree with) the arguments for
better directory performance. I have moved to ReiserFS, mainly for the
avoidance of long fsck (power failure, children pushing buttons, alpha
and beta testing of 3D graphics drivers). I *love* being able to type
"rm -rf linux-x.y.z-acNN" and have the command prompt reappear in less
than a second. I intended merely to highlight the danger inherent in
saying to people "oh look you can put a million entries in a directory
now" :)

*whack* bad thread *die* *die*

> Pavel

--
/* Bill Crawford, Unix Systems Developer, Ebone (formerly GTS Netcom) */
#include <stddiscl>
const char *addresses[] = {
"[email protected]", "[email protected]", // work
"[email protected]", "[email protected]" // home
};

2001-03-07 00:38:24

by Jamie Lokier

[permalink] [raw]
Subject: Re: Hashing and directories

Pavel Machek wrote:
> > the space allowed for arguments is not a userland issue, it is a kernel
> > limit defined by MAX_ARG_PAGES in binfmts.h, so one could tweak it if one
> > wanted to without breaking any userland.
>
> Which is exactly what I done on my system. 2MB for command line is
> very nice.

Mine too (until recently). A proper patch to remove the limit, and copy
the args into swappable memory as they go (to avoid DoS) would be nicer,
but a quick change to MAX_ARG_PAGES seemed so much easier :-)

In my case, it was a Makefile generating the huge command lines. There
were about 20000 source files and 80000 target files, and the occasional
long line "update the archive with these changed files: ..." ;-)

Splitting the file name list seemed so much more difficult. You can't
even do "echo $(FILES) | xargs", as the "echo" command line is too long!

-- Jamie

2001-03-07 04:04:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: Hashing and directories

In article <[email protected]>,
Jamie Lokier <[email protected]> wrote:
>Pavel Machek wrote:
>> > the space allowed for arguments is not a userland issue, it is a kernel
>> > limit defined by MAX_ARG_PAGES in binfmts.h, so one could tweak it if one
>> > wanted to without breaking any userland.
>>
>> Which is exactly what I done on my system. 2MB for command line is
>> very nice.
>
>Mine too (until recently). A proper patch to remove the limit, and copy
>the args into swappable memory as they go (to avoid DoS) would be nicer,
>but a quick change to MAX_ARG_PAGES seemed so much easier :-)

You can't just change MAX_ARG_PAGES. The space for the array is
allocated on the stack, and you'll just overflow the stack if you start
upping it a lot.

The long-term solution for this is to create the new VM space for the
new process early, and add it to the list of mm_struct's that the
swapper knows about, and then just get rid of the pages[MAX_ARG_PAGES]
array completely and instead just populate the new VM directly. That
way the destination is swappable etc, and you can also remove the
"put_dirty_page()" loop later on, as the pages will already be in their
right places.

It's definitely not a one-liner, but if somebody really feels strongly
about this, then I can tell already that the above is the only way to do
it sanely.

Linus

2001-03-07 13:42:33

by Jamie Lokier

[permalink] [raw]
Subject: Re: Hashing and directories

Linus Torvalds wrote:
> The long-term solution for this is to create the new VM space for the
> new process early, and add it to the list of mm_struct's that the
> swapper knows about, and then just get rid of the pages[MAX_ARG_PAGES]
> array completely and instead just populate the new VM directly. That
> way the destination is swappable etc, and you can also remove the
> "put_dirty_page()" loop later on, as the pages will already be in their
> right places.
>
> It's definitely not a one-liner, but if somebody really feels strongly
> about this, then I can tell already that the above is the only way to do
> it sanely.

Yup. We discussed this years ago, and it nobody thought it important
enough. mm->mmlist didn't exist then, and creating it it _just_ for
this feature seemed too intrusive. I agree it's the only sane way to
completely remove the limit.

-- Jamie

2001-03-07 15:57:49

by Manfred Spraul

[permalink] [raw]
Subject: Re: Hashing and directories

Jamie wrote:

> Linus Torvalds wrote:
> > The long-term solution for this is to create the new VM space for
the
> > new process early, and add it to the list of mm_struct's that the
> > swapper knows about, and then just get rid of the
pages[MAX_ARG_PAGES]
> > array completely and instead just populate the new VM directly. That
> > way the destination is swappable etc, and you can also remove the
> > "put_dirty_page()" loop later on, as the pages will already be in
their
> > right places.
> >
> > It's definitely not a one-liner, but if somebody really feels
strongly
> > about this, then I can tell already that the above is the only way
to do
> > it sanely.

> Yup. We discussed this years ago, and it nobody thought it important

> enough. mm->mmlist didn't exist then, and creating it it _just_ for

> this feature seemed too intrusive. I agree it's the only sane way to

> completely remove the limit.

I'm not sure that this is the right way: It means that every exec() must
call dup_mmap(), and usually only to copy a few hundert bytes. But I
don't see a sane alternative. I won't propose to create a temporary file
in a kernel tmpfs mount ;-)

--

Manfred




2001-03-07 16:10:49

by Jamie Lokier

[permalink] [raw]
Subject: Re: Hashing and directories

Manfred Spraul wrote:
> I'm not sure that this is the right way: It means that every exec() must
> call dup_mmap(), and usually only to copy a few hundert bytes. But I
> don't see a sane alternative. I won't propose to create a temporary file
> in a kernel tmpfs mount ;-)

Every exec creates a whole new mm anyway, after copying data from the
old mm. The suggestion is to create the new mm before copying the data,
and to copy the data from the old mm directly to the new one.

-- Jamie

2001-03-07 16:23:49

by Manfred Spraul

[permalink] [raw]
Subject: Re: Hashing and directories

From: "Jamie Lokier" <[email protected]>
> Manfred Spraul wrote:
> > I'm not sure that this is the right way: It means that every exec()
> > must call dup_mmap(), and usually only to copy a few hundert
> > bytes. But I don't see a sane alternative. I won't propose to
> > create a temporary file in a kernel tmpfs mount ;-)
>
> Every exec creates a whole new mm anyway, after copying data from the
> old mm. The suggestion is to create the new mm before copying the
> data, and to copy the data from the old mm directly to the new one.
>

exec_mmap currenly avoids mm_alloc()/activate_mm()/mm_drop() for single
threaded apps, and that would become impossible.
I'm not sure how expensive these calls are.

--
Manfred


2001-03-07 18:23:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: Hashing and directories

In article <003701c0a722$f6b02700$5517fea9@local>,
Manfred Spraul <[email protected]> wrote:
>
>exec_mmap currenly avoids mm_alloc()/activate_mm()/mm_drop() for single
>threaded apps, and that would become impossible.
>I'm not sure how expensive these calls are.

They aren't that expensive: activate_mm() needs to flush the TLB, but we
do that anyway in the re-use case too, so on the whole the expense is
limited to having to do the extra allocation.

That expense might be at least partially offset by being able to do some
things more simply - right now the mm re-use case actually has to be
more careful than I really like for security and protection reasons, for
example. We've had bugs in this area before - simplifying it would be
nice. And it also gets rid of the special cases.

I don't think the patches to do this should be all that huge, but who
knows? It's not a trivial part of the kernel.

Linus

2001-03-10 12:37:02

by kaih

[permalink] [raw]
Subject: Re: Hashing and directories

[email protected] (Bill Crawford) wrote on 22.02.01 in <[email protected]>:

> A particular reason for this, apart from filesystem efficiency,
> is to make it easier for people to find things, as it is usually
> easier to spot what you want amongst a hundred things than among
> a thousand or ten thousand.
>
> A couple of practical examples from work here at Netcom UK (now
> Ebone :), would be say DNS zone files or user authentication data.
> We use Solaris and NFS a lot, too, so large directories are a bad
> thing in general for us, so we tend to subdivide things using a
> very simple scheme: taking the first letter and then sometimes
> the second letter or a pair of letters from the filename. This
> actually works extremely well in practice, and as mentioned above
> provides some positive side-effects.

So the practical difference between finding a file in a hierarchy if you
already know the first N characters (because you need them to find the
subdirectory it's in), and finding the same file in a flat directory still
knowing the first N characters, is ... well, maybe tab completion is a tad
slower.

Sorry, but I can't see the human angle.


MfG Kai

2001-03-12 10:07:47

by Herbert Xu

[permalink] [raw]
Subject: Re: Hashing and directories

Pavel Machek <[email protected]> wrote:

> xargs is very ugly. I want to rm 12*. Just plain "rm 12*". *Not* "find
> . -name "12*" | xargs rm, which has terrible issues with files names

Try

printf "%s\0" 12* | xargs -0 rm
--
Debian GNU/Linux 2.2 is out! ( http://www.debian.org/ )
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2001-03-12 10:46:29

by Xavier Bestel

[permalink] [raw]
Subject: Re: Hashing and directories

Le 12 Mar 2001 21:05:58 +1100, Herbert Xu a ?crit :
> Pavel Machek <[email protected]> wrote:
>
> > xargs is very ugly. I want to rm 12*. Just plain "rm 12*". *Not* "find
> > . -name "12*" | xargs rm, which has terrible issues with files names
>
> Try
>
> printf "%s\0" 12* | xargs -0 rm

Or find -print0 ... | xargs -0 ...

Xav

2001-04-27 11:09:44

by Goswin Brederlow

[permalink] [raw]
Subject: Re: Hashing and directories

>>>>> " " == Pavel Machek <[email protected]> writes:

> Hi!
>> I was hoping to point out that in real life, most systems that
>> need to access large numbers of files are already designed to
>> do some kind of hashing, or at least to divide-and-conquer by
>> using multi-level directory structures.

> Yes -- because their workaround kernel slowness.

> I had to do this kind of hashing because kernel disliked 70000
> html files (copy of train time tables).

> BTW try rm * with 70000 files in directory -- command line will
> overflow.

There are filesystems that use btrees (reiserfs) or hashing (affs) for
directories.

That way you get a O(log(n)) or even O(1) access time for
files. Saddly the hashtable for affs depends on the blocksize and
linux AFAIK only allows far too small block sizes (512 byte) for affs.
It was designed for floppies, so the lack of dynamically resizing hash
tables is excused.

What also could be done is to keed directories sorted. Creating of
files would cost O(N) time but a lookup could be done in
O(log(log(n))) most of the time with reasonable name distribution.
This could be done with ext2 without breaking any compatibility. One
would need to convert (sort all directories) every time the FS was
mounted RW by an older ext2, but otherwise nothing changes.

Would you like to write support for this?

MfG
Goswin

2001-04-27 16:20:33

by Daniel Phillips

[permalink] [raw]
Subject: Re: Hashing and directories

On Thursday 08 March 2001 13:42, Goswin Brederlow wrote:
> >>>>> " " == Pavel Machek <[email protected]> writes:
> > Hi!
> >
> >> I was hoping to point out that in real life, most systems that
> >> need to access large numbers of files are already designed to
> >> do some kind of hashing, or at least to divide-and-conquer by
> >> using multi-level directory structures.
> >>
> > Yes -- because their workaround kernel slowness.
> >
> > I had to do this kind of hashing because kernel disliked 70000
> > html files (copy of train time tables).
> >
> > BTW try rm * with 70000 files in directory -- command line will
> > overflow.
>
> There are filesystems that use btrees (reiserfs) or hashing (affs) for
> directories.
>
> That way you get a O(log(n)) or even O(1) access time for
> files. Saddly the hashtable for affs depends on the blocksize and
> linux AFAIK only allows far too small block sizes (512 byte) for affs.
> It was designed for floppies, so the lack of dynamically resizing hash
> tables is excused.
>
> What also could be done is to keed directories sorted. Creating of
> files would cost O(N) time but a lookup could be done in
> O(log(log(n))) most of the time with reasonable name distribution.
> This could be done with ext2 without breaking any compatibility. One
> would need to convert (sort all directories) every time the FS was
> mounted RW by an older ext2, but otherwise nothing changes.
>
> Would you like to write support for this?

Hi, I missed this whole thread at the time, ironically, because I was
too busy working on my hash-keyed directory index.

How do you get log(log(n))? The best I can do is logb(n), with
b=large. For practical purposes this is O(1).

The only problem I ran into is the mismatch between the storage order
of the sorted directory and that of the inodes, which causes thrashing
in the inode table. I was able to eliminate this thrashing completely
from user space by processing the files in inode order. I went on to
examine ways of eliminating the thrashing without help from user space,
and eventually came up with a good way of doing that that relies on
setting an inode allocation target that corresponds loosely to the sort
order.

The thrashing doesn't hurt much anyway compared to the current N**2
behaviour. For a million files I saw a 4X slowdown for delete vs
create. Create shows no thrashing because it works in storage order
in the inodes, with the directory blocks themselves being smaller by
a factor of 6-7, so not contributing significantly to the cache
pressure. Compare this to the 150 times slowdown you see with normal
Ext2, creating 100,000 files.

--
Daniel