2002-10-07 18:28:10

by Andrew Morton

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

Daniel Phillips wrote:
>
> On Sunday 06 October 2002 17:19, Martin J. Bligh wrote:
> > > Then there's the issue of application startup. There's not enough
> > > read ahead. This is especially sad, as the order of page faults is
> > > at least partially predictable.
> >
> > Is the problem really, fundamentally a lack of readahead in the
> > kernel? Or is it that your application is huge bloated pig?
>
> Readahead isn't the only problem, but it is a huge problem. The current
> readahead model is per-inode, which is very little help with lots of small
> files, especially if they are fragmented or out of order. There are various
> ways to fix this; they are all difficult[1]. Fortunately, we can call this
> "tuning work" so it can be done during the stable series.
>
> [1] We could teach each filesystem how to read ahead across directories, or
> we could teach the vfs how to do physical readahead. Choose your poison.

Devices do physical readahead, and it works nicely.

Go into ext2_new_inode, replace the call to find_group_dir with
find_group_other. Then untar a kernel tree, unmount the fs,
remount it and see how long it takes to do a

`find . -type f xargs cat > /dev/null'

on that tree. If your disk is like my disk, you will achieve
full disk bandwidth.


2002-10-07 18:47:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))


On Mon, 7 Oct 2002, Andrew Morton wrote:
>
> Devices do physical readahead, and it works nicely.

Indeed. There isn't any reasonable device where this isn't the case: the
_device_ (and sometimes the driver - floppy.c) does a lot better at
readahead than higher layers can do anyway.

> Go into ext2_new_inode, replace the call to find_group_dir with
> find_group_other.

I hate that thing. Hate hate hate. Maybe we should just do this, and hope
that somebody will do a proper off-line cleanup tool.

In the meantime, it might just be possible to take a look at the uid, and
if the uid matches use find_group_other, but for non-matching uids use
find_group_dir. That gives a "compact for same users, distribute for
different users" heuristic, which might be acceptable for normal use (and
the theoretical cleanup tool could fix it up).

Add some other heuristics ("if the difference between free group sizes is
bigger than a factor of two"), and maybe it would be useful.

The current approach sucks for everybody, and makes it impossible to get
good throughput on a disk on many very common loads.

Linus

2002-10-07 18:59:50

by Daniel Phillips

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

On Monday 07 October 2002 20:31, Andrew Morton wrote:
> Daniel Phillips wrote:
> > [1] We could teach each filesystem how to read ahead across directories, or
> > we could teach the vfs how to do physical readahead. Choose your poison.
>
> Devices do physical readahead, and it works nicely.

Devices have a few MB of readahead cache, the kernel can have thousands of
times as much.

--
Daniel

2002-10-07 19:19:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))


On Mon, 7 Oct 2002, Daniel Phillips wrote:
>
> Devices have a few MB of readahead cache, the kernel can have thousands of
> times as much.

I don't think that is in the least realistic.

There's _no_ way that the krenel could do physical readahead for more than
a few tens or hundreds of kB - the latency impact would just be too much
to handle, and the VM impact is not likely insignificant either.

So the device readahead is _not_ noticeably smaller than what the kernel
can reasonably do, and it does a better job of it (ie disks can fill track
buffers optimally, depending on where the head hits etc).

Linus

2002-10-07 20:00:27

by Alan

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

On Mon, 2002-10-07 at 19:51, Linus Torvalds wrote:
> In the meantime, it might just be possible to take a look at the uid, and
> if the uid matches use find_group_other, but for non-matching uids use
> find_group_dir. That gives a "compact for same users, distribute for
> different users" heuristic, which might be acceptable for normal use (and
> the theoretical cleanup tool could fix it up).

Factoring the uid/gid/pid in actually may help in other ways. If we are
doing it by pid or by uid we will reduce the interleave of multiple
files thing you sometimes get

2002-10-07 19:57:24

by Daniel Phillips

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

On Monday 07 October 2002 21:24, Linus Torvalds wrote:
> On Mon, 7 Oct 2002, Daniel Phillips wrote:
> >
> > Devices have a few MB of readahead cache, the kernel can have thousands of
> > times as much.
>
> I don't think that is in the least realistic.
>
> There's _no_ way that the krenel could do physical readahead for more than
> a few tens or hundreds of kB

If that's a bet, I'll take you up on it.

> - the latency impact would just be too much
> to handle, and the VM impact is not likely insignificant either.

I did say difficult. It really is, but there are big gains to be had.

This is easy to verify: say you have 100 MB of kernel source stored in, say,
50 different clumps on disk. Complete with seeks, a perfectly prescient
readahead algorithm can read that into memory in about 5 seconds, even with
my lame scsi raid controller[1]. So two of those needs 10 seconds, and I
can diff those two trees in 2 seconds, in cache. In practice it takes 90
seconds, so there is obviously a lot of room for improvement.

Note that if the disks really were capable of handling the readahead
themselves they would already give me the 12 second result, not the 90
seconds. They simply can't, because they haven't got enough cache.

[1] If the controller wasn't lame it would read the 100 MB in less than a
second, with its (peak) total of 200 MB/s media bandwith, less 20% worth
of parity blocks.

--
Daniel

2002-10-07 20:39:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))


On 7 Oct 2002, Alan Cox wrote:
>
> Factoring the uid/gid/pid in actually may help in other ways. If we are
> doing it by pid or by uid we will reduce the interleave of multiple
> files thing you sometimes get

'pid' would probably work better than what we have now, even though I bet
it would get confused by a large number of installers (ie "make install"
in just about any project will use multiple different processes to copy
over separate subdirectories. In the X11R6 tree it uses individual "cp"
processes for each file!)

The session ID would avoid some of that, but they both have a fundamental
problem: neither pid nor session ID is actually saved in any directory
structure, so it's quite hard to use that as a heuristic for whether a new
file should go into the same directory group as the directory it is
created in.

That's why "uid" would work better. The uid has a different issue, though,
namely the fact that when user directories are created, they are basically
always created as uid 0 first, and then a "chown" - which means that the
user heuristic wouldn't actually trigger at the right time. So the
heuristic couldn't be just "newfile->uid == directory->uid", it would have
to be something better.

I think last time we had the discussion, time-based things were also felt
were good heuristics in many cases..

It could also be good to have an additional static hint on whether
directories should be spread out or not. Administrators could set the
"spread out" bit on the /, /home and /var/spool/(news|mail) directories,
for example, causing those to spread out their subdirectories. but not
causing normal user activity to do so.

Yeah, yeah, I know there are papers on this. I don't care. I think
something has to be done, and last time the discussion petered out at
about this point.

Linus

2002-10-07 20:23:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))


On Mon, 7 Oct 2002, Daniel Phillips wrote:
>
> If that's a bet, I'll take you up on it.

Sure. The mey is:
- we can more easily fix the f*cking filesystems to be sane
- then trying to add prescient read-ahead to the kernel

In other words, trying to do an impossibly good job on read-ahead is
_stupid_, when the real problem is that ext2 lays out files in total crap
ways.

> I did say difficult. It really is, but there are big gains to be had.

But why do the horribly stupid thing, when Andrew has already shown that a
one-liner change to ext2/3 gives you platter speeds (and better speeds
than your approach _can_ get, since you still are going to end up seeking
a lot, even if you can make your read-ahead prescient).

In other words, you're overcompensating.

Linus

2002-10-07 21:11:31

by Daniel Phillips

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

On Monday 07 October 2002 22:28, Linus Torvalds wrote:
> On Mon, 7 Oct 2002, Daniel Phillips wrote:
> >
> > If that's a bet, I'll take you up on it.
>
> Sure. The mey is:
^^^ <---- "bet" ?
> - we can more easily fix the f*cking filesystems to be sane
> - then trying to add prescient read-ahead to the kernel

--
Daniel

2002-10-07 21:51:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))


On Mon, 7 Oct 2002, Daniel Phillips wrote:
> >
> > Sure. The mey is:
> ^^^ <---- "bet" ?

Yeah. What the heck happened to my fingers?

Linus "spastic" Torvalds

2002-10-07 21:59:14

by Daniel Phillips

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

On Monday 07 October 2002 23:55, Linus Torvalds wrote:
> On Mon, 7 Oct 2002, Daniel Phillips wrote:
> > >
> > > Sure. The mey is:
> > ^^^ <---- "bet" ?
>
> Yeah. What the heck happened to my fingers?

Apparently, one of them missed the key it was aiming for and the other one
changed hands.

--
Daniel

2002-10-07 22:09:02

by Charles Cazabon

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

Daniel Phillips <[email protected]> wrote:
> >
> > Sure. The mey is:
> ^^^ <---- "bet" ?

Kubys typed that line.

Charles
--
-----------------------------------------------------------------------
Charles Cazabon <[email protected]>
GPL'ed software available at: http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------

2002-10-30 18:19:49

by lee leahu

[permalink] [raw]
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))

Pardon my ignorance,

How does Readahead relate to journaling filesystems such as ReiserFS, or XFS?

Do they have the same or similar problems that I've been reading about with ext2/3?

Andrew Morton <[email protected]> scribbled something about Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)):

> Daniel Phillips wrote:
> >
> > On Sunday 06 October 2002 17:19, Martin J. Bligh wrote:
> > > > Then there's the issue of application startup. There's not enough
> > > > read ahead. This is especially sad, as the order of page faults is
> > > > at least partially predictable.
> > >
> > > Is the problem really, fundamentally a lack of readahead in the
> > > kernel? Or is it that your application is huge bloated pig?
> >
> > Readahead isn't the only problem, but it is a huge problem. The current
> > readahead model is per-inode, which is very little help with lots of small
> > files, especially if they are fragmented or out of order. There are various
> > ways to fix this; they are all difficult[1]. Fortunately, we can call this
> > "tuning work" so it can be done during the stable series.
> >
> > [1] We could teach each filesystem how to read ahead across directories, or
> > we could teach the vfs how to do physical readahead. Choose your poison.
>
> Devices do physical readahead, and it works nicely.
>
> Go into ext2_new_inode, replace the call to find_group_dir with
> find_group_other. Then untar a kernel tree, unmount the fs,
> remount it and see how long it takes to do a
>
> `find . -type f xargs cat > /dev/null'
>
> on that tree. If your disk is like my disk, you will achieve
> full disk bandwidth.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


--
+----------------------------------+---------------------------------+
| Lee Leahu | voice -> 708-444-2690 |
| Internet Technology Specialist | fax -> 708-444-2697 |
| RICIS, Inc. | email -> [email protected] |
+----------------------------------+---------------------------------+
| I cannot conceive that anybody will require multiplications at the |
| rate of 40,000 or even 4,000 per hour ... |
| -- F. H. Wales (1936) |
+--------------------------------------------------------------------+