2002-09-11 15:38:22

by Xuan Baldauf

[permalink] [raw]
Subject: Heuristic readahead for filesystems

Hello,

I wonder wether Linux implements a kind of heuristic
readahead for filesystems:

If an application reads a directory with getdents() and if
in the past, it stat()ed a significant part of the directory
entries, it is likely that it will stat() every entry of
every directory it reads with getdents() in the future. Thus
readahead for the stat data could improve the perfomance,
especially if the stat data is located closely to each other
on disk.

If an application did a stat()..open()..read() sequence on a
file, it is likely that, after the next stat(), it will open
and read the mentioned file. Thus, one could readahead the
start of a file on stat() of that file.

Combined: If an application walks a directory tree and
visits each file, it is likely that it will continue up to
the end of that tree.

Example: See this diff strace:

stat64("linux/drivers/usb/net", {st_mode=S_IFDIR|0775,
st_size=400, ...}) = 0
stat64("linux-2.5.24/drivers/usb/net",
{st_mode=S_IFDIR|0775, st_size=400, ...}) = 0
open("linux/drivers/usb/net",
O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
fstat64(3, {st_mode=S_IFDIR|0775, st_size=400, ...}) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
getdents64(0x3, 0x8057718, 0x1000, 0) = 432
getdents64(0x3, 0x8057718, 0x1000, 0) = 0
close(3) = 0
open("linux-2.5.24/drivers/usb/net",
O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
fstat64(3, {st_mode=S_IFDIR|0775, st_size=400, ...}) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
getdents64(0x3, 0x8057718, 0x1000, 0) = 432
getdents64(0x3, 0x8057718, 0x1000, 0) = 0
close(3) = 0
stat64("linux/drivers/usb/net/Config.help",
{st_mode=S_IFREG|0644, st_size=5230, ...}) = 0
stat64("linux-2.5.24/drivers/usb/net/Config.help",
{st_mode=S_IFREG|0644, st_size=5230, ...}) = 0
open("linux/drivers/usb/net/Config.help", O_RDONLY) = 3
open("linux-2.5.24/drivers/usb/net/Config.help", O_RDONLY) =
4
read(3, "CONFIG_USB_CATC\n Say Y if you w"..., 4096) = 4096

read(4, "CONFIG_USB_CATC\n Say Y if you w"..., 4096) = 4096

read(3, "ule ( = code which can be\n inse"..., 1134) = 1134

read(4, "ule ( = code which can be\n inse"..., 1134) = 1134

close(3) = 0
close(4) = 0
stat64("linux/drivers/usb/net/Config.in",
{st_mode=S_IFREG|0644, st_size=931, ...}) = 0
stat64("linux-2.5.24/drivers/usb/net/Config.in",
{st_mode=S_IFREG|0644, st_size=931, ...}) = 0
open("linux/drivers/usb/net/Config.in", O_RDONLY) = 3
open("linux-2.5.24/drivers/usb/net/Config.in", O_RDONLY) = 4

read(3, "#\n# USB Network devices configur"..., 4096) = 931
read(4, "#\n# USB Network devices configur"..., 4096) = 931
close(3) = 0
close(4) = 0
stat64("linux/drivers/usb/net/Makefile",
{st_mode=S_IFREG|0644, st_size=299, ...}) = 0
stat64("linux-2.5.24/drivers/usb/net/Makefile",
{st_mode=S_IFREG|0644, st_size=299, ...}) = 0
open("linux/drivers/usb/net/Makefile", O_RDONLY) = 3
open("linux-2.5.24/drivers/usb/net/Makefile", O_RDONLY) = 4
read(3, "#\n# Makefile for USB Network dri"..., 4096) = 299
read(4, "#\n# Makefile for USB Network dri"..., 4096) = 299
close(3) = 0
close(4) = 0
stat64("linux/drivers/usb/net/catc.c",
{st_mode=S_IFREG|0644, st_size=23960, ...}) = 0
stat64("linux-2.5.24/drivers/usb/net/catc.c",
{st_mode=S_IFREG|0644, st_size=23960, ...}) = 0
open("linux/drivers/usb/net/catc.c", O_RDONLY) = 3
open("linux-2.5.24/drivers/usb/net/catc.c", O_RDONLY) = 4
read(3, "/*\n * Copyright (c) 2001 Vojtec"..., 4096) = 4096

read(4, "/*\n * Copyright (c) 2001 Vojtec"..., 4096) = 4096

read(3, " long last_stats;\n\n\tu8 multicast"..., 19864) =
19864
read(4, " long last_stats;\n\n\tu8 multicast"..., 19864) =
19864
close(3) = 0
close(4) = 0
stat64("linux/drivers/usb/net/cdc-ether.c",
{st_mode=S_IFREG|0644, st_size=45669, ...}) = 0
stat64("linux-2.5.24/drivers/usb/net/cdc-ether.c",
{st_mode=S_IFREG|0644, st_size=45669, ...}) = 0
open("linux/drivers/usb/net/cdc-ether.c", O_RDONLY) = 3
open("linux-2.5.24/drivers/usb/net/cdc-ether.c", O_RDONLY) =
4
read(3, "// Portions of this file taken f"..., 4096) = 4096
read(4, "// Portions of this file taken f"..., 4096) = 4096
read(3, "SY;\n}\n\nstatic void write_bulk_ca"..., 41573) =
41573
read(4, "SY;\n}\n\nstatic void write_bulk_ca"..., 41573) =
41573
close(3) = 0
close(4) = 0


Diff only takes 2% CPU time for diffing two linux-kernel
trees on a AMD Duron 1 GHz, DMA enabled. Dumb readahead only
some blocks after the original block does not work here
well. Diff always switches between the kernel trees, and I
assume, the disk head seeks accordingly back and forth. If
the trees were read (at least partly) in parallel, I assume
that reading would be much faster. Hence the question.

Xu?n.



2002-09-11 16:24:05

by Davide Libenzi

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

On Wed, 11 Sep 2002, Xuan Baldauf wrote:

> Hello,
>
> I wonder wether Linux implements a kind of heuristic
> readahead for filesystems:
>
> If an application reads a directory with getdents() and if
> in the past, it stat()ed a significant part of the directory
> entries, it is likely that it will stat() every entry of
> every directory it reads with getdents() in the future. Thus
> readahead for the stat data could improve the perfomance,
> especially if the stat data is located closely to each other
> on disk.
>
> If an application did a stat()..open()..read() sequence on a
> file, it is likely that, after the next stat(), it will open
> and read the mentioned file. Thus, one could readahead the
> start of a file on stat() of that file.
>
> Combined: If an application walks a directory tree and
> visits each file, it is likely that it will continue up to
> the end of that tree.

M$ Win XP does exactly something like this and keep applications
( windows\prefetch ) and boot profiles that it uses to prefetch disk data
and avoid long page fault latencies. It does kind-of-work but care should
be taken adopting a similar technique on Linux ( patents ).



- Davide


2002-09-11 16:37:38

by Rik van Riel

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

On Wed, 11 Sep 2002, Xuan Baldauf wrote:

> I wonder wether Linux implements a kind of heuristic
> readahead for filesystems:

> If an application did a stat()..open()..read() sequence on a
> file, it is likely that, after the next stat(), it will open
> and read the mentioned file. Thus, one could readahead the
> start of a file on stat() of that file.

> Example: See this diff strace:

Your observation is right, but I'm not sure how much it will
matter if we start reading the file at stat() time or at
read() time.

This is because one disk seek takes about 10 million CPU
cycles on modern systems and we'll have completed the stat(),
open() and started the read() before the disk arm has started
moving ;)

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Spamtraps of the month: [email protected] [email protected]

2002-09-11 16:58:18

by jdow

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

From: "Davide Libenzi" <[email protected]>
> On Wed, 11 Sep 2002, Xuan Baldauf wrote:
>
> > Hello,
> >
> > I wonder wether Linux implements a kind of heuristic
> > readahead for filesystems:
> >
> > If an application reads a directory with getdents() and if
> > in the past, it stat()ed a significant part of the directory
> > entries, it is likely that it will stat() every entry of
> > every directory it reads with getdents() in the future. Thus
> > readahead for the stat data could improve the perfomance,
> > especially if the stat data is located closely to each other
> > on disk.
> >
> > If an application did a stat()..open()..read() sequence on a
> > file, it is likely that, after the next stat(), it will open
> > and read the mentioned file. Thus, one could readahead the
> > start of a file on stat() of that file.
> >
> > Combined: If an application walks a directory tree and
> > visits each file, it is likely that it will continue up to
> > the end of that tree.
>
> M$ Win XP does exactly something like this and keep applications
> ( windows\prefetch ) and boot profiles that it uses to prefetch disk data
> and avoid long page fault latencies. It does kind-of-work but care should
> be taken adopting a similar technique on Linux ( patents ).

Davide, when was the patent on readahead taken out? It has either expired
or I can prove prior art I did myself on the old StarDrive and HardFrame
controllers for the Amiga made by Microbotics, Inc.

{^_^} Joanne Dow, [email protected]


2002-09-11 17:10:39

by Davide Libenzi

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

On Wed, 11 Sep 2002, jdow wrote:

> Davide, when was the patent on readahead taken out? It has either expired
> or I can prove prior art I did myself on the old StarDrive and HardFrame
> controllers for the Amiga made by Microbotics, Inc.

I'm definitely not a patent attorney :) but since there's the tendency to
patent almost everything in big corporations ( my Co. gives $ 2K for each
patent filed, plus another $ 1K if it's approved ) I bet that there's a
patent pending somewhere about this. Even if someone have prior art about
hw prefecting, imho there's the possibility to patent a software ( kernel
) based version of the art. I also do not think M$ to be so dumb to adopt
a technique that is described inside a someone else owned patent in US.
But again, i'm not a patent attorney ...




- Davide



2002-09-11 17:11:37

by Hans Reiser

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

Davide Libenzi wrote:

>On Wed, 11 Sep 2002, Xuan Baldauf wrote:
>
>
>
>>Hello,
>>
>>I wonder wether Linux implements a kind of heuristic
>>readahead for filesystems:
>>
>>If an application reads a directory with getdents() and if
>>in the past, it stat()ed a significant part of the directory
>>entries, it is likely that it will stat() every entry of
>>every directory it reads with getdents() in the future. Thus
>>readahead for the stat data could improve the perfomance,
>>especially if the stat data is located closely to each other
>>on disk.
>>
Nikita should comment on this.

>>
>>If an application did a stat()..open()..read() sequence on a
>>file, it is likely that, after the next stat(), it will open
>>and read the mentioned file. Thus, one could readahead the
>>start of a file on stat() of that file.
>>
>>Combined: If an application walks a directory tree and
>>visits each file, it is likely that it will continue up to
>>the end of that tree.
>>
>>
>
>M$ Win XP does exactly something like this and keep applications
>( windows\prefetch ) and boot profiles that it uses to prefetch disk data
>and avoid long page fault latencies. It does kind-of-work but care should
>be taken adopting a similar technique on Linux ( patents ).
>
>
>
>- Davide
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
There was a Usenix paper on profiling based prefetching, does it predate
the MS patents?

Rik, the more reads you put into the elevator, the more efficiently it
can sort the reads, so even if latency reduction is not important,
prefetching can still help (in principle).

Hans


2002-09-11 17:19:42

by Oliver Neukum

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

Am Mittwoch, 11. September 2002 18:42 schrieb Rik van Riel:

> Your observation is right, but I'm not sure how much it will
> matter if we start reading the file at stat() time or at
> read() time.
>
> This is because one disk seek takes about 10 million CPU
> cycles on modern systems and we'll have completed the stat(),
> open() and started the read() before the disk arm has started
> moving ;)

Do we gain by sorting the disk accesses ?
How about savings due to better cooperation with the IO
scheduler?

Regards
Oliver

2002-09-11 17:46:50

by jdow

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

From: "Davide Libenzi" <[email protected]>

> On Wed, 11 Sep 2002, jdow wrote:
>
> > Davide, when was the patent on readahead taken out? It has either
expired
> > or I can prove prior art I did myself on the old StarDrive and HardFrame
> > controllers for the Amiga made by Microbotics, Inc.
>
> I'm definitely not a patent attorney :) but since there's the tendency to
> patent almost everything in big corporations ( my Co. gives $ 2K for each
> patent filed, plus another $ 1K if it's approved ) I bet that there's a
> patent pending somewhere about this. Even if someone have prior art about
> hw prefecting, imho there's the possibility to patent a software ( kernel
> ) based version of the art. I also do not think M$ to be so dumb to adopt
> a technique that is described inside a someone else owned patent in US.
> But again, i'm not a patent attorney ...

This was kernel software. It was committed in the mid 80s. (Hm, so a patent
MIGHT not be expired by a year or two.) The predecessor device was used to
overturn a patent a company claimed on parallel port to SCSI converters. The
Microbotics units were rather forward looking for their day.

{^_^} Joanne Dow, [email protected]


2002-09-11 17:55:25

by Xuan Baldauf

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems



Rik van Riel wrote:

> On Wed, 11 Sep 2002, Xuan Baldauf wrote:
>
> > I wonder wether Linux implements a kind of heuristic
> > readahead for filesystems:
>
> > If an application did a stat()..open()..read() sequence on a
> > file, it is likely that, after the next stat(), it will open
> > and read the mentioned file. Thus, one could readahead the
> > start of a file on stat() of that file.
>
> > Example: See this diff strace:
>
> Your observation is right, but I'm not sure how much it will
> matter if we start reading the file at stat() time or at
> read() time.
>
> This is because one disk seek takes about 10 million CPU
> cycles on modern systems and we'll have completed the stat(),
> open() and started the read() before the disk arm has started
> moving ;)
>
> regards,
>
> Rik

The point here is not to optimize latency but to optimize throughput: If the
filesystem is able to recognize that a whole tree is being read, it may issue
read requests for all the blocks of that tree, which are (with a high
probability) in such a close location to each other that all the read requests
can result in a single, large, megabyte-big disk-read-burst, taking few
seconds instead of minutes.

In theory, this also could be implemented explicitly if the application could
tell the kernel "I'm going to read these 100 files in the very near future,
please make them ready for me". But wait, maybe the application can do this
(for regular files, not for directory entries and stat() data): Could it be
efficient if the application used open(file,O_NONBLOCK) for the next 100 files
and subsequent read()s on each of the returned filedescriptors?

Xu?n.


2002-09-11 18:28:32

by Oliver Neukum

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems


> In theory, this also could be implemented explicitly if the application
> could tell the kernel "I'm going to read these 100 files in the very
> near future, please make them ready for me". But wait, maybe the
> application can do this (for regular files, not for directory entries
> and stat() data): Could it be efficient if the application used
> open(file,O_NONBLOCK) for the next 100 files and subsequent read()s on
> each of the returned filedescriptors?

What do you want to trigger the reading ahead, open() or read() ?
Please correct me, if I am wrong, but wouldn't read() block ?

Aio should be able to do it. But even that want help you with the stat data.

Regards
Oliver

2002-09-11 18:38:43

by Xuan Baldauf

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems



Oliver Neukum wrote:

> > In theory, this also could be implemented explicitly if the application
> > could tell the kernel "I'm going to read these 100 files in the very
> > near future, please make them ready for me". But wait, maybe the
> > application can do this (for regular files, not for directory entries
> > and stat() data): Could it be efficient if the application used
> > open(file,O_NONBLOCK) for the next 100 files and subsequent read()s on
> > each of the returned filedescriptors?
>
> What do you want to trigger the reading ahead, open() or read() ?

As open() immediately returns, it does not matter by which call the readahead is
triggered. But I'm unsure about wether it is triggered at all for the amount of
data the read() requested.

>
> Please correct me, if I am wrong, but wouldn't read() block ?

AFAIK, "man open" tells

[...]
int open(const char *pathname, int flags);
[...]
O_NONBLOCK or O_NDELAY
The file is opened in non-blocking mode. Neither the open nor any
__subsequent__ operations on the file descriptor
which is returned will cause the calling process to wait.
[...]

So read won't block if the file has been opened with O_NONBLOCK.


>
>
> Aio should be able to do it. But even that want help you with the stat data.

Aio would help me announcing stat() usage for the future?

>
>
> Regards
> Oliver

Xu?n.


2002-09-11 18:59:22

by Tomas Szepe

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

> > I wonder wether Linux implements a kind of heuristic
> > readahead for filesystems:
>
> > If an application did a stat()..open()..read() sequence on a
> > file, it is likely that, after the next stat(), it will open
> > and read the mentioned file. Thus, one could readahead the
> > start of a file on stat() of that file.
>
> > Example: See this diff strace:
>
> Your observation is right, but I'm not sure how much it will
> matter if we start reading the file at stat() time or at
> read() time.
>
> This is because one disk seek takes about 10 million CPU
> cycles on modern systems and we'll have completed the stat(),
> open() and started the read() before the disk arm has started
> moving ;)

Please correct me if I'm missing something: I tend to assume
that if the cost of the disk seek is known to be a relatively
huge value, a fraction of it could be used as a timeout for disk
read request reordering with performance not deteriorating (except
for the heuristics overhead of course). I.e., if some code in the
block layer were able to detect a lot of "parallel yet sequential"
reads were going on, the amount of seeks could be minimized for
the price of the final timeout.

T.

2002-09-11 19:02:13

by Oliver Neukum

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

Am Mittwoch, 11. September 2002 20:43 schrieb Xuan Baldauf:

> > Please correct me, if I am wrong, but wouldn't read() block ?
>
> AFAIK, "man open" tells
>
> [...]
> int open(const char *pathname, int flags);
> [...]
> O_NONBLOCK or O_NDELAY
> The file is opened in non-blocking mode. Neither the open
> nor any __subsequent__ operations on the file descriptor
> which is returned will cause the calling process to wait.
> [...]
>
> So read won't block if the file has been opened with O_NONBLOCK.

Well, so the man page tells you. The kernel sources tell otherwise, unless
I am badly mistaken.

> > Aio should be able to do it. But even that want help you with the stat
> > data.
>
> Aio would help me announcing stat() usage for the future?

No, it won't. But it would solve the issue of reading ahead.
Stating needs a kernel implementation of 'stat ahead'

2002-09-11 19:14:00

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

On Wed, 11 Sep 2002, Oliver Neukum wrote:

> Am Mittwoch, 11. September 2002 20:43 schrieb Xuan Baldauf:
>
> > > Please correct me, if I am wrong, but wouldn't read() block ?
> >
> > AFAIK, "man open" tells
> >
> > [...]
> > int open(const char *pathname, int flags);
> > [...]
> > O_NONBLOCK or O_NDELAY
> > The file is opened in non-blocking mode. Neither the open
> > nor any __subsequent__ operations on the file descriptor
> > which is returned will cause the calling process to wait.
> > [...]
> >
> > So read won't block if the file has been opened with O_NONBLOCK.
>
> Well, so the man page tells you. The kernel sources tell otherwise, unless
> I am badly mistaken.
>
> > > Aio should be able to do it. But even that want help you with the stat
> > > data.
> >
> > Aio would help me announcing stat() usage for the future?
>
> No, it won't. But it would solve the issue of reading ahead.
> Stating needs a kernel implementation of 'stat ahead'
> -

I think this is discussed in the future. Write-ahead is the
next problem solved. ?;)

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-12 00:40:57

by jw schultz

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

On Wed, Sep 11, 2002 at 03:21:37PM -0400, Richard B. Johnson wrote:
> On Wed, 11 Sep 2002, Oliver Neukum wrote:
> > Am Mittwoch, 11. September 2002 20:43 schrieb Xuan Baldauf:

> > > > Aio should be able to do it. But even that want help you with the stat
> > > > data.
> > >
> > > Aio would help me announcing stat() usage for the future?
> >
> > No, it won't. But it would solve the issue of reading ahead.
> > Stating needs a kernel implementation of 'stat ahead'
> > -
>
> I think this is discussed in the future. Write-ahead is the
> next problem solved. ?;)

Gating back to the original issue which was "readahead" of
stat() info...

The userland open of a directory could trigger an advance
reading of the directory data and of the inode structs of
all it's immediate members. Almost all instances of a
usermode open on a directory will be doing fstats. Even a
command line ls often has options (colour, -F, etc) turned on
by default that require fstat on all the entries.
The question would be how far ahead of the user app would
the kernel be.

I could possibly see having a fcntl() for directories to
pre-read just the first block of each file to accelerate
file-managers that use magic and perhaps forestall readahead
pulling in more than magic will use.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-09-12 01:38:32

by Andrew Morton

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

jw schultz wrote:
>
> ...
> Gating back to the original issue which was "readahead" of
> stat() info...
>

ext3 does directory readahead. ext2 does not. If someone can
demonstrate any advantage which ext3 gets from this, we can put
it back into ext2 as well.

inode readahead would be pretty simple to do.

2002-09-12 11:34:55

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

On Wed, 11 Sep 2002, jw schultz wrote:

> On Wed, Sep 11, 2002 at 03:21:37PM -0400, Richard B. Johnson wrote:
> > On Wed, 11 Sep 2002, Oliver Neukum wrote:
> > > Am Mittwoch, 11. September 2002 20:43 schrieb Xuan Baldauf:
>
> > > > > Aio should be able to do it. But even that want help you with the stat
> > > > > data.
> > > >
> > > > Aio would help me announcing stat() usage for the future?
> > >
> > > No, it won't. But it would solve the issue of reading ahead.
> > > Stating needs a kernel implementation of 'stat ahead'
> > > -
> >
> > I think this is discussed in the future. Write-ahead is the
> > next problem solved. ?;)
>
> Gating back to the original issue which was "readahead" of
> stat() info...
>
> The userland open of a directory could trigger an advance
> reading of the directory data and of the inode structs of
> all it's immediate members. Almost all instances of a
> usermode open on a directory will be doing fstats. Even a
> command line ls often has options (colour, -F, etc) turned on
> by default that require fstat on all the entries.
> The question would be how far ahead of the user app would
> the kernel be.
>
> I could possibly see having a fcntl() for directories to
> pre-read just the first block of each file to accelerate
> file-managers that use magic and perhaps forestall readahead
> pulling in more than magic will use.

Then you are tuning a file-system for a single program
like `ls`. Most real-world I/O to file-systems are not done
by `ls` or even `make`. The extra read-ahead overhead is
just that, 'overhead'. Since the cost of disk I/O is expensive,
you certainly do not want to read any more than is absolutely
necessary. There had been a lot so studies about this in the
70's when disks were very, very, slow. The disk-to-RAM speed
ratio hasn't changed much even though both are much faster.
Therefore, the conclusions of these studies, made by persons
from DEC and IBM, should not have changed. From what I recall,
all studies showed that read-ahead always reduced performance,
but keeping what was already read in RAM always increased
performance.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-12 12:37:29

by Jesse Pollard

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

On Wednesday 11 September 2002 07:45 pm, jw schultz wrote:
> On Wed, Sep 11, 2002 at 03:21:37PM -0400, Richard B. Johnson wrote:
> > On Wed, 11 Sep 2002, Oliver Neukum wrote:
> > > Am Mittwoch, 11. September 2002 20:43 schrieb Xuan Baldauf:
> > > > > Aio should be able to do it. But even that want help you with the
> > > > > stat data.
> > > >
> > > > Aio would help me announcing stat() usage for the future?
> > >
> > > No, it won't. But it would solve the issue of reading ahead.
> > > Stating needs a kernel implementation of 'stat ahead'
> > > -
> >
> > I think this is discussed in the future. Write-ahead is the
> > next problem solved. ?;)
>
> Gating back to the original issue which was "readahead" of
> stat() info...
>
> The userland open of a directory could trigger an advance
> reading of the directory data and of the inode structs of
> all it's immediate members. Almost all instances of a
> usermode open on a directory will be doing fstats. Even a
> command line ls often has options (colour, -F, etc) turned on
> by default that require fstat on all the entries.
> The question would be how far ahead of the user app would
> the kernel be.
>
> I could possibly see having a fcntl() for directories to
> pre-read just the first block of each file to accelerate
> file-managers that use magic and perhaps forestall readahead
> pulling in more than magic will use.

The problem now is that this becomes filesystem dependant.
Some of the filesystems will already have the inode loaded, and
if the inode is loaded, then the first block of the file is also loaded.

If the proposed readahead is done at the VFS level then multiple
reads will be issued for the same block, with some physical IO
reduction done due to cache hits.

This is starting to sound like a bit of overoptimization.

Still.. Try it. and on different filesystems and see what happens.

--
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2002-09-12 13:30:59

by jw schultz

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

On Thu, Sep 12, 2002 at 07:41:09AM -0400, Richard B. Johnson wrote:
> On Wed, 11 Sep 2002, jw schultz wrote:
>
> > On Wed, Sep 11, 2002 at 03:21:37PM -0400, Richard B. Johnson wrote:
> > > On Wed, 11 Sep 2002, Oliver Neukum wrote:
> > > > No, it won't. But it would solve the issue of reading ahead.
> > > > Stating needs a kernel implementation of 'stat ahead'
> > > > -
> > >
> > > I think this is discussed in the future. Write-ahead is the
> > > next problem solved. ?;)
> >
> > Gating back to the original issue which was "readahead" of
> > stat() info...
> >
> > The userland open of a directory could trigger an advance
> > reading of the directory data and of the inode structs of
> > all it's immediate members. Almost all instances of a
> > usermode open on a directory will be doing fstats. Even a
> > command line ls often has options (colour, -F, etc) turned on
> > by default that require fstat on all the entries.
> > The question would be how far ahead of the user app would
> > the kernel be.
> >
> > I could possibly see having a fcntl() for directories to
> > pre-read just the first block of each file to accelerate
> > file-managers that use magic and perhaps forestall readahead
> > pulling in more than magic will use.
>
> Then you are tuning a file-system for a single program
> like `ls`. Most real-world I/O to file-systems are not done
> by `ls` or even `make`. The extra read-ahead overhead is

Most real-world filesystem I/O doesn't open(2) directories.
Most filesystem I/O is stat, open and unlink of files with
full paths, no open(2) of the directory. Notice that i
refer to the system call not internal functions that path
lookup invoke. The list of things that open(2) directories
is very short and almost all of them stat the the majority
of the directory's contents.

> just that, 'overhead'. Since the cost of disk I/O is expensive,
> you certainly do not want to read any more than is absolutely
> necessary. There had been a lot so studies about this in the
> 70's when disks were very, very, slow. The disk-to-RAM speed
> ratio hasn't changed much even though both are much faster.
> Therefore, the conclusions of these studies, made by persons
> from DEC and IBM, should not have changed. From what I recall,
> all studies showed that read-ahead always reduced performance,
> but keeping what was already read in RAM always increased
> performance.

I'm sure there will be others that can show you the numbers.
Things have changed since those studies. Disks are still
slooooooooow. However the OS doesn't have to nursemaid the
transfer. In most cases we queue requests and receive an
interrupt when the data is in memory. _IF_ the disk
isn't otherwise kept busy readahead reduces latency.
Most of the associated blocks have near proximity so there
is a good chance to do the readahead in a minumum number of
requests if they are fed to the queues in a batch.

> > The question would be how far ahead of the user app would
> > the kernel be.
I repeat this because i think it is a central point. Much
of the I/O associated with directory scanning is in tight
loops that would mirror the kernel's behavior. I have
doubts that it would produce a performance boost because it
might just be a synchronized duplication of effort.

I am not advocating doing this. I was just exploring the
idea and bringing the thread back to the original question.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-09-12 21:28:22

by jdow

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

From: "Richard B. Johnson" <[email protected]>

> Then you are tuning a file-system for a single program
> like `ls`. Most real-world I/O to file-systems are not done
> by `ls` or even `make`. The extra read-ahead overhead is
> just that, 'overhead'. Since the cost of disk I/O is expensive,
> you certainly do not want to read any more than is absolutely
> necessary. There had been a lot so studies about this in the
> 70's when disks were very, very, slow. The disk-to-RAM speed
> ratio hasn't changed much even though both are much faster.
> Therefore, the conclusions of these studies, made by persons
> from DEC and IBM, should not have changed. From what I recall,
> all studies showed that read-ahead always reduced performance,
> but keeping what was already read in RAM always increased
> performance.

Dick, those studies are simply not meaningful. The speedup for
general applications that I generated in the mid 80s with a pair
of SCSI controllers for the Amiga was rather dramatic. At that
time every PC controller I ran down was reading 512 bytes per
transaction. They could not read contiguous sectors unless they
were VERY fast. For these readahead would generate no benefit.
(Even some remarkably expensive SCSI controllers for PCs fell
into that trap and defective mindset.) The controllers I re-
engineered were capable of reading large blocks of data multiple
sectors in size in a single transaction. I experimented with
several programs and discovered that a 16k readahead was about
my optimum compromise between the read time overhead vs the
transaction time overhead. I even found that for the average case
ONE buffer was sufficient, which boggled me. (As a developer I
was used to reading multiple files at a time to create object
files and linked targets.)

Back in the 70s did anyone ever read more than a single block
at a time, other than me that is? (I had readahead in CP/M back
in the late 70s. But the evidence for that has evaporated I am
afraid. I repeatedly boggled other CP/M users with my 8" floppy
speeds as a result. I also added blocking and deblocking so I
could use large sectors for even more speed.)

{^_^} Joanne Dow, [email protected], not a gray beard solely
because I am severely beard challenged.


2002-09-16 12:45:39

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

On Thu, 12 Sep 2002, jdow wrote:

> From: "Richard B. Johnson" <[email protected]>
>
> > Then you are tuning a file-system for a single program
> > like `ls`. Most real-world I/O to file-systems are not done
> > by `ls` or even `make`. The extra read-ahead overhead is
> > just that, 'overhead'. Since the cost of disk I/O is expensive,
> > you certainly do not want to read any more than is absolutely
> > necessary. There had been a lot so studies about this in the
> > 70's when disks were very, very, slow. The disk-to-RAM speed
> > ratio hasn't changed much even though both are much faster.
> > Therefore, the conclusions of these studies, made by persons
> > from DEC and IBM, should not have changed. From what I recall,
> > all studies showed that read-ahead always reduced performance,
> > but keeping what was already read in RAM always increased
> > performance.
>
> Dick, those studies are simply not meaningful. The speedup for
> general applications that I generated in the mid 80s with a pair
> of SCSI controllers for the Amiga was rather dramatic. At that
> time every PC controller I ran down was reading 512 bytes per
> transaction. They could not read contiguous sectors unless they
> were VERY fast. For these readahead would generate no benefit.
> (Even some remarkably expensive SCSI controllers for PCs fell
> into that trap and defective mindset.) The controllers I re-
> engineered were capable of reading large blocks of data multiple
> sectors in size in a single transaction. I experimented with
> several programs and discovered that a 16k readahead was about
> my optimum compromise between the read time overhead vs the
> transaction time overhead. I even found that for the average case
> ONE buffer was sufficient, which boggled me. (As a developer I
> was used to reading multiple files at a time to create object
> files and linked targets.)

Well they could read contiguous sectors if the sector interleave
was correctly determined and the correct interleave was set
while low-level formatting. Now-days, interleave is either ignored
or unavailable because there is a sector buffer that can contain
an entire track of data. Some SCSI drives have sector buffers
that can contain a whole cylinder of data.

>
> Back in the 70s did anyone ever read more than a single block
> at a time, other than me that is? (I had readahead in CP/M back
> in the late 70s. But the evidence for that has evaporated I am
> afraid. I repeatedly boggled other CP/M users with my 8" floppy
> speeds as a result. I also added blocking and deblocking so I
> could use large sectors for even more speed.)
>

When we were doing direct-to-disk data writing here at
Analogic around 20 years ago, I experimented with changing
the sector interleave to speed up disc I/O. The DEC disks
in use at that time on our systems had ST-506 interfaces.
I had made a similar utility, called Spin-Ok (a put-down
of Spin-Right who stole my public-domain software and
made a commercial version), for the IBM/PC.

I had to get the privatives from Digital so I could write
a disk formatting driver. Our Chairman Bernie Gordon and
DEC's chairman Ken Olson were not exactly buddies, but we
were allowed to obtain information that might not be given
to everybody.

Some DEC engineers were interested in the results of my experiments.
It turned out that DEC's 1:1 interleave was no where near
correct for the maximum data transfer rate. I don't remember the
numbers, but I think the best speed was with a 5:1 or, perhaps,
3:1. Anyway experiments lead to using DEC file-system disks
as well as raw disks.

The file-system of the day was RMS. It had a fixed-length allocation
map which was owned by a "file" called BITMAP.SYS. This map was
contiguous and each bit in that map represented an allocation unit
on the physical structure. I proposed to DEC engineering that any
access to this file area, result in the entire map being read
into memory. I am quite aware of the "not-invented-here" syndrome
commonplace at Digital. However, one Engineer went to great lengths
presenting data and the results of his group's tests on just that
idea plus the idea of reading ahead, in general. It was shown, at
least to my satisfaction 20 years ago, that any read-ahead was
quite counter productive.

And the numbers were obvious! They didn't require any thought!

Somebody else on this thread asked for a URL! There wasn't any
such thing when this work was going on. We had BBS Systems and
1200 baud modems, eventually up to 9600. However, I did search
for some info. An interesing result is RAPID (Read Ahead for Parallel
Independent Disks)
http://www.unet.univie.ac.at/~a9405327/glossary/node93.html

This was experimental and seems to have been abandoned in the
late 1990s although it provided a lot of meat for several Masters
and PhD thesis.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-17 01:40:34

by jdow

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

From: "Richard B. Johnson" <[email protected]>
> On Thu, 12 Sep 2002, jdow wrote:
>
> > From: "Richard B. Johnson" <[email protected]>
> >
> > > Then you are tuning a file-system for a single program
> > > like `ls`. Most real-world I/O to file-systems are not done
> > > by `ls` or even `make`. The extra read-ahead overhead is
> > > just that, 'overhead'. Since the cost of disk I/O is expensive,
> > > you certainly do not want to read any more than is absolutely
> > > necessary. There had been a lot so studies about this in the
> > > 70's when disks were very, very, slow. The disk-to-RAM speed
> > > ratio hasn't changed much even though both are much faster.
> > > Therefore, the conclusions of these studies, made by persons
> > > from DEC and IBM, should not have changed. From what I recall,
> > > all studies showed that read-ahead always reduced performance,
> > > but keeping what was already read in RAM always increased
> > > performance.
> >
> > Dick, those studies are simply not meaningful. The speedup for
> > general applications that I generated in the mid 80s with a pair
> > of SCSI controllers for the Amiga was rather dramatic. At that
> > time every PC controller I ran down was reading 512 bytes per
> > transaction. They could not read contiguous sectors unless they
> > were VERY fast. For these readahead would generate no benefit.
> > (Even some remarkably expensive SCSI controllers for PCs fell
> > into that trap and defective mindset.) The controllers I re-
> > engineered were capable of reading large blocks of data multiple
> > sectors in size in a single transaction. I experimented with
> > several programs and discovered that a 16k readahead was about
> > my optimum compromise between the read time overhead vs the
> > transaction time overhead. I even found that for the average case
> > ONE buffer was sufficient, which boggled me. (As a developer I
> > was used to reading multiple files at a time to create object
> > files and linked targets.)
>
> Well they could read contiguous sectors if the sector interleave
> was correctly determined and the correct interleave was set
> while low-level formatting. Now-days, interleave is either ignored
> or unavailable because there is a sector buffer that can contain
> an entire track of data. Some SCSI drives have sector buffers
> that can contain a whole cylinder of data.

When I say contiguous I mean contiguous not interleaved, sonny. I had
CP/M (and UCSD Pascal) reading physically contiguous sectors on the
disk with no lost speed. That means I read, with my DSSD format of
9 sectors each 512 bytes in size per side 18 full tracks 19 revolutions
of the disk. I did skew the sector numbers to allow for seeks. But I
did not interleave the tracks. It was not necessary with clean and
correct code. I rather resent the presumption that I am a dumb bitch
here.

> > Back in the 70s did anyone ever read more than a single block
> > at a time, other than me that is? (I had readahead in CP/M back
> > in the late 70s. But the evidence for that has evaporated I am
> > afraid. I repeatedly boggled other CP/M users with my 8" floppy
> > speeds as a result. I also added blocking and deblocking so I
> > could use large sectors for even more speed.)
> >
>
> When we were doing direct-to-disk data writing here at
> Analogic around 20 years ago, I experimented with changing
> the sector interleave to speed up disc I/O. The DEC disks
> in use at that time on our systems had ST-506 interfaces.
> I had made a similar utility, called Spin-Ok (a put-down
> of Spin-Right who stole my public-domain software and
> made a commercial version), for the IBM/PC.

I had my CP/M hard disks also working on 1:1 with a somewhat
tweaked Morrow Designs controller and a good BIOS.

> I had to get the privatives from Digital so I could write
> a disk formatting driver. Our Chairman Bernie Gordon and
> DEC's chairman Ken Olson were not exactly buddies, but we
> were allowed to obtain information that might not be given
> to everybody.

Morrow Designs provided code. That made it easy. {^_-}

> Some DEC engineers were interested in the results of my experiments.
> It turned out that DEC's 1:1 interleave was no where near
> correct for the maximum data transfer rate. I don't remember the
> numbers, but I think the best speed was with a 5:1 or, perhaps,
> 3:1. Anyway experiments lead to using DEC file-system disks
> as well as raw disks.

<snicker> It all depends on how much overhead you install bewteen
individual sector read requests. I worked very hard to get that
overhead down to a bare minimum. I also discovered that having a
local buffer from which program read requests could take place
was an invaluable speedup technique. (So in effect I was doing
readahead back in my CP/M days. Oh, I also rewrote CP/M itself
for the exercise. It proved enteraining to add some of the MP/M
features into CP/M with automatic code relocation on TSRs so that
when any one was unloaded the others moved to remove any gaps in
the resultant memory map. It was rather fun. Thanks for reminding
me of it.)

> The file-system of the day was RMS. It had a fixed-length allocation
> map which was owned by a "file" called BITMAP.SYS. This map was
> contiguous and each bit in that map represented an allocation unit
> on the physical structure. I proposed to DEC engineering that any
> access to this file area, result in the entire map being read
> into memory. I am quite aware of the "not-invented-here" syndrome
> commonplace at Digital. However, one Engineer went to great lengths
> presenting data and the results of his group's tests on just that
> idea plus the idea of reading ahead, in general. It was shown, at
> least to my satisfaction 20 years ago, that any read-ahead was
> quite counter productive.

That sounds like the filesystem UCSD Pascal used. I had it doing
1:1 interleave very early on instead of the 2:1 that it came with.

> And the numbers were obvious! They didn't require any thought!

Oh really. {^_-}

> Somebody else on this thread asked for a URL! There wasn't any
> such thing when this work was going on. We had BBS Systems and
> 1200 baud modems, eventually up to 9600. However, I did search
> for some info. An interesing result is RAPID (Read Ahead for Parallel
> Independent Disks)
> http://www.unet.univie.ac.at/~a9405327/glossary/node93.html
>
> This was experimental and seems to have been abandoned in the
> late 1990s although it provided a lot of meat for several Masters
> and PhD thesis.

I'll have to take a look at it, David. (That was all back in my
RF Engineering days before I "formally" adopted software as a
hobby to get paid for indulging. I did things I bloody well knew
could be done how ever erroneously my knowledge existed. I tended
to do that with RF, too. But the field is wider for software so
I find software more fun these days.)
{^_^}

2002-09-17 05:14:19

by jdow

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

From: "Richard B. Johnson" <[email protected]>

> On Thu, 12 Sep 2002, jdow wrote:

> > Dick, those studies are simply not meaningful. The speedup for
> > general applications that I generated in the mid 80s with a pair
> > of SCSI controllers for the Amiga was rather dramatic. At that
> > time every PC controller I ran down was reading 512 bytes per
> > transaction. They could not read contiguous sectors unless they
> > were VERY fast. For these readahead would generate no benefit.
> > (Even some remarkably expensive SCSI controllers for PCs fell
> > into that trap and defective mindset.) The controllers I re-
> > engineered were capable of reading large blocks of data multiple
> > sectors in size in a single transaction. I experimented with
> > several programs and discovered that a 16k readahead was about
> > my optimum compromise between the read time overhead vs the
> > transaction time overhead. I even found that for the average case
> > ONE buffer was sufficient, which boggled me. (As a developer I
> > was used to reading multiple files at a time to create object
> > files and linked targets.)
>
> Well they could read contiguous sectors if the sector interleave
> was correctly determined and the correct interleave was set
> while low-level formatting. Now-days, interleave is either ignored
> or unavailable because there is a sector buffer that can contain
> an entire track of data. Some SCSI drives have sector buffers
> that can contain a whole cylinder of data.

Please allow me to add to my previous comment that the StarDrive and
HardFrame controllers were SCSI controllers. They ALWAYS made multi-
block reads. The smallest read they made was 16k. If more was requested
the StarDrive would read as much as was requested in one read. The
HardFrame had to break it into smaller, 128k, pieces due to a strange
DMA problem with the DMA controller that didn't like to be recycled
too quickly. So if a drive could send contiguous sectors on a 1:1
interleave that was what was used. (The ONLY thing I ever had that
could not was an Adaptec A4000 SCSI to ST506 controller that was rather
poorly written inside and could not, itself, manage 1:1 interleave on
the attached ST506 drive.) With CP/M I was doing 1:1 interleave on
8" floppies in the late 70s at last as reliably as other folks did
their 13:1 (CP/M) or 2:1 (UCSD Pascal). The controller was "dump",
a WD 1771 chip in a board called a "VersaFloppy".

{^_^} Joanne Dow, [email protected], who "gets it off" doing things
like this that the experts tell her are impossible.


2002-09-17 10:25:48

by jbradford

[permalink] [raw]
Subject: Re: Heuristic readahead for filesystems

> > Well they could read contiguous sectors if the sector interleave
> > was correctly determined and the correct interleave was set
> > while low-level formatting. Now-days, interleave is either ignored
> > or unavailable because there is a sector buffer that can contain
> > an entire track of data. Some SCSI drives have sector buffers
> > that can contain a whole cylinder of data.
>
> When I say contiguous I mean contiguous not interleaved, sonny. I had
> CP/M (and UCSD Pascal) reading physically contiguous sectors on the
> disk with no lost speed. That means I read, with my DSSD format of
> 9 sectors each 512 bytes in size per side 18 full tracks 19 revolutions
> of the disk. I did skew the sector numbers to allow for seeks. But I
> did not interleave the tracks. It was not necessary with clean and
> correct code. I rather resent the presumption that I am a dumb bitch
> here.

Ah, but the *really* clever thing to do at the time, on systems where you couldn't optimally achieve 1:1 interleave on a floppy, was to allocate sectors on alternating sides of the disk. So, you could, for example, read two tracks in 3 revolutions, instead of 6, in the case of 3:1 interleave :-).

John.