2010-04-19 15:51:36

by Phillip Susi

[permalink] [raw]
Subject: readahead on directories

I have been trying to improve boot times with ureadahead and have
identified a period of time of almost zero IO throughput caused by calls
to open() blocking during name lookup while fetching directory blocks.
At first I thought this could simply be fixed by calling readahead() on
the directories first before open()ing all of the normal files for
readahead.

Unfortunately it seems that readahead() fails when called on a
directory. I was wondering if I could get some help understanding why
this is, and how it could be fixed.


2010-04-21 00:44:37

by Jamie Lokier

[permalink] [raw]
Subject: Re: readahead on directories

Phillip Susi wrote:
> I have been trying to improve boot times with ureadahead and have
> identified a period of time of almost zero IO throughput caused by calls
> to open() blocking during name lookup while fetching directory blocks.
> At first I thought this could simply be fixed by calling readahead() on
> the directories first before open()ing all of the normal files for
> readahead.
>
> Unfortunately it seems that readahead() fails when called on a
> directory. I was wondering if I could get some help understanding why
> this is, and how it could be fixed.

readahead() doesn't make much sense on a directory - the offset and
size aren't meaningful.

But does plain opendir/readdir/closedir solve the problem?

-- Jamie

2010-04-21 14:57:10

by Phillip Susi

[permalink] [raw]
Subject: Re: readahead on directories

On 4/20/2010 8:44 PM, Jamie Lokier wrote:
> readahead() doesn't make much sense on a directory - the offset and
> size aren't meaningful.
>
> But does plain opendir/readdir/closedir solve the problem?

No, since those are synchronous. I want to have readahead() queue up
reading the entire directory in the background to avoid blocking, and
get the queue filled with a bunch of requests that can be merged into
larger segments before being dispatched to the hardware.

I don't actually care to have the contents of the directories returned,
so readdir() does more than I need in that respect, and also it performs
a blocking read of one disk block at a time, which is horribly slow with
a cold cache.

2010-04-21 16:12:14

by Jamie Lokier

[permalink] [raw]
Subject: Re: readahead on directories

Phillip Susi wrote:
> On 4/20/2010 8:44 PM, Jamie Lokier wrote:
> > readahead() doesn't make much sense on a directory - the offset and
> > size aren't meaningful.
> >
> > But does plain opendir/readdir/closedir solve the problem?
>
> No, since those are synchronous. I want to have readahead() queue up
> reading the entire directory in the background to avoid blocking, and
> get the queue filled with a bunch of requests that can be merged into
> larger segments before being dispatched to the hardware.

Asynchronous is available: Use clone or pthreads.

More broadly: One of the ways to better I/O sorting is to make sure
you've got enough things in parallel that the I/O queue is never
empty, so what you issue has time to get sorted before it reaches the
head of the queue for dispatch. On the other hand, not so many things
in parallel that the queues fill up and throttle. Unfortunately it
only works if things aren't serialised by kernel locks - but there's been
a lot of work on lockless this and that in the kernel, which may help.

Back to your problem: You need a bunch of scattered block requests to
be queued and sorted sanely, and readdir doesn't do that, and even
waits for each block before issuing the next request.

Or does it?

A quick skim of fs/{ext3,ext4}/dir.c finds a call to
page_cache_sync_readahead. Doesn't that do any reading ahead? :-)

> I don't actually care to have the contents of the
> directories returned, so readdir() does more than I need in that
> respect, and also it performs a blocking read of one disk block at a
> time, which is horribly slow with a cold cache.

I/O is the probably the biggest cost, so it's more important to get
the I/O pattern you want than worrying about return values you'll discard.

If readdir() calls are slowed by lots of calls and libc, consider
using the getdirentries system call directly.

If not, fs/ext4/namei.c:ext4_dir_inode_operations points to
ext4_fiemap. So you may have luck calling FIEMAP or FIBMAP on the
directory, and then reading blocks using the block device. I'm not
sure if the cache loaded via the block device (when mounted) will then
be used for directory lookups.

-- Jamie

2010-04-21 18:51:32

by Jamie Lokier

[permalink] [raw]
Subject: Re: readahead on directories

Evgeniy Polyakov wrote:
> On Wed, Apr 21, 2010 at 05:12:11PM +0100, Jamie Lokier ([email protected]) wrote:
> > A quick skim of fs/{ext3,ext4}/dir.c finds a call to
> > page_cache_sync_readahead. Doesn't that do any reading ahead? :-)
>
> It goes down to fs callbacks of data reading, which is not appliable to
> directories.
>
> To implement directory 'readahead' we use separated thread to call
> readdir(). It is damn slow indeed, but it can populate cache in advance
> of actual data reading. As a higher level crunch there is a 'find'
> running in background.

Fwiw, I found sorting directories by inode and reading them in that
order help to reduce seeks, some 10 years ago. I implemented
something like 'find' which works like that, keeping a queue of
directories to read and things to open/stat, ordered by inode number
seen in d_ino before open/stat and st_ino after. However it did not
try to readahead the blocks inside a directory, or sort operations by
block number. It reduced some 'find'-like operations to about a
quarter of the time on cold cache. I still use that program sometimes
before "git status" ;-) Google "treescan" and "lokier" if you're
interested in trying it (though I use 0.7 which isn't published).

> > > I don't actually care to have the content s of the
> > > directories returned, so readdir() does more than I need in that
> > > respect, and also it performs a blocking read of one disk block at a
> > > time, which is horribly slow with a cold cache.
> >
> > I/O is the probably the biggest cost, so it's more important to get
> > the I/O pattern you want than worrying about return values you'll discard.
> >
> > If readdir() calls are slowed by lots of calls and libc, consider
> > using the getdirentries system call directly.
>
> it is not about readdir(). Plain read() is synchronous too. But
> filesystem can respond to readahead calls and read next block to current
> one, while it won't do this for next direntry.

I'm surprised it makes much difference, as directories are usually not
very large anyway.

But if it does, go on, try FIEMAP and blockdev reading, you know you
want to :-)

-- Jamie

2010-04-21 18:57:00

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: readahead on directories

On Wed, Apr 21, 2010 at 07:51:24PM +0100, Jamie Lokier ([email protected]) wrote:
> Fwiw, I found sorting directories by inode and reading them in that
> order help to reduce seeks, some 10 years ago. I implemented
> something like 'find' which works like that, keeping a queue of
> directories to read and things to open/stat, ordered by inode number
> seen in d_ino before open/stat and st_ino after. However it did not
> try to readahead the blocks inside a directory, or sort operations by
> block number. It reduced some 'find'-like operations to about a
> quarter of the time on cold cache. I still use that program sometimes
> before "git status" ;-) Google "treescan" and "lokier" if you're
> interested in trying it (though I use 0.7 which isn't published).

As you might expect it is not really a directory readahead :)
Nad I'm not really sure ext234 can implement it in kernel more optimally
without breaking backward compatibility though.

> > it is not about readdir(). Plain read() is synchronous too. But
> > filesystem can respond to readahead calls and read next block to current
> > one, while it won't do this for next direntry.
>
> I'm surprised it makes much difference, as directories are usually not
> very large anyway.

Well, having several tens of millions of files in 64k dirs takes from tens of
seconds to minutes to read just because of that.

> But if it does, go on, try FIEMAP and blockdev reading, you know you
> want to :-)

Well, it requires substantial underlying fs knowledge and is not simple
and, well, appropriate to do in some cases.

--
Evgeniy Polyakov

2010-04-21 19:05:57

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: readahead on directories

On Wed, Apr 21, 2010 at 05:12:11PM +0100, Jamie Lokier ([email protected]) wrote:
> A quick skim of fs/{ext3,ext4}/dir.c finds a call to
> page_cache_sync_readahead. Doesn't that do any reading ahead? :-)

It goes down to fs callbacks of data reading, which is not appliable to
directories.

To implement directory 'readahead' we use separated thread to call
readdir(). It is damn slow indeed, but it can populate cache in advance
of actual data reading. As a higher level crunch there is a 'find'
running in background.

> > I don't actually care to have the contents of the
> > directories returned, so readdir() does more than I need in that
> > respect, and also it performs a blocking read of one disk block at a
> > time, which is horribly slow with a cold cache.
>
> I/O is the probably the biggest cost, so it's more important to get
> the I/O pattern you want than worrying about return values you'll discard.
>
> If readdir() calls are slowed by lots of calls and libc, consider
> using the getdirentries system call directly.

it is not about readdir(). Plain read() is synchronous too. But
filesystem can respond to readahead calls and read next block to current
one, while it won't do this for next direntry.

--
Evgeniy Polyakov

2010-04-21 19:12:33

by Phillip Susi

[permalink] [raw]
Subject: Re: readahead on directories

On 4/21/2010 12:12 PM, Jamie Lokier wrote:
> Asynchronous is available: Use clone or pthreads.

Synchronous in another process is not the same as async. It seems I'm
going to have to do this for now as a workaround, but one of the reasons
that aio was created was to avoid the inefficiencies this introduces.
Why create a new thread context, switch to it, put a request in the
queue, then sleep, when you could just drop the request in the queue in
the original thread and move on?

> A quick skim of fs/{ext3,ext4}/dir.c finds a call to
> page_cache_sync_readahead. Doesn't that do any reading ahead? :-)

Unfortunately it does not help when it is synchronous. The process
still sleeps until it has fetched the blocks it needs. I believe that
code just ends up doing a single 4kb read if the directory is no larger
than that, or if it is, then it reads up to readahead_size. It puts the
request in the queue then sleeps until all the data has been read, even
if only the first 4kb was required before readdir() could return.

This means that a single thread calling readdir() is still going to
block reading the directory before it can move on to trying to read
other directories that are also needed.

> I/O is the probably the biggest cost, so it's more important to get
> the I/O pattern you want than worrying about return values you'll discard.

True, but it would be nice not to waste cpu cycles copying unneeded data
around.

> If not, fs/ext4/namei.c:ext4_dir_inode_operations points to
> ext4_fiemap. So you may have luck calling FIEMAP or FIBMAP on the
> directory, and then reading blocks using the block device. I'm not
> sure if the cache loaded via the block device (when mounted) will then
> be used for directory lookups.

Yes, I had considered that. ureadahead already makes use of ext2fslibs
to open the block device and read the inode tables so they are already
in the cache for later use. It seems a bit silly to do that though,
when that is exactly what readahead() SHOULD do for you.

2010-04-21 19:23:17

by Phillip Susi

[permalink] [raw]
Subject: Re: readahead on directories

On 4/21/2010 2:51 PM, Jamie Lokier wrote:
> Fwiw, I found sorting directories by inode and reading them in that
> order help to reduce seeks, some 10 years ago. I implemented
> something like 'find' which works like that, keeping a queue of
> directories to read and things to open/stat, ordered by inode number
> seen in d_ino before open/stat and st_ino after. However it did not
> try to readahead the blocks inside a directory, or sort operations by
> block number. It reduced some 'find'-like operations to about a
> quarter of the time on cold cache. I still use that program sometimes
> before "git status" ;-) Google "treescan" and "lokier" if you're
> interested in trying it (though I use 0.7 which isn't published).

That helps with open()ing or stat()ing the files since you access the
inodes in order, but ureadahead already preloads all of the inode tables
so this won't help.

>> it is not about readdir(). Plain read() is synchronous too. But
>> filesystem can respond to readahead calls and read next block to current
>> one, while it won't do this for next direntry.
>
> I'm surprised it makes much difference, as directories are usually not
> very large anyway.

That's just it; it doesn't help. That's why I want to readahead() all
of the directories at once instead of reading them one block at a time.

> But if it does, go on, try FIEMAP and blockdev reading, you know you
> want to :-)

Why reinvent the wheel when that's readahead()'s job? As a workaround
I'm about to try just threading all of the calls to open(). Each one
will queue a read and block, but with them all doing so at once should
fill the queue with plenty of reads. It is inefficient, but better than
one block at a time.

2010-04-21 20:01:09

by Jamie Lokier

[permalink] [raw]
Subject: Re: readahead on directories

Phillip Susi wrote:
> On 4/21/2010 2:51 PM, Jamie Lokier wrote:
> > Fwiw, I found sorting directories by inode and reading them in that
> > order help to reduce seeks, some 10 years ago. I implemented
> > something like 'find' which works like that, keeping a queue of
> > directories to read and things to open/stat, ordered by inode number
> > seen in d_ino before open/stat and st_ino after. However it did not
> > try to readahead the blocks inside a directory, or sort operations by
> > block number. It reduced some 'find'-like operations to about a
> > quarter of the time on cold cache. I still use that program sometimes
> > before "git status" ;-) Google "treescan" and "lokier" if you're
> > interested in trying it (though I use 0.7 which isn't published).
>
> That helps with open()ing or stat()ing the files since you access the
> inodes in order, but ureadahead already preloads all of the inode tables
> so this won't help.

It helps a little with data access too, because of block group
locality tending to follow inode numbers. Don't read inodes and data
in the same batch though.

> >> it is not about readdir(). Plain read() is synchronous too. But
> >> filesystem can respond to readahead calls and read next block to current
> >> one, while it won't do this for next direntry.
> >
> > I'm surprised it makes much difference, as directories are usually not
> > very large anyway.
>
> That's just it; it doesn't help. That's why I want to readahead() all
> of the directories at once instead of reading them one block at a time.

Ok, this discussion has got a bit confused. Text above refers to
needing to asynchronously read next block in a directory, but if they
are small then that's not important.

> > But if it does, go on, try FIEMAP and blockdev reading, you know you
> > want to :-)
>
> Why reinvent the wheel when that's readahead()'s job? As a workaround
> I'm about to try just threading all of the calls to open().

FIEMAP suggestion is only if you think you need to issue reads for
multiple blocks in the _same_ directory in parallel. From what you say,
I doubt that's important.

FIEMAP is not relevant for reading different directories in parallel.
You'd still have to thread the FIEMAP calls for that - it's a
different problem.

> Each one will queue a read and block, but with them all doing so at
> once should fill the queue with plenty of reads. It is inefficient,
> but better than one block at a time.

That was my first suggestion: threads with readdir(); I thought it had
been rejected hence the further discussion.

(Actually I would use clone + open + getdirentries + tiny userspace
stack to avoid using tons of memory. But that's just a tweak, only to
be used if the threading is effective.)

-- Jamie

2010-04-21 20:02:48

by Jamie Lokier

[permalink] [raw]
Subject: Re: readahead on directories

Evgeniy Polyakov wrote:
> > But if it does, go on, try FIEMAP and blockdev reading, you know you
> > want to :-)
>
> Well, it requires substantial underlying fs knowledge and is not simple
> and, well, appropriate to do in some cases.

FIEMAP might not be the answer, but what part of it requires fs
knowledge? It's supposed to be fs-independent. I agree it's not
always appropriate to use, and I don't know if it would be effective
anyway.

-- Jamie

2010-04-21 20:14:06

by Phillip Susi

[permalink] [raw]
Subject: Re: readahead on directories

On 4/21/2010 4:01 PM, Jamie Lokier wrote:
> Ok, this discussion has got a bit confused. Text above refers to
> needing to asynchronously read next block in a directory, but if they
> are small then that's not important.

It is very much important since if you ready each small directory one
block at a time, it is very slow. You want to queue up reads to all of
them at once so they can be batched.

> FIEMAP suggestion is only if you think you need to issue reads for
> multiple blocks in the _same_ directory in parallel. From what you say,
> I doubt that's important.

That may be why you suggested it, but it is also exactly what
readahead() does. It also queues the read asynchronously which is what
I really want so that I can queue more reads on other directories in one
big batch.

> That was my first suggestion: threads with readdir(); I thought it had
> been rejected hence the further discussion.

Yes, it was sort of rejected, which is why I said it's just a workaround
for now until readahead() works on directories. It will produce the
desired IO pattern but at the expense of ram and cpu cycles creating a
bunch of short lived threads that go to sleep almost immediately after
being created, and exit when they wake up. readahead() would be much
more efficient.

2010-04-21 20:21:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: readahead on directories

On Wed, Apr 21, 2010 at 09:02:43PM +0100, Jamie Lokier ([email protected]) wrote:
> FIEMAP might not be the answer, but what part of it requires fs
> knowledge? It's supposed to be fs-independent. I agree it's not
> always appropriate to use, and I don't know if it would be effective
> anyway.

At least we have to know whether given fs supports such interface.
And more complex is to know how underlying fs is organized. What is
extent, which types can it have, where exactly information about extent
metadata is stored, i.e. where can we find what this object is about?
And how to actually populate appropriate blocks into ram to speedup readdir()?

FIEMAP (which is file mapper btw :) is useful for information gathering
about how fs is organized, but that's all I'm afraid.

--
Evgeniy Polyakov

2010-04-21 20:22:13

by Jamie Lokier

[permalink] [raw]
Subject: Re: readahead on directories

Phillip Susi wrote:
> On 4/21/2010 12:12 PM, Jamie Lokier wrote:
> > Asynchronous is available: Use clone or pthreads.
>
> Synchronous in another process is not the same as async. It seems I'm
> going to have to do this for now as a workaround, but one of the reasons
> that aio was created was to avoid the inefficiencies this introduces.
> Why create a new thread context, switch to it, put a request in the
> queue, then sleep, when you could just drop the request in the queue in
> the original thread and move on?

Because tests have found that it's sometimes faster than AIO anyway!

...for those things where AIO is supported at all. The problem with
more complicated fs operations (like, say, buffered file reads and
directory operations) is you can't just put a request in a queue.

Some of it has to be done in a context with stack and occasional
sleeping. It's just too complicated to make all filesystem operations
_entirely_ async, and that is the reason Linux AIO has never gotten
very far trying to do that.

Those things where putting a request on a queue works tend to move the
sleepable metadata fetching to the code _before_ the request is queued
to get around that. Which is one reason why Linux O_DIRECT AIO can
still block when submitting a request... :-/

The most promising direction for AIO at the moment is in fact spawning
kernel threads on demand to do the work that needs a context, and
swizzling some pointers so that it doesn't look like threads were used
to userspace.

Kernel threads on demand, especially magical demand at the point where
the thread would block, are faster than clone() in userspace - but not
expected to be much faster if you're reading from cold cache anyway,
with lots of blocking happening.

You might even find that calling readahead() on *files* goes a bit
faster if you have several threads working in parallel calling it,
because of the ability to parallelise metadata I/O.

> > A quick skim of fs/{ext3,ext4}/dir.c finds a call to
> > page_cache_sync_readahead. Doesn't that do any reading ahead? :-)
>
> Unfortunately it does not help when it is synchronous. The process
> still sleeps until it has fetched the blocks it needs. I believe that
> code just ends up doing a single 4kb read if the directory is no larger
> than that, or if it is, then it reads up to readahead_size. It puts the
> request in the queue then sleeps until all the data has been read, even
> if only the first 4kb was required before readdir() could return.

So you're saying it _does_ readahead_size if needed. That's great!
Evigny's concern about sequantially reading blocks one by one
isn't anything to care about then. That's one problem solved. :-)

> This means that a single thread calling readdir() is still going to
> block reading the directory before it can move on to trying to read
> other directories that are also needed.

Of course.

> > If not, fs/ext4/namei.c:ext4_dir_inode_operations points to
> > ext4_fiemap. So you may have luck calling FIEMAP or FIBMAP on the
> > directory, and then reading blocks using the block device. I'm not
> > sure if the cache loaded via the block device (when mounted) will then
> > be used for directory lookups.
>
> Yes, I had considered that. ureadahead already makes use of ext2fslibs
> to open the block device and read the inode tables so they are already
> in the cache for later use. It seems a bit silly to do that though,
> when that is exactly what readahead() SHOULD do for you.

Don't bother with FIEMAP then. It sounds like all the preloadable
metadata is already loaded. FIEMAP would have still needed to be
threaded for parallel directories.

Filesystem-independent readahead() on directories is out of the
question (except by using a kernel background thread, which is
pointless because you can do that yourself.)

Some filesystems have directories which aren't stored like a file's
data, and the process of reading the directory needs to work through
its logic, and needs a sleepable context to work in. Generic page
reading won't work for all of them.

readahead() on directories in specific filesystem types may be possible.
It would have to be implemented in each fs.

-- Jamie

2010-04-21 20:37:25

by Jamie Lokier

[permalink] [raw]
Subject: Re: readahead on directories

Phillip Susi wrote:
> On 4/21/2010 4:01 PM, Jamie Lokier wrote:
> > Ok, this discussion has got a bit confused. Text above refers to
> > needing to asynchronously read next block in a directory, but if they
> > are small then that's not important.
>
> It is very much important since if you ready each small directory one
> block at a time, it is very slow. You want to queue up reads to all of
> them at once so they can be batched.

I don't understand what you are saying at this point. Or you don't
understand what I'm saying. Or I didn't understand what Evigny was
saying :-)

Small directories don't _have_ next blocks; this is not a problem for
them. And you've explained that filesystems of interest already fetch
readahead_size in larger directories, so they don't have the "next
block" problem either.

> > That was my first suggestion: threads with readdir(); I thought it had
> > been rejected hence the further discussion.
>
> Yes, it was sort of rejected, which is why I said it's just a workaround
> for now until readahead() works on directories. It will produce the
> desired IO pattern but at the expense of ram and cpu cycles creating a
> bunch of short lived threads that go to sleep almost immediately after
> being created, and exit when they wake up. readahead() would be much
> more efficient.

Some test results comparing AIO with kernel threads indicate that
threads are more efficient than you might expect for this. Especially
in the cold I/O cache cases. readahead() has to do a lot of the same
work, in a different way and with less opportunity to parallelise the
metadata stage.

clone() threads with tiny stacks (you can even preallocate the stacks,
and they can be smaller than a page) aren't especially slow or big,
and ideally you'll use *long-lived* threads with an efficient
multi-consumer queue that they pull requests from, written to by the
main program and kept full enough to avoid blocking the threads.

Also since you're discarding the getdirentries() data, you can read
all of it into the same memory for hot cache goodness. (One per CPU
please.)

I don't know what performance that'll get you, but I think it'll be
faster than you are expecting - *if* the directory locking is
sufficiently scalable at this point. That's an unknown.

Try it with files if you want to get a comparative picture.

-- Jamie

2010-04-21 20:39:40

by Jamie Lokier

[permalink] [raw]
Subject: Re: readahead on directories

Evgeniy Polyakov wrote:
> On Wed, Apr 21, 2010 at 09:02:43PM +0100, Jamie Lokier ([email protected]) wrote:
> > FIEMAP might not be the answer, but what part of it requires fs
> > knowledge? It's supposed to be fs-independent. I agree it's not
> > always appropriate to use, and I don't know if it would be effective
> > anyway.
>
> At least we have to know whether given fs supports such interface.
> And more complex is to know how underlying fs is organized. What is
> extent, which types can it have, where exactly information about extent
> metadata is stored, i.e. where can we find what this object is about?

Ummm... Does any of that matter?

> And how to actually populate appropriate blocks into ram to speedup
> readdir()?

Blockdev readahead() :-)

> FIEMAP (which is file mapper btw :) is useful for information gathering
> about how fs is organized, but that's all I'm afraid.

That's all you need to start fetching from the blockdev. You can't
*use* the blockdev data, but that doesn't matter for this readahead
operation, only that they are approximately the right data blocks.

- Jamie

2010-04-21 20:59:34

by Phillip Susi

[permalink] [raw]
Subject: Re: readahead on directories

On 4/21/2010 4:22 PM, Jamie Lokier wrote:
> Because tests have found that it's sometimes faster than AIO anyway!

Not when the aio is working properly ;)

This is getting a bit off topic, but aio_read() and readahead() have to
map the disk blocks before they can queue a read. In the case of ext2/3
this often requires reading an indirect block from the disk so the
kernel has to wait for that read to finish before it can queue the rest
of the reads and return. With ext4 extents, usually all of the mapping
information is in the inode so all of the reads can be queued without
delay, and the kernel returns to user space immediately.

So older testing done on ext3 likely ran into this and lead to the
conclusion that threading can be faster, but it would be preferable when
using ext4 with extents to drop the read requests in the queue without
the bother of setting up and tearing down threads, which is really just
a workaround for a shortcoming in aio_read and readahead() when using
indirect blocks. For that matter aio_read and readahead() could
probably benefit from some reworking to fix this so that they can return
as soon as they have queued the read of the indirect block, and queueing
the remaining reads can be deferred until the indirect block comes in.

> ...for those things where AIO is supported at all. The problem with
> more complicated fs operations (like, say, buffered file reads and
> directory operations) is you can't just put a request in a queue.

Unfortunately there aren't async versions of the calls that make
directory operations, but aio_read() performs a buffered file read
asynchronously just fine. Right now though I'm only concerned with
reading lots of data into the cache at boot time to speed things up.

> Those things where putting a request on a queue works tend to move the
> sleepable metadata fetching to the code _before_ the request is queued
> to get around that. Which is one reason why Linux O_DIRECT AIO can
> still block when submitting a request... :-/

Yep, as I just described. Would be nice to fix this.

> The most promising direction for AIO at the moment is in fact spawning
> kernel threads on demand to do the work that needs a context, and
> swizzling some pointers so that it doesn't look like threads were used
> to userspace.

NO! This is how aio was implemented at first and it was terrible.
Context is only required because it is easier to write the code linearly
instead of as a state machine. It would be better for example, to have
readahead() register a callback function to be called when the read of
the indirect block completes, and the callback needs zero context to
queue reads of the data blocks referred to by the indirect block.

> You might even find that calling readahead() on *files* goes a bit
> faster if you have several threads working in parallel calling it,
> because of the ability to parallelise metadata I/O.

Indeed... or you can use extents, or fix the implementation of
readahead() ;)

> So you're saying it _does_ readahead_size if needed. That's great!

I'm not sure, I'm just saying that if it does, it does not help much
since most directories fit in a single 4kb block anyhow. I need to get
a number of different directories read quickly.

> Filesystem-independent readahead() on directories is out of the
> question (except by using a kernel background thread, which is
> pointless because you can do that yourself.)

No need for a thread. readahead() does not need one for files, reading
the contents of a directory should be no different.

> Some filesystems have directories which aren't stored like a file's
> data, and the process of reading the directory needs to work through
> its logic, and needs a sleepable context to work in. Generic page
> reading won't work for all of them.

If the fs absolutely has to block that's ok, since that is no different
from the way readahead() works on files, but most of the time it
shouldn't have to and should be able to throw the read in the queue and
return.

2010-04-21 22:06:14

by Jamie Lokier

[permalink] [raw]
Subject: Re: readahead on directories

Phillip Susi wrote:
> > ...for those things where AIO is supported at all. The problem with
> > more complicated fs operations (like, say, buffered file reads and
> > directory operations) is you can't just put a request in a queue.
>
> Unfortunately there aren't async versions of the calls that make
> directory operations, but aio_read() performs a buffered file read
> asynchronously just fine.

Why am I reading all over the place that Linux AIO only works with O_DIRECT?
Is it out of date? :-)

I admit I haven't even _tried_ buffered files with Linux AIO due to
the evil propaganda.

> > The most promising direction for AIO at the moment is in fact spawning
> > kernel threads on demand to do the work that needs a context, and
> > swizzling some pointers so that it doesn't look like threads were used
> > to userspace.
>
> NO! This is how aio was implemented at first and it was terrible.
> Context is only required because it is easier to write the code linearly
> instead of as a state machine. It would be better for example, to have
> readahead() register a callback function to be called when the read of
> the indirect block completes, and the callback needs zero context to
> queue reads of the data blocks referred to by the indirect block.

To read an indirect block, you have to allocate memory: another
callback after you've slept waiting for memory to be freed up.

Then you allocate a request: another callback while you wait for the
request queue to drain.

Then you submit the request: that's the callback you mentioned,
waiting for the result.

But then triple, double, single indirect blocks: each of the above
steps repeated.

In the case of writing, another group of steps for bitmap blocks,
inode updates, and heaven knows how fiddly it gets with ordered
updates to the journal, synchronised with other writes.

Plus every little mutex / rwlock is another place where you need those
callback functions. We don't even _have_ an async mutex facility in
the kernel. So every user of a mutex has to be changed to use
waitqueues or something. No more lockdep checking, no more RT
priority inheritance.

There are a _lot_ of places that can sleep on the way to a trivial
file I/O, and quite a lot of state to be past along the continuation
functions.

It's possible but by no means obvious that it's better.

I think people have mostly given up on that approach due to the how
much it complicates all the filesystem code, and how much goodness
there is in being able to call things which can sleep when you look at
all the different places. It seemed like a good idea for a while.

And it's not _that_ certain that it would be faster at high
loads after all the work.

A compromise where just a few synchronisation points are made async is
ok. But then it's a compromise... so you still need a multi-threaded
caller to keep the queues full in all situations.

> > Filesystem-independent readahead() on directories is out of the
> > question (except by using a kernel background thread, which is
> > pointless because you can do that yourself.)
>
> No need for a thread. readahead() does not need one for files, reading
> the contents of a directory should be no different.
>
> > Some filesystems have directories which aren't stored like a file's
> > data, and the process of reading the directory needs to work through
> > its logic, and needs a sleepable context to work in. Generic page
> > reading won't work for all of them.
>
> If the fs absolutely has to block that's ok, since that is no different
> from the way readahead() works on files, but most of the time it
> shouldn't have to and should be able to throw the read in the queue and
> return.

For specific filesystems, you could do it. readahead() on directories
is not an unreasonable thing to add on.

Generically is not likely. It's not about blocking, it's about the
fact that directories don't always consist of data blocks on the store
organised similarly to a file. For example NFS, CIFS, or (I'm not
sure), maybe even reiserfs/btrfs?

-- Jamie

2010-04-22 07:02:12

by Brad Boyer

[permalink] [raw]
Subject: Re: readahead on directories

On Wed, Apr 21, 2010 at 11:06:12PM +0100, Jamie Lokier wrote:
> For specific filesystems, you could do it. readahead() on directories
> is not an unreasonable thing to add on.
>
> Generically is not likely. It's not about blocking, it's about the
> fact that directories don't always consist of data blocks on the store
> organised similarly to a file. For example NFS, CIFS, or (I'm not
> sure), maybe even reiserfs/btrfs?

Some non-UNIX file systems don't have anything that looks like a
directory either. Just as an example, HFS and HFS+ both have a single
catalog file for the whole file system. The directory listing method
involves walking the tree from this file and picking a few fields out
of each record matching the appropriate parent directory. This would
make it hard to do something generic, although it would be possible
to readahead some range of blocks of the catalog and produce a
similar effect. This would really need to be FS specific, and the
current readahead impementation is mostly common code.

Brad Boyer
[email protected]

2010-04-22 14:26:43

by Phillip Susi

[permalink] [raw]
Subject: Re: readahead on directories

On 4/21/2010 6:06 PM, Jamie Lokier wrote:
> Why am I reading all over the place that Linux AIO only works with O_DIRECT?
> Is it out of date? :-)

Dunno, where did you read that? If you are using O_DIRECT then you
really should be using aio or you will suffer a pretty heavy performance
loss from all of the sleeping, but strictly speaking the two do not have
to be used together.

Personally I wish there was another flag besides O_DIRECT that split the
two semantics O_DIRECT now carries. Right now it FORCES the cache to be
bypassed and the IO to go to the disk, even if it's already in the
cache. It would be nice if you could ask a read to done such that IF
it's already cached, then copy it from there, otherwise, send the read
direct down to the disk to dma into my buffer.

> To read an indirect block, you have to allocate memory: another
> callback after you've slept waiting for memory to be freed up.

You allocate the cache pages in the initial readahead() before
returning. No need to do it from the bio completion callback.

> Then you allocate a request: another callback while you wait for the
> request queue to drain.

Same thing. Get everything set up and ready to go in readahead() and
the only thing that has to wait on the indirect block to be read is
filling in the block addresses of the bios and submitting them. This
last part can be done in the bio completion callback.

As an added optimization, you only need to allocate one bio in
readahead() since it is likely that only one will be needed if all of
the blocks are sequential. Then the callback can use the gfp_mask flags
to prevent allocations from sleeping and if more can not be allocated,
then you sumbit what you've got and when THAT completes, you try to
build more requests.

> Plus every little mutex / rwlock is another place where you need those
> callback functions. We don't even _have_ an async mutex facility in
> the kernel. So every user of a mutex has to be changed to use
> waitqueues or something. No more lockdep checking, no more RT
> priority inheritance.

Yes, it looks like ext4_get_blocks() does use mutexes so it can't be
called from bh context. Perhaps it could be changed to avoid this if
possible and if it must, return -EWOULDBLOCK and the completion callback
would have to punt to a work queue to retry. In the common case though,
it looks like it would be possible for ext4_get_blocks() to avoid using
mutexes and just parse the newly read indirect block and return, then
the completion callback can queue its bios and be done.

> A compromise where just a few synchronisation points are made async is
> ok. But then it's a compromise... so you still need a multi-threaded
> caller to keep the queues full in all situations.

Right, which tends to negate most of the gains of having any async at
all. For example, if we have multiple threads calling readahead()
instead of just one since it may sleep reading an indirect block, then
we can end up with this:

Thread 1 queues reads of the first 12 blocks of the first file, and the
indirect block. Thread then sleeps waiting for the indirect block.

Thread 2 queues reads of the first 12 blocks of the second file and its
indirect block. Thread then sleeps waiting for the indirect block.

Now we have the disk read 12 contiguous blocks of data + indirect from
the first file, then 12 contiguous blocks of data + indirect from the
second file, which are further down the disk, so the head has to seek
forward. Then thread 1 wakes up and parses the indirect block and
queues reading of the subsequent sectors, which now requires a backwards
seek since we skipped reading those sectors to move ahead to the second
file.

So in our attempt to use threads to keep the queue full, we have
introduced more seeking, which tends to have a higher penalty than just
using a single thread and having the queue drain and the disk idle for a
few ns while we wake up and parse the indirect block.

Come to think of it, I guess that is a good argument NOT to make
readahead() fully async.

> Generically is not likely. It's not about blocking, it's about the
> fact that directories don't always consist of data blocks on the store
> organised similarly to a file. For example NFS, CIFS, or (I'm not
> sure), maybe even reiserfs/btrfs?

The contents are stored in blocks somewhere. It doesn't really matter
how or where as long as the fs figures out what it will need to resolve
names in that directory and reads that into the cache.

2010-04-22 17:53:25

by Jamie Lokier

[permalink] [raw]
Subject: Re: readahead on directories

Phillip Susi wrote:
> On 4/21/2010 6:06 PM, Jamie Lokier wrote:
> > Generically is not likely. It's not about blocking, it's about the
> > fact that directories don't always consist of data blocks on the store
> > organised similarly to a file. For example NFS, CIFS, or (I'm not
> > sure), maybe even reiserfs/btrfs?
>
> The contents are stored in blocks somewhere. It doesn't really matter
> how or where as long as the fs figures out what it will need to resolve
> names in that directory and reads that into the cache.

Right, but finding those blocks is highly filesystem-dependent which
is why making it a generic feature would need support in each filesystem.

However, there could be generic readahead() support in the VFS for
filesystems with the right block-getting hook. All those which
support FIEMAP on directories should work. We're back to why not do
it yourself then, as very few programs need directory readahead.

> > Why am I reading all over the place that Linux AIO only works with O_DIRECT?
> > Is it out of date? :-)
>
> Dunno, where did you read that? If you are using O_DIRECT then you
> really should be using aio or you will suffer a pretty heavy performance
> loss from all of the sleeping, but strictly speaking the two do not have
> to be used together.

Linux AIO is fickle: Whether it's _really_ async depends on the
filesystem type, kernel version, distro-specific patches, and whether
you used O_DIRECT or not. Unfortunately there is no API to find out.
Even when really async, it's not guaranteed to never block.

So it's a lot like the "compromise" readahead we've discussed: Mostly
async (if you've checked filesystem type etc.), but not fully async.

> As an added optimization, you only need to allocate one bio in
> readahead() since it is likely that only one will be needed if all of
> the blocks are sequential. Then the callback can use the gfp_mask flags
> to prevent allocations from sleeping and if more can not be allocated,
> then you sumbit what you've got and when THAT completes, you try to
> build more requests.

True, you can use gfp_mask flags for allocations and stop readahead
where it fails. Probably a good plan. But you can't use it for, say,
taking mutexes.

> > Plus every little mutex / rwlock is another place where you need those
> > callback functions. We don't even _have_ an async mutex facility in
> > the kernel. So every user of a mutex has to be changed to use
> > waitqueues or something. No more lockdep checking, no more RT
> > priority inheritance.
>
> Yes, it looks like ext4_get_blocks() does use mutexes so it can't be
> called from bh context. Perhaps it could be changed to avoid this if
> possible and if it must, return -EWOULDBLOCK and the completion callback
> would have to punt to a work queue to retry. In the common case though,
> it looks like it would be possible for ext4_get_blocks() to avoid using
> mutexes and just parse the newly read indirect block and return, then
> the completion callback can queue its bios and be done.

If you're interested, try finding all the places which could sleep for
a write() call... Note that POSIX requires a mutex for write; you
can't easily change that. Reading is easier to make fully async than
writing.

> > To read an indirect block, you have to allocate memory: another
> > callback after you've slept waiting for memory to be freed up.
>
> You allocate the cache pages in the initial readahead() before
> returning. No need to do it from the bio completion callback.

Then readahead() isn't async, which was your request... It can block
waiting for memory and other things when you call it.

So that would be the "compromise" version which you complain about
below. So I don't know why you're suggesting it :-)

> > A compromise where just a few synchronisation points are made async is
> > ok. But then it's a compromise... so you still need a multi-threaded
> > caller to keep the queues full in all situations.
>
> Right, which tends to negate most of the gains of having any async at all.

Exactly. And making it so it _never_ blocks when called is a ton of
work, more lines of code (in C anyway), a maintainability nightmare,
and adds some different bottlenecks you've not thought off. At this
point I suggest you look up the 2007 discussions about fibrils which
are quite good: They cover the overheads of setting up state for async
calls when unnecessary, and the beautiful simplicty of treating stack
frames as states in their own right.

> For example, if we have multiple threads calling readahead()
> instead of just one since it may sleep reading an indirect block, then
> we can end up with this:
>
> Thread 1 queues reads of the first 12 blocks of the first file, and the
> indirect block. Thread then sleeps waiting for the indirect block.
>
> Thread 2 queues reads of the first 12 blocks of the second file and its
> indirect block. Thread then sleeps waiting for the indirect block.
>
> Now we have the disk read 12 contiguous blocks of data + indirect from
> the first file, then 12 contiguous blocks of data + indirect from the
> second file, which are further down the disk, so the head has to seek
> forward. Then thread 1 wakes up and parses the indirect block and
> queues reading of the subsequent sectors, which now requires a backwards
> seek since we skipped reading those sectors to move ahead to the second
> file.
>
> So in our attempt to use threads to keep the queue full, we have
> introduced more seeking, which tends to have a higher penalty than just
> using a single thread and having the queue drain and the disk idle for a
> few ns while we wake up and parse the indirect block.
>
> Come to think of it, I guess that is a good argument NOT to make
> readahead() fully async.

No: In that particular case, waiting while the indirect block is
parsed is advantageous. But suppose the first indirect block is
located close to the second file's data blocks. Or the second file's
data blocks are on a different MD backing disk. Or the disk has
different seeking characteristics (flash, DRBD).

Then the I/O scheduler _should_ overlap the reads from both files.
The information to decide that is dependent on filesystem layout (and
block device layout if MD). Forcing the order at the generic
readahead() level would be suboptimal, even if it sort of works out in
common situations.

Going a bit off-topic (the 'd' key if bored...)

What you've almost identified ;-) is a theoretical scheduling
advantage of AIO _sometimes_: If the original I/O submission order has
good properties _and_ the I/O queue is close to empty preventing
sorting, the hints implicit in the submission order are better
preserved with AIO.

For example: If you use O_DIRECT with AIO to read a file in
block-sequential order, the I/Os reach the disk in that order. If you
implement a queue in userspace serviced by multiple threads to overlap
synchronous I/O, there will be _slight_ timing randomness which
shuffles the order. It doesn't matter when the I/O queue has enough
entries, but it'll make a difference when starting up in that example
- but even than only rarely - most times even the threads will end up
queuing things in the preferred order.

But if your AIO submission order doesn't have any particularly special
properties anyway, or if you can submit enough in parallel that the
kernel can sort things, or if the randomness from threads is small
enough that it doesn't matter... that AIO benefit is lost entirely.

With Linux AIO, because submission can block unpredictably, I'm
guessing the sweet spot is probably to use a small number of threads -
number not easy to determine, and dependent on many factors.

I reckon the same applies to your readahead() calls: A queue which you
make sure is always full enough that threads never block, sorted by
inode number or better hints where known, with a small number of
threads calling readahead() for files, and doing whatever is useful
for directories.

It is most unfortunate that such ideas are rather complicated to
implemented well, just to try them out :-)

Fwiw, my view of this is after writing userspace programs using a very
async-callback sort of structure and trying to make the most of
O_DIRECT + AIO (i.e. I'm not thinking like an "I love threads and
don't understand the point of state machines" sort of person), only to
read the fibril discussion and agree that the same issues occur in
userspace: There is a balance between thread-like scheduling and
event-like scheduling: Certain complex things are better using
coroutine-like call stacks, using explicit state callbacks only at
particular well-defined sleeping points.

In some ways (zoom way, way out of the picture...) the balance between
implicit state via threads and explicit state ("event-driven...") is
more fundamentally a continuum than it appears close up. Both for
simplicity of code, and performance. Neither extreme is optimal in general.

-- Jamie

2010-04-22 19:24:03

by Phillip Susi

[permalink] [raw]
Subject: Re: readahead on directories

On 4/22/2010 1:53 PM, Jamie Lokier wrote:
> Right, but finding those blocks is highly filesystem-dependent which
> is why making it a generic feature would need support in each filesystem.

It already exists, it's called ->get_blocks(). That's how readahead()
figures out which blocks need to be read.

> support FIEMAP on directories should work. We're back to why not do
> it yourself then, as very few programs need directory readahead.

Because there's already a system call to accomplish that exact task; why
reinvent the wheel?

> If you're interested, try finding all the places which could sleep for
> a write() call... Note that POSIX requires a mutex for write; you
> can't easily change that. Reading is easier to make fully async than
> writing.

POSIX doesn't say anything about how write() must be implemented
internally. You can do without mutexes just fine. A good deal of the
current code does use mutexes, but does not have to. If your data is
organized well then the critical sections of code that modify it can be
kept very small, and guarded with either atomic access functions or a
spin lock. A mutex is more convenient since it it allows you to have
much larger critical sections and sleep, but we don't really like having
coarse grained locking in the kernel.

> Then readahead() isn't async, which was your request... It can block
> waiting for memory and other things when you call it.

It doesn't have to block; it can return -ENOMEM or -EWOULDBLOCK.

> Exactly. And making it so it _never_ blocks when called is a ton of
> work, more lines of code (in C anyway), a maintainability nightmare,
> and adds some different bottlenecks you've not thought off. At this
> point I suggest you look up the 2007 discussions about fibrils which
> are quite good: They cover the overheads of setting up state for async
> calls when unnecessary, and the beautiful simplicty of treating stack
> frames as states in their own right.

Sounds like an interesting compromise. I'll look it up.

> No: In that particular case, waiting while the indirect block is
> parsed is advantageous. But suppose the first indirect block is
> located close to the second file's data blocks. Or the second file's
> data blocks are on a different MD backing disk. Or the disk has
> different seeking characteristics (flash, DRBD).

Hrm... true, so knowing this, defrag could lay out the indirect block of
the first file after the first 12 blocks of the second file to maintain
optimal reading. Hrm... I might have to try that.

> I reckon the same applies to your readahead() calls: A queue which you
> make sure is always full enough that threads never block, sorted by
> inode number or better hints where known, with a small number of
> threads calling readahead() for files, and doing whatever is useful
> for directories.

Yes, and ureadahead already orders the calls to readahead() based on
disk block order. Multithreading it leads the problem with backward
seeks right now but a tweak to the way defrag lays out the indirect
blocks, should fix that. The more I think about it the better this idea
sounds.

2010-04-22 20:36:07

by Jamie Lokier

[permalink] [raw]
Subject: Re: readahead on directories

Phillip Susi wrote:
> > If you're interested, try finding all the places which could sleep for
> > a write() call... Note that POSIX requires a mutex for write; you
> > can't easily change that. Reading is easier to make fully async than
> > writing.
>
> POSIX doesn't say anything about how write() must be implemented
> internally. You can do without mutexes just fine.

POSIX requires concurrent, overlapping writes don't interleave the
data (at least, I have read that numerous times), which is usually
implemented with a mutex even though there are other ways.

Many implementations relax this for O_DIRECT, because it's non-POSIX
and concurrent writes are good.

> A good deal of the current code does use mutexes, but does not have
> to. If your data is organized well then the critical sections of
> code that modify it can be kept very small, and guarded with either
> atomic access functions or a spin lock. A mutex is more convenient
> since it it allows you to have much larger critical sections and
> sleep, but we don't really like having coarse grained locking in the
> kernel.

The trickier stuff in proper AIO is sleeping waiting for memory to be
freed up, sleeping waiting for a rate-limited request queue entry
repeatedly, prior to each of the triple, double, single indirect
blocks, which you then sleep waiting to complete, sleeping waiting for
an atime update journal node, sleeping on requests and I/O on every
step through b-trees, etc... That's just reads; writing adds just as
much again. Changing those to async callbacks in every
filesystem - it's not worth it and it'd be a maintainability
nightmare. We're talking about changes to the kernel
memory allocator among other things. You can't gfp_mask it away -
except for readahead() because it's an abortable hint.

Oh, and fine-grained locking makes the async transformation harder,
not easier :-)

> > Then readahead() isn't async, which was your request... It can block
> > waiting for memory and other things when you call it.
>
> It doesn't have to block; it can return -ENOMEM or -EWOULDBLOCK.

For readahead yes because it's just an abortable hint.
For general AIO, no.

> > No: In that particular case, waiting while the indirect block is
> > parsed is advantageous. But suppose the first indirect block is
> > located close to the second file's data blocks. Or the second file's
> > data blocks are on a different MD backing disk. Or the disk has
> > different seeking characteristics (flash, DRBD).
>
> Hrm... true, so knowing this, defrag could lay out the indirect block of
> the first file after the first 12 blocks of the second file to maintain
> optimal reading. Hrm... I might have to try that.
>
> > I reckon the same applies to your readahead() calls: A queue which you
> > make sure is always full enough that threads never block, sorted by
> > inode number or better hints where known, with a small number of
> > threads calling readahead() for files, and doing whatever is useful
> > for directories.
>
> Yes, and ureadahead already orders the calls to readahead() based on
> disk block order. Multithreading it leads the problem with backward
> seeks right now but a tweak to the way defrag lays out the indirect
> blocks, should fix that. The more I think about it the better this idea
> sounds.

Ah, you didn't mention defragging for optimising readahead before.

In that case, just trace the I/O done a few times and order your
defrag to match the trace, it should handle consistent patterns
without special defrag rules. I'm surprised it doesn't already.
Does ureadahead not use prior I/O traces for guidance?

Also, having defragged readahead files into a few compact zones, and
gotten the last boot's I/O trace, why not readahead those areas of the
blockdev first in perfect order, before finishing the job with
filesystem operations? The redundancy from no-longer needed blocks is
probably small compared with the gain from perfect order in few big
zones, and if you store the I/O trace of the filesystem stage every
time to use for the block stage next time, the redundancy should stay low.

Just a few ideas.

-- Jamie

2010-04-22 21:23:31

by Phillip Susi

[permalink] [raw]
Subject: Re: readahead on directories

On 4/22/2010 4:35 PM, Jamie Lokier wrote:
> POSIX requires concurrent, overlapping writes don't interleave the
> data (at least, I have read that numerous times), which is usually
> implemented with a mutex even though there are other ways.

I think what you are getting at here is that write() needs to atomically
update the file pointer, which does not need a mutex.

> The trickier stuff in proper AIO is sleeping waiting for memory to be
> freed up, sleeping waiting for a rate-limited request queue entry
> repeatedly, prior to each of the triple, double, single indirect
> blocks, which you then sleep waiting to complete, sleeping waiting for
> an atime update journal node, sleeping on requests and I/O on every

There's no reason to wait for updating the atime, and I already said if
there isn't enough memory then you just return -EAGAIN or -ENOMEM
instead of waiting. Whether it's reading indirect blocks or b-trees
doesn't make much difference; the fs ->get_blocks() tries not to sleep
if possible, and if it must, returns -EAGAIN and the calling code can
punt to a work queue to try again in a context that can sleep.

> step through b-trees, etc... That's just reads; writing adds just as
> much again. Changing those to async callbacks in every
> filesystem - it's not worth it and it'd be a maintainability
> nightmare. We're talking about changes to the kernel
> memory allocator among other things. You can't gfp_mask it away -
> except for readahead() because it's an abortable hint.

The fs specific code just needs to support a flag like gfp_mask so it
can be told we aren't in a context that can sleep; do your best and if
you must block, return -EAGAIN. It looks like it almost already does
something like that based on this comment from fs/mpage.c:

* We pass a buffer_head back and forth and use its buffer_mapped() flag to
* represent the validity of its disk mapping and to decide when to do
the next
* get_block() call.
*/

If it fixes up a buffer_head for the blocks it needs to finish and
returns, then do_mpage_readpage() could queue those reads with a
completion routine that would call get_block() again when the data has
been read, and when get_block() maps the blocks, then queue reads for
those blocks.

> Oh, and fine-grained locking makes the async transformation harder,
> not easier :-)

How so? With fine grained locking you can avoid the use of mutexes and
opt for atomic functions or spin locks, so no need to sleep.

> For readahead yes because it's just an abortable hint.
> For general AIO, no.

Why not? aio_read() is perfectly allowed to fail if there is not enough
memory to satisfy the request.

> Ah, you didn't mention defragging for optimising readahead before.
>
> In that case, just trace the I/O done a few times and order your
> defrag to match the trace, it should handle consistent patterns
> without special defrag rules. I'm surprised it doesn't already.
> Does ureadahead not use prior I/O traces for guidance?

Yes, it traces the IO then on the next boot calls readahead() on the
files that were read during the trace, after sorting them by on disk
block location. I've been trying to improve things by having defrag
pack those files tightly at the start of the disk, and have run into the
problem with the indirect blocks and the open() calls blocking because
the directories have not been read yet, hence, my desire to readahead()
on the directories.

Right now defrag lays down the indirect block immediately after the 12
direct blocks, which makes the most sense if you are just reading that
one file. Threading the readahead() calls and moving the indirect block
to after the next file's direct blocks would make ureadahead faster, at
the expense of any one single file read. Probably a good tradeoff that
I will have to try.

That still leaves the problem of all the open() calls blocking to read
one disk directory block at a time, since ureadahead opens all of the
files first, then calls readahead() on each of them. This is where it
would really help to be able to readahead() the directories first, then
try to open all of the files.

> Also, having defragged readahead files into a few compact zones, and
> gotten the last boot's I/O trace, why not readahead those areas of the
> blockdev first in perfect order, before finishing the job with
> filesystem operations? The redundancy from no-longer needed blocks is
> probably small compared with the gain from perfect order in few big
> zones, and if you store the I/O trace of the filesystem stage every
> time to use for the block stage next time, the redundancy should stay low.

Good point, though I was hoping to be able to accomplish effectively the
same thing purely with readahead() and other filesystem calls instead of
going direct to the block device.

2010-04-22 22:43:30

by Jamie Lokier

[permalink] [raw]
Subject: Re: readahead on directories

Phillip Susi wrote:
> On 4/22/2010 4:35 PM, Jamie Lokier wrote:
> > POSIX requires concurrent, overlapping writes don't interleave the
> > data (at least, I have read that numerous times), which is usually
> > implemented with a mutex even though there are other ways.
>
> I think what you are getting at here is that write() needs to atomically
> update the file pointer, which does not need a mutex.

No, that is not the reason. pwrite needs the mutex too.

> > The trickier stuff in proper AIO is sleeping waiting for memory to be
> > freed up, sleeping waiting for a rate-limited request queue entry
> > repeatedly, prior to each of the triple, double, single indirect
> > blocks, which you then sleep waiting to complete, sleeping waiting for
> > an atime update journal node, sleeping on requests and I/O on every
>
> There's no reason to wait for updating the atime, and

> Whether it's reading indirect blocks or b-trees
> doesn't make much difference; the fs ->get_blocks() tries not to sleep
> if possible, and if it must, returns -EAGAIN and the calling code can
> punt to a work queue to try again in a context that can sleep.

Now you are describing using threads in the blocking cases. (Work
queues, thread pools, same thing.) Earlier you were saying threads
are the wrong approach.... Good, good :-)

> The fs specific code just needs to support a flag like gfp_mask so it
> can be told we aren't in a context that can sleep; do your best and if
> you must block, return -EAGAIN. It looks like it almost already does
> something like that based on this comment from fs/mpage.c:

Yes, it's not a bad pattern. Simple to understand.

There's a slight overhead compared with saving the stack frame
fibril-style: The second, sleepable call has to redo much of the work
done in the non-sleepable call, and queuing the work queue requires
serialising etc. plus extra code for that. Plus the work queue is a
bit more scheduling

On the other hand, the queue uses less memory than a stack frame.

For the in-cache cases, there's no overhead so it's fine.

A big problem with it, apart from having to change lots of places in
all the filesystems, is that the work-queues run with the wrong
security and I/O context. Network filesystems break permissions, quotas
break, ionice doesn't work, etc. It's obviously fixable but more
involved than just putting a read request on a work queue.

That's why the fibril/acall discussions talked about spawning threads
from the caller's context or otherwise magically swizzling contexts
around to do it with the efficiency of a preexisting thread pool.

Once you're doing task security & I/O context swizzling (which turns
out to be quite fiddly), the choice between swizzling stack frames or
using EAGAIN and work queue type objects becomes a less radical design
decision, and could even be a per-filesystem, per-operation choice.

> > Oh, and fine-grained locking makes the async transformation harder,
> > not easier :-)
>
> How so? With fine grained locking you can avoid the use of mutexes and
> opt for atomic functions or spin locks, so no need to sleep.

Fine-grained locking isn't the same thing as using non-sleepable locks.

> > For readahead yes because it's just an abortable hint.
> > For general AIO, no.
>
> Why not? aio_read() is perfectly allowed to fail if there is not enough
> memory to satisfy the request.

So is read(). And then the calling application usually exits, because
there's nothing else it can do usefully. Same if aio_read() ever returns ENOMEM.

That way lies an application getting ENOMEM often and having to retry
aio_read in a loop, probably a busy one, which isn't how the interface
is supposed to work, and is not efficient either.

The only atomic allocation you might conceivably want is a small one
to enqueue the AIO and return immediately. But really even that
should sleep. That's the one case where you actually do want
aio_read() to sleep.

> That still leaves the problem of all the open() calls blocking to read
> one disk directory block at a time, since ureadahead opens all of the
> files first, then calls readahead() on each of them. This is where it
> would really help to be able to readahead() the directories first, then
> try to open all of the files.

Put open() in threads too! Actually I don't have any idea how well
that really goes.

> > Also, having defragged readahead files into a few compact zones, and
> > gotten the last boot's I/O trace, why not readahead those areas of the
> > blockdev first in perfect order, before finishing the job with
> > filesystem operations? The redundancy from no-longer needed blocks is
> > probably small compared with the gain from perfect order in few big
> > zones, and if you store the I/O trace of the filesystem stage every
> > time to use for the block stage next time, the redundancy should stay low.
>
> Good point, though I was hoping to be able to accomplish effectively the
> same thing purely with readahead() and other filesystem calls instead of
> going direct to the block device.

It depends on how accurate your block-level traces are, but if the
blocks are consolidated into few contiguous zones, readahead on the
blockdev should give perfect seek order, minimal IOPS and maximum I/O
sizes. It won't even need any particular order from the defrag. It's
hard to see even file readahead() approaching that for speed because
it's so simple.

Consider: Boot may read 50MB data in countless files including
scripts, parts of shared libs etc. Just 0.5 second on any modern
system. How long does the ureadahead run take?

-- Jamie

2010-04-23 04:13:19

by Phillip Susi

[permalink] [raw]
Subject: Re: readahead on directories

On Thu, 2010-04-22 at 23:43 +0100, Jamie Lokier wrote:
> No, that is not the reason. pwrite needs the mutex too.

Which mutex and what for?

> Now you are describing using threads in the blocking cases. (Work
> queues, thread pools, same thing.) Earlier you were saying threads
> are the wrong approach.... Good, good :-)

Sure, in some cases, just not ALL. If you can't control whether or not
the call blocks then you HAVE to use threads. If you can be sure it
won't block most of the time, then most of the time you don't need any
other threads, and when you finally do, you need very few.

> A big problem with it, apart from having to change lots of places in
> all the filesystems, is that the work-queues run with the wrong
> security and I/O context. Network filesystems break permissions, quotas
> break, ionice doesn't work, etc. It's obviously fixable but more
> involved than just putting a read request on a work queue.

Hrm... good point.

> Fine-grained locking isn't the same thing as using non-sleepable locks.

Yes, it is not the same, but non-sleepable locks can ONLY be used with
fine grained locks. The two reasons to use a mutex instead of a spin
lock are that you can sleep while holding it, and so it isn't a problem
to hold it for an extended period of time.

> So is read(). And then the calling application usually exits, because
> there's nothing else it can do usefully. Same if aio_read() ever returns ENOMEM.
>
> That way lies an application getting ENOMEM often and having to retry
> aio_read in a loop, probably a busy one, which isn't how the interface
> is supposed to work, and is not efficient either.

Simply retrying in a loop would be very stupid. The programs using aio
are not simple stupid, so they would take more appropriate action. For
example a server might decide it already has enough data in the pipe and
forget about asking for more until the queues empty, or it might decide
to drop that client, which would free up some more memory, or it might
decide it has some cache it can free up. Something like readahead could
decide that if there isn't enough memory left then it has no business
trying to read any more, and exit. Both of these are preferable to
waiting for something else to free up enough memory to continue.