Pre-caching reloaded ;)
I have been exploring cache prefetching on desktop systems for quite
some time, and recently found ways that have _proven_ effects.
The basic ideas are described in the following google soc proposal,
with the proof-of-concept patch to support I/O priority attached.
I'd like to know whether the two proposed kernel components would be
acceptable for the mainline kernel, and any recommends to improve them.
Thanks,
Wu
------------------------------------------------------------------------
SOC PROPOSAL
Rapid linux desktop startup through pre-caching
MOTIVATION
KDE, Gnome, OpenOffice, and Firefox all take too long to start up.
Boot time pre-caching seems to be the single most straightforward and
effective way to improve it and make linux desktop experience more
comfortable. It is a great pleasure for me to take up the work.
QUALIFICATION
I have been working on linux kernel readahead for over a year, and
developed an adaptive readahead patch(http://lwn.net/Articles/155510/)
to bring a bunch of new capabilities to linux. During the time I
gained knowledge about the VM/IO subsystems of linux kernel, and get
acquainted with the slow startup problem, the existing solutions, why
they do no work as one would expect and how to get the job better done.
PREVIOUS WORKS
There has been some similar efforts, i.e.
- Linux: Boot Time Speedups Through Precaching
http://kerneltrap.org/node/2157
- Andrew Morton's kernel module solution
http://www.zip.com.au/~akpm/linux/fboot.tar.gz
- preload - adaptive readahead daemon
http://sourceforge.net/projects/preload
The formers are mainly kernel staffs, while the latter is a pure
user-land solution. But none seems to do the trick for system startup
time. Andrew saw ~10% speedup, while preload actually saw slow down.
IMHO, Andrew's kernel module to dump the contents of pagecache and to
restore it at boot time is one step towards the final solution.
However, it takes one more step to actually win the time: to optimize
the I/O on restoring time.
I/O ANALYSE
How come the prefetcher gains little or even negative time?
After some huntings in the source tree and some experiments,
I came to the conclusions that
1. the prefetcher has to readahead one file _after_ another, thus loses
the opportunity to reorder the blocks and reduce seeking time.
== reasoning ==
It tends to be blocked on calling open(), where the kernel has to do
some dir/inode/bmap lookups and do tiny I/Os _synchronously_ and with
_locks_ held.
2. the readahead I/O stands in the way of normal I/O, thus the prefetcher
blocks normal apps and loses the opportunity to overlap CPU/IO time.
== reasoning ==
Support of I/O priority is still incomplete for linux. Of the three
I/O elevators, anticipatory & deadline simply overlooks priority;
CFQ is built for fair I/O priorities, though is still not enough for
the case of readahead. Imagine that the prefetcher first issues an
I/O request for page A with low priority, then comes the real app
that needs page A, and simply waits on the page to be available,
which will take rather long time because the elevator still thinks
the page as a low priority one.
So we get the amazing fact that prefetching actually slows things down!
SCHEME/GOAL
1) kernel facility to provide necessary I/O priority support
- add basic I/O priority support to AS/deadline elevators:
never have readahead I/O stand in the way of normal I/O
- enable AS/CFQ/deadline to react on I/O priority changes:
reschedule a previous readahead I/O that is now actually
demanded by a legacy program
2) kernel module to query the file cache
- on loading: create /proc/filecache
- setting up: echo "param value" > /proc/filecache
- info query: cat /proc/filecache
- sample sessions:
# modprobe filecache
$ cat /proc/filecache
# file ALL
# mask 0
#
# ino size cached status refcnt name
0 2929846 3181 D 1 /dev/hda1
81455 9 9 _ 1 /sbin/init
......
$ echo "file /dev/hda1" > /proc/filecache
$ cat /proc/filecache
# file /dev/hda1
# mask 0
#
# idx len
0 24
48 2
53 5
......
3) user-land tools to dump the current content of file cache,
and to restore it on boot time
- store as plain text files to be script/user friendly
- be able to run on the very beginning of boot process
- be able to trim down the cache records (for small systems)
- optional poor man's defrag ;)
- record and replay I/O for any task by taking two cache
snapshots and do a set-substract
A proof of concept implementation has been developed and ran on my
desktop. According to the experimental results, the expected effect
of the final work would be:
- the power-on to login time remains roughly the same
- most gui files are ready on login time
- login to usable desktop time comes close to a warm startup
BIO
I am currently pursuing PhD. degree in University of Science and
Technology of China. I enjoy computing and GNU/Linux.
- programming since 1993
- using linux since 1998
- playing PXE since 2000
- developing OSS since 2004
- developing adaptive readahead for linux since 2005
------------------------------------------------------------------------
block/deadline-iosched.c | 35 +++++++++++++++++++++++++++++++++++
block/ll_rw_blk.c | 8 ++++++++
fs/buffer.c | 5 +++--
include/linux/elevator.h | 2 ++
4 files changed, 48 insertions(+), 2 deletions(-)
--- linux.orig/block/deadline-iosched.c
+++ linux/block/deadline-iosched.c
@@ -310,6 +310,40 @@ deadline_add_request(struct request_queu
}
/*
+ * kick a page for io
+ */
+static int
+deadline_kick_page(struct request_queue *q, struct page *page)
+{
+ struct deadline_data *dd = q->elevator->elevator_data;
+ struct deadline_rq *drq;
+ struct request *rq;
+ struct list_head *pos;
+ struct bio_vec *bvec;
+ struct bio *bio;
+ int i;
+
+ list_for_each(pos, &dd->fifo_list[READ]) {
+ drq = list_entry_fifo(pos);
+ rq = drq->request;
+ rq_for_each_bio(bio, rq) {
+ bio_for_each_segment(bvec, bio, i) {
+ if (page == bvec->bv_page)
+ goto found;
+ }
+ }
+ }
+
+ return -1;
+
+found:
+ rq->ioprio = 1;
+ list_del(&drq->fifo);
+ deadline_add_drq_fifo(dd, rq);
+ return 0;
+}
+
+/*
* remove rq from rbtree, fifo, and hash
*/
static void deadline_remove_request(request_queue_t *q, struct request *rq)
@@ -784,6 +818,7 @@ static struct elevator_type iosched_dead
.elevator_merge_req_fn = deadline_merged_requests,
.elevator_dispatch_fn = deadline_dispatch_requests,
.elevator_add_req_fn = deadline_add_request,
+ .elevator_kick_page_fn = deadline_kick_page,
.elevator_queue_empty_fn = deadline_queue_empty,
.elevator_former_req_fn = deadline_former_request,
.elevator_latter_req_fn = deadline_latter_request,
--- linux.orig/block/ll_rw_blk.c
+++ linux/block/ll_rw_blk.c
@@ -1620,6 +1620,14 @@ static void blk_backing_dev_unplug(struc
{
request_queue_t *q = bdi->unplug_io_data;
+ if (page &&
+ IOPRIO_PRIO_CLASS(current->ioprio) != IOPRIO_CLASS_IDLE &&
+ q->elevator && q->elevator->ops->elevator_kick_page_fn) {
+ spin_lock_irq(q->queue_lock);
+ q->elevator->ops->elevator_kick_page_fn(q, page);
+ spin_unlock_irq(q->queue_lock);
+ }
+
/*
* devices don't necessarily have an ->unplug_fn defined
*/
--- linux.orig/fs/buffer.c
+++ linux/fs/buffer.c
@@ -63,8 +63,9 @@ static int sync_buffer(void *word)
smp_mb();
bd = bh->b_bdev;
- if (bd)
- blk_run_address_space(bd->bd_inode->i_mapping);
+ if (bd && bd->bd_inode && bd->bd_inode->i_mapping)
+ blk_run_backing_dev(bd->bd_inode->i_mapping->backing_dev_info,
+ bh->b_page);
io_schedule();
return 0;
}
--- linux.orig/include/linux/elevator.h
+++ linux/include/linux/elevator.h
@@ -20,6 +20,7 @@ typedef int (elevator_set_req_fn) (reque
typedef void (elevator_put_req_fn) (request_queue_t *, struct request *);
typedef void (elevator_activate_req_fn) (request_queue_t *, struct request *);
typedef void (elevator_deactivate_req_fn) (request_queue_t *, struct request *);
+typedef int (elevator_kick_page_fn) (request_queue_t *, struct page *);
typedef int (elevator_init_fn) (request_queue_t *, elevator_t *);
typedef void (elevator_exit_fn) (elevator_t *);
@@ -34,6 +35,7 @@ struct elevator_ops
elevator_add_req_fn *elevator_add_req_fn;
elevator_activate_req_fn *elevator_activate_req_fn;
elevator_deactivate_req_fn *elevator_deactivate_req_fn;
+ elevator_kick_page_fn *elevator_kick_page_fn;
elevator_queue_empty_fn *elevator_queue_empty_fn;
elevator_completed_req_fn *elevator_completed_req_fn;
>
> PREVIOUS WORKS
>
> There has been some similar efforts, i.e.
> - Linux: Boot Time Speedups Through Precaching
> http://kerneltrap.org/node/2157
> - Andrew Morton's kernel module solution
> http://www.zip.com.au/~akpm/linux/fboot.tar.gz
> - preload - adaptive readahead daemon
> http://sourceforge.net/projects/preload
you missed the solution Fedora deploys since over a year using readahead
On Tue, May 02, 2006 at 09:58:44AM +0200, Arjan van de Ven wrote:
>
> >
> > PREVIOUS WORKS
> >
> > There has been some similar efforts, i.e.
> > - Linux: Boot Time Speedups Through Precaching
> > http://kerneltrap.org/node/2157
> > - Andrew Morton's kernel module solution
> > http://www.zip.com.au/~akpm/linux/fboot.tar.gz
> > - preload - adaptive readahead daemon
> > http://sourceforge.net/projects/preload
>
> you missed the solution Fedora deploys since over a year using readahead
Thanks, and sorry for more previous works that I failed to mention here :)
Wu
On Tue, May 02 2006, Wu Fengguang wrote:
> I'd like to know whether the two proposed kernel components would be
> acceptable for the mainline kernel, and any recommends to improve them.
I tried something very similar to this years ago, except I made it
explicit instead of hiding it in the blk_run_backing_dev() which we
didn't have at that time. My initial results showed that you would get a
load of requests for different pages so would end up doing io randomly
instead again.
Your patch isn't exactly pretty, but that is fixable. I'm more
interested in how you'd control that sanely.
--
Jens Axboe
On Tue, May 02, 2006 at 10:09:36AM +0200, Jens Axboe wrote:
> I tried something very similar to this years ago, except I made it
> explicit instead of hiding it in the blk_run_backing_dev() which we
> didn't have at that time. My initial results showed that you would get a
> load of requests for different pages so would end up doing io randomly
> instead again.
Yes, the hard one would be _not_ to impact normal I/Os in any way.
A simple solution for the case of deadline scheduler would be:
- static const int read_expire = HZ / 2; /* max time before a read is submitted. */
+ static const int read_expire = HZ / 2; /* max time before a impending read is submitted. */
+ static const int reada_expire = HZ * 30; /* max time before a read-ahead is submitted. */
Wu
On Tue, 2006-05-02 at 16:06 +0800, Wu Fengguang wrote:
> On Tue, May 02, 2006 at 09:58:44AM +0200, Arjan van de Ven wrote:
> >
> > >
> > > PREVIOUS WORKS
> > >
> > > There has been some similar efforts, i.e.
> > > - Linux: Boot Time Speedups Through Precaching
> > > http://kerneltrap.org/node/2157
> > > - Andrew Morton's kernel module solution
> > > http://www.zip.com.au/~akpm/linux/fboot.tar.gz
> > > - preload - adaptive readahead daemon
> > > http://sourceforge.net/projects/preload
> >
> > you missed the solution Fedora deploys since over a year using readahead
>
> Thanks, and sorry for more previous works that I failed to mention here :)
one interesting thing that came out of the fedora readahead work is that
most of the bootup isn't actually IO bound. My test machine for example
can load all the data into ram in about 10 seconds, so that the rest of
the boot is basically IO-less. But that still takes 2 minutes....
So I'm not entirely sure how much you can win by just attacking this.
Another interesting approach would be to actually put all the data you
want to use in a non-fragmented, sequential area on disk somehow (there
is an OLS paper submitted about that by Ben) so that at least the disk
side is seekless...
On Tue, May 02, 2006 at 10:30:17AM +0200, Arjan van de Ven wrote:
> one interesting thing that came out of the fedora readahead work is that
> most of the bootup isn't actually IO bound. My test machine for example
> can load all the data into ram in about 10 seconds, so that the rest of
> the boot is basically IO-less. But that still takes 2 minutes....
> So I'm not entirely sure how much you can win by just attacking this.
Yes, I find it hard to improve the boot time of the init.d stage.
However, it is perfectly ok to preload all GUI staffs during that
timespan, by overlapping CPU/IO activities.
> Another interesting approach would be to actually put all the data you
> want to use in a non-fragmented, sequential area on disk somehow (there
> is an OLS paper submitted about that by Ben) so that at least the disk
> side is seekless...
You are right, reducing seeking distances helps not much. My fluxbox
desktop requires near 3k seeks, which can be loaded in the 20s init.d
booting time. But for KDE/GNOME desktops, some defragging would be
necessary to fit them into the 20s time span.
I found ext3 to be rather good filesystem to support poor man's defrag
described by Chris Mason:
http://www.gelato.unsw.edu.au/archives/git/0504/1690.html
This certainly can help. Based on some ideas from andrea I
made a poor man's defrag script last year that was similar.
It worked by copying files into a flat dir in the order you
expected to read them in, deleting the original, then hard
linking them into their original name.
Make the 'flat dir' as an top level dir would do the trick for ext3.
We have good chance to merge 10k seeks into 3k seeks by this trick :)
Wu
On Tue, 2006-05-02 at 16:53 +0800, Wu Fengguang wrote:
> On Tue, May 02, 2006 at 10:30:17AM +0200, Arjan van de Ven wrote:
> > one interesting thing that came out of the fedora readahead work is that
> > most of the bootup isn't actually IO bound. My test machine for example
> > can load all the data into ram in about 10 seconds, so that the rest of
> > the boot is basically IO-less. But that still takes 2 minutes....
> > So I'm not entirely sure how much you can win by just attacking this.
>
> Yes, I find it hard to improve the boot time of the init.d stage.
> However, it is perfectly ok to preload all GUI staffs during that
> timespan, by overlapping CPU/IO activities.
fwiw fedora even loads a bunch of GUI apps into memory already
>
> > Another interesting approach would be to actually put all the data you
> > want to use in a non-fragmented, sequential area on disk somehow (there
> > is an OLS paper submitted about that by Ben) so that at least the disk
> > side is seekless...
>
> You are right, reducing seeking distances helps not much. My fluxbox
> desktop requires near 3k seeks, which can be loaded in the 20s init.d
> booting time. But for KDE/GNOME desktops, some defragging would be
> necessary to fit them into the 20s time span.
or just move the blocks (or copy them) to a reserved area...
>> > Another interesting approach would be to actually put all the data you
>> > want to use in a non-fragmented, sequential area on disk somehow (there
>> > is an OLS paper submitted about that by Ben) so that at least the disk
>> > side is seekless...
>>
>> You are right, reducing seeking distances helps not much. My fluxbox
>> desktop requires near 3k seeks, which can be loaded in the 20s init.d
>> booting time. But for KDE/GNOME desktops, some defragging would be
>> necessary to fit them into the 20s time span.
>
>or just move the blocks (or copy them) to a reserved area...
Or even put it into a ramdisk, and then fix userspace. When Gnome loads
more than 10000 files [http://kerneltrap.org/node/2157], it sounds like a
design problem.
Jan Engelhardt
--
On Tue, May 02, 2006 at 10:55:29AM +0200, Arjan van de Ven wrote:
> > Yes, I find it hard to improve the boot time of the init.d stage.
> > However, it is perfectly ok to preload all GUI staffs during that
> > timespan, by overlapping CPU/IO activities.
>
> fwiw fedora even loads a bunch of GUI apps into memory already
Hopefully the warm cache time has been improving.
I have very good performance with fluxbox. Lubos Lunak said
(http://wiki.kde.org/tiki-index.php?page=KDE%20Google%20SoC%202006%20ideas#id106304)
that "KDE startup with cold caches is several times slower than with
warm caches." The latest GNOME release also saw good speedup.
> > > Another interesting approach would be to actually put all the data you
> > > want to use in a non-fragmented, sequential area on disk somehow (there
> > > is an OLS paper submitted about that by Ben) so that at least the disk
> > > side is seekless...
> >
> > You are right, reducing seeking distances helps not much. My fluxbox
> > desktop requires near 3k seeks, which can be loaded in the 20s init.d
> > booting time. But for KDE/GNOME desktops, some defragging would be
> > necessary to fit them into the 20s time span.
>
> or just move the blocks (or copy them) to a reserved area...
Ah, we can save discontinuous pages into the journal file on umounting
and restore them on mounting. But let's do the simple way first :)
Wu
El Tue, 2 May 2006 15:50:49 +0800,
Wu Fengguang <[email protected]> escribi?:
> 2) kernel module to query the file cache
Can't mincore() + /proc/$PID/* stuff be used to replace that ?
Improving boot time is nice and querying the file cache would work
for that, but improving the boot time of some programs once the system
is running (ie: running openoffice 6 hours after booting) is something
that other preloaders do in other OSes aswell, querying the full file
cache wouldn't be that useful for such cases.
The main reason why I believe that the pure userspace (preload.sf.net)
solution slows down in some cases is becauses it uses bayesian heuristics
(!) as a magic ball to guess the future, which is a flawed idea IMHO.
I started (but didn't finish) a preloader which uses the process event
connector to get notifications of what processes are being launched,
then it profiles it (using mincore(), /proc/$PID/* stuff, etc) and
preloads things optimally the next time it gets a notification of the
same app.
Mac OS X has a program that implements your idea, available (the sources)
at http://darwinsource.opendarwin.org/projects/apsl/BootCache-25/
Also, mac os x uses launchd, as init/init.d replacement
(http://darwinsource.opendarwin.org/projects/apsl/launchd-106/)
If (as Arjan noticed) the bootup is not really IO-bound, launchd could
help to reduce the amount of CPU time wasted if the CPU time being wasted
is in shell interpreters (a more unixy port has been done by the freebsd
people - http://wikitest.freebsd.org/moin.cgi/launchd)
Diego,
On Tue, May 02, 2006 at 02:46:41PM +0200, Diego Calleja wrote:
> El Tue, 2 May 2006 15:50:49 +0800,
> Wu Fengguang <[email protected]> escribi?:
>
> > 2) kernel module to query the file cache
>
> Can't mincore() + /proc/$PID/* stuff be used to replace that ?
Nope. mincore() only provides info about files that are currently
opened, by the process itself. The majority in the file cache are
closed files.
> Improving boot time is nice and querying the file cache would work
> for that, but improving the boot time of some programs once the system
> is running (ie: running openoffice 6 hours after booting) is something
> that other preloaders do in other OSes aswell, querying the full file
> cache wouldn't be that useful for such cases.
Yes, it can still be useful after booting :) One can get the cache
footprint of any task started at any time by taking snapshots of the
cache before and after the task, and do a set-subtract on them.
> The main reason why I believe that the pure userspace (preload.sf.net)
> solution slows down in some cases is becauses it uses bayesian heuristics
> (!) as a magic ball to guess the future, which is a flawed idea IMHO.
> I started (but didn't finish) a preloader which uses the process event
> connector to get notifications of what processes are being launched,
> then it profiles it (using mincore(), /proc/$PID/* stuff, etc) and
> preloads things optimally the next time it gets a notification of the
> same app.
My thought is that any prefetcher that ignores the thousands of small
data files is incomplete. Only kernel can provide that info.
> Mac OS X has a program that implements your idea, available (the sources)
> at http://darwinsource.opendarwin.org/projects/apsl/BootCache-25/
Ah, thanks for mentioning it.
It seems to do the job mostly in kernel, comprising of 2400+ LOC.
Whereas my plan is to write a module of ~300 LOC to _only_ provide the
necessary info, and leave other jobs to smart user-land tools.
Wu
On Tue, 2 May 2006, Wu Fengguang wrote:
>
> block/deadline-iosched.c | 35 +++++++++++++++++++++++++++++++++++
> block/ll_rw_blk.c | 8 ++++++++
> fs/buffer.c | 5 +++--
> include/linux/elevator.h | 2 ++
> 4 files changed, 48 insertions(+), 2 deletions(-)
Now, regardless of the other issues people have brought up, I'd like to
say that I think this is broken.
Doing prefetching on a physical block basis is simply not a valid
approach, for several reasons:
- it misses several important cases (you can surely prefetch over NFS
too)
- it gives you only a very limited view into what is actually going on.
- it doesn't much allow you to _fix_ any problems, it just allows you to
try to paper them over.
- it's useless anyway, since pretty all kernel caching is based on
virtual caches, so if you "pre-read" the physical buffers, it won't
help: you'll just waste time reading the data into a buffer that will
never be used, and when the real request comes in, the read will
be done _again_.
So I would _seriously_ claim that the place to do all the statistics
allocation is in anything that ends up having to call "->readpage()", and
do it all on a virtual mapping level.
Yes, it isn't perfect either (I'll mention some problems), but it's a
_lot_ better. It means that when you gather the statistics, you can see
the actual _files_ and offsets being touched. You can even get the
filenames by following the address space -> inode -> i_dentry list.
This is important for several reasons:
(a) it makes it a hell of a lot more readable, and the user gets a
lot more information that may make him see the higher-level issues
involved.
(b) it's in the form that we cache things, so if you read-ahead in
that form, you'll actually get real information.
(c) it's in a form where you can actually _do_ something about things
like fragmentation etc ("Oh, I could move these files all to a
separate area")
Now, admittedly it has a few downsides:
- right now "readpage()" is called in several places, and you'd have to
create some kind of nice wrapper for the most common
"mapping->a_ops->readpage()" thing and hook into there to avoid
duplicating the effort.
Alternatively, you could decide that you only want to do this at the
filesystem level, which actually simplifies some things. If you
instrument "mpage_readpage[2]()", you'll already get several of the
ones you care about, and you could do the others individually.
[ As a third alternative, you might decide that the only thing you
actually care about is when you have to wait on a locked page, and
instrument the page wait-queues instead. ]
- it will miss any situation where a filesystem does a read some other
way. Notably, in many loads, the _directory_ accesses are the important
ones, and if you want statistics for those you'd often have to do that
separately (not always - some of the filesystems just use the same
page reading stuff).
The downsides basically boil down to the fact that it's not as clearly
just one single point. You can't just look at the request queue and see
what physical requests go out.
NOTE! You can obviously do both, and try to correlate one against the
other, and you'd get the best possible information ("is this seek because
we started reading another file, or is it because the file itself is
fragmented" kind of stuff). So physical statistics aren't meaningless.
They're just _less_ important than the virtual ones, and you should do
them only if you already do virtual stats.
Linus
El Tue, 2 May 2006 22:42:03 +0800,
Wu Fengguang <[email protected]> escribi?:
> Nope. mincore() only provides info about files that are currently
> opened, by the process itself. The majority in the file cache are
> closed files.
Yes; what I meant was:
- start listening the process event connector as firsk task on startup
- get a list of what files are being used (every second or something)
by looking at /proc/$PID/* stuff
- mincore() all those files at the end of the bootup
> Yes, it can still be useful after booting :) One can get the cache
> footprint of any task started at any time by taking snapshots of the
> cache before and after the task, and do a set-subtract on them.
Although this certainly looks simpler for userspace (and complete, if
you want to get absolutely all the info about files that get opened and
closed faster than the profile interval of a prefetcher)
(another useful tool would be a dtrace-like thing)
Linus Torvalds <[email protected]> writes:
>
> The downsides basically boil down to the fact that it's not as clearly
> just one single point. You can't just look at the request queue and see
> what physical requests go out.
The other problem is that readpages has no idea about the layout
of the disk so just doing it dumbly might end up with a lot of seeking.
I suppose you would at least need some tweaks to the scheduler to make
sure it doesn't give these reads any priority and keeps them in the
queue long enough to get real good sorting and large requests.
Problem is that this would likely need to be asynchronous to be any
useful unless you were willing to block a thread for a potentially
long time for each file to be prefetched.
-Andi
Hi!
> SOC PROPOSAL
>
> Rapid linux desktop startup through pre-caching
Looks nice me...
> SCHEME/GOAL
> 2) kernel module to query the file cache
> - on loading: create /proc/filecache
> - setting up: echo "param value" > /proc/filecache
> - info query: cat /proc/filecache
> - sample sessions:
>
> # modprobe filecache
> $ cat /proc/filecache
> # file ALL
> # mask 0
> #
> # ino size cached status refcnt name
> 0 2929846 3181 D 1 /dev/hda1
> 81455 9 9 _ 1 /sbin/init
> ......
>
> $ echo "file /dev/hda1" > /proc/filecache
> $ cat /proc/filecache
> # file /dev/hda1
> # mask 0
> #
> # idx len
> 0 24
> 48 2
> 53 5
> ......
Could we use this instead of blockdev freezing/big suspend image
support? It should permit us to resume quickly (with small image), and
then do readahead. ... that will give us usable machine quickly, still
very responsive desktop after resume?
Pavel
--
Thanks for all the (sleeping) penguins.
On Tue, May 02, 2006 at 10:30:17AM +0200, Arjan van de Ven wrote:
> one interesting thing that came out of the fedora readahead work is that
> most of the bootup isn't actually IO bound.
Here's another interesting datapoint.
A huge proportion of I/O done during bootup is _completely_ unnecessary.
Profiling just a few of the places that seemed to stall yielded some
lovely things, like cupsd reading in and parsing descriptions for
every printer ever known to man, even if there's no printer connected.
Or my favorite.. hald reloading and reparsing the same XML hardware description
files.. ~50 times, _each_.
Utterly insane.
(And these aren't Fedora specific problems btw, every distro shipping those
bits will have the same dumb behaviour [modulo versions])
Dave
--
http://www.codemonkey.org.uk
Hi.
On Wednesday 03 May 2006 05:10, Pavel Machek wrote:
> Could we use this instead of blockdev freezing/big suspend image
> support? It should permit us to resume quickly (with small image), and
> then do readahead. ... that will give us usable machine quickly, still
> very responsive desktop after resume?
Unrelated to bdev freezing, and will involve far more seeking than reading a
contiguous image (as they normally are).
Regards,
Nigel
--
See our web page for Howtos, FAQs, the Wiki and mailing list info.
http://www.suspend2.net IRC: #suspend2 on Freenode
Hi Pavel,
On Tue, May 02, 2006 at 09:10:01PM +0200, Pavel Machek wrote:
> > SOC PROPOSAL
> >
> > Rapid linux desktop startup through pre-caching
>
> Looks nice me...
Thanks.
> > SCHEME/GOAL
> > 2) kernel module to query the file cache
> > - on loading: create /proc/filecache
> > - setting up: echo "param value" > /proc/filecache
> > - info query: cat /proc/filecache
> > - sample sessions:
> >
> > # modprobe filecache
> > $ cat /proc/filecache
> > # file ALL
> > # mask 0
> > #
> > # ino size cached status refcnt name
> > 0 2929846 3181 D 1 /dev/hda1
> > 81455 9 9 _ 1 /sbin/init
> > ......
> >
> > $ echo "file /dev/hda1" > /proc/filecache
> > $ cat /proc/filecache
> > # file /dev/hda1
> > # mask 0
> > #
> > # idx len
> > 0 24
> > 48 2
> > 53 5
> > ......
>
> Could we use this instead of blockdev freezing/big suspend image
> support? It should permit us to resume quickly (with small image), and
> then do readahead. ... that will give us usable machine quickly, still
> very responsive desktop after resume?
Sure. Indeed I have considered about the possibility to integrate my
work with the foresighted userland-swsusp idea. The added value of
this approach is that user-land tools can do all kinds of smart
processing of the contents, and present users with different policies.
Another imagined user of /proc/filecache is system adms :)
Wu
On Wed, May 03, 2006 at 09:36:12AM +1000, Nigel Cunningham wrote:
> > Could we use this instead of blockdev freezing/big suspend image
> > support? It should permit us to resume quickly (with small image), and
> > then do readahead. ... that will give us usable machine quickly, still
> > very responsive desktop after resume?
>
> Unrelated to bdev freezing, and will involve far more seeking than reading a
> contiguous image (as they normally are).
Yes, needs more pondering on this issue...
On Tue, May 02, 2006 at 08:55:06AM -0700, Linus Torvalds wrote:
> Doing prefetching on a physical block basis is simply not a valid
> approach, for several reasons:
Sorry!
I made a misleading introduction. I'll try to explain it in more detail.
DATA ACQUISITION
/proc/filecache provides an interface to query the cached pages of any
file. This information is expressed in tuples of <idx, len>, which
more specifically means <mapping-offset, pages>.
Normally one should use 'echo' to setup two parameters before doing
'cat':
@file
the filename;
use 'ALL' to get a list all files cached
@mask
only show the pages with non-zero (page-flags & @mask);
for simplicity, use '0' to show all present pages(take 0 as ~0)
Normally, one should first get the file list using param 'file ALL',
and then iterate through all the files and pages of interested with
params 'file filename' and 'mask pagemask'.
The param 'mask' acts as a filter for different users: it allows
sysadms to know where his memory goes, and the prefetcher to ignore
pages from false readahead.
One can use 'mask hex(PG_active|PG_referenced|PG_mapped)' in its hex form
to show only accessed pages(here PG_mapped is a faked flag), and use
'mask hex(PG_dirty)' to show only dirtied pages.
One can use
$ echo "file /sbin/init" > /proc/filecache
$ echo "mask 0" > /proc/filecache
$ cat /proc/filecache
to get an idea which pages of /sbin/init are currently cached.
In the proposal, I used the following example, which is proved to be
rather misleading:
$ echo "file /dev/hda1" > /proc/filecache
$ cat /proc/filecache
The intention of that example was to show that filesystem dir/inode
buffer status -- which is the key data for user-land pre-caching --
can also be retrieved through this interface.
So the proposed solution is to
- prefetch normal files on the virtual mapping level
- prefetch fs dir/inode buffers on a physical block basis
I/O SUBMISSION
How can we avoid unnecessary seeks when prefetching on virtual mapping
level? The answer is to leave this job to i/o elevators. What we
should do is to present elevators with most readahead requests before
too many requests being submitted to disk drivers.
The proposed scheme is to:
1) (first things first)
issue all readahead requests for filesystem buffers
2) (in background, often blocked)
issue all readahead requests for normal files
-) make sure the above requests are of really _low_ priority
3) regular system boot continues
4) promote the priority of any request that is now demanded by
legacy programs
In the scheme, most work is done by user-land tools. The required
kernel support is minimal and general-purpose:
- an /proc/filecache interface
- the ability to promote I/O priority on demanded pages
By this approach, we avoided the complicated OSX bootcache solution,
which is a physical-blocks-based, special-handlings-in-kernel solution
that is exactly what Linus is against.
Thanks,
Wu
On Tue, May 02, 2006 at 06:07:53PM +0200, Diego Calleja wrote:
> > Nope. mincore() only provides info about files that are currently
> > opened, by the process itself. The majority in the file cache are
> > closed files.
>
> Yes; what I meant was:
> - start listening the process event connector as firsk task on startup
> - get a list of what files are being used (every second or something)
> by looking at /proc/$PID/* stuff
> - mincore() all those files at the end of the bootup
>
> > Yes, it can still be useful after booting :) One can get the cache
> > footprint of any task started at any time by taking snapshots of the
> > cache before and after the task, and do a set-subtract on them.
>
> Although this certainly looks simpler for userspace (and complete, if
> you want to get absolutely all the info about files that get opened and
> closed faster than the profile interval of a prefetcher)
Thanks, so it's a question of simplicity/completeness :)
> (another useful tool would be a dtrace-like thing)
Lubos Lunak also reminds me of SUSE's preload
(http://en.opensuse.org/index.php?title=SUPER_preloading_internals)
which is a user-land solution using strace to collect the info.
And there's Andrea Arcangeli's "bootcache userspace logging" kernel
patch(http://lkml.org/lkml/2004/8/6/216).
Wu
On Tue, May 02, 2006 at 08:55:06AM -0700, Linus Torvalds wrote:
> So I would _seriously_ claim that the place to do all the statistics
> allocation is in anything that ends up having to call "->readpage()", and
> do it all on a virtual mapping level.
>
> Yes, it isn't perfect either (I'll mention some problems), but it's a
> _lot_ better. It means that when you gather the statistics, you can see
> the actual _files_ and offsets being touched. You can even get the
> filenames by following the address space -> inode -> i_dentry list.
>
> This is important for several reasons:
> (a) it makes it a hell of a lot more readable, and the user gets a
> lot more information that may make him see the higher-level issues
> involved.
> (b) it's in the form that we cache things, so if you read-ahead in
> that form, you'll actually get real information.
> (c) it's in a form where you can actually _do_ something about things
> like fragmentation etc ("Oh, I could move these files all to a
> separate area")
There have been two alternatives for me:
1) static/passive interface i.e. the /proc/filecache querier
- user-land tools request to dump the cache contents on demand
2) dynamic/active interface i.e. the readpage() logger
- user-land daemon accepts live page access/io activities
> Now, admittedly it has a few downsides:
>
> - right now "readpage()" is called in several places, and you'd have to
> create some kind of nice wrapper for the most common
> "mapping->a_ops->readpage()" thing and hook into there to avoid
> duplicating the effort.
>
> Alternatively, you could decide that you only want to do this at the
> filesystem level, which actually simplifies some things. If you
> instrument "mpage_readpage[2]()", you'll already get several of the
> ones you care about, and you could do the others individually.
>
> [ As a third alternative, you might decide that the only thing you
> actually care about is when you have to wait on a locked page, and
> instrument the page wait-queues instead. ]
>
> - it will miss any situation where a filesystem does a read some other
> way. Notably, in many loads, the _directory_ accesses are the important
> ones, and if you want statistics for those you'd often have to do that
> separately (not always - some of the filesystems just use the same
> page reading stuff).
>
> The downsides basically boil down to the fact that it's not as clearly
> just one single point. You can't just look at the request queue and see
> what physical requests go out.
Good insights.
The readpage() activities logging idea has been appealing for me.
We might even go further to log mark_page_accessed() calls for more
information.
This approach is more precise, and provides process/page
correlations and time info that the /proc/filecache interface cannot
provide. Though it involves more complexity and overhead(for me they
mean the possibility of being rejected:).
Wu
On Tue, May 02, 2006 at 09:10:01PM +0200, Pavel Machek wrote:
> Could we use this instead of blockdev freezing/big suspend image
> support? It should permit us to resume quickly (with small image), and
> then do readahead. ... that will give us usable machine quickly, still
> very responsive desktop after resume?
Seems that it can help in reducing the image size:
write only small ranges of file pages to the suspend image(maybe 80MB
= 10k ranges * 8k avgsize), and let the prefetcher restore other large
chunks of code/data, depending on user specified policies.
Wu
Linus Torvalds writes:
>
[...]
>
> Now, admittedly it has a few downsides:
>
> - right now "readpage()" is called in several places, and you'd have to
> create some kind of nice wrapper for the most common
> "mapping->a_ops->readpage()" thing and hook into there to avoid
> duplicating the effort.
>
> Alternatively, you could decide that you only want to do this at the
> filesystem level, which actually simplifies some things. If you
> instrument "mpage_readpage[2]()", you'll already get several of the
> ones you care about, and you could do the others individually.
>
> [ As a third alternative, you might decide that the only thing you
> actually care about is when you have to wait on a locked page, and
> instrument the page wait-queues instead. ]
>
> - it will miss any situation where a filesystem does a read some other
> way. Notably, in many loads, the _directory_ accesses are the important
> ones, and if you want statistics for those you'd often have to do that
> separately (not always - some of the filesystems just use the same
> page reading stuff).
Another disadvantage is that ->readpage() can only do read-ahead within
single file, which is not helpful for the case of reading a lot of small
files (and this is what happens during startup).
And to implement reasonable multi-file read-ahead at the file system
layer one needs asynchronous inode loading interface implemented for
every file system.
Nikita.
On Wed, 2006-05-03 at 12:11 +0800, Wu Fengguang wrote:
> On Tue, May 02, 2006 at 08:55:06AM -0700, Linus Torvalds wrote:
> > Doing prefetching on a physical block basis is simply not a valid
> > approach, for several reasons:
>
> Sorry!
> I made a misleading introduction. I'll try to explain it in more detail.
>
> DATA ACQUISITION
>
> /proc/filecache provides an interface to query the cached pages of any
> file. This information is expressed in tuples of <idx, len>, which
> more specifically means <mapping-offset, pages>.
>
> Normally one should use 'echo' to setup two parameters before doing
> 'cat':
> @file
> the filename;
> use 'ALL' to get a list all files cached
> @mask
> only show the pages with non-zero (page-flags & @mask);
> for simplicity, use '0' to show all present pages(take 0 as ~0)
>
> Normally, one should first get the file list using param 'file ALL',
> and then iterate through all the files and pages of interested with
> params 'file filename' and 'mask pagemask'.
>
> The param 'mask' acts as a filter for different users: it allows
> sysadms to know where his memory goes, and the prefetcher to ignore
> pages from false readahead.
>
> One can use 'mask hex(PG_active|PG_referenced|PG_mapped)' in its hex form
> to show only accessed pages(here PG_mapped is a faked flag), and use
> 'mask hex(PG_dirty)' to show only dirtied pages.
>
> One can use
> $ echo "file /sbin/init" > /proc/filecache
> $ echo "mask 0" > /proc/filecache
> $ cat /proc/filecache
> to get an idea which pages of /sbin/init are currently cached.
>
> In the proposal, I used the following example, which is proved to be
> rather misleading:
> $ echo "file /dev/hda1" > /proc/filecache
> $ cat /proc/filecache
> The intention of that example was to show that filesystem dir/inode
> buffer status -- which is the key data for user-land pre-caching --
> can also be retrieved through this interface.
>
> So the proposed solution is to
> - prefetch normal files on the virtual mapping level
> - prefetch fs dir/inode buffers on a physical block basis
>
> I/O SUBMISSION
> How can we avoid unnecessary seeks when prefetching on virtual mapping
> level? The answer is to leave this job to i/o elevators. What we
> should do is to present elevators with most readahead requests before
> too many requests being submitted to disk drivers.
> The proposed scheme is to:
> 1) (first things first)
> issue all readahead requests for filesystem buffers
> 2) (in background, often blocked)
> issue all readahead requests for normal files
> -) make sure the above requests are of really _low_ priority
> 3) regular system boot continues
> 4) promote the priority of any request that is now demanded by
> legacy programs
>
> In the scheme, most work is done by user-land tools. The required
> kernel support is minimal and general-purpose:
> - an /proc/filecache interface
> - the ability to promote I/O priority on demanded pages
>
> By this approach, we avoided the complicated OSX bootcache solution,
> which is a physical-blocks-based, special-handlings-in-kernel solution
> that is exactly what Linus is against.
Wu,
While ago, I hacked up similar /proc interface
echo "<filesystem-name>" > /proc/pagecache-usage
Which showed pagecache usage of every file in that filesystem
(filename, #num pages). My main objective was to shoot down pagecache
for all the files in a given filesystem. I ended up using it to do
posix_fadivse(POSIX_FADV_DONTNEED) on those files. (Initially, tried
to do this without this, by doing fadvise() on all files in the
filesystem - but ended up bloating up inode and dcache).
Yeah. having this kind of information would be useful. But I am not sure
how much of this can benefit regular workloads - unless one is willing
to tweak things heavily. Bottom line is, need to have a better strategy
on how you would use information ..
Thanks,
Badari
El Wed, 3 May 2006 14:45:03 +0800,
Wu Fengguang <[email protected]> escribi?:
> Lubos Lunak also reminds me of SUSE's preload
> (http://en.opensuse.org/index.php?title=SUPER_preloading_internals)
> which is a user-land solution using strace to collect the info.
>
> And there's Andrea Arcangeli's "bootcache userspace logging" kernel
> patch(http://lkml.org/lkml/2004/8/6/216).
Just for completeness, windows vista will include a enhanced prefetcher
called (sic) SuperFetch. The idea behind it seems to be to analyze I/O
patterns and then "mirror" the most frequently used disk blocks into
the USB flash drive; so if when the usb flash drive is plugged in
the system will read those blocks from it as it was the hard drive
the next time you run the app
(http://www.windowsitpro.com/Windows/Article/ArticleID/48085/48085.html)
Wu Fengguang wrote:
> Pre-caching reloaded ;)
> I/O ANALYSE...
> SCHEME/GOAL(s)...
>
Some good analysis and ideas. I don't know if it is wanted, but I'd
like to add a 'few cents' referring to the pre-fetch mechanism in
XP, which addresses both boot and application prefetch and has the
benefit of showing measurable performance improvements (compare the
boot time of an NT4 system to XP; maybe a 5-8x performance boost?).
1. As you mention; reading files "sequentially" through the file
system is "bad" for several reasons. Areas of interest:
a) don't go through the file system. Don't waste time doing
directory lookups and following file-allocation maps; Instead,
use raw-disk i/o and read sectors in using device & block number.
b) Be "dynamic"; "Trace" (record (dev&blockno/range) blocks
starting ASAP after system boot and continuing for some "configurable"
number of seconds past reaching the desired "run-level" (coinciding with
initial disk quiescence). Save as "configurable" (~6-8?) number of
traces to allow finding the common initial subset of blocks needed.
c) Allow specification of max# of blocks and max number of "sections"
(discontiguous areas on disk);
d) "Ideally", would have a way to "defrag" the common set of blocks.
I.e. -- moving the needed blocks from potentially disparate areas of
files into 1 or 2 contiguous areas, hopefully near the beginning of
the disk (or partition(s)).
That's the area of "boot" pre-caching.
Next is doing something similar for "application" starts. Start tracing
when an application is loaded & observe what blocks are requested for
that app for the first 20 ("configurable") seconds of execution. Store
traces on a per-application basis. Again, it would be ideal if the
different files (blocks, really), needed by an application could be
grouped so that sequentially needed disk-blocks are stored sequentially
on disk (this _could_ imply the containing files are not contiguous).
Essentially, one wants to do for applications, the same thing one does
for booting. On small applications, the benefit would likely be negligible,
but on loading a large app like a windowing system, IDE, or database app,
multiple configuration files could be read into the cache in one large
read.
That's "application" pre-caching.
A third area -- that can't be easily done in the kernel, but would
require a higher skill level on the part of application and library
developers, is to move towards using "delay-loaded" libraries. In
Windows, it seems common among system libraries to use this feature.
An obvious benefit -- if certain features of a program are not used,
the associated libraries are never loaded. Not loading unneeded parts
of a program should speed up initial application load time, significantly.
I don't know where the "cross-over" point is, but moving to demand
loaded "so's" can cause extreme benefits for interactive usage. In
addition to load-time benefits, additional benefits are gained by not
wasting memory on unused libraries and program features.
In looking at the distro I use, many unused libraries are linked in
with commonly used programs. For a small office or home setup, I rarely
have a need for LDAP, internationalized libraries, & application support
for virtually any hardware & software configuration.
This isn't a kernel problem -- the kernel has dynamically loadable
modules to allow loading only of hardware & software drivers that are
needed in a particular kernel. User applications that I have been
exposed to in Linux usually don't have this adaptability -- virtually
everything is loaded at execution time -- not as needed during
program execution.
I've seen this feature used on Unix systems to dynamically present
feature interfaces depending on the user's software configuration. On
linux, I more often see a "everything, including the kitchen sink"
approach, where every possible software configuration is supported
via "static", shared libraries that must be present and loaded into
memory before the program begins execution.
This has the potential to have a greater benefit as the application
environment becomes more complex if you think about how the number
of statically loaded, sharable libraries have increased (have seen
addition of ldap, pam, and most recently, selinux libraries that are
required for loading before application execution).
Good luck in speeding these things up. It might require some
level of cooperation in different areas (kernel, fs utils,
distro-configuration, application design and build...etc).
-linda
On Tue, May 02, 2006 at 10:09:36AM +0200, Jens Axboe wrote:
> Your patch isn't exactly pretty, but that is fixable. I'm more
> interested in how you'd control that sanely.
I have a quick and dirty hack called BootCache (only afterwards did I notice
that Apple used the same name) that snoops all the read requests during
startup and uses that to populate a file which gets read in to populate
the cache on subsequent boots. The results are impressive: eliminating
the seeks (since the file is contiguous) results in 15s (out of the 45s
from init starting to the login prompt) reduction in boot time on ext3.
Prefetch without reordering isn't particularly effective, as the cost of
the seeks to read in the inodes and file data tend to block critical path
io in the startup scripts.
I hope to release the code sometime soon and rework it to be a better fit
for merging with the kernel -- it should be possible to use the block io
tracing facility to log the blocks required and then feed that back into
the filesystem to reallocate blocks in the filesystem. That would eliminate
the messy part of bootcache that has to snoop block writes to maintain the
cache. This is all part of what I'm aiming to present at OLS.
-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.
On Wed, 2006-05-03 at 20:14 +0200, Diego Calleja wrote:
> Just for completeness, windows vista will include a enhanced prefetcher
> called (sic) SuperFetch. The idea behind it seems to be to analyze I/O
> patterns and then "mirror" the most frequently used disk blocks into
> the USB flash drive; so if when the usb flash drive is plugged in
> the system will read those blocks from it as it was the hard drive
> the next time you run the app
> (http://www.windowsitpro.com/Windows/Article/ArticleID/48085/48085.html)
Linux should be able to do something like this using unionfs. It could
be worthwhile to try it with one of the very fastest flash cards or USB
drives.
With slower cards and USB keys its more of a loss unless the faster seek
speed can make up for it, because sequential hard drive access is
faster.
For comparison, a OCZ USB 2.0 Flash Drive with dual channel works at
about 30 MB/s. One of my 7,200 RPM SATA desktop drives does 45 MB/s. A
15k SCSI drive can do over 60 MB/s.
It'd be great for laptops though. My slug of a laptop drive does 20
MB/s on a good day.
--
Zan Lynx <[email protected]>
Linus Torvalds wrote:
> Now, regardless of the other issues people have brought up, I'd like to
> say that I think this is broken.
>
> Doing prefetching on a physical block basis is simply not a valid
> approach, for several reasons:
>
> - it misses several important cases (you can surely prefetch over NFS
> too)
---
I don't think fetching over a network is a case that needs
to be "optimized". I don't think any there would be sufficient gain
in "pre-fetching" over a network (even if it was one long network read)
to outweigh network latencies. I would strongly suggest this be
ignored.
> - it gives you only a very limited view into what is actually going on.
---
??? In what way? I don't think we need a *complex* view of what
is going on. If anything, complexity can be a performance drag (though
in some cases, an excellent algorithm can outweigh the extra computational
complexitiy.
> - it doesn't much allow you to _fix_ any problems, it just allows
you to
> try to paper them over.
----
It isn't designed to be the one "true" fix for every problem known.
It's meant to provide speedups for a common subset of io-requests.
If one comes up with the one-uber algorithm in three years, then
replace this one, meanwhile...
> - it's useless anyway, since pretty all kernel caching is based on
> virtual caches, so if you "pre-read" the physical buffers, it won't
> help: you'll just waste time reading the data into a buffer that will
> never be used, and when the real request comes in, the read will
> be done _again_.
----
This is the big drawback/problem. But this leads me to an obvious
question. If I read a file into memory, it is cached, _seemingly_,
relative to the filename. If I read the the same physical block into
memory using _block_-i/o, relative to the device, the claim is that
two separate reads are done into the block cache? Can't this result
in inconsistent results between the two copies? If I change the
file relative block, how is this coordinated with the device-relative
block? Surely there is some block coelescing done at some point to
ensure the two different views of the block are consistent?
If not, I'd tend to view this as a "bug".
But presuming this is "dealt with" at some layer, can't the
same information used to merge the two blocks be used to identify that
the same block is already in memory?
If file-blockno, addresses can't be converted to device-blockno's,
how can the blocks be written to disk?
If the file-blockno's are convertable, can't this be used to
compare which physical blocks are already in the cache? Certainly this
would be faster than the milliseconds it could take to do another
physial read, no?
> (a) it makes it a hell of a lot more readable, and the user gets a
> lot more information that may make him see the higher-level issues
> involved.
---
I'm not sure this is a benefit that outweight the elimination of
having to go through the file system (following directory chains,
indirect blocks, extents, etc).
> (b) it's in the form that we cache things, so if you read-ahead in
> that form, you'll actually get real information.
----
This may not be a great reason. Using the fact that "this is the
way things are already done", may not allow optimal implementation of
faster algorithms.
> (c) it's in a form where you can actually _do_ something about
things
> like fragmentation etc ("Oh, I could move these files all to a
> separate area")
----
This could theoretically be done solely on a block basis. While
file pointers would have to be updated to point to "out-of-sequence"
blocks, it might be more optimal to store _some_ files discontiguously
in order to pack the desired blocks into a (relatively) "small" area
that can be read into memory in some minimal number of physical i/o
requests.
> - it will miss any situation where a filesystem does a read some other
> way. Notably, in many loads, the _directory_ accesses are the
important
> ones, and if you want statistics for those you'd often have to do
that
> separately (not always - some of the filesystems just use the same
> page reading stuff).
---
Recording physical block reads could automatically include this
situation.
>
> The downsides basically boil down to the fact that it's not as clearly
> just one single point. You can't just look at the request queue and see
> what physical requests go out.
---
You also have the downside of having to go through the file-system
and following each path and file-chain in order to read a potentially
discontiguous section of a file. It also, _seemingly_, has the downside
of reading in "entire files" when only subsections of files may be
needed on startup.
>
> NOTE! You can obviously do both, and try to correlate one against the
> other, and you'd get the best possible information ("is this seek
because
> we started reading another file, or is it because the file itself is
> fragmented" kind of stuff). So physical statistics aren't meaningless.
> They're just _less_ important than the virtual ones, and you should do
> them only if you already do virtual stats.
---
Sounds like you might need to have 2 sets of "addresses" / block:
file relative block addresses, and device relative block addresses.
-l
On Wed, 3 May 2006, Linda Walsh wrote:
>
> > - it gives you only a very limited view into what is actually going on.
>
> ??? In what way? I don't think we need a *complex* view of what
> is going on.
The block-based view is a very simple one, but it's _too_ simple. It
doesn't actually tell the user what is happening. It doesn't tell you why
the request happened in the first place, so it leaves you no way to really
do anything sane about it.
You can't prefetch (because the indexing is wrong), and you can't actually
even analyze it (because you don't know any background to what you're
seeing). In short, you can't _do_ anything with it.
Linus
El Wed, 03 May 2006 17:39:51 -0600,
Zan Lynx <[email protected]> escribi?:
> Linux should be able to do something like this using unionfs. It could
> be worthwhile to try it with one of the very fastest flash cards or USB
> drives.
BTW; I forgot that the next intel laptops (and apparently, mactel laptops)
will have a small flash memory built in for this very purpose. According
to an article at extremetech.com, the flash memory requires an "I/O
controller"; IOW, it is not transparent and seems to need support from the
OS (although I guess that vista's "superprefetch" was designed precisely
for this hardware). Apparently, the main purpose here is to improve the
battery life (disks seeking for too many seconds can eat lot of power,
I guess). "Prefetchers" are becoming trendy, it seems
(page 14) http://www.intel.com/pressroom/kits/events/idfspr_2006/20060307_MaloneyTranscript.pdf
> With slower cards and USB keys its more of a loss unless the faster seek
> speed can make up for it, because sequential hard drive access is
> faster.
That's where the gain is; if the hard drive access was sequential people
wouldn't be talking about prefetchers. My SATA desktop drive also does
~45 MB/s, but it doesn't goes beyond 2 when seeking
On Thursday 04 May 2006 03:31, Linus Torvalds wrote:
> On Wed, 3 May 2006, Linda Walsh wrote:
> > > - it gives you only a very limited view into what is actually going
> > > on.
> >
> > ??? In what way? I don't think we need a *complex* view of what
> > is going on.
>
> The block-based view is a very simple one, but it's _too_ simple. It
> doesn't actually tell the user what is happening. It doesn't tell you why
> the request happened in the first place, so it leaves you no way to really
> do anything sane about it.
>
> You can't prefetch (because the indexing is wrong), and you can't actually
> even analyze it (because you don't know any background to what you're
> seeing). In short, you can't _do_ anything with it.
Please forgive me if I'm dumb, but what about the following reasoning:
The user as such (the person just *using* the computer) won't do anything
with "program xyz uses block a, b, and c.".
But it wouldn't help much if he was told "program xyz uses files a, b, and d."
That's a task for package maintainers and distributions to optimize the
*number* of accesses.
So the simple information which block numbers we need would not benefit the
users (or the distribution developers), but it's enough for caching things:
Ascending block numbers on disk can be read very fast, as the disk needs no or
less seeking. That's even true for stripes and mirrors. (I grant you that
there are complicated setups out there, but these could be handled similar.)
If we know which blocks are read (directories, inodes, data), an early
userspace application (possibly even from initrd) could
- read these blocks into a ramdisk device (linearly),
- define a device mapper as a mirror of
- the original root partition,
- and a sparse device composed from parts of the ramdisk, interleaved by
non-existing (error-producing) ranges, and
- setup the device mapper as a root device.
- In a specified point at startup the mapping device gets set to just
a linear interpretation of the root partition, and the ramdisk gets
discarded.
Example:
--------
N specifies a needed block, the numbers other blocks.
Harddisk: 1 2 3 N N N 7 8 N 10 N 12 13 14 N 16 N ...
RAM-Disk contains 4 5 6 9 11 15 17
The mirror set would look like this,
with E representing an error-returning block:
Harddisk: 1 2 3 N N N 7 8 N 10 N 12 13 14 N 16 N ...
mapped from RAM-Disk: E E E 4 5 6 E E 9 E 11 E E E 15 E 17 ...
Advantages:
- As long as there's a good chance that the needed blocks are read from the
ramdisk (which was filled linearly, with 20-60MB/sec as reported by others),
*all* disk-access times (inodes, directories, data) should be nearly zero.
Of course, there will be blocks which have now to be read which weren't
needed the last boot, which will have to be taken from the harddisk,
but the majority should be taken from the ramdisk.
Problems:
- We'd need to define a priority in the mirror set, so that *everything*
possible is taken from the ramdisk
- There has to be a way to *not* disable the ram-disk when an error-block is
read.
That's completely doable in userspace (except for knowing which blocks have to
be read), and doesn't sound like much overhead to me.
Suggestions, ideas?
Does that work? Does it not?
Regards,
Phil
> Ascending block numbers on disk can be read very fast, as the disk needs no or
> less seeking. That's even true for stripes and mirrors. (I grant you that
> there are complicated setups out there, but these could be handled similar.)
>
btw this all really spells out that you may want to do this as a device
mapper thing; eg have a device mapper module that can do "lookaside" to
a different order/mirror block whatever. The filesystem just doesn't
want to know; do it at the DM level ;) That also solves the entire
caching layering problem etc ;)
Wu Fengguang wrote:
> Rapid linux desktop startup through pre-caching
>
>
>MOTIVATION
>
> KDE, Gnome, OpenOffice, and Firefox all take too long to start up.
> Boot time pre-caching seems to be the single most straightforward and
> effective way to improve it and make linux desktop experience more
> comfortable. It is a great pleasure for me to take up the work.
>
>
Actually, the best way is to not run so much software. An yes,
that is an option. I won't say no to an improved kernel too though. :-)
The apps mentioned are popular, but few needs *all* of them.
One can do without KDE and gnome, run a nice lightweight
window manager instead. Take the kde/gnome performance hit
only when you actually need some kde/gnome app. Not every day.
A nice windowmanager like icewm of fluxbox brings the login
delay down to 3s or so for me.
Openoffice has lightweight alternatives for every task.
(abiword,lyx,gnumeric, . . . ) Strange that this bloated sw is
as popular as it is, given the many alternatives. Not something
I use every month, and I use linux exclusively for my office tasks.
Another alternative is to profile the slow apps and improve them.
Fix algorithms, optimize stuff.
The slow boot is fixable by:
1) run boot scripts in parallell instead of sequentially - somewhat
experimental
but helps. Especially if you can bring up X before slowest stuff
completes.
2) Don't run what you don't use/need! Don't install everything and the
kitchen
sink just because it is free software. I am guilty of installing
too much myself,
so I suffer 40-second bootup time. But I don't reboot my office pc
every
week, normally I only have that 3s login delay.
Helge Hafting
On Tue, May 02, 2006 at 09:10:01PM +0200, Pavel Machek wrote:
> > $ echo "file /dev/hda1" > /proc/filecache
> > $ cat /proc/filecache
> > # file /dev/hda1
> > # mask 0
> > #
> > # idx len
> > 0 24
> > 48 2
> > 53 5
> > ......
>
> Could we use this instead of blockdev freezing/big suspend image
> support? It should permit us to resume quickly (with small image), and
> then do readahead. ... that will give us usable machine quickly, still
> very responsive desktop after resume?
Badari's usage case inspired me that on suspension we can
- first invoke a user-land tool to do all the cache
status-logging/analyzing/selective-dropping jobs
- then let the kernel write all the remaining caches(made up of many
small chunks) to the suspend image
And do the reverse things on restoring.
With that we moved all the strategies to userspace.
Wu
On Thu, 2006-05-04 at 20:14 +0800, Wu Fengguang wrote:
> On Thu, May 04, 2006 at 09:33:24AM +0200, Arjan van de Ven wrote:
> >
> >
> > > Ascending block numbers on disk can be read very fast, as the disk needs no or
> > > less seeking. That's even true for stripes and mirrors. (I grant you that
> > > there are complicated setups out there, but these could be handled similar.)
> > >
> >
> >
> > btw this all really spells out that you may want to do this as a device
> > mapper thing; eg have a device mapper module that can do "lookaside" to
> > a different order/mirror block whatever. The filesystem just doesn't
> > want to know; do it at the DM level ;) That also solves the entire
> > caching layering problem etc ;)
>
> I guess some big corps might want to install such a layer into their
> storage products ;)
maybe, could be.
Doing it at this level also has the advantage that fs metadata becomes
seek free as well; since that is mostly fixed-location inside the fs,
reordering that on an fs-level is really hard.. on a dm level.. easy.
On Wed, May 03, 2006 at 02:45:53PM -0700, Linda Walsh wrote:
> 1. As you mention; reading files "sequentially" through the file
> system is "bad" for several reasons. Areas of interest:
> a) don't go through the file system. Don't waste time doing
> directory lookups and following file-allocation maps; Instead,
> use raw-disk i/o and read sectors in using device & block number.
Sorry, it does not fit in the linux's cache model.
> b) Be "dynamic"; "Trace" (record (dev&blockno/range) blocks
> starting ASAP after system boot and continuing for some "configurable"
> number of seconds past reaching the desired "run-level" (coinciding with
> initial disk quiescence). Save as "configurable" (~6-8?) number of
> traces to allow finding the common initial subset of blocks needed.
It is a alternative way of doing the same job: more precise, with more
complexity and more overhead. However the 'blockno' way is not so
tasteful.
> c) Allow specification of max# of blocks and max number of "sections"
> (discontiguous areas on disk);
Good point, will do it in my work.
> d) "Ideally", would have a way to "defrag" the common set of blocks.
> I.e. -- moving the needed blocks from potentially disparate areas of
> files into 1 or 2 contiguous areas, hopefully near the beginning of
> the disk (or partition(s)).
> That's the area of "boot" pre-caching.
I guess poor man's defrag would be good enough for the seeking storm.
> Next is doing something similar for "application" starts. Start tracing
> when an application is loaded & observe what blocks are requested for
> that app for the first 20 ("configurable") seconds of execution. Store
> traces on a per-application basis. Again, it would be ideal if the
> different files (blocks, really), needed by an application could be
> grouped so that sequentially needed disk-blocks are stored sequentially
> on disk (this _could_ imply the containing files are not contiguous).
>
> Essentially, one wants to do for applications, the same thing one does
> for booting. On small applications, the benefit would likely be negligible,
> but on loading a large app like a windowing system, IDE, or database app,
> multiple configuration files could be read into the cache in one large
> read.
>
> That's "application" pre-caching.
Yes, it is a planned feature, will do it.
> A third area -- that can't be easily done in the kernel, but would
> require a higher skill level on the part of application and library
> developers, is to move towards using "delay-loaded" libraries. In
> Windows, it seems common among system libraries to use this feature.
> An obvious benefit -- if certain features of a program are not used,
> the associated libraries are never loaded. Not loading unneeded parts
> of a program should speed up initial application load time, significantly.
Linux already does lazy loading for linked libs. The only one pitfall
is that /lib/ld-linux.so.2 seems to touch the first 512B data of every
libs before doing mmap():
% strace date
execve("/bin/date", ["date"], [/* 41 vars */]) = 0
uname({sys="Linux", node="lark", ...}) = 0
brk(0) = 0x8053000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f0e000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f0d000
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=77643, ...}) = 0
mmap2(NULL, 77643, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7efa000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/tls/librt.so.1", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0`\35\0\000"..., 512) = 512
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~HERE~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fstat64(3, {st_mode=S_IFREG|0644, st_size=30612, ...}) = 0
mmap2(NULL, 29264, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7ef2000
mmap2(0xb7ef8000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6) = 0xb7ef8000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/tls/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\260O\1"..., 512) = 512
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~HERE~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fstat64(3, {st_mode=S_IFREG|0755, st_size=1270928, ...}) = 0
mmap2(NULL, 1276892, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7dba000
mmap2(0xb7ee8000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x12e) = 0xb7ee8000
mmap2(0xb7ef0000, 7132, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7ef0000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/tls/libpthread.so.0", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\340G\0"..., 512) = 512
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~HERE~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fstat64(3, {st_mode=S_IFREG|0755, st_size=85770, ...}) = 0
mmap2(NULL, 70104, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7da8000
mmap2(0xb7db6000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xd) = 0xb7db6000
mmap2(0xb7db8000, 4568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7db8000
close(3) = 0
They can lead to more seeks, and also disturb the readahead logic much.
Wu
On Thu, May 04, 2006 at 09:33:24AM +0200, Arjan van de Ven wrote:
>
>
> > Ascending block numbers on disk can be read very fast, as the disk needs no or
> > less seeking. That's even true for stripes and mirrors. (I grant you that
> > there are complicated setups out there, but these could be handled similar.)
> >
>
>
> btw this all really spells out that you may want to do this as a device
> mapper thing; eg have a device mapper module that can do "lookaside" to
> a different order/mirror block whatever. The filesystem just doesn't
> want to know; do it at the DM level ;) That also solves the entire
> caching layering problem etc ;)
I guess some big corps might want to install such a layer into their
storage products ;)
On Thu, 4 May 2006, Wu Fengguang wrote:
>
> On Wed, May 03, 2006 at 10:28:00AM -0700, Badari Pulavarty wrote:
> > While ago, I hacked up similar /proc interface
> > echo "<filesystem-name>" > /proc/pagecache-usage
> >
> > Which showed pagecache usage of every file in that filesystem
> > (filename, #num pages). My main objective was to shoot down pagecache
> > for all the files in a given filesystem. I ended up using it to do
> > posix_fadivse(POSIX_FADV_DONTNEED) on those files. (Initially, tried
> > to do this without this, by doing fadvise() on all files in the
> > filesystem - but ended up bloating up inode and dcache).
>
> Ah, I have not thought of the possibility of querying the page cache
> just to drop some caches -- a subset of sysctl_drop_caches functions.
Actually, I did something even simpler for a totally one-off thing: you
don't actually even need to drop caches or track pretty much anything,
it's sufficient for most analysis to just have a timestamp on each page
cache, and then have some way to read out the current cached contents.
You can then use the timestamps to get a pretty good idea of what order
things happened in.
You'll lose the temporary file information (files that are created and
deleted), because deleting a file will also flush the page cache for it,
but temporary files tend to not be hugely interesting.
The really nice thing was that you don't even have to set the timestamp in
any complex place: you do it at page _allocation_ time. That automatically
gets the right answer for any page cache page, and you can do it in a
single place.
Linus
On Thu, 2006-05-04 at 08:03 -0700, Linus Torvalds wrote:
>
> On Thu, 4 May 2006, Wu Fengguang wrote:
> >
> > On Wed, May 03, 2006 at 10:28:00AM -0700, Badari Pulavarty wrote:
> > > While ago, I hacked up similar /proc interface
> > > echo "<filesystem-name>" > /proc/pagecache-usage
> > >
> > > Which showed pagecache usage of every file in that filesystem
> > > (filename, #num pages). My main objective was to shoot down pagecache
> > > for all the files in a given filesystem. I ended up using it to do
> > > posix_fadivse(POSIX_FADV_DONTNEED) on those files. (Initially, tried
> > > to do this without this, by doing fadvise() on all files in the
> > > filesystem - but ended up bloating up inode and dcache).
> >
> > Ah, I have not thought of the possibility of querying the page cache
> > just to drop some caches -- a subset of sysctl_drop_caches functions.
>
> Actually, I did something even simpler for a totally one-off thing: you
> don't actually even need to drop caches or track pretty much anything,
> it's sufficient for most analysis to just have a timestamp on each page
> cache, and then have some way to read out the current cached contents.
>
> You can then use the timestamps to get a pretty good idea of what order
> things happened in.
>
> You'll lose the temporary file information (files that are created and
> deleted), because deleting a file will also flush the page cache for it,
> but temporary files tend to not be hugely interesting.
>
> The really nice thing was that you don't even have to set the timestamp in
> any complex place: you do it at page _allocation_ time. That automatically
> gets the right answer for any page cache page, and you can do it in a
> single place.
Sounds like an interesting idea.
The problem, I was trying to address was that - one of our large
database customer was complaining about the variation (degradation)
in database performance when other applications like backup, ftp, scp,
tar happening on other filesystems on the system. VM end up swapping
out some of database process data to make room for these read/writes.
They wanted a way to completely flush out pagecache data for a given
filesystem to see if it improves their situation. (although "swapiness"
should in *theory* do the same thing). In fact, they want a way to
control the amount of pagecache filesystems populate through /proc :(
My patch was an initial attempt to unwind mysteries of VM behaviour :)
(VM may doing the right thing from kernel perspective - but customer
was complaining that its thrashing their workload).
Thanks,
Badari
Wu Fengguang wrote:
> On Wed, May 03, 2006 at 02:45:53PM -0700, Linda Walsh wrote:
>
>> 1. As you mention; reading files "sequentially" through the file
>> system is "bad" for several reasons. Areas of interest:
>> a) don't go through the file system. Don't waste time doing
>> directory lookups and following file-allocation maps; Instead,
>> use raw-disk i/o and read sectors in using device & block number.
>>
>
> Sorry, it does not fit in the linux's cache model.
>
---
Maybe linux's cache model needs to be _improved_ to better
allow for hardware acceleration?^** It is the job of the OS to provide
sufficiently low level facilities to allow optimal use of the system
hardware, while at the same time providing enough high level facilities
to support applications that don't require such tuning.
>> b) Be "dynamic"; "Trace" (record (dev&blockno/range) blocks
>> starting ASAP after system boot and continuing for some "configurable"
>> number of seconds past reaching the desired "run-level" (coinciding with
>> initial disk quiescence). Save as "configurable" (~6-8?) number of
>> traces to allow finding the common initial subset of blocks needed.
>>
>
> It is a alternative way of doing the same job: more precise, with more
> complexity and more overhead. However the 'blockno' way is not so
> tasteful.
>
----
Maybe not so tasteful to you, but it is an alternate path that
circumvents unnecessary i/o redirections. An additional optimization
is to have a "cache" of frequently used "system applications" that have
their pathnames registered, so no run-time seaching of the user PATH
is necessary.
>> c) Allow specification of max# of blocks and max number of "sections"
>> (discontiguous areas on disk);
>>
>
> Good point, will do it in my work.
>
>
>> d) "Ideally", would have a way to "defrag" the common set of blocks.
>> I.e. -- moving the needed blocks from potentially disparate areas of
>> files into 1 or 2 contiguous areas, hopefully near the beginning of
>> the disk (or partition(s)).
>> That's the area of "boot" pre-caching.
>>
>
>
> I guess poor man's defrag would be good enough for the seeking storm.
>
---
I disagree. The "poor man's defrag, as you call it, puts entire files
into contiguous sections -- each of which will have to be referenced by
following
a PATH and directory chain. The optimization I'm talking about would
only store
the referenced data-blocks actually used in the files. This would allow a
directy read into memory of a *bunch* of needed sectors while not including
sectors from the same files that are not actually read from during the
boot *or*
app. initialization process.
>> That's "application" pre-caching.
>>
> Yes, it is a planned feature, will do it.
>
Trés cool.
>> A third area -- that can't be easily done in the kernel, but would
>> require a higher skill level on the part of application and library
>> developers, is to move towards using "delay-loaded" libraries. In
>> Windows, it seems common among system libraries to use this feature.
>> An obvious benefit -- if certain features of a program are not used,
>> the associated libraries are never loaded. Not loading unneeded parts
>> of a program should speed up initial application load time, significantly.
>>
> Linux already does lazy loading for linked libs. The only one pitfall
> is that /lib/ld-linux.so.2 seems to touch the first 512B data of every
> libs before doing mmap():
>
----
Yes -- and this is the "killer". If my app may "potentially" use
50 run-time libs, but in any single invocation, only uses 20 of those
libs, the page tables and fixups for the unused 30 libraries don't
need to be done. In fact, in some configurations, those 30 libs may
not even need to be present on disk!
Typical example - "Active Directory"; I don't use it. I don't
need the libraries on my system or need them "resolved" at runtime.
It would be far preferable if programs would only load those
libraries actually used at run-time -- and load them *dynamically*,
as needed (for libraries that may not actually be called or
used). This way, the initial time to start the program is
significantly reduced to the "core" set of libraries needed to
run the program. Optional features are loaded as those features
are called for off disk. Delays of loading optional libraries
one-two at a time, interactively, are not likely to be noticed,
but if you load all of those "optional" libraries prior to execution,
the sum-total will be noticed in an interactive environment.
^** -- "somewhere", it _seems_, the physical, device relative sector
must be resolved. If it is not, how is i/o-block buffer consistency
maintained when the user references "device "hda", sector "1000", then
the same sector as "hda1", sector 500, and also as file "/etc/passwd",
sector 0? _If_ cache consistency is maintained (and I _assume_^**2
it is), they all need to be mapped to a physical sector at some point.
^**2 - Use of assumption noted; feel free to correct me and tell me
this isn't the case if linux doesn't maintain disk-block cache
consistency.
-linda
On Thu, May 04, 2006 at 08:03:50AM -0700, Linus Torvalds wrote:
> Actually, I did something even simpler for a totally one-off thing: you
> don't actually even need to drop caches or track pretty much anything,
> it's sufficient for most analysis to just have a timestamp on each page
> cache, and then have some way to read out the current cached contents.
>
> You can then use the timestamps to get a pretty good idea of what order
> things happened in.
> The really nice thing was that you don't even have to set the timestamp in
> any complex place: you do it at page _allocation_ time. That automatically
> gets the right answer for any page cache page, and you can do it in a
> single place.
Looks nice. A detailed scheme might be:
1) ctime/atime for each radixtree node(or, cluser of pages)
It seems to be a good accuracy/overhead compromise.
And is perfect for the most common case of sequential accesses.
struct radix_tree_node {
+ unsigned long ctime, atime;
}
radix_tree_node_alloc()
{
+ node->ctime = some_virtual_time();
}
radix_tree_lookup_slot()
{
+ node->atime = some_virtual_time();
}
2) eviction-time for each recently evicted page
Store eviction-time _in place_ in the slot that used to store the
page's address, with minimal space/time impact.
+#define SLOT_IS_EVICTION_TIME 1
radix_tree_delete()
{
- pathp->node->slots[pathp->offset] = NULL;
+ pathp->node->slots[pathp->offset] =
+ some_virtual_time() | SLOT_IS_EVICTION_TIME;
}
Regards,
Wu
On Thu, May 04, 2006 at 11:57:21AM -0700, Linda Walsh wrote:
> Wu Fengguang wrote:
> >> b) Be "dynamic"; "Trace" (record (dev&blockno/range) blocks
> >>starting ASAP after system boot and continuing for some "configurable"
> >>number of seconds past reaching the desired "run-level" (coinciding with
> >>initial disk quiescence). Save as "configurable" (~6-8?) number of
> >>traces to allow finding the common initial subset of blocks needed.
> >>
> >
> >It is a alternative way of doing the same job: more precise, with more
> >complexity and more overhead. However the 'blockno' way is not so
> >tasteful.
> >
> ----
> Maybe not so tasteful to you, but it is an alternate path that
> circumvents unnecessary i/o redirections. An additional optimization
Yes, it's about the choice of the overhead/generalization that
indirections bring about.
> is to have a "cache" of frequently used "system applications" that have
> their pathnames registered, so no run-time seaching of the user PATH
> is necessary.
The facility is already there: it is called dcache.
> >I guess poor man's defrag would be good enough for the seeking storm.
> >
> ---
> I disagree. The "poor man's defrag, as you call it, puts entire files
> into contiguous sections -- each of which will have to be referenced by
> following
> a PATH and directory chain. The optimization I'm talking about would
> only store
> the referenced data-blocks actually used in the files. This would allow a
> directy read into memory of a *bunch* of needed sectors while not including
> sectors from the same files that are not actually read from during the
> boot *or*
> app. initialization process.
The seek storm is mostly caused by small files(that are read in wholes),
and the corresponding scattered dir/inode buffers. By moving these
files to one single directory, both the file data/inode buffers are
kept continuous enough.
> ^** -- "somewhere", it _seems_, the physical, device relative sector
> must be resolved. If it is not, how is i/o-block buffer consistency
> maintained when the user references "device "hda", sector "1000", then
> the same sector as "hda1", sector 500, and also as file "/etc/passwd",
> sector 0? _If_ cache consistency is maintained (and I _assume_^**2
> it is), they all need to be mapped to a physical sector at some point.
>
> ^**2 - Use of assumption noted; feel free to correct me and tell me
> this isn't the case if linux doesn't maintain disk-block cache
> consistency.
The linux cache model is something like:
physical drive file /dev/hda1
| <-----------------> |
file /bin/ls | |
|<---------------------> | |
|<---------------------> | |
| <-----------------> |
| |
| |
file /boot/vmlinuz | |
|<---------------------> | |
| | |
| | |
|<---------------------> | |
| <-----------------> |
| |
| |
| |
As opposed to this one:
file /dev/hda1 physical drive
| <-----------------> |
file /bin/ls | |
|<---------------------> | |
|<---------------------> | |
| <-----------------> |
| |
| |
file /boot/vmlinuz | |
|<---------------------> | |
| | |
| | |
|<---------------------> | |
| <-----------------> |
| |
| |
| |
Wu
On Tue, 2 May 2006, Linus Torvalds wrote:
> On Tue, 2 May 2006, Wu Fengguang wrote:
> >
> > 4 files changed, 48 insertions(+), 2 deletions(-)
>
> So I would _seriously_ claim that the place to do all the statistics
> allocation is in anything that ends up having to call "->readpage()", and
> do it all on a virtual mapping level.
Why not simply read everything in a whole file at a time
at boot time, while we still have enough free memory ?
We can have a small modification to the readahead code
to read in the whole file on the first read or fault,
or maybe even on open.
Once the system is done booting, it can switch this
bootup readahead mode off through a tunable in /proc.
If the system is booting on a system with not enough
memory to load everything file-at-a-time, the bootup
scripts can switch this off earlier (or not switch
it on).
The kernel modifications needed to make this work
are minimal. It might need some tweaking so we don't
try to read in truly enormous files, but that is easy.
Does this sound reasonable ?
--
All Rights Reversed
On Wed, May 03, 2006 at 06:20:29PM -0400, Rik van Riel wrote:
> Why not simply read everything in a whole file at a time
> at boot time, while we still have enough free memory ?
Sure it helps, though may be unnecessary once we can rely on userspace
pre-caching tools to do it in a more accurate and efficient way.
The main concern would be inter-file/inter-buffer readahead.
I have some data to support it:
% wc -l hda* startup.files
1743 hda2_buffers <== my /
335 hda3_buffers <== my /home
1130 startup.files
3208 total
The above cache sizes sum up to ~110MB, so it's about 36KB per seek.
Thanks,
Wu
On Tuesday 02 May 2006 11:53, Wu Fengguang wrote:
> On Tue, May 02, 2006 at 10:30:17AM +0200, Arjan van de Ven wrote:
> > one interesting thing that came out of the fedora readahead work is that
> > most of the bootup isn't actually IO bound. My test machine for example
> > can load all the data into ram in about 10 seconds, so that the rest of
> > the boot is basically IO-less. But that still takes 2 minutes....
> > So I'm not entirely sure how much you can win by just attacking this.
>
> Yes, I find it hard to improve the boot time of the init.d stage.
Then _get rid of your init.d_.
SysV init sequence is insane.
I use daemontools-based setup and it starts up to login prompt
in 5-10 secs (counting from the root fs mount).
In my setup, after a few simple startup scripts a svscan process
is called, which starts all services in parallel.
Gettys are services, and start right away.
--
vda