2005-10-18 20:03:38

by Guido Fiala

[permalink] [raw]
Subject: large files unnecessary trashing filesystem cache?

(please note, i'am not subscribed to the list, please CC me on reply)

Story:
Once in while we have a discussion at the vdr (video disk recorder) mailing
list about very large files trashing the filesystems memory cache leading to
unnecessary delays accessing directory contents no longer cached.

This program and certainly all applications that deal with very large files
only read once (much larger than usual memory) - it happens that all other
cached blocks of the filessystem are removed from memory solely to keep as
much as possible of that file in memory, which seems to be a bad strategy in
most situations.

Of course one could always implement f_advise-calls in all applications, but i
suggest a discussion if a maximum (configurable) in-memory-cache on a
per-file base should be implemented in linux/mm or where this belongs.

My guess was, it has something to do with mm/readahead.c, a test limiting the
result of the function "max_sane_readahead(...) to 8 MBytes as a quick and
dirty test did not solve the issue, but i might have done something wrong.

I've searched the archive but could not find a previous discussion - is this a
new idea?

It would be interesting to discuss if and when this proposed feature could
lead to better performance or has any unwanted side effects.

Thanks for ideas on that issue.


2005-10-18 20:48:43

by Badari Pulavarty

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On Tue, 2005-10-18 at 22:01 +0200, Guido Fiala wrote:
> (please note, i'am not subscribed to the list, please CC me on reply)
>
> Story:
> Once in while we have a discussion at the vdr (video disk recorder) mailing
> list about very large files trashing the filesystems memory cache leading to
> unnecessary delays accessing directory contents no longer cached.
>
> This program and certainly all applications that deal with very large files
> only read once (much larger than usual memory) - it happens that all other
> cached blocks of the filessystem are removed from memory solely to keep as
> much as possible of that file in memory, which seems to be a bad strategy in
> most situations.
>
> Of course one could always implement f_advise-calls in all applications, but i
> suggest a discussion if a maximum (configurable) in-memory-cache on a
> per-file base should be implemented in linux/mm or where this belongs.
>
> My guess was, it has something to do with mm/readahead.c, a test limiting the
> result of the function "max_sane_readahead(...) to 8 MBytes as a quick and
> dirty test did not solve the issue, but i might have done something wrong.
>
> I've searched the archive but could not find a previous discussion - is this a
> new idea?
>
> It would be interesting to discuss if and when this proposed feature could
> lead to better performance or has any unwanted side effects.
>
> Thanks for ideas on that issue.

Is there a reason why those applications couldn't use O_DIRECT ?

Thanks,
Badari

2005-10-18 21:58:56

by Bodo Eggert

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

Badari Pulavarty <[email protected]> wrote:
> On Tue, 2005-10-18 at 22:01 +0200, Guido Fiala wrote:

[large files trash cache]

> Is there a reason why those applications couldn't use O_DIRECT ?

The cache trashing will affect all programs handling large files:

mkisofs * > iso
dd < /dev/hdx42 | gzip > imagefile
perl -pe's/filenamea/filenameb/' < iso | cdrecord - # <- never tried

Changing a few programs will only partly cover the problems.

I guess the solution would be using random cache eviction rather than
a FIFO. I never took a look the cache mechanism, so I may very well be
wrong here.
--
Ich danke GMX daf?r, die Verwendung meiner Adressen mittels per SPF
verbreiteten L?gen zu sabotieren.

2005-10-18 23:06:33

by Badari Pulavarty

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On Tue, 2005-10-18 at 23:58 +0200, Bodo Eggert wrote:
> Badari Pulavarty <[email protected]> wrote:
> > On Tue, 2005-10-18 at 22:01 +0200, Guido Fiala wrote:
>
> [large files trash cache]
>
> > Is there a reason why those applications couldn't use O_DIRECT ?
>
> The cache trashing will affect all programs handling large files:
>
> mkisofs * > iso
> dd < /dev/hdx42 | gzip > imagefile
> perl -pe's/filenamea/filenameb/' < iso | cdrecord - # <- never tried
>

Are these examples which demonstrate the thrashing problem.
Few product (database) groups here are trying to get me to
work on a solution before demonstrating the problem. They
also claim exactly what you are saying. They want a control
on how many pages (per process or per file or per filesystem
or system wide) you can have in filesystem cache.

Thats why I am pressing to find out the real issue behind this.
If you have a demonstratable testcase, please let me know.
I will be happy to take a look.


> Changing a few programs will only partly cover the problems.
>
> I guess the solution would be using random cache eviction rather than
> a FIFO. I never took a look the cache mechanism, so I may very well be
> wrong here.

Read-only pages should be re-cycled really easily & quickly. I can't
belive read-only pages are causing you all the trouble.


Thanks,
Badari

2005-10-19 00:21:03

by David Lang

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On Tue, 18 Oct 2005, Badari Pulavarty wrote:

> On Tue, 2005-10-18 at 23:58 +0200, Bodo Eggert wrote:
>> Badari Pulavarty <[email protected]> wrote:
>>> On Tue, 2005-10-18 at 22:01 +0200, Guido Fiala wrote:
>>
>> [large files trash cache]
>>
>>> Is there a reason why those applications couldn't use O_DIRECT ?
>>
>> The cache trashing will affect all programs handling large files:
>>
>> mkisofs * > iso
>> dd < /dev/hdx42 | gzip > imagefile
>> perl -pe's/filenamea/filenameb/' < iso | cdrecord - # <- never tried
>>
>
> Are these examples which demonstrate the thrashing problem.
> Few product (database) groups here are trying to get me to
> work on a solution before demonstrating the problem. They
> also claim exactly what you are saying. They want a control
> on how many pages (per process or per file or per filesystem
> or system wide) you can have in filesystem cache.
>
> Thats why I am pressing to find out the real issue behind this.
> If you have a demonstratable testcase, please let me know.
> I will be happy to take a look.
>
>
>> Changing a few programs will only partly cover the problems.
>>
>> I guess the solution would be using random cache eviction rather than
>> a FIFO. I never took a look the cache mechanism, so I may very well be
>> wrong here.
>
> Read-only pages should be re-cycled really easily & quickly. I can't
> belive read-only pages are causing you all the trouble.

the problem is that there are many sources of read-only pages (how many
shared library pages are not read-only for example) and not all of them
are of equal value to the system

the ideal situation would probably be something like the adaptive
read-ahead approach where the system balances the saved pages between
processes/files rather then just benifiting the process that uses pages
the fastest.

I don't have any idea how to implement this sanely without a horrible
performance hit due to recordkeeping, but someone else may have a better
idea.

thinking out loud here, how bad would it be to split the LRU list based on
the number of things that have a page mapped? even if it only split it
into a small number of lists (say even just 0, 1+) and then evicted pages
from the 0 list in prefrence to the 1+ list (or at least add a fixed value
to the age of the 0 pages to have them age faster). this would limit how
badly library and code pages get evicted by a large file access.

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-10-19 00:33:32

by Fawad Lateef

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On 10/19/05, Badari Pulavarty <[email protected]> wrote:
> On Tue, 2005-10-18 at 23:58 +0200, Bodo Eggert wrote:
> > Changing a few programs will only partly cover the problems.
> >
> > I guess the solution would be using random cache eviction rather than
> > a FIFO. I never took a look the cache mechanism, so I may very well be
> > wrong here.
>
> Read-only pages should be re-cycled really easily & quickly. I can't
> belive read-only pages are causing you all the trouble.
>

I don't think the file is marked read-only ... that is when it is
accessed for reading the cache will contain the new file data and the
previous cached data will be lost .... So how u can say that read-only
pages or read-pages are not causing the problem ??

And I think the large files trashing filesystem caching problem can be
handled by the application using direct I/O or that must and might
already be managed by the file-system it-self because I think besides
application and file-system there isn't any thing present which can
detect the file currently accessing is a large file (as underlying
layer deals with blocks of data, or at block level with sectors and u
can't say what kind of data it is) ....

Is I m correct ??? or missing something ??


--
Fawad Lateef

2005-10-19 01:42:06

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

In article <[email protected]> you wrote:
> Is I m correct ??? or missing something ??

Well, applications who know the file is not going to be cached should
clearly give that hint. someheuristics in detecting "streaming one time"
access is usefull, however a more strict limitation on process level would
make the system more sane and stop single files from occupying most of the block
buffer.

Gruss
Bernd
y

2005-10-19 03:03:14

by Andrew James Wade

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On Tuesday 18 October 2005 16:01, Guido Fiala wrote:
> (please note, i'am not subscribed to the list, please CC me on reply)
>
> Story:
> Once in while we have a discussion at the vdr (video disk recorder) mailing
> list about very large files trashing the filesystems memory cache leading to
> unnecessary delays accessing directory contents no longer cached.
>
> This program and certainly all applications that deal with very large files
> only read once (much larger than usual memory) - it happens that all other
> cached blocks of the filessystem are removed from memory solely to keep as
> much as possible of that file in memory, which seems to be a bad strategy in
> most situations.

For this particular workload, a heuristic to detect streaming and drop
pages a few mb back from currently accessed pages would probably work well.
I believe the second part is already in the kernel (activated by an f-advise
call), but the heuristic is lacking.

> Of course one could always implement f_advise-calls in all applications, but i
> suggest a discussion if a maximum (configurable) in-memory-cache on a
> per-file base should be implemented in linux/mm or where this belongs.
>
> My guess was, it has something to do with mm/readahead.c, a test limiting the
> result of the function "max_sane_readahead(...) to 8 MBytes as a quick and
> dirty test did not solve the issue, but i might have done something wrong.
>
> I've searched the archive but could not find a previous discussion - is this a
> new idea?

I'd do searches on thrashing control and swap tokens. The problem with
thrashing is similar: a process accessing large amounts of memory in a short
period of time blowing away the caches. And the solution should be similar:
penalize the process doing so by preferentially reclaiming it's pages.

> It would be interesting to discuss if and when this proposed feature could
> lead to better performance or has any unwanted side effects.

Sometimes you want a single file to take up most of the memory; databases
spring to mind. Perhaps files/processes that take up a large proportion of
memory should be penalized by preferentially reclaiming their pages, but
limit the aggressiveness so that they can still take up most of the memory
if sufficiently persistent (and the rest of the system isn't thrashing).

>
> Thanks for ideas on that issue.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2005-10-19 04:20:06

by Lee Revell

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On Tue, 2005-10-18 at 22:01 +0200, Guido Fiala wrote:
> Of course one could always implement f_advise-calls in all
> applications

Um, this seems like the obvious answer. The application doing the read
KNOWS it's a streaming read, while the best the kernel can do is guess.

You don't really make much of a case that fadvise can't do the job.

Lee

2005-10-19 04:38:26

by Andrew Morton

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

Andrew James Wade <[email protected]> wrote:
>
> Sometimes you want a single file to take up most of the memory; databases
> spring to mind. Perhaps files/processes that take up a large proportion of
> memory should be penalized by preferentially reclaiming their pages, but
> limit the aggressiveness so that they can still take up most of the memory
> if sufficiently persistent (and the rest of the system isn't thrashing).

Yes. Basically any smart heuristic we apply here will have failure modes.
For example, the person whose application does repeated linear reads of the
first 100MB of a 4G file will get very upset.

So any such change really has to be opt-in. Yes, it can be done quite
simply via repeated application of posix_fadvise(). But practically
speaking, it's very hard to get upstream GPL'ed applications changed, let
alone proprietary ones.

An obvious approach would be an LD_PRELOAD thingy which modifies read() and
write(), perhaps controlled via an environment variable. AFAIK nobody has
even attempted this.

For a kernel-based solution you could take a look at my old 2.4-based
O_STREAMING patch. It works OK, but it still needs modification of each
application (or an LD_PRELOAD hook into open()).

A decent kernel implementation would be to add a max_resident_pages to
struct file_struct and to use that to perform drop-behind within read() and
write(). That's a bit of arithmetic and a call to
invalidate_mapping_pages(). The userspace interface to that could be a
linux-specific extension to posix_fadvise() or to fcntl().

But that still requires that all the applications be modified.

So I'd also suggest a new resource limit which, if set, is copied into the
applications's file_structs on open(). So you then write a little wrapper
app which does setrlimit()+exec():

limit-cache-usage -s 1000 my-fave-backup-program <args>

Which will cause every file which my-fave-backup-program reads or writes to
be limited to a maximum pagecache residency of 1000 kbytes.

This facility could trivially be used for a mini-DoS: shooting down other
people's pagecache so their apps run slowly. But you can use fadvise() for
that already.


Or raise a patch against glibc's read() and write() which uses some
environment string to control fadvise-based invalidation. That's pretty
simple.


2005-10-19 05:45:50

by Andrew James Wade

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On Wednesday 19 October 2005 00:37, Andrew Morton wrote:
> Andrew James Wade <[email protected]> wrote:
> >
> > Sometimes you want a single file to take up most of the memory; databases
> > spring to mind. Perhaps files/processes that take up a large proportion of
> > memory should be penalized by preferentially reclaiming their pages, but
> > limit the aggressiveness so that they can still take up most of the memory
> > if sufficiently persistent (and the rest of the system isn't thrashing).
>
> Yes. Basically any smart heuristic we apply here will have failure modes.
> For example, the person whose application does repeated linear reads of the
> first 100MB of a 4G file will get very upset.

As will any dumb heuristic for that matter; we'd need precognition[1] to avoid
all of them. But we can hopefully make the failure modes rarer and more
predictable. I don't know how my proposal would fare, and as I do not have
the code to test the matter I think I shall drop it.

[1] Which could, on occasion, be provided by hinting.

2005-10-19 07:24:07

by Bodo Eggert

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On Tue, 18 Oct 2005, Badari Pulavarty wrote:

> On Tue, 2005-10-18 at 23:58 +0200, Bodo Eggert wrote:
> > Badari Pulavarty <[email protected]> wrote:
> > > On Tue, 2005-10-18 at 22:01 +0200, Guido Fiala wrote:
> >
> > [large files trash cache]
> >
> > > Is there a reason why those applications couldn't use O_DIRECT ?
> >
> > The cache trashing will affect all programs handling large files:
> >
> > mkisofs * > iso
> > dd < /dev/hdx42 | gzip > imagefile
> > perl -pe's/filenamea/filenameb/' < iso | cdrecord - # <- never tried
> >
>
> Are these examples which demonstrate the thrashing problem.

You can alyo cat a big file into /dev/null. I made those examples in order
to demonstrate the problem with using O_DIRECT.

OTOH, I don't realtime stuff on my computer, so I'm not really affected,
but I'll try to show it anyway.

> > Changing a few programs will only partly cover the problems.
> >
> > I guess the solution would be using random cache eviction rather than
> > a FIFO. I never took a look the cache mechanism, so I may very well be
> > wrong here.
>
> Read-only pages should be re-cycled really easily & quickly. I can't
> belive read-only pages are causing you all the trouble.

Just a q&d test:

$ time ls -l $DIR > /dev/null
real 0m0.442s
user 0m0.008s
sys 0m0.024s

$ time ls -l $DIR > /dev/null
real 0m0.077s
user 0m0.008s
sys 0m0.008s

cat $BIGFILES_1.5GB > /Dev/null

$ time ls -l $DIR > /dev/null
real 0m0.270s
user 0m0.008s
sys 0m0.008s

$ time ls -l $DIR > /dev/null
real 0m0.078s
user 0m0.004s
sys 0m0.004s



BTW:
I suggested the random eviction because it will evict pages from large
files more likely than pages from small files, but I now think it will
cause the evicted pages to be non-continuous, too, and thereby cause
rereading them to be slower. I don't know which effect would be worse.

--
Is reading in the bathroom considered Multitasking?

2005-10-19 11:01:22

by Guido Fiala

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

Zitat von Andrew James Wade <[email protected]>:
> > For example, the person whose application does repeated linear reads of
> the
> > first 100MB of a 4G file will get very upset.
>
> As will any dumb heuristic for that matter; we'd need precognition[1] to
> avoid
> all of them. But we can hopefully make the failure modes rarer and more
> predictable. I don't know how my proposal would fare, and as I do not have
> the code to test the matter I think I shall drop it.
>
> [1] Which could, on occasion, be provided by hinting.

That is why i said it should be configurable by root via proc/sys/kernel
interface - if the system is intended to primarily run a database, set it to a
different heuristic than to "desktop/multimedia-workstation".

Much like the recent IO-scheduler additions (do they affect this behaviour?)

2005-10-19 11:06:54

by Guido Fiala

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

Zitat von Bodo Eggert <[email protected]>:
> You can alyo cat a big file into /dev/null. I made those examples in order
> to demonstrate the problem with using O_DIRECT.

O_DIRECT has to much impact at the mentioned "vdr" due to unwanted side effects
either.

>
> OTOH, I don't realtime stuff on my computer, so I'm not really affected,
> but I'll try to show it anyway.
>
> > > Changing a few programs will only partly cover the problems.
> > >
> > > I guess the solution would be using random cache eviction rather than
> > > a FIFO. I never took a look the cache mechanism, so I may very well be
> > > wrong here.
> >
> > Read-only pages should be re-cycled really easily & quickly. I can't
> > belive read-only pages are causing you all the trouble.
>
> Just a q&d test:
>
> $ time ls -l $DIR > /dev/null
> real 0m0.442s
> user 0m0.008s
> sys 0m0.024s
>
> $ time ls -l $DIR > /dev/null
> real 0m0.077s
> user 0m0.008s
> sys 0m0.008s
>
> cat $BIGFILES_1.5GB > /Dev/null
>
> $ time ls -l $DIR > /dev/null
> real 0m0.270s
> user 0m0.008s
> sys 0m0.008s
>
> $ time ls -l $DIR > /dev/null
> real 0m0.078s
> user 0m0.004s
> sys 0m0.004s
>
>
Thanks for pointing this out - this clearly shows the effect.
Now consider a mildly loaded multitasking environment running X, some services,
window-manager, email, maybe some databases and a streaming video-application
at once (so does mine) - the video-file will have unwanted impact on all the
other applications - leading to unnecessary reloads of lots of files, inodes
etc.

2005-10-19 11:10:36

by Guido Fiala

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

Zitat von Andrew Morton <[email protected]>:

> An obvious approach would be an LD_PRELOAD thingy which modifies read() and
> write(), perhaps controlled via an environment variable. AFAIK nobody has
> even attempted this.

Sounds interesting.

> A decent kernel implementation would be to add a max_resident_pages to
> struct file_struct and to use that to perform drop-behind within read() and
> write(). That's a bit of arithmetic and a call to
> invalidate_mapping_pages(). The userspace interface to that could be a
> linux-specific extension to posix_fadvise() or to fcntl().

Would still like to have a way to configure a "default file policy/heuristics"
for the system, just like i can choose IO-scheduler.

>
> But that still requires that all the applications be modified.
>
> So I'd also suggest a new resource limit which, if set, is copied into the
> applications's file_structs on open(). So you then write a little wrapper
> app which does setrlimit()+exec():
>
> limit-cache-usage -s 1000 my-fave-backup-program <args>
>
> Which will cause every file which my-fave-backup-program reads or writes to
> be limited to a maximum pagecache residency of 1000 kbytes.

Or make it another 'ulimit' parameter...

2005-10-19 13:43:53

by Avi Kivity

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

Bodo Eggert wrote:

>I guess the solution would be using random cache eviction rather than
>a FIFO. I never took a look the cache mechanism, so I may very well be
>wrong here.
>
>
Instead of random cache eviction, you can make pages that were read in
contiguously age faster than pages that were read in singly.

The motivation is that the cost of reading 64K vs 4K is almost the same
(most of the cost is the seek), while the benefit for evicting 64K is 16
times that of evicting 4K. Over time, the kernel would favor expensive
random-access pages over cheap streaming pages.

In a way, this is already implemented for inodes, which are aged more
slowly than data pages.


2005-10-19 15:43:40

by Badari Pulavarty

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On Wed, 2005-10-19 at 00:10 -0400, Lee Revell wrote:
> On Tue, 2005-10-18 at 22:01 +0200, Guido Fiala wrote:
> > Of course one could always implement f_advise-calls in all
> > applications
>
> Um, this seems like the obvious answer. The application doing the read
> KNOWS it's a streaming read, while the best the kernel can do is guess.
>
> You don't really make much of a case that fadvise can't do the job.

The issue is, how will "other/random" programs/applications affect
performance of my application.

Complain I hear most is from our database folks, they tune stuff
and they are happy with their performance. And then, some one does
a tar/cp/cpio/ftp/backup/compile on some random files on the system.
Suddenly, database performance drops. They want to see a system wide/
per-filesystem tunable on how much pagecache it takes up.

Andrew, does this make sense at all ?


Thanks,
Badari

2005-10-19 15:54:28

by Ingo Oeser

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

Hi,

On Wednesday 19 October 2005 13:10, [email protected] wrote:
> Zitat von Andrew Morton <[email protected]>:
> > So I'd also suggest a new resource limit which, if set, is copied into the
> > applications's file_structs on open(). So you then write a little wrapper
> > app which does setrlimit()+exec():
> >
> > limit-cache-usage -s 1000 my-fave-backup-program <args>
> >
> > Which will cause every file which my-fave-backup-program reads or writes to
> > be limited to a maximum pagecache residency of 1000 kbytes.
>
> Or make it another 'ulimit' parameter...

Which is already there: There is an ulimit for "maximum RSS",
which is at least a superset of "maximum pagecache residency".

This is already settable and known by many admins. But AFAIR it is not
honoured by the kernel completely, right?

But per file is a much better choice, since this would allow
concurrent streaming. This is needed to implement timeshifting at least[1].

So either I miss something or this is no proper solution yet.


Regards

Ingo Oeser

[1] Which is obviously done by some kind of on-disk FIFO.


Attachments:
(No filename) (1.07 kB)
(No filename) (189.00 B)
Download all attachments

2005-10-19 18:00:41

by Guido Fiala

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On Wednesday 19 October 2005 06:10, Lee Revell wrote:
> On Tue, 2005-10-18 at 22:01 +0200, Guido Fiala wrote:
> > Of course one could always implement f_advise-calls in all
> > applications
>
> Um, this seems like the obvious answer. The application doing the read
> KNOWS it's a streaming read, while the best the kernel can do is guess.
>
> You don't really make much of a case that fadvise can't do the job.
>

Kernel could do the best to optimize default performance, applications that
consider their own optimal behaviour should do so, all other files are kept
under default heuristic policy (adaptable, configurable one)

Heuristic can be based on access statistic:

streaming/sequential can be guessed by getting exactly 100% cache hit rate
(drop behind pages immediately),

random access/repeated reads can be guessed by >100% hit rate (keep as much in
memory as possible).

Less than 100% hit rate is already handled sanely i guess by reducing
readahead, precognition would gather access patterns (every n-th block is
read so readahead every n-th block, unlikely scenario i guess, but might
happen in databases).

How about backward-read-files? Others?

2005-10-19 18:43:12

by Kyle Moffett

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On Oct 19, 2005, at 13:58:37, Guido Fiala wrote:
> Kernel could do the best to optimize default performance,
> applications that consider their own optimal behaviour should do
> so, all other files are kept under default heuristic policy
> (adaptable, configurable one)
>
> Heuristic can be based on access statistic:
>
> streaming/sequential can be guessed by getting exactly 100% cache
> hit rate (drop behind pages immediately),

What about a grep through my kernel sources or other linear search
through a large directory tree? That would get exactly 100% cache
hit rate which would cause your method to drop the pages immediately,
meaning that subsequent greps are equally slow. I have enough memory
to hold a couple kernel trees and I want my grepping to push OO.org
out of RAM for a bit while I do my kernel development.


Cheers,
Kyle Moffett

--
I lost interest in "blade servers" when I found they didn't throw
knives at people who weren't supposed to be in your machine room.
-- Anthony de Boer


2005-10-19 18:54:15

by Guido Fiala

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On Wednesday 19 October 2005 20:43, Kyle Moffett wrote:
> On Oct 19, 2005, at 13:58:37, Guido Fiala wrote:
> > Kernel could do the best to optimize default performance,
> > applications that consider their own optimal behaviour should do
> > so, all other files are kept under default heuristic policy
> > (adaptable, configurable one)
> >
> > Heuristic can be based on access statistic:
> >
> > streaming/sequential can be guessed by getting exactly 100% cache
> > hit rate (drop behind pages immediately),
>
> What about a grep through my kernel sources or other linear search
> through a large directory tree? That would get exactly 100% cache
> hit rate which would cause your method to drop the pages immediately,
> meaning that subsequent greps are equally slow. I have enough memory
> to hold a couple kernel trees and I want my grepping to push OO.org
> out of RAM for a bit while I do my kernel development.

Ok, it seems this was thought to easy, needs some thinking ;-)
Of course i have lots of similar work and dont like to loose the speedup
either.

What other useful data for the job do we have already in the structs?

2005-10-19 19:50:16

by Andrew Morton

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

Ingo Oeser <[email protected]> wrote:
>
> Hi,
>
> On Wednesday 19 October 2005 13:10, [email protected] wrote:
> > Zitat von Andrew Morton <[email protected]>:

Please don't edit Cc lines. Just do reply-to-all.

> > > So I'd also suggest a new resource limit which, if set, is copied into the
> > > applications's file_structs on open(). So you then write a little wrapper
> > > app which does setrlimit()+exec():
> > >
> > > limit-cache-usage -s 1000 my-fave-backup-program <args>
> > >
> > > Which will cause every file which my-fave-backup-program reads or writes to
> > > be limited to a maximum pagecache residency of 1000 kbytes.
> >
> > Or make it another 'ulimit' parameter...

That's what I said. ulimit is the shell interface to resource limits.

> Which is already there: There is an ulimit for "maximum RSS",
> which is at least a superset of "maximum pagecache residency".

RSS is a quite separate concept from pagecache.

> This is already settable and known by many admins. But AFAIR it is not
> honoured by the kernel completely, right?
>
> But per file is a much better choice, since this would allow
> concurrent streaming. This is needed to implement timeshifting at least[1].
>
> So either I miss something or this is no proper solution yet.

I described a couple of ways in which this can be done from userspace with
LD_PRELOAD.

2005-10-19 22:26:29

by Paul Jackson

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

Andrew wrote:
> Please don't edit Cc lines. Just do reply-to-all.

Don't worry, Ingo Oeser.

I think it was [email protected] who edited the Cc line in this thread,
not yourself.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-20 06:28:38

by Ingo Oeser

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

Hi,

On Wednesday 19 October 2005 21:49, Andrew Morton wrote:
> Ingo Oeser <[email protected]> wrote:
> > > > app which does setrlimit()+exec():
> > > >
> > > > limit-cache-usage -s 1000 my-fave-backup-program <args>
> > > >
> > > > Which will cause every file which my-fave-backup-program reads or writes to
> > > > be limited to a maximum pagecache residency of 1000 kbytes.
> > >
> > > Or make it another 'ulimit' parameter...
> > Which is already there: There is an ulimit for "maximum RSS",
> > which is at least a superset of "maximum pagecache residency".
>
> RSS is a quite separate concept from pagecache.

Yes I know, but the amount of pagecache which is RESIDENT for a process
is not that seperate from RSS, I think.

I always thought RSS is the amount of mapped and anonymous
pages of a process, which are in physical memory (aka resident).

So I consider the amount of mapped pagecache pages of
a process which are in physical memory (aka resident) a subset.

Or do you care about page cache pages not mapped into the process?
Is this the point I miss?

Please enlighten me :-)


Regards

Ingo Oeser


Attachments:
(No filename) (1.10 kB)
(No filename) (189.00 B)
Download all attachments

2005-10-20 15:26:07

by Guido Fiala

[permalink] [raw]
Subject: Re: large files unnecessary trashing filesystem cache?

On Tuesday 18 October 2005 22:48, Badari Pulavarty wrote:
> On Tue, 2005-10-18 at 22:01 +0200, Guido Fiala wrote:
> > Story:
> > Once in while we have a discussion at the vdr (video disk recorder)
> > mailing list about very large files trashing the filesystems memory cache
> > leading to unnecessary delays accessing directory contents no longer
> > cached.
> > [...]
> Is there a reason why those applications couldn't use O_DIRECT ?
>
> Thanks,
> Badari

I asked a vdr-expert on this and here is the reason why O_DIRECT is not
suitable:

O_DIRECT would be great if it were a simple option for opening files.
But as a matter of facts O_DIRECT completely changes the semantics of
file access. You have to read blocks of a defines size to memory that
is aligned to defined block borders. Memory provided by normal malloc()
or new() is not usable and results in IO errors. So the result is you
have to have a complete rewrite of the whole IO subsystem of the
affected program. Most maintainers of non-trivial applications are
completely resistant against such changes - for good reasons.

If there would be an O_DIRECT_EX32++ (or O_STREAMING) that doesn't have
this change in semantic it would be much easier to apply the necessary
changes.

BTW: In the case of the VDR program not even a per process limit in used
buffer caches would help: the same program reads huge files _and_ huge
directory trees with a lot of small files that should be cached. A
heuristic for this case has to work on per file base. It needs to
detect that some files are only used in a streaming manner - with very
seldom jumps in random directions (skipping commercials, review a
scene). I don't know if such a heuristic is possible and if it would
not break other things.

PS: using f_advise helps a bit. One can keep IO semantics but you have
to add a virtualisation layer for all streaming IO. And you can't
combine posix_fadvise(POSIX_FADV_DONTNEED) with
posix_fadvise(POSIX_FADV_WILLNEED) when you possibly have jumps in your
access pattern because you can't cancel (at least to my knowledge) the
POSIX_FADV_WILLNEED call when you see the read ahead is not needed any
more. It would be an interesting add on if POSIX_FADV_DONTNEED would
cancel the read of a region that has been requested by
POSIX_FADV_WILLNEED before.

Ralf (forwarded by me on his request)

---
Hopefully i did now correctly "reply all" - sorry if i accidently caused some
trouble.