2007-01-24 00:51:44

by Christoph Lameter

[permalink] [raw]
Subject: [RFC] Limit the size of the pagecache

This is a patch using some of Aubrey's work plugging it in what is IMHO
the right way. Feel free to improve on it. I have gotten repeatedly
requests to be able to limit the pagecache. With the revised VM statistics
this is now actually possile. I'd like to know more about possible uses of
such a feature.




It may be useful to limit the size of the page cache for various reasons
such as

1. Insure that anonymous pages that may contain performance
critical data is never subject to swap.

2. Insure rapid turnaround of pages in the cache.

3. Reserve memory for other uses? (Aubrey?)

We add a new variable "pagecache_ratio" to /proc/sys/vm/ that
defaults to 100 (all memory usable for the pagecache).

The size of the pagecache is the number of file backed
pages in a zone which is available through NR_FILE_PAGES.

We skip zones that contain too many page cache pages in
the page allocator which may cause us to enter reclaim.

If we enter reclaim and the number of page cache pages
is too high then we switch off swapping during reclaim
to avoid touching anonymous pages.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.20-rc5/include/linux/gfp.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/gfp.h 2007-01-12 12:54:26.000000000 -0600
+++ linux-2.6.20-rc5/include/linux/gfp.h 2007-01-23 17:54:51.750696888 -0600
@@ -46,6 +46,7 @@ struct vm_area_struct;
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_PAGECACHE ((__force gfp_t)0x80000u) /* Page cache allocation */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
Index: linux-2.6.20-rc5/include/linux/pagemap.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/pagemap.h 2007-01-12 12:54:26.000000000 -0600
+++ linux-2.6.20-rc5/include/linux/pagemap.h 2007-01-23 18:13:14.310062155 -0600
@@ -62,12 +62,13 @@ static inline struct page *__page_cache_

static inline struct page *page_cache_alloc(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x));
+ return __page_cache_alloc(mapping_gfp_mask(x)| __GFP_PAGECACHE);
}

static inline struct page *page_cache_alloc_cold(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+ return __page_cache_alloc(mapping_gfp_mask(x) |
+ __GFP_COLD | __GFP_PAGECACHE);
}

typedef int filler_t(void *, struct page *);
Index: linux-2.6.20-rc5/include/linux/sysctl.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/sysctl.h 2007-01-12 12:54:26.000000000 -0600
+++ linux-2.6.20-rc5/include/linux/sysctl.h 2007-01-23 18:17:09.285324555 -0600
@@ -202,6 +202,7 @@ enum
VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
+ VM_PAGECACHE_RATIO=36, /* percent of RAM to use as page cache */
};


@@ -956,7 +957,6 @@ extern ctl_handler sysctl_intvec;
extern ctl_handler sysctl_jiffies;
extern ctl_handler sysctl_ms_jiffies;

-
/*
* Register a set of sysctl names by calling register_sysctl_table
* with an initialised array of ctl_table's. An entry with zero
Index: linux-2.6.20-rc5/kernel/sysctl.c
===================================================================
--- linux-2.6.20-rc5.orig/kernel/sysctl.c 2007-01-12 12:54:26.000000000 -0600
+++ linux-2.6.20-rc5/kernel/sysctl.c 2007-01-23 18:24:04.763443772 -0600
@@ -1023,6 +1023,17 @@ static ctl_table vm_table[] = {
.extra2 = &one_hundred,
},
#endif
+ {
+ .ctl_name = VM_PAGECACHE_RATIO,
+ .procname = "pagecache_ratio",
+ .data = &sysctl_pagecache_ratio,
+ .maxlen = sizeof(sysctl_pagecache_ratio),
+ .mode = 0644,
+ .proc_handler = &sysctl_pagecache_ratio_sysctl_handler,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
#ifdef CONFIG_X86_32
{
.ctl_name = VM_VDSO_ENABLED,
Index: linux-2.6.20-rc5/mm/page_alloc.c
===================================================================
--- linux-2.6.20-rc5.orig/mm/page_alloc.c 2007-01-16 23:26:28.000000000 -0600
+++ linux-2.6.20-rc5/mm/page_alloc.c 2007-01-23 18:11:40.484617205 -0600
@@ -59,6 +59,8 @@ unsigned long totalreserve_pages __read_
long nr_swap_pages;
int percpu_pagelist_fraction;

+int sysctl_pagecache_ratio = 100;
+
static void __free_pages_ok(struct page *page, unsigned int order);

/*
@@ -1168,6 +1170,11 @@ zonelist_scan:
!cpuset_zone_allowed_softwall(zone, gfp_mask))
goto try_next_zone;

+ if ((gfp_mask & __GFP_PAGECACHE) &&
+ zone_page_state(zone, NR_FILE_PAGES) >
+ zone->max_pagecache_pages)
+ goto try_next_zone;
+
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
if (alloc_flags & ALLOC_WMARK_MIN)
@@ -2670,6 +2677,8 @@ static void __meminit free_area_init_cor
/ 100;
zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
#endif
+ zone->max_pagecache_pages =
+ (realsize * sysctl_pagecache_ratio) / 100;
zone->name = zone_names[j];
spin_lock_init(&zone->lock);
spin_lock_init(&zone->lru_lock);
@@ -3245,6 +3254,22 @@ int sysctl_min_slab_ratio_sysctl_handler
}
#endif

+int sysctl_pagecache_ratio_sysctl_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ struct zone *zone;
+ int rc;
+
+ rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+ if (rc)
+ return rc;
+
+ for_each_zone(zone)
+ zone->max_pagecache_pages = (zone->present_pages *
+ sysctl_pagecache_ratio) / 100;
+ return 0;
+}
+
/*
* lowmem_reserve_ratio_sysctl_handler - just a wrapper around
* proc_dointvec() so that we can call setup_per_zone_lowmem_reserve()
Index: linux-2.6.20-rc5/mm/vmscan.c
===================================================================
--- linux-2.6.20-rc5.orig/mm/vmscan.c 2007-01-23 17:35:53.000000000 -0600
+++ linux-2.6.20-rc5/mm/vmscan.c 2007-01-23 18:20:19.118051138 -0600
@@ -932,6 +932,14 @@ static unsigned long shrink_zone(int pri
else
nr_inactive = 0;

+ /*
+ * If the page cache is too big then focus on page cache
+ * and ignore anonymous pages
+ */
+ if (sc->may_swap && zone_page_state(zone, NR_FILE_PAGES)
+ > zone->max_pagecache_pages)
+ sc->may_swap = 0;
+
while (nr_active || nr_inactive) {
if (nr_active) {
nr_to_scan = min(nr_active,
Index: linux-2.6.20-rc5/include/linux/mmzone.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/mmzone.h 2007-01-17 22:06:02.000000000 -0600
+++ linux-2.6.20-rc5/include/linux/mmzone.h 2007-01-23 18:22:11.473419856 -0600
@@ -167,6 +167,8 @@ struct zone {
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];

+ unsigned long max_pagecache_pages;
+
#ifdef CONFIG_NUMA
int node;
/*
@@ -540,6 +542,8 @@ int sysctl_min_unmapped_ratio_sysctl_han
struct file *, void __user *, size_t *, loff_t *);
int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
+int sysctl_pagecache_ratio_sysctl_handler(struct ctl_table *, int,
+ struct file *, void __user *, size_t *, loff_t *);

#include <linux/topology.h>
/* Returns the number of the current Node. */
Index: linux-2.6.20-rc5/include/linux/swap.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/swap.h 2007-01-12 12:54:26.000000000 -0600
+++ linux-2.6.20-rc5/include/linux/swap.h 2007-01-23 18:18:43.943851519 -0600
@@ -192,6 +192,8 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern long vm_total_pages;

+extern int sysctl_pagecache_ratio;
+
#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
extern int sysctl_min_unmapped_ratio;


2007-01-24 02:58:43

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Tue, 23 Jan 2007 16:49:55 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> If we enter reclaim and the number of page cache pages
> is too high then we switch off swapping during reclaim
> to avoid touching anonymous pages.

In general, I like this (kind of) feature.

> + /*
> + * If the page cache is too big then focus on page cache
> + * and ignore anonymous pages
> + */
> + if (sc->may_swap && zone_page_state(zone, NR_FILE_PAGES)
> + > zone->max_pagecache_pages)
> + sc->may_swap = 0;
> +


How about adding this (kind of) check ?

if (sc->may_swap &&
zone_page_state(zone, NR_FILE_PAGES) &&
!(curreht->flags & PF_MEMALLOC))
sc->may_swap = 0;

-Kame

2007-01-24 03:00:58

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

Christoph Lameter wrote:
> This is a patch using some of Aubrey's work plugging it in what is IMHO
> the right way. Feel free to improve on it. I have gotten repeatedly
> requests to be able to limit the pagecache. With the revised VM statistics
> this is now actually possile. I'd like to know more about possible uses of
> such a feature.
>
>
>
>
> It may be useful to limit the size of the page cache for various reasons
> such as
>
> 1. Insure that anonymous pages that may contain performance
> critical data is never subject to swap.
>
> 2. Insure rapid turnaround of pages in the cache.

So if these two aren't working properly at 100%, then I want to know the
reason why. Or at least see what the workload and the numbers look like.

>
> 3. Reserve memory for other uses? (Aubrey?)

Maybe. This is still a bad hack, and I don't like to legitimise such use
though. I hope Aubrey isn't relying on this alone for his device to work
because his customers might end up hitting fragmentation problems sooner
or later.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-24 03:01:51

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 24 Jan 2007, KAMEZAWA Hiroyuki wrote:

> if (sc->may_swap &&
> zone_page_state(zone, NR_FILE_PAGES) &&
> !(curreht->flags & PF_MEMALLOC))
> sc->may_swap = 0;

That is probably better than what we have so far.

2007-01-24 03:16:04

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 24 Jan 2007, Nick Piggin wrote:

> > 1. Insure that anonymous pages that may contain performance
> > critical data is never subject to swap.
> >
> > 2. Insure rapid turnaround of pages in the cache.
>
> So if these two aren't working properly at 100%, then I want to know the
> reason why. Or at least see what the workload and the numbers look like.

The reason for the anonymous page may be because data is rarely touched
but for some reason the pages must stay in memory. Rapid turnaround is
just one of the reason that I vaguely recall but I never really
understood what the purpose was.

> > 3. Reserve memory for other uses? (Aubrey?)
>
> Maybe. This is still a bad hack, and I don't like to legitimise such use
> though. I hope Aubrey isn't relying on this alone for his device to work
> because his customers might end up hitting fragmentation problems sooner
> or later.

I surely wish that Aubrey would give us some more clarity on
how this should work. Maybe the others who want this feature could also
speak up? I am not that clear on its purpose.

2007-01-24 03:20:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache


one more thing...

On Tue, 23 Jan 2007 16:49:55 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> @@ -1168,6 +1170,11 @@ zonelist_scan:
> !cpuset_zone_allowed_softwall(zone, gfp_mask))
> goto try_next_zone;
>
> + if ((gfp_mask & __GFP_PAGECACHE) &&
> + zone_page_state(zone, NR_FILE_PAGES) >
> + zone->max_pagecache_pages)
> + goto try_next_zone;
> +

I don't prefer to cause zone fallback by this.
This may use ZONE_DMA before exhausing ZONE_NORMAL (ia64),
ZONE_NORMAL before ZONE_HIGHMEM (x86).
Very rapid page allocation can eats some amount of lower zone.

Regards,
-Kame

2007-01-24 03:51:41

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On 1/24/07, Christoph Lameter <[email protected]> wrote:
> On Wed, 24 Jan 2007, Nick Piggin wrote:
>
> > > 1. Insure that anonymous pages that may contain performance
> > > critical data is never subject to swap.
> > >
> > > 2. Insure rapid turnaround of pages in the cache.
> >
> > So if these two aren't working properly at 100%, then I want to know the
> > reason why. Or at least see what the workload and the numbers look like.
>
> The reason for the anonymous page may be because data is rarely touched
> but for some reason the pages must stay in memory. Rapid turnaround is
> just one of the reason that I vaguely recall but I never really
> understood what the purpose was.
>
> > > 3. Reserve memory for other uses? (Aubrey?)
> >
> > Maybe. This is still a bad hack, and I don't like to legitimise such use
> > though. I hope Aubrey isn't relying on this alone for his device to work
> > because his customers might end up hitting fragmentation problems sooner
> > or later.
>
> I surely wish that Aubrey would give us some more clarity on
> how this should work. Maybe the others who want this feature could also
> speak up? I am not that clear on its purpose.
>
Sorry for the delay. Somehow this thread was put into the spam folder
of my gmail box. :(
The patch I posted several days ago works properly on my side. I'm
working on blackfin-uclinux platform. So I'm not sure it works 100% on
the other arch platform. From O_DIRECT threads, I know different
people suffer from VFS pagecache issue for different reason. So I
really hope the patch can be improved.

On my side, When VFS pagecache eat up all of the available memory,
applications who want to allocate the largeish block(order =4 ?) will
fail. So the logic is as follows:

if request pagecache
watermark = min + reserved_pagecache.
else
watermark = min.

Here, assume min=123 pages, reserved_pagecache = 200 pages. That means
when VFS pagecache eat up its all of available memory, there are still
200 pages available for the allocation of the application. Does that
make sense?

> I hope Aubrey isn't relying on this alone for his device to work
> because his customers might end up hitting fragmentation problems sooner
> or later.

That's true. I wrote a replacement of buddy system, it's here:
http://lkml.org/lkml/2006/12/30/36.

That can improve the fragmentation problems on our platform.

Christoph - I can't find your original patch, Can you send me again?
it would be great if you merged all of the enhancement.

Thanks,
-Aubrey

2007-01-24 04:03:42

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

Aubrey Li wrote:
> On 1/24/07, Christoph Lameter <[email protected]> wrote:
>
>> On Wed, 24 Jan 2007, Nick Piggin wrote:
>>
>> > > 1. Insure that anonymous pages that may contain performance
>> > > critical data is never subject to swap.
>> > >
>> > > 2. Insure rapid turnaround of pages in the cache.
>> >
>> > So if these two aren't working properly at 100%, then I want to know
>> the
>> > reason why. Or at least see what the workload and the numbers look
>> like.
>>
>> The reason for the anonymous page may be because data is rarely touched
>> but for some reason the pages must stay in memory. Rapid turnaround is
>> just one of the reason that I vaguely recall but I never really
>> understood what the purpose was.
>>
>> > > 3. Reserve memory for other uses? (Aubrey?)
>> >
>> > Maybe. This is still a bad hack, and I don't like to legitimise such
>> use
>> > though. I hope Aubrey isn't relying on this alone for his device to
>> work
>> > because his customers might end up hitting fragmentation problems
>> sooner
>> > or later.
>>
>> I surely wish that Aubrey would give us some more clarity on
>> how this should work. Maybe the others who want this feature could also
>> speak up? I am not that clear on its purpose.
>>
> Sorry for the delay. Somehow this thread was put into the spam folder
> of my gmail box. :(
> The patch I posted several days ago works properly on my side. I'm
> working on blackfin-uclinux platform. So I'm not sure it works 100% on
> the other arch platform. From O_DIRECT threads, I know different
> people suffer from VFS pagecache issue for different reason. So I
> really hope the patch can be improved.

So we need to work out what those issues are and fix them.

> On my side, When VFS pagecache eat up all of the available memory,
> applications who want to allocate the largeish block(order =4 ?) will
> fail. So the logic is as follows:

Yeah, it will be failing at order=4, because the allocator won't try
very hard reclaim pagecache pages at that cutoff point. This needs to
be fixed in the allocator.

>> I hope Aubrey isn't relying on this alone for his device to work
>> because his customers might end up hitting fragmentation problems sooner
>> or later.
>
>
> That's true. I wrote a replacement of buddy system, it's here:
> http://lkml.org/lkml/2006/12/30/36.
>
> That can improve the fragmentation problems on our platform.

That might be a good idea, but while the buddy system may not seem as
efficient and wastes space, it is actually really good for fragmentation.

Anyway, point being that you can't eliminate fragmentation, so you need
to cope with allocation failures or implement reserve pools if you want a
robust system.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-24 04:32:23

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 24 Jan 2007, KAMEZAWA Hiroyuki wrote:

> I don't prefer to cause zone fallback by this.
> This may use ZONE_DMA before exhausing ZONE_NORMAL (ia64),

Hmmm... We could use node_page_state instead of zone_page_state.

> Very rapid page allocation can eats some amount of lower zone.

One queston: For what purpose would you be using the page cache size
limitation?

2007-01-24 05:18:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Tue, 23 Jan 2007 20:30:16 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Wed, 24 Jan 2007, KAMEZAWA Hiroyuki wrote:
>
> > I don't prefer to cause zone fallback by this.
> > This may use ZONE_DMA before exhausing ZONE_NORMAL (ia64),
>
> Hmmm... We could use node_page_state instead of zone_page_state.
>
> > Very rapid page allocation can eats some amount of lower zone.
>
> One queston: For what purpose would you be using the page cache size
> limitation?
>
This is my experience in support-desk for RHEL4.
(therefore, this may not be suitable for talking about the current kernel)

- One for stability
When a customer constructs their detabase(Oracle), the system often goes to oom.
This is because that the system cannot allocate DMA_ZOME memory for 32bit device.
(USB or e100)
Not allowing to use almost all pages as page cache (for temporal use) will be some help.
(Note: construction DB on ext3....so all writes are serialized and the system couldn't
free page cache.)

- One for tuing.
Sometimes our cutomer requests us to limit size of page-cache.

Many cutomers's memory usage reaches 99.x%. (this is very common situation.)
If almost all memories are used by page-cache, and we can think we can free it.
But the customer cannot estimate what amount of page-cache can be freed (without
perfromance regression).

When a cutomer wants to add a new application, he tunes the system.
But memory usage is always 99%.
page-cache limitation is useful when the customer tunes his system and find
sets of data and page-cache.
(Of course, we can use some other complicated resource management system for this.)
This will allow the users to decide that they need extra memory or not.

And...some customers want to keep memory Free as much as possible.
99% memory usage makes insecure them ;)

-Kame



2007-01-24 05:47:47

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

Christoph's patch is better than mine. The only thing I think is that
zone->max_pagecache_pages should be checked never less than
zone->pages_low.

The good part of the patch is using the existing reclaimer. But the
problem in my opinion of the idea is the existing reclaimer too. Think
of when vfs cache limit is
hit, reclaimer doesn't reclaim all of the reclaimable pages, it just
give few out. So next time vfs pagecache request, it is quite possible
reclaimer is triggered again. That means after limit is hit, reclaim
will be implemented every time fs ops allocating memory. That's the
point in my mind impacting the performance of the applications.

-Aubrey

2007-01-24 07:04:56

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache



Christoph Lameter wrote:
> This is a patch using some of Aubrey's work plugging it in what is IMHO
> the right way. Feel free to improve on it. I have gotten repeatedly
> requests to be able to limit the pagecache. With the revised VM statistics
> this is now actually possile. I'd like to know more about possible uses of
> such a feature.
>
>
>
>
> It may be useful to limit the size of the page cache for various reasons
> such as
>
> 1. Insure that anonymous pages that may contain performance
> critical data is never subject to swap.
>
> 2. Insure rapid turnaround of pages in the cache.
>
> 3. Reserve memory for other uses? (Aubrey?)
>
> We add a new variable "pagecache_ratio" to /proc/sys/vm/ that
> defaults to 100 (all memory usable for the pagecache).
>
> The size of the pagecache is the number of file backed
> pages in a zone which is available through NR_FILE_PAGES.
>
> We skip zones that contain too many page cache pages in
> the page allocator which may cause us to enter reclaim.

Skipping the zone may not be a good idea. We can have a threshold
for reclaim to avoid running the reclaim code too often

> If we enter reclaim and the number of page cache pages
> is too high then we switch off swapping during reclaim
> to avoid touching anonymous pages.

This is a good idea, however there could be the following problems:

1. We may not find much of unmapped pages in the given number of
pages to scan. We will have to iterate too much in shrink_zone and
artificially increase memory pressure in order to scan more pages
and find sufficient pagecache pages to free and bring it under limit

2. NR_FILE_PAGES include mapped pagecache pages count, if we turn
off may_swap, then reclaim_mapped will also be off and we will not
remove mapped pagecache. This is correct because these pages are
'in use' relative to unmapped pagecache pages.

But the problem is in the limit comparison, we need to subtract
mapped pages before checking for overlimit

3. We may want to write out dirty and referenced pagecache pages and
free them. Current shrink_zone looks for easily freeable pagecache
pages only, but if we set a 200MB limit and write out a 1GB file,
then all of the pages will be dirty, active and referenced and still
we will have to force the reclaimer to remove those pages.

Adding more scan control flags in reclaim can give us better
control. Please review http://lkml.org/lkml/2007/01/17/96 which
used new scan control flags.


> Signed-off-by: Christoph Lameter <[email protected]>
>
> Index: linux-2.6.20-rc5/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.20-rc5.orig/include/linux/gfp.h 2007-01-12 12:54:26.000000000 -0600
> +++ linux-2.6.20-rc5/include/linux/gfp.h 2007-01-23 17:54:51.750696888 -0600
> @@ -46,6 +46,7 @@ struct vm_area_struct;
> #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
> #define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
> #define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
> +#define __GFP_PAGECACHE ((__force gfp_t)0x80000u) /* Page cache allocation */
>
> #define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
> Index: linux-2.6.20-rc5/include/linux/pagemap.h
> ===================================================================
> --- linux-2.6.20-rc5.orig/include/linux/pagemap.h 2007-01-12 12:54:26.000000000 -0600
> +++ linux-2.6.20-rc5/include/linux/pagemap.h 2007-01-23 18:13:14.310062155 -0600
> @@ -62,12 +62,13 @@ static inline struct page *__page_cache_
>
> static inline struct page *page_cache_alloc(struct address_space *x)
> {
> - return __page_cache_alloc(mapping_gfp_mask(x));
> + return __page_cache_alloc(mapping_gfp_mask(x)| __GFP_PAGECACHE);
> }
>
> static inline struct page *page_cache_alloc_cold(struct address_space *x)
> {
> - return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
> + return __page_cache_alloc(mapping_gfp_mask(x) |
> + __GFP_COLD | __GFP_PAGECACHE);
> }
>
> typedef int filler_t(void *, struct page *);
> Index: linux-2.6.20-rc5/include/linux/sysctl.h
> ===================================================================
> --- linux-2.6.20-rc5.orig/include/linux/sysctl.h 2007-01-12 12:54:26.000000000 -0600
> +++ linux-2.6.20-rc5/include/linux/sysctl.h 2007-01-23 18:17:09.285324555 -0600
> @@ -202,6 +202,7 @@ enum
> VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
> VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
> VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
> + VM_PAGECACHE_RATIO=36, /* percent of RAM to use as page cache */
> };
>
>
> @@ -956,7 +957,6 @@ extern ctl_handler sysctl_intvec;
> extern ctl_handler sysctl_jiffies;
> extern ctl_handler sysctl_ms_jiffies;
>
> -
> /*
> * Register a set of sysctl names by calling register_sysctl_table
> * with an initialised array of ctl_table's. An entry with zero
> Index: linux-2.6.20-rc5/kernel/sysctl.c
> ===================================================================
> --- linux-2.6.20-rc5.orig/kernel/sysctl.c 2007-01-12 12:54:26.000000000 -0600
> +++ linux-2.6.20-rc5/kernel/sysctl.c 2007-01-23 18:24:04.763443772 -0600
> @@ -1023,6 +1023,17 @@ static ctl_table vm_table[] = {
> .extra2 = &one_hundred,
> },
> #endif
> + {
> + .ctl_name = VM_PAGECACHE_RATIO,
> + .procname = "pagecache_ratio",
> + .data = &sysctl_pagecache_ratio,
> + .maxlen = sizeof(sysctl_pagecache_ratio),
> + .mode = 0644,
> + .proc_handler = &sysctl_pagecache_ratio_sysctl_handler,
> + .strategy = &sysctl_intvec,
> + .extra1 = &zero,
> + .extra2 = &one_hundred,
> + },
> #ifdef CONFIG_X86_32
> {
> .ctl_name = VM_VDSO_ENABLED,
> Index: linux-2.6.20-rc5/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.20-rc5.orig/mm/page_alloc.c 2007-01-16 23:26:28.000000000 -0600
> +++ linux-2.6.20-rc5/mm/page_alloc.c 2007-01-23 18:11:40.484617205 -0600
> @@ -59,6 +59,8 @@ unsigned long totalreserve_pages __read_
> long nr_swap_pages;
> int percpu_pagelist_fraction;
>
> +int sysctl_pagecache_ratio = 100;
> +
> static void __free_pages_ok(struct page *page, unsigned int order);
>
> /*
> @@ -1168,6 +1170,11 @@ zonelist_scan:
> !cpuset_zone_allowed_softwall(zone, gfp_mask))
> goto try_next_zone;
>
> + if ((gfp_mask & __GFP_PAGECACHE) &&
> + zone_page_state(zone, NR_FILE_PAGES) >
> + zone->max_pagecache_pages)
> + goto try_next_zone;
> +

We should add some threshold and start reclaim within the zone and
keep the natural/default memory allocation order for zone.

> if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> unsigned long mark;
> if (alloc_flags & ALLOC_WMARK_MIN)
> @@ -2670,6 +2677,8 @@ static void __meminit free_area_init_cor
> / 100;
> zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
> #endif
> + zone->max_pagecache_pages =
> + (realsize * sysctl_pagecache_ratio) / 100;
> zone->name = zone_names[j];
> spin_lock_init(&zone->lock);
> spin_lock_init(&zone->lru_lock);
> @@ -3245,6 +3254,22 @@ int sysctl_min_slab_ratio_sysctl_handler
> }
> #endif
>
> +int sysctl_pagecache_ratio_sysctl_handler(ctl_table *table, int write,
> + struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
> +{
> + struct zone *zone;
> + int rc;
> +
> + rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
> + if (rc)
> + return rc;
> +
> + for_each_zone(zone)
> + zone->max_pagecache_pages = (zone->present_pages *
> + sysctl_pagecache_ratio) / 100;
> + return 0;
> +}
> +
> /*
> * lowmem_reserve_ratio_sysctl_handler - just a wrapper around
> * proc_dointvec() so that we can call setup_per_zone_lowmem_reserve()
> Index: linux-2.6.20-rc5/mm/vmscan.c
> ===================================================================
> --- linux-2.6.20-rc5.orig/mm/vmscan.c 2007-01-23 17:35:53.000000000 -0600
> +++ linux-2.6.20-rc5/mm/vmscan.c 2007-01-23 18:20:19.118051138 -0600
> @@ -932,6 +932,14 @@ static unsigned long shrink_zone(int pri
> else
> nr_inactive = 0;
>
> + /*
> + * If the page cache is too big then focus on page cache
> + * and ignore anonymous pages
> + */
> + if (sc->may_swap && zone_page_state(zone, NR_FILE_PAGES)
> + > zone->max_pagecache_pages)
> + sc->may_swap = 0;
> +

If we turn off swap, then we should not count mapped pagecache pages:

We should check (zone_page_state(zone, NR_FILE_PAGES) -
global_page_state(NR_FILE_MAPPED)) > zone->max_pagecache_pages

else reclaim will always miss the target.

> while (nr_active || nr_inactive) {
> if (nr_active) {
> nr_to_scan = min(nr_active,

nr_to_scan in shrink_zone may need some tweak since we may be
potentially looking for 5 to 10% of pagecache pages which may be in
the active list. It may take too many iterations to bring the
sample space to inactive list and reclaim them.

Use of separate pagecache active and inactive list may help here.
Going through complete LRU page list to find relative few pagecache
pages will be a performance bottle neck.

Please review thread http://lkml.org/lkml/2007/01/22/219 on two LRU
lists.

--Vaidy

> Index: linux-2.6.20-rc5/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.20-rc5.orig/include/linux/mmzone.h 2007-01-17 22:06:02.000000000 -0600
> +++ linux-2.6.20-rc5/include/linux/mmzone.h 2007-01-23 18:22:11.473419856 -0600
> @@ -167,6 +167,8 @@ struct zone {
> */
> unsigned long lowmem_reserve[MAX_NR_ZONES];
>
> + unsigned long max_pagecache_pages;
> +
> #ifdef CONFIG_NUMA
> int node;
> /*
> @@ -540,6 +542,8 @@ int sysctl_min_unmapped_ratio_sysctl_han
> struct file *, void __user *, size_t *, loff_t *);
> int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
> struct file *, void __user *, size_t *, loff_t *);
> +int sysctl_pagecache_ratio_sysctl_handler(struct ctl_table *, int,
> + struct file *, void __user *, size_t *, loff_t *);
>
> #include <linux/topology.h>
> /* Returns the number of the current Node. */
> Index: linux-2.6.20-rc5/include/linux/swap.h
> ===================================================================
> --- linux-2.6.20-rc5.orig/include/linux/swap.h 2007-01-12 12:54:26.000000000 -0600
> +++ linux-2.6.20-rc5/include/linux/swap.h 2007-01-23 18:18:43.943851519 -0600
> @@ -192,6 +192,8 @@ extern int vm_swappiness;
> extern int remove_mapping(struct address_space *mapping, struct page *page);
> extern long vm_total_pages;
>
> +extern int sysctl_pagecache_ratio;
> +
> #ifdef CONFIG_NUMA
> extern int zone_reclaim_mode;
> extern int sysctl_min_unmapped_ratio;
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-01-24 07:57:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
> This is a patch using some of Aubrey's work plugging it in what is IMHO
> the right way. Feel free to improve on it. I have gotten repeatedly
> requests to be able to limit the pagecache. With the revised VM statistics
> this is now actually possile. I'd like to know more about possible uses of
> such a feature.
>
>
>
>
> It may be useful to limit the size of the page cache for various reasons
> such as
>
> 1. Insure that anonymous pages that may contain performance
> critical data is never subject to swap.

This is what we have mlock for, no?

> 2. Insure rapid turnaround of pages in the cache.

This sounds like we either need more fadvise hints and/or understand why
the VM doesn't behave properly.

> 3. Reserve memory for other uses? (Aubrey?)

He wants to make a nommu system act like a mmu system; this will just
never ever work. Memory fragmentation is a real issue not some gimmick
thought up by the hardware folks to sell these mmu chips.

> We add a new variable "pagecache_ratio" to /proc/sys/vm/ that
> defaults to 100 (all memory usable for the pagecache).
>
> The size of the pagecache is the number of file backed
> pages in a zone which is available through NR_FILE_PAGES.
>
> We skip zones that contain too many page cache pages in
> the page allocator which may cause us to enter reclaim.
>
> If we enter reclaim and the number of page cache pages
> is too high then we switch off swapping during reclaim
> to avoid touching anonymous pages.
>
> Signed-off-by: Christoph Lameter <[email protected]>

Code looks nice, however earlier responses have raised good points. Esp.
the one pointing out you'd need to defeat swappiness too.

That said, I'm not much in favour of a limit pagecache knob.

Esp. the "my customers are scared of the 99.9% memory used scenario" is
a clear case of educate them. We don't go fix psychological problems
with code.

The only maybe valid point would be 2, and I'd like to see if we can't
solve that differently - a better use-once logic comes to mind.

2007-01-24 12:33:49

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache



Christoph Lameter wrote:
> This is a patch using some of Aubrey's work plugging it in what is IMHO
> the right way. Feel free to improve on it. I have gotten repeatedly
> requests to be able to limit the pagecache. With the revised VM statistics
> this is now actually possile. I'd like to know more about possible uses of
> such a feature.
>
>

[snip]

Hi Christoph,

With your patch, MMAP of a file that will cross the pagecache limit hangs the
system. As I mentioned in my previous mail, without subtracting the
NR_FILE_MAPPED, the reclaim will infinitely try and fail.

I have tested your patch with the attached fix on my PPC64 box.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>

---
mm/page_alloc.c | 3 ++-
mm/vmscan.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)

--- linux-2.6.20-rc5.orig/mm/page_alloc.c
+++ linux-2.6.20-rc5/mm/page_alloc.c
@@ -1171,7 +1171,8 @@ zonelist_scan:
goto try_next_zone;

if ((gfp_mask & __GFP_PAGECACHE) &&
- zone_page_state(zone, NR_FILE_PAGES) >
+ (zone_page_state(zone, NR_FILE_PAGES) -
+ zone_page_state(zone, NR_FILE_MAPPED)) >
zone->max_pagecache_pages)
goto try_next_zone;

--- linux-2.6.20-rc5.orig/mm/vmscan.c
+++ linux-2.6.20-rc5/mm/vmscan.c
@@ -936,7 +936,8 @@ static unsigned long shrink_zone(int pri
* If the page cache is too big then focus on page cache
* and ignore anonymous pages
*/
- if (sc->may_swap && zone_page_state(zone, NR_FILE_PAGES)
+ if (sc->may_swap && (zone_page_state(zone, NR_FILE_PAGES) -
+ zone_page_state(zone, NR_FILE_MAPPED))
> zone->max_pagecache_pages)
sc->may_swap = 0;

2007-01-24 12:51:00

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

Peter Zijlstra wrote:
> On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:

>>2. Insure rapid turnaround of pages in the cache.

[...]

> The only maybe valid point would be 2, and I'd like to see if we can't
> solve that differently - a better use-once logic comes to mind.

There must be something I'm missing with that point. The faster
the turnaround of pagecache pages, the *less* efficiently the
pagecache is working (assuming a rapid turnaround means a high
rate of pages brought into, then reclaimed from pagecache).

I can't argue that a smaller pagecache will be subject to a
higher turnaround given the same workload, but I don't know why
that would be a good thing.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-24 12:58:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 2007-01-24 at 23:50 +1100, Nick Piggin wrote:
> Peter Zijlstra wrote:
> > On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
>
> >>2. Insure rapid turnaround of pages in the cache.
>
> [...]
>
> > The only maybe valid point would be 2, and I'd like to see if we can't
> > solve that differently - a better use-once logic comes to mind.
>
> There must be something I'm missing with that point. The faster
> the turnaround of pagecache pages, the *less* efficiently the
> pagecache is working (assuming a rapid turnaround means a high
> rate of pages brought into, then reclaimed from pagecache).
>
> I can't argue that a smaller pagecache will be subject to a
> higher turnaround given the same workload, but I don't know why
> that would be a good thing.

I interpreted the issue as selecting the wrong pages for the 'working
set'. Like not quickly evicting pages from a large streaming read, which
then pushes out more useful pages.

2007-01-24 14:22:44

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On 1/24/07, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
> > This is a patch using some of Aubrey's work plugging it in what is IMHO
> > the right way. Feel free to improve on it. I have gotten repeatedly
> > requests to be able to limit the pagecache. With the revised VM statistics
> > this is now actually possile. I'd like to know more about possible uses of
> > such a feature.
> >
> >
> >
> >
> > It may be useful to limit the size of the page cache for various reasons
> > such as
> >
> > 1. Insure that anonymous pages that may contain performance
> > critical data is never subject to swap.
>
> This is what we have mlock for, no?
>
> > 2. Insure rapid turnaround of pages in the cache.
>
> This sounds like we either need more fadvise hints and/or understand why
> the VM doesn't behave properly.
>
> > 3. Reserve memory for other uses? (Aubrey?)
>
> He wants to make a nommu system act like a mmu system; this will just
> never ever work.

Nope. Actually my nommu system works great with some of patches made by us.
What let you think this will never work?

>Memory fragmentation is a real issue not some gimmick
> thought up by the hardware folks to sell these mmu chips.
>
I totally disagree. Memory fragmentations is the issue not only on
nommu, it's also on mmu chips. That's not the reason mmu chips can be
sold.

-Aubrey

2007-01-24 14:57:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 2007-01-24 at 22:22 +0800, Aubrey Li wrote:
> On 1/24/07, Peter Zijlstra <[email protected]> wrote:
> > On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
> > > This is a patch using some of Aubrey's work plugging it in what is IMHO
> > > the right way. Feel free to improve on it. I have gotten repeatedly
> > > requests to be able to limit the pagecache. With the revised VM statistics
> > > this is now actually possile. I'd like to know more about possible uses of
> > > such a feature.
> > >
> > >
> > >
> > >
> > > It may be useful to limit the size of the page cache for various reasons
> > > such as
> > >
> > > 1. Insure that anonymous pages that may contain performance
> > > critical data is never subject to swap.
> >
> > This is what we have mlock for, no?
> >
> > > 2. Insure rapid turnaround of pages in the cache.
> >
> > This sounds like we either need more fadvise hints and/or understand why
> > the VM doesn't behave properly.
> >
> > > 3. Reserve memory for other uses? (Aubrey?)
> >
> > He wants to make a nommu system act like a mmu system; this will just
> > never ever work.
>
> Nope. Actually my nommu system works great with some of patches made by us.
> What let you think this will never work?

Because there are perfectly valid things user-space can do to mess you
up. I forgot the test-case but it had something to do with opening a
million files, this will scatter slab pages all over the place.

Also, if you cycle your large user-space allocations a bit unluckily
you'll also fragment it into oblivion.

So you can not guarantee it will not fragment into smithereens stopping
your user-space from using large than page size allocations.

If your user-space consists of several applications that do dynamic
memory allocation of various sizes its a matter of (run-) time before
things will start failing.

If you prealloc a large area at boot time (like we now do for hugepages)
and use that for user-space, you might 'reset' the status quo by cycling
the whole of userspace.

> > Memory fragmentation is a real issue not some gimmick
> > thought up by the hardware folks to sell these mmu chips.
> >
> I totally disagree. Memory fragmentations is the issue not only on
> nommu, it's also on mmu chips. That's not the reason mmu chips can be
> sold.

For MMU enabled chips these fragmentation issues (at the page allocation
level) will never reach (regular - !hugepages) user-space. Exactly
because of the MMU, it will make things virtually contiguous.

Yes, there are problem in kernel space, esp. when we want to use huge
pages.

2007-01-24 14:59:11

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 24 Jan 2007, Vaidyanathan Srinivasan wrote:

> With your patch, MMAP of a file that will cross the pagecache limit hangs the
> system. As I mentioned in my previous mail, without subtracting the
> NR_FILE_MAPPED, the reclaim will infinitely try and fail.

Well mapped pages are still pagecache pages.

> I have tested your patch with the attached fix on my PPC64 box.

Interesting. What is your reason for wanting to limit the size of the
pagecache?

2007-01-24 14:59:29

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 24 Jan 2007, Nick Piggin wrote:

> I can't argue that a smaller pagecache will be subject to a
> higher turnaround given the same workload, but I don't know why
> that would be a good thing.

Neither do I. Wonder why we need this but I keep getting
these requests. Could we either find a reason for limiting the pagecache
or get this out of our system for good?

2007-01-24 20:14:32

by Erik Andersen

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed Jan 24, 2007 at 06:58:42AM -0800, Christoph Lameter wrote:
> On Wed, 24 Jan 2007, Nick Piggin wrote:
>
> > I can't argue that a smaller pagecache will be subject to a
> > higher turnaround given the same workload, but I don't know why
> > that would be a good thing.
>
> Neither do I. Wonder why we need this but I keep getting
> these requests. Could we either find a reason for limiting the pagecache
> or get this out of our system for good?

I think this paints with too broad a brushstroke...

Simply limiting the page cache with no regard to the potential
for particular content to be later reused seems a rather
pointless exercise which is guaranteed to diminish system
performance.

It would be far more useful if an application could hint to the
pagecache as to which files are and which files as not worth
caching, especially when the application knows a priori that data
from a particular file will or will not ever be reused.

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2007-01-25 00:39:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 24 Jan 2007 14:15:10 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> And...some customers want to keep memory Free as much as possible.
> 99% memory usage makes insecure them ;)
>
If there is a way that the "free" command can show "never used" memory,
they will not complain ;).

But I can't think of the way to show that.
==
[kamezawa@aworks src]$ free
total used free shared buffers cached
Mem: 741604 724628 16976 0 62700 564600
-/+ buffers/cache: 97328 644276
Swap: 1052216 2532 1049684
==

If anyone has some good idea, could you teach me ?

Regards,
-Kame

2007-01-25 02:27:45

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On 1/24/07, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2007-01-24 at 22:22 +0800, Aubrey Li wrote:
> > On 1/24/07, Peter Zijlstra <[email protected]> wrote:
> > > On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
> > > > This is a patch using some of Aubrey's work plugging it in what is IMHO
> > > > the right way. Feel free to improve on it. I have gotten repeatedly
> > > > requests to be able to limit the pagecache. With the revised VM statistics
> > > > this is now actually possile. I'd like to know more about possible uses of
> > > > such a feature.
> > > >
> > > >
> > > >
> > > >
> > > > It may be useful to limit the size of the page cache for various reasons
> > > > such as
> > > >
> > > > 1. Insure that anonymous pages that may contain performance
> > > > critical data is never subject to swap.
> > >
> > > This is what we have mlock for, no?
> > >
> > > > 2. Insure rapid turnaround of pages in the cache.
> > >
> > > This sounds like we either need more fadvise hints and/or understand why
> > > the VM doesn't behave properly.
> > >
> > > > 3. Reserve memory for other uses? (Aubrey?)
> > >
> > > He wants to make a nommu system act like a mmu system; this will just
> > > never ever work.
> >
> > Nope. Actually my nommu system works great with some of patches made by us.
> > What let you think this will never work?
>
> Because there are perfectly valid things user-space can do to mess you
> up. I forgot the test-case but it had something to do with opening a
> million files, this will scatter slab pages all over the place.
>
> Also, if you cycle your large user-space allocations a bit unluckily
> you'll also fragment it into oblivion.
>
> So you can not guarantee it will not fragment into smithereens stopping
> your user-space from using large than page size allocations.
>
> If your user-space consists of several applications that do dynamic
> memory allocation of various sizes its a matter of (run-) time before
> things will start failing.
>
> If you prealloc a large area at boot time (like we now do for hugepages)
> and use that for user-space, you might 'reset' the status quo by cycling
> the whole of userspace.
>

It seems you are talking about a perfect system. Opening a million
files will never be the requirement of my system. You know I'm working
on an embedded system, most of the time the whole system just run for
one application, if I can guarantee this application works forever, I
think it's enough. I'm not trying to make a nommu system act like a
mmu system, it's impossible, I just make my nommu system work.

-Aubrey

2007-01-25 02:42:05

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Thu, 25 Jan 2007, KAMEZAWA Hiroyuki wrote:

> On Wed, 24 Jan 2007 14:15:10 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > And...some customers want to keep memory Free as much as possible.
> > 99% memory usage makes insecure them ;)
> >
> If there is a way that the "free" command can show "never used" memory,
> they will not complain ;).
>
> But I can't think of the way to show that.
> ==
> [kamezawa@aworks src]$ free
> total used free shared buffers cached
> Mem: 741604 724628 16976 0 62700 564600
> -/+ buffers/cache: 97328 644276
> Swap: 1052216 2532 1049684
> ==

Could we call the free memory "unused memory" and not talk about free
memory at all?

2007-01-25 02:42:51

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 24 Jan 2007, Erik Andersen wrote:

> It would be far more useful if an application could hint to the
> pagecache as to which files are and which files as not worth
> caching, especially when the application knows a priori that data
> from a particular file will or will not ever be reused.

It can give such hints via madvise(2).

2007-01-25 03:18:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 24 Jan 2007 18:41:27 -0800 (PST)
Christoph Lameter <[email protected]> wrote:
> > But I can't think of the way to show that.
> > ==
> > [kamezawa@aworks src]$ free
> > total used free shared buffers cached
> > Mem: 741604 724628 16976 0 62700 564600
> > -/+ buffers/cache: 97328 644276
> > Swap: 1052216 2532 1049684
> > ==
>
> Could we call the free memory "unused memory" and not talk about free
> memory at all?
>
Ah, maybe it's better.

I met several memory troubles in user's systems in these days. (on older kernels)
Thousands/hundreds of process works on it.

When I explain the cutomers about memory management, I devides memory into..

(1) unused memory --- memory which is not used, in free-list of zones.

(2) reclaimable memory --- page cache, which is reclaimable
clean pages --- can be reclaimed soon
dirty pages --- need to be written back
*BUT* busy pages are unreclaimable.

(3) swappable memory --- user process's pages. basically reclaimable if
swap is available.
shmem pages are included here.

(4) locked memory --- mlocked memory, which is not reclaimable(but movable)

(5) kernel memory --- used by kernel,
(and we can't see how many pages are reclaimable)

We can know the amount of (1) and (5) and total memory.
Basically, (3) = (Total) - (2) - (1).
busy data-set of (2)(3) is not reclaimable. but the amount of busy data-set
is unknown. Many users takes log of 'ps' or 'sar' to estimate their memory
usage. (and sometimes page-cache of 'log-file' eats their memory.....)

The amount of (4) is unknown. But there was a system with 6GB of 8GB
memory was mlocked (--; and OOM works.

I'm sorry that I can't catch up how the current kernel can show memory usage.
I should investigate that.

FYI:
Because some customers are migrated from mainframes, they want to control
almost all features in OS, IOW, designing memory usages.

-Kame


2007-01-25 04:17:23

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache



Christoph Lameter wrote:
> On Wed, 24 Jan 2007, Vaidyanathan Srinivasan wrote:
>
>> With your patch, MMAP of a file that will cross the pagecache limit hangs the
>> system. As I mentioned in my previous mail, without subtracting the
>> NR_FILE_MAPPED, the reclaim will infinitely try and fail.
>
> Well mapped pages are still pagecache pages.
>

Yes, but they can be classified under a process RSS pages. Whether it
is an anon page or shared mem or mmap of pagecache, it would show up
under RSS. Those pages can be limited by RSS limiter similar to the
one we are discussing in pagecache limiter. In my opinion, once a
file page is mapped by the process, then it should be treated at par
with anon pages. Application programs generally do not mmap a file
page if the reuse for the content is very low.

>> I have tested your patch with the attached fix on my PPC64 box.
>
> Interesting. What is your reason for wanting to limit the size of the
> pagecache?

1. Systems primarily running database workloads would benefit if
background house keeping applications like backup processes do not
fill the pagecache. Databases use O_DIRECT and we do not want the
kernel to even remove cold pages belonging to that application to make
room for pagecache that is going to be used by an unimportant backup
application. The objective is to have some limit on pagecache usage
and make the backup application take all the performance hit and have
zero impact on the main database workload.

Solutions:

* The backup applications could use O_DIRECT as well, but this is not
very flexible since there are restrictions in using O_DIRECT.

Please review http://lkml.org/lkml/2007/1/4/55 for issues with O_DIRECT

* Improve fadvice to specify caching behavior. Rightnow we only model
the readahead behavior. However this would need a change in all
applications and more command line options.

* The technique we are discussing right now can serve the purpose

2. In the context of 'containers' and per container resource
management, there is a need to restrict resources utilized by each of
the process groups within the container. Resources like CPU time,
RSS, pagecache usage, IO bandwidth etc may have to be controlled for
each process groups.

Some of today's open virtualisation solutions like UML instances, KVM
instances among others also have a need to control CPU time, RSS and
(unmapped) pagecache pages to be able to successfully execute
commercial workloads within their virtual environments. Each of these
instances are normal Linux process within the host kernel.

--Vaidy

2007-01-25 04:23:13

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

Christoph Lameter wrote:
> This is a patch using some of Aubrey's work plugging it in what is IMHO
> the right way. Feel free to improve on it. I have gotten repeatedly
> requests to be able to limit the pagecache.

IMHO it's a bad hack.

It would be better to identify the problem this "feature" is
trying to fix, and then fix the root cause.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-01-25 04:29:43

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

KAMEZAWA Hiroyuki wrote:

> FYI:
> Because some customers are migrated from mainframes, they want to control
> almost all features in OS, IOW, designing memory usages.

Don't you mean:

"Because some customers are migrating from mainframes, they are
used to needing to control all features in OS" ? :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-01-25 04:53:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

Vaidyanathan Srinivasan wrote:

> In my opinion, once a
> file page is mapped by the process, then it should be treated at par
> with anon pages. Application programs generally do not mmap a file
> page if the reuse for the content is very low.

Why not have the VM measure this, instead of making wild
assumptions about every possible workload out there?

There are a few databases out there that mmap the whole
thing. Sleepycat for one...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-01-25 05:24:01

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 24 Jan 2007 23:28:15 -0500
Rik van Riel <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
>
> > FYI:
> > Because some customers are migrated from mainframes, they want to control
> > almost all features in OS, IOW, designing memory usages.
>
> Don't you mean:
>
> "Because some customers are migrating from mainframes, they are
> used to needing to control all features in OS" ? :)
>
Ah yes ;)
I always says Linux is different from mainframes.

--
Because some customers have been migrated from mainframes,
they expected that they could do what they did on mainframes.
They want to control almost all features in OS. But they can't now.
This means they can't use their experience and schemes from old days.
--

Because they are studying Linux now, the case may change in future, I think.


Thanks,
-Kame

2007-01-25 05:44:11

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

KAMEZAWA Hiroyuki wrote:
> On Wed, 24 Jan 2007 23:28:15 -0500
> Rik van Riel <[email protected]> wrote:
>
>> KAMEZAWA Hiroyuki wrote:
>>
>>> FYI:
>>> Because some customers are migrated from mainframes, they want to control
>>> almost all features in OS, IOW, designing memory usages.
>> Don't you mean:
>>
>> "Because some customers are migrating from mainframes, they are
>> used to needing to control all features in OS" ? :)
>>
> Ah yes ;)
> I always says Linux is different from mainframes.

It's not just about Linux.

Applications behave differently too from the way they were 15
years ago.

Some databases, eg. sleepycat's db, map the whole database in
memory. Other databases, like MySQL and postgresql, rely on
the kernel's page cache to cache the most frequently accessed
data.

To make matters more interesting, memory sizes have increased
by a factor 1000, but disk seek times have only gotten 10 times
faster. This means that simplistic memory management algorithms
can hurt performance a lot more than they could back then.

In short, I am not convinced that any of the simple tunable knobs
from the "good old days" will do much to actually help people
with modern workloads on modern computers.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-01-25 05:49:46

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache



Rik van Riel wrote:
> Vaidyanathan Srinivasan wrote:
>
>> In my opinion, once a
>> file page is mapped by the process, then it should be treated at par
>> with anon pages. Application programs generally do not mmap a file
>> page if the reuse for the content is very low.
>
> Why not have the VM measure this, instead of making wild
> assumptions about every possible workload out there?

Yes, VM page aging and page replacement algorithm should decide on the
relevance of anon or mmap page. However we may still need to limit
total pages in memory for a given set of process.

> There are a few databases out there that mmap the whole
> thing. Sleepycat for one...
>

That is why my suggestion would be not to touch mmapped pagecache
pages in the current pagecache limit code. The limit should concern
only unmapped pagecache pages.

When the application unmaps the pages, then instantly we would go over
limit and 'now' unmapped pages can be reclaimed. This behavior has
been verified with my fix on top of Christoph's patch.

2007-01-25 06:02:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Thu, 25 Jan 2007 00:40:54 -0500
Rik van Riel <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Wed, 24 Jan 2007 23:28:15 -0500
> > Rik van Riel <[email protected]> wrote:
> >
> >> KAMEZAWA Hiroyuki wrote:
> > I always says Linux is different from mainframes.
>
> It's not just about Linux.
>
> Applications behave differently too from the way they were 15
> years ago.
>
> Some databases, eg. sleepycat's db, map the whole database in
> memory. Other databases, like MySQL and postgresql, rely on
> the kernel's page cache to cache the most frequently accessed
> data.
>
> To make matters more interesting, memory sizes have increased
> by a factor 1000, but disk seek times have only gotten 10 times
> faster. This means that simplistic memory management algorithms
> can hurt performance a lot more than they could back then.
>
> In short, I am not convinced that any of the simple tunable knobs
> from the "good old days" will do much to actually help people
> with modern workloads on modern computers.
>
I agree.

My current concerns is not adding knobs but how to show/explain
what the users does. In most case, users don't know what they does
and believes system-information can tell that.

for example)
A user sometimes asks "why amount of system-A's pagecache and system-B's are
different from each other ?. I definitly does the same jobs on the both system."

...just because he used different deta-set ;)

Thanks,
-Kame


2007-01-25 06:35:31

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On 1/25/07, Vaidyanathan Srinivasan <[email protected]> wrote:
>
>
> Christoph Lameter wrote:
> > On Wed, 24 Jan 2007, Vaidyanathan Srinivasan wrote:
> >
> >> With your patch, MMAP of a file that will cross the pagecache limit hangs the
> >> system. As I mentioned in my previous mail, without subtracting the
> >> NR_FILE_MAPPED, the reclaim will infinitely try and fail.
> >
> > Well mapped pages are still pagecache pages.
> >
>
> Yes, but they can be classified under a process RSS pages. Whether it
> is an anon page or shared mem or mmap of pagecache, it would show up
> under RSS. Those pages can be limited by RSS limiter similar to the
> one we are discussing in pagecache limiter. In my opinion, once a
> file page is mapped by the process, then it should be treated at par
> with anon pages. Application programs generally do not mmap a file
> page if the reuse for the content is very low.
>

I agree, we shouldn't take mmapped page into account.
But Vaidy - even with your patch, we are still using the existing
reclaimer, that means we dont ensure that only page cache is
reclaimed/limited. mapped pages will be hit also.
I think we still need to add a new scancontrol field to lock mmaped
pages and remove unmapped pagecache pages only.

-Aubrey

2007-01-25 06:40:33

by Al Boldi

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

Rik van Riel wrote:
> Christoph Lameter wrote:
> > This is a patch using some of Aubrey's work plugging it in what is IMHO
> > the right way. Feel free to improve on it. I have gotten repeatedly
> > requests to be able to limit the pagecache.
>
> IMHO it's a bad hack.
>
> It would be better to identify the problem this "feature" is
> trying to fix, and then fix the root cause.

Ok, here is the problem: kswapd.

Limiting the page-cache memory inhibits invoking kswapd needlessly, aiding
performance and easing OOM pressures.

I tried the patch; it works.

But it needs a bit of debugging. Setting pagecache_ratio = 1 either
deadlocks or reduces thru-put to < 1mb/s.


Thanks!

--
Al

2007-01-25 06:50:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Thu, 25 Jan 2007, Aubrey Li wrote:

> But Vaidy - even with your patch, we are still using the existing
> reclaimer, that means we dont ensure that only page cache is
> reclaimed/limited. mapped pages will be hit also.
> I think we still need to add a new scancontrol field to lock mmaped
> pages and remove unmapped pagecache pages only.

Setting sc->swappiness to zero will make the reclaimer hit
unmapped pages until we get into problems. Maybe set that to some negative
value to avoid reclaim_mapped being set to 1 in shrink_active_list?

Oh. But reclaim_mapped is staying at zero anyways if may_swap is off. So
we are already fine.

I still wonder why you are doing this at all. If you just run your own app
on the box then preallocate your higher order allocations from user space.
Much less trouble.


2007-01-25 07:51:34

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache



Aubrey Li wrote:
> On 1/25/07, Vaidyanathan Srinivasan <[email protected]> wrote:
>>
>> Christoph Lameter wrote:
>>> On Wed, 24 Jan 2007, Vaidyanathan Srinivasan wrote:
>>>
>>>> With your patch, MMAP of a file that will cross the pagecache limit hangs the
>>>> system. As I mentioned in my previous mail, without subtracting the
>>>> NR_FILE_MAPPED, the reclaim will infinitely try and fail.
>>> Well mapped pages are still pagecache pages.
>>>
>> Yes, but they can be classified under a process RSS pages. Whether it
>> is an anon page or shared mem or mmap of pagecache, it would show up
>> under RSS. Those pages can be limited by RSS limiter similar to the
>> one we are discussing in pagecache limiter. In my opinion, once a
>> file page is mapped by the process, then it should be treated at par
>> with anon pages. Application programs generally do not mmap a file
>> page if the reuse for the content is very low.
>>
>
> I agree, we shouldn't take mmapped page into account.
> But Vaidy - even with your patch, we are still using the existing
> reclaimer, that means we dont ensure that only page cache is
> reclaimed/limited. mapped pages will be hit also.
> I think we still need to add a new scancontrol field to lock mmaped
> pages and remove unmapped pagecache pages only.

I have tried to add scan control to Roy's patch at
http://lkml.org/lkml/2007/01/17/96

In that patch, we search and remove only pages that are not mapped.
We also remove referenced and hot pagecache pages which the normal
reclaimer is not expected to consider.

I will try to fit that logic in Christoph's patch and test.

--Vaidy

> -Aubrey
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-01-25 08:07:22

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache



Christoph Lameter wrote:
> On Wed, 24 Jan 2007, Erik Andersen wrote:
>
>> It would be far more useful if an application could hint to the
>> pagecache as to which files are and which files as not worth
>> caching, especially when the application knows a priori that data
>> from a particular file will or will not ever be reused.
>
> It can give such hints via madvise(2).

I think you meant fadvise. That is certainly a possibility which we
need to work on. Current implementation of fadvise only throttles
read ahead in case of sequential access and flushes the file in case
of DONTNEED. We leave it at default for NOREUSE.

In case of DONTNEED and NOREUSE, we need to limit the pages used for
page cache and also reclaim them as soon as possible. Interaction of
mmap() and fadvise is little more dfficult to handle.

--Vaidy

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-01-25 08:24:27

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache



Al Boldi wrote:
> Rik van Riel wrote:
>> Christoph Lameter wrote:
>>> This is a patch using some of Aubrey's work plugging it in what is IMHO
>>> the right way. Feel free to improve on it. I have gotten repeatedly
>>> requests to be able to limit the pagecache.
>> IMHO it's a bad hack.
>>
>> It would be better to identify the problem this "feature" is
>> trying to fix, and then fix the root cause.
>
> Ok, here is the problem: kswapd.
>
> Limiting the page-cache memory inhibits invoking kswapd needlessly, aiding
> performance and easing OOM pressures.

Apart from kswapd, limiting pagecache helps performance of
applications by not eating away their ANON pages or other parts of its
resident data set. When there is enough free memory, then there is no
performance issue. However memory is always utilized to the max.
Hence every pagecache page that is allocated should come from some
application's RSS, or from cold pagecache page. If that page was
stolen from some application, then that application pays the price for
swapping or reading the page back to memory. This scenario is what we
want to avoid. All that we are trying to achieve is that pagecache
eats a (unmapped) pagecache page and not steal memory from other
important application's resident set.

Certainly this should be a configurable option and kernel's behavior
should not be changed in general.

> I tried the patch; it works.

:)

> But it needs a bit of debugging. Setting pagecache_ratio = 1 either
> deadlocks or reduces thru-put to < 1mb/s.

Yes, going below 5% on my 1GB RAM machine causes severe performance
problems. We need to hard wire a reasonable lower limit and not
provide a noose for the end user to tie around!

--Vaidy

>
> Thanks!
>
> --
> Al
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-01-25 10:29:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache


> Apart from kswapd, limiting pagecache helps performance of
> applications by not eating away their ANON pages or other parts of its
> resident data set. When there is enough free memory, then there is no
> performance issue. However memory is always utilized to the max.
> Hence every pagecache page that is allocated should come from some
> application's RSS, or from cold pagecache page. If that page was
> stolen from some application, then that application pays the price for
> swapping or reading the page back to memory. This scenario is what we
> want to avoid. All that we are trying to achieve is that pagecache
> eats a (unmapped) pagecache page and not steal memory from other
> important application's resident set.
>
> Certainly this should be a configurable option and kernel's behavior
> should not be changed in general.

Ah, this would be a clear case of the page reclaim selecting the wrong
working set.

It is perfectly fine for a page cache page to evict a app page (be it
anon or not) if that page cache page is used more frequently than the
app page in question.

Trouble seems to be that the current algorithm gets it quite wrong at
times.

Also stating that free memory somehow is good for you is weird, free
memory is a loss, you under utilise your machine. Keeping clean
pagecache pages in there that are likely to be referenced again is a
clear win; it saves the tediously slow load from disk.

So you're now proposing to limit the page cache were as its clear that
the better solution would be to tune replacement policy (and or provide
hints to said mechanism using madvise/fadvise)

2007-01-25 11:24:11

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache



Peter Zijlstra wrote:
>> Apart from kswapd, limiting pagecache helps performance of
>> applications by not eating away their ANON pages or other parts of its
>> resident data set. When there is enough free memory, then there is no
>> performance issue. However memory is always utilized to the max.
>> Hence every pagecache page that is allocated should come from some
>> application's RSS, or from cold pagecache page. If that page was
>> stolen from some application, then that application pays the price for
>> swapping or reading the page back to memory. This scenario is what we
>> want to avoid. All that we are trying to achieve is that pagecache
>> eats a (unmapped) pagecache page and not steal memory from other
>> important application's resident set.
>>
>> Certainly this should be a configurable option and kernel's behavior
>> should not be changed in general.
>
> Ah, this would be a clear case of the page reclaim selecting the wrong
> working set.
>
> It is perfectly fine for a page cache page to evict a app page (be it
> anon or not) if that page cache page is used more frequently than the
> app page in question.

Well, this is true only as long as all applications running in the
system are graded equally and it is kernel's job to provide the best
of the system resources to all applications.

> Trouble seems to be that the current algorithm gets it quite wrong at
> times.

The current reclaim code does a good job based on the assumption that
pages belonging to different applications have equal priority. The
aging of the page is independent of application's priority or class.
This is good for best overall system performance.

The new use case that is challenging this assumption is the fact that
application groups fall into different class on the same system and
there is a need to make certain class perform better at the cost of
certain other class of applications. In this scenario system
performance is not judged by overall average throughput, but by
performance of certain class of applications only.

A backup job running in the database server can take any amount of
performance hit to marginally improve database performance since that
is what the users care about. We would run into similar situations
when running various virtualization and consolidation solutions.

> Also stating that free memory somehow is good for you is weird, free
> memory is a loss, you under utilise your machine. Keeping clean
> pagecache pages in there that are likely to be referenced again is a
> clear win; it saves the tediously slow load from disk.

Agreed

>
> So you're now proposing to limit the page cache were as its clear that
> the better solution would be to tune replacement policy (and or provide
> hints to said mechanism using madvise/fadvise)

Well, we may need to use both the approach. Hints with
madvise/fadvise is definitely a good approach and the kernel should
take these hints aggressively. Yet even with these hints we may want
to have limits in the interest of other applications that do not use
pagecache.

System wide limit to pagecache may not sound very interesting, but if
we think about 'containers' and group of process having such limits it
will have more practical use cases. An aggregation of process having
limit on pagecache would give relative importance to certain class of
pages during page replacement. Controlling limits among group of
applications will help achieve peak application performance with the
applications that we care about.

--Vaidy

2007-01-25 11:55:52

by Al Boldi

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

Peter Zijlstra wrote:
> > Apart from kswapd, limiting pagecache helps performance of
> > applications by not eating away their ANON pages or other parts of its
> > resident data set. When there is enough free memory, then there is no
> > performance issue. However memory is always utilized to the max.
> > Hence every pagecache page that is allocated should come from some
> > application's RSS, or from cold pagecache page. If that page was
> > stolen from some application, then that application pays the price for
> > swapping or reading the page back to memory. This scenario is what we
> > want to avoid. All that we are trying to achieve is that pagecache
> > eats a (unmapped) pagecache page and not steal memory from other
> > important application's resident set.
> >
> > Certainly this should be a configurable option and kernel's behavior
> > should not be changed in general.
>
> Ah, this would be a clear case of the page reclaim selecting the wrong
> working set.

Yes.

> It is perfectly fine for a page cache page to evict a app page (be it
> anon or not) if that page cache page is used more frequently than the
> app page in question.

It seems, that there is currently a clear preference for pagecache-page over
app-page. Some form of prio-selection could probably aid the situation.

> Trouble seems to be that the current algorithm gets it quite wrong at
> times.

It breaks down when memory gets tight. You can actually hear it thrashing
the disk, although it's not supposed to thrash, even with swapoff.

> Also stating that free memory somehow is good for you is weird, free
> memory is a loss, you under utilise your machine. Keeping clean
> pagecache pages in there that are likely to be referenced again is a
> clear win; it saves the tediously slow load from disk.

That's the theory.

> So you're now proposing to limit the page cache

As a workaround.

> where as its clear that
> the better solution would be to tune replacement policy

Yes. Hopefully successfully.

> (and or provide
> hints to said mechanism using madvise/fadvise)

Not feasible; source is sometimes not immediately available.


Thanks!

--
Al

2007-01-25 12:02:37

by Al Boldi

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

Vaidyanathan Srinivasan wrote:
> Al Boldi wrote:
> > Rik van Riel wrote:
> >> Christoph Lameter wrote:
> >>> This is a patch using some of Aubrey's work plugging it in what is
> >>> IMHO the right way. Feel free to improve on it. I have gotten
> >>> repeatedly requests to be able to limit the pagecache.
> >>
> >> IMHO it's a bad hack.
> >>
> >> It would be better to identify the problem this "feature" is
> >> trying to fix, and then fix the root cause.
> >
> > Ok, here is the problem: kswapd.
> >
> > Limiting the page-cache memory inhibits invoking kswapd needlessly,
> > aiding performance and easing OOM pressures.
>
> Apart from kswapd, limiting pagecache helps performance of
> applications by not eating away their ANON pages or other parts of its
> resident data set. When there is enough free memory, then there is no
> performance issue. However memory is always utilized to the max.
> Hence every pagecache page that is allocated should come from some
> application's RSS, or from cold pagecache page. If that page was
> stolen from some application, then that application pays the price for
> swapping or reading the page back to memory. This scenario is what we
> want to avoid. All that we are trying to achieve is that pagecache
> eats a (unmapped) pagecache page and not steal memory from other
> important application's resident set.

Agreed 100%. Thanks for expanding exactly what I meant.

> Certainly this should be a configurable option and kernel's behavior
> should not be changed in general.
>
> > I tried the patch; it works.
> >
> :)
> :
> > But it needs a bit of debugging. Setting pagecache_ratio = 1 either
> > deadlocks or reduces thru-put to < 1mb/s.
>
> Yes, going below 5% on my 1GB RAM machine causes severe performance
> problems. We need to hard wire a reasonable lower limit and not
> provide a noose for the end user to tie around!

One reason to test full range settings, is to expose underlying system
problems, like scalability. By limiting the range, you only hide a problem
that was exposed.


Thanks!

--
Al

2007-01-25 14:52:17

by Bodo Eggert

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

Peter Zijlstra <[email protected]> wrote:
> On Wed, 2007-01-24 at 22:22 +0800, Aubrey Li wrote:
>> On 1/24/07, Peter Zijlstra <[email protected]> wrote:

>> > He wants to make a nommu system act like a mmu system; this will just
>> > never ever work.
>>
>> Nope. Actually my nommu system works great with some of patches made by us.
>> What let you think this will never work?
>
> Because there are perfectly valid things user-space can do to mess you
> up. I forgot the test-case but it had something to do with opening a
> million files, this will scatter slab pages all over the place.

a) Limit the number of open files.
b) Don't do that then.

> Also, if you cycle your large user-space allocations a bit unluckily
> you'll also fragment it into oblivion.
>
> So you can not guarantee it will not fragment into smithereens stopping
> your user-space from using large than page size allocations.

Therefore you should purposely increase the mess up to the point where the
system is guaranteed not to work? IMO you should rather put the other issues
onto the TODO list.

BTW: I'm not sure a hard limit is the right thing to do for mmu systems,
I'd rather implement high and low watermarks; if one pool is larger than
it's high watermark, it will be next get it's pages evicted, and it won't
lose pages if it's at the lower watermark.

> If your user-space consists of several applications that do dynamic
> memory allocation of various sizes its a matter of (run-) time before
> things will start failing.
>
> If you prealloc a large area at boot time (like we now do for hugepages)
> and use that for user-space, you might 'reset' the status quo by cycling
> the whole of userspace.

Preallocating the page cache (and maybe the slab space?) may very well be
the right thing to do for nommu systems. It worked quite well in DOS times
and on old MACs.
--
Funny quotes:
30. Why is a person who plays the piano called a pianist but a person who
drives a race car not called a racist?
Fri?, Spammer: [email protected] [email protected]

2007-01-25 16:14:48

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

Vaidyanathan Srinivasan wrote:
> Rik van Riel wrote:

>> There are a few databases out there that mmap the whole
>> thing. Sleepycat for one...
>
> That is why my suggestion would be not to touch mmapped pagecache
> pages in the current pagecache limit code. The limit should concern
> only unmapped pagecache pages.

So you want to limit how much data the kernel caches for mysql
or postgresql, but not limit how much of the rpm database is
cached ?!

IMHO your proposal does the exact opposite of what would be
right for my systems :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-01-25 17:58:30

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

Rik van Riel wrote:
> Vaidyanathan Srinivasan wrote:
>> Rik van Riel wrote:
>
>>> There are a few databases out there that mmap the whole
>>> thing. Sleepycat for one...
>>
>> That is why my suggestion would be not to touch mmapped pagecache
>> pages in the current pagecache limit code. The limit should concern
>> only unmapped pagecache pages.
>
> So you want to limit how much data the kernel caches for mysql
> or postgresql, but not limit how much of the rpm database is
> cached ?!
>
> IMHO your proposal does the exact opposite of what would be
> right for my systems :)
>

<Jumping in late into the discussion>

One scenario I can think of is

A group of I/O intensive task can cause readahead and
dirty page I/O and make good forward progress, but
they'll hit another group of processes by swapping
their pages out. How do we make fair forward progress?
The system administrator can currently control the
amount of swappiness by setting it, but swappiness is
a reclaim time control parameter.

We can control dirty page I/O by setting vm_dirty_ratio.
Readahead is also tuneable with fadvise(), but not many
applications use fadvise.

The question now is, is it easier for the system administrator
to say, limit my page cache usage to say 30% of total memory available,
so that other allocations do not have to wait on disk I/O or page
reclaim (consider slab allocations, other kernel data structures).

A low priority task might run infrequently and end up spending all
it's time either swapping in pages or reclaiming memory and by
the time it runs again, it ends up doing the same thing.

I understand the swap token mitigates this problem to some extent,
but limiting the page cache will give the system administrator
control over system memory behaviour.

--
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-01-25 18:37:47

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache



Al Boldi wrote:
> Vaidyanathan Srinivasan wrote:
>> Al Boldi wrote:
>>> Rik van Riel wrote:
>>>> Christoph Lameter wrote:
>>>>> This is a patch using some of Aubrey's work plugging it in what is
>>>>> IMHO the right way. Feel free to improve on it. I have gotten
>>>>> repeatedly requests to be able to limit the pagecache.
>>>> IMHO it's a bad hack.
>>>>
>>>> It would be better to identify the problem this "feature" is
>>>> trying to fix, and then fix the root cause.
>>> Ok, here is the problem: kswapd.
>>>
>>> Limiting the page-cache memory inhibits invoking kswapd needlessly,
>>> aiding performance and easing OOM pressures.
>> Apart from kswapd, limiting pagecache helps performance of
>> applications by not eating away their ANON pages or other parts of its
>> resident data set. When there is enough free memory, then there is no
>> performance issue. However memory is always utilized to the max.
>> Hence every pagecache page that is allocated should come from some
>> application's RSS, or from cold pagecache page. If that page was
>> stolen from some application, then that application pays the price for
>> swapping or reading the page back to memory. This scenario is what we
>> want to avoid. All that we are trying to achieve is that pagecache
>> eats a (unmapped) pagecache page and not steal memory from other
>> important application's resident set.
>
> Agreed 100%. Thanks for expanding exactly what I meant.
>
>> Certainly this should be a configurable option and kernel's behavior
>> should not be changed in general.
>>
>>> I tried the patch; it works.
>>>
>> :)
>> :
>>> But it needs a bit of debugging. Setting pagecache_ratio = 1 either
>>> deadlocks or reduces thru-put to < 1mb/s.
>> Yes, going below 5% on my 1GB RAM machine causes severe performance
>> problems. We need to hard wire a reasonable lower limit and not
>> provide a noose for the end user to tie around!
>
> One reason to test full range settings, is to expose underlying system
> problems, like scalability. By limiting the range, you only hide a problem
> that was exposed.

Agreed. This is a good point.

>
> Thanks!
>
> --
> Al
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-01-26 10:36:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 24 Jan 2007 14:15:10 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> - One for stability
> When a customer constructs their detabase(Oracle), the system often goes to oom.
> This is because that the system cannot allocate DMA_ZOME memory for 32bit device.
> (USB or e100)
> Not allowing to use almost all pages as page cache (for temporal use) will be some help.
> (Note: construction DB on ext3....so all writes are serialized and the system couldn't
> free page cache.)

I'm surprised that any reasonable driver has a dependency on ZONE_DMA. Are
you sure? Send full oom-killer output, please.


> - One for tuing.
> Sometimes our cutomer requests us to limit size of page-cache.
>
> Many cutomers's memory usage reaches 99.x%. (this is very common situation.)
> If almost all memories are used by page-cache, and we can think we can free it.
> But the customer cannot estimate what amount of page-cache can be freed (without
> perfromance regression).
>
> When a cutomer wants to add a new application, he tunes the system.
> But memory usage is always 99%.
> page-cache limitation is useful when the customer tunes his system and find
> sets of data and page-cache.
> (Of course, we can use some other complicated resource management system for this.)
> This will allow the users to decide that they need extra memory or not.
>
> And...some customers want to keep memory Free as much as possible.
> 99% memory usage makes insecure them ;)

Tell them to do "echo 3 > /proc/sys/vm/drop_caches", then wait three minutes?

2007-01-26 10:50:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Wed, 24 Jan 2007 15:03:23 +1100
Nick Piggin <[email protected]> wrote:

>
> Yeah, it will be failing at order=4, because the allocator won't try
> very hard reclaim pagecache pages at that cutoff point. This needs to
> be fixed in the allocator.

A simple and perhaps sufficient fix for this nommu problem would be to replace
the magic "3" in __alloc_pages() with a tunable.

2007-01-26 18:04:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Limit the size of the pagecache

On Fri, 26 Jan 2007 02:29:55 -0800
Andrew Morton <[email protected]> wrote:

> On Wed, 24 Jan 2007 14:15:10 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > - One for stability
> > When a customer constructs their detabase(Oracle), the system often goes to oom.
> > This is because that the system cannot allocate DMA_ZOME memory for 32bit device.
> > (USB or e100)
> > Not allowing to use almost all pages as page cache (for temporal use) will be some help.
> > (Note: construction DB on ext3....so all writes are serialized and the system couldn't
> > free page cache.)
>
> I'm surprised that any reasonable driver has a dependency on ZONE_DMA. Are
> you sure? Send full oom-killer output, please.
>
>
Our ia64 server's USB/e100 device uses 32bit-PCI, so sometimes OOM happens on DMA zone.
(ia64's ZONE_DMA is 0-4G area.)

But very sorry....I was confused.

I looked the issue above again and found ZONE_NORMAL/x86 was exhausted.

This was interesiting incident,

Constructing DB on 4Gb system has no problem.
Constructing DB on 8Gb system always causes OOM.

I asked the users to change DB's parameter. (this happened on RHEL4/linux-2.6.9 series)


> > And...some customers want to keep memory Free as much as possible.
> > 99% memory usage makes insecure them ;)
>
> Tell them to do "echo 3 > /proc/sys/vm/drop_caches", then wait three minutes?

Ah, maybe we can use it on RHEL5. We'll test it. thank you.

Thanks,
-Kamezawa