Hi,
I would like to tune my kernel not to use as much memory for cache
as it currently does. I have 2GB RAM, but when I am running one program
that accesses a lot of files on my disk (like rsync), that program uses
most of the cache, and other programs wind up swapping out. I'd prefer to
have just rsync run slower because less of its data is cached, rather
than have
all my other programs run more slowly. rsync is not allocating memory,
but the kernel is caching it at the expense of other programs.
With 2GB on a system, I should never page out, but I consistently do and I
need to tune the kernel to avoid that. Cache usage is around 1.4 GB!
I never had this problem with earlier kernels. I've read a lot of comments
where so-called experts poo-poo this problem, but it is real and
repeatable and I am
ready to take matters into my own hands to fix it. I am told the cache
is replaced when
another program needs more memory, so it shouldn't swap, but that is not
the
behaviour I am seeing.
Can anyone help point me in the right direction?
Do any kernel developers care about this?
My kernel is stock 2.4.21, I run Redhat 9 on a 3GHz P4. I'd give you MB
info but I've seen
this behaviour on other motherboards as well.
Thank you very much for your help.
-- tony
"Surrender to the Void."
-- John Lennon
Hello!
Anthony R. wrote:
[..snip..]
> With 2GB on a system, I should never page out, but I consistently do and I
One, very easy, solution is to do:
# swapoff -a
FWIW, I'd like an option to limit the cache size to a maximum amount...
Say: echo 500000 > /proc/sys/vm/max_disk_cache
But, AFAIK, that's not going to happen.
Regards,
Nuno Silva
On 19 August 2003 07:39, Anthony R. wrote:
> I would like to tune my kernel not to use as much memory for cache
> as it currently does. I have 2GB RAM, but when I am running one program
> that accesses a lot of files on my disk (like rsync), that program uses
> most of the cache, and other programs wind up swapping out. I'd prefer to
> have just rsync run slower because less of its data is cached, rather
> than have
> all my other programs run more slowly. rsync is not allocating memory,
> but the kernel is caching it at the expense of other programs.
There was a discussion (and patches) in the middle of 2.5 series
about O_STREAMING open flag which mean "do not aggressively cache
this file". Targeted at MP3/video playing, copying large files and such.
I don't know whether it actually was merged. If it was,
your program can use it.
> With 2GB on a system, I should never page out, but I consistently do and I
> need to tune the kernel to avoid that. Cache usage is around 1.4 GB!
So why did you configured your system to have huge swap?
That's rather contradictory setup ;)
> I never had this problem with earlier kernels. I've read a lot of comments
> where so-called experts poo-poo this problem, but it is real and
> repeatable and I am
> ready to take matters into my own hands to fix it. I am told the cache
> is replaced when
> another program needs more memory, so it shouldn't swap, but that is not
> the
> behaviour I am seeing.
>
> Can anyone help point me in the right direction?
I'd say stop allocating insane amounts of swap.
Frankly, with 2G you may run without swap at all.
--
vda
Anthony R. wrote:
>Hi,
>
>
>
>I would like to tune my kernel not to use as much memory for cache
>as it currently does. I have 2GB RAM, but when I am running one program
>that accesses a lot of files on my disk (like rsync), that program uses
>most of the cache, and other programs wind up swapping out. I'd prefer to
>have just rsync run slower because less of its data is cached, rather
>than have
>all my other programs run more slowly. rsync is not allocating memory,
>but the kernel is caching it at the expense of other programs.
>
>With 2GB on a system, I should never page out, but I consistently do and I
>need to tune the kernel to avoid that. Cache usage is around 1.4 GB!
>I never had this problem with earlier kernels. I've read a lot of comments
>where so-called experts poo-poo this problem, but it is real and
>repeatable and I am
>ready to take matters into my own hands to fix it. I am told the cache
>is replaced when
>another program needs more memory, so it shouldn't swap, but that is not
>the
>behaviour I am seeing.
>
>Can anyone help point me in the right direction?
>Do any kernel developers care about this?
>
>My kernel is stock 2.4.21, I run Redhat 9 on a 3GHz P4. I'd give you MB
>info but I've seen
>this behaviour on other motherboards as well.
>
>Thank you very much for your help.
>
>-- tony
>"Surrender to the Void."
>-- John Lennon
>
>
Hi Anthony,
If you're up for a bit of work, give the "aa" series kernels a try, also
see how 2.6-test goes and be sure to report any problems you encounter.
The VM in stock 2.4 is slow to pick up updates due to being a stable series.
The problems definitely won't get poo-pooed here. Be sure you include a
good description of your workload and probably a log of vmstat 1 to start
with.
Denis Vlasenko <[email protected]> wrote:
>
> There was a discussion (and patches) in the middle of 2.5 series
> about O_STREAMING open flag which mean "do not aggressively cache
> this file". Targeted at MP3/video playing, copying large files and such.
>
> I don't know whether it actually was merged. If it was,
> your program can use it.
It was not. Instead we have fadvise. So it would be appropriate to change
applications such as rsync to optionally run
posix_fadvise(fd, 0, -1, POSIX_FADV_DONTNEED)
against file descriptors just before closing them, so all the pagecache
gets thrown away. (Well, most of the pagecache - dirty pages won't get
dropped - the app must fsync the files by hand first if it wants this)
This would be a useful addition to rsync and such applications - it is
stronger and more specific and safer than banging on the VM for a special
case.
But if you want to bang on the VM for a special case, run 2.6 and set
/proc/sys/vm/swappiness to zero during the rsync run.
On 08.19, Andrew Morton wrote:
> Denis Vlasenko <[email protected]> wrote:
> >
> > There was a discussion (and patches) in the middle of 2.5 series
> > about O_STREAMING open flag which mean "do not aggressively cache
> > this file". Targeted at MP3/video playing, copying large files and such.
> >
> > I don't know whether it actually was merged. If it was,
> > your program can use it.
>
> It was not. Instead we have fadvise. So it would be appropriate to change
Does this work in 2.4 ?
If not, any patch flying around ?
It would be interesting to have this functionality in 2.4 also so
people can start modifying and teting things like DVD readers, rsync,
updatedb, grep and so on...
I have tested O_STREAMING in 2.4 and it is fine...
TIA
--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.22-rc2-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk))
"J.A. Magallon" <[email protected]> wrote:
>
> > It was not. Instead we have fadvise. So it would be appropriate to change
>
> Does this work in 2.4 ?
> If not, any patch flying around ?
No. It would be fairly messy to implement in 2.4 because 2.4 does not have
the per-inode radix trees for pagecache. The implementation would need to
walk every page attached to the inode just to shoot down a single page.
And all of it underneath the global pagecache lock.
But it is certainly possible.
On 08.19, Andrew Morton wrote:
> "J.A. Magallon" <[email protected]> wrote:
> >
> > > It was not. Instead we have fadvise. So it would be appropriate to
change
> >
> > Does this work in 2.4 ?
> > If not, any patch flying around ?
>
> No. It would be fairly messy to implement in 2.4 because 2.4 does not have
> the per-inode radix trees for pagecache. The implementation would need to
> walk every page attached to the inode just to shoot down a single page.
> And all of it underneath the global pagecache lock.
>
> But it is certainly possible.
>
So could O_STREAMING be included in 2.4, and let people do things like
#if 2.4
fcntl(...O_STREAMING...)
#else
posix_fadvise()
#endif
Or, if fadvise just fails with error code in 2.4,
if (fadvise()<0)
fcntl(O_STREAMING)
Or even:
fadvise()
fcntl(O_STREAMING):
and let whatever succeed...
Or is it too dirt ?
TIA
--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.22-rc2-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk))
"J.A. Magallon" <[email protected]> wrote:
>
>
> So could O_STREAMING be included in 2.4, and let people do things like
Sounds fairly ugh, actually. It might be better to just implement
fadvise().
O_STREAMING is really designed for large streaming writes; the current
implementation only performs invalidation after each megabyte of I/O, so it
would fail to do anything at all in the lots-of-medium-size-files case
such as rsync.
Or use 2.6. It will take a while for the feature to usefully propagate into
applications anyway...
On Mon Aug 18, 2003 at 11:20:24PM -0700, Andrew Morton wrote:
> Denis Vlasenko <[email protected]> wrote:
> >
> > There was a discussion (and patches) in the middle of 2.5 series
> > about O_STREAMING open flag which mean "do not aggressively cache
> > this file". Targeted at MP3/video playing, copying large files and such.
> >
> > I don't know whether it actually was merged. If it was,
> > your program can use it.
>
> It was not. Instead we have fadvise. So it would be appropriate to change
> applications such as rsync to optionally run
>
> posix_fadvise(fd, 0, -1, POSIX_FADV_DONTNEED)
>
> against file descriptors just before closing them, so all the pagecache
> gets thrown away. (Well, most of the pagecache - dirty pages won't get
> dropped - the app must fsync the files by hand first if it wants this)
This is not supported in 2.4.x though, right?
What if I don't want to fill up the pagecache with garbage in the
first place? When closing a file descriptor, it is already too
late -- the one time only giant pile of data has already caused
the kernel to wastefully flush useful things out of cache...
-Erik
--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
>>another program needs more memory, so it shouldn't swap, but that is not
>>the
>>behaviour I am seeing.
>>
>>Can anyone help point me in the right direction?
>>
>>
>
>I'd say stop allocating insane amounts of swap.
>Frankly, with 2G you may run without swap at all.
>
>
I'm not sure how you knew I had 2GB of swap. ;)
I just always thought it was a good idea to have some just in case.
I did not know having swap would actually, in some cases, degrade
performance.
Are you saying that, if I turn off swap, the amount of cache used will
be the same, but that when other programs need more memory, the kernel
will take it from cache? If so, I will try, since that would be
an ideal solution.
And while O_STREAMING sounds good, I'm not really up for rewriting
all the rsync-like apps. I want my OS to deal with it.
Thanks.
-- tony
"Surrender to the Void." -- John Lennon
On Tue, Aug 19, 2003 at 10:28:58AM -0400, Anthony R. wrote:
>
> >>another program needs more memory, so it shouldn't swap, but that is not
> >>the
> >>behaviour I am seeing.
> >>
> >>Can anyone help point me in the right direction?
> >>
> >>
> >
> >I'd say stop allocating insane amounts of swap.
> >Frankly, with 2G you may run without swap at all.
> >
> >
> I'm not sure how you knew I had 2GB of swap. ;)
> I just always thought it was a good idea to have some just in case.
> I did not know having swap would actually, in some cases, degrade
> performance.
>
> Are you saying that, if I turn off swap, the amount of cache used will
> be the same, but that when other programs need more memory, the kernel
> will take it from cache? If so, I will try, since that would be
> an ideal solution.
And the -aa and rmap kernels do that with swap on too.
If you test them and they don't for your workload, please get back to the
list and let use know.
It is well known that the stock 2.4 VM is WAY behind -aa and rmap in terms
of reactiveness and correct choices.
Erik Andersen <[email protected]> wrote:
>
> On Mon Aug 18, 2003 at 11:20:24PM -0700, Andrew Morton wrote:
> > Denis Vlasenko <[email protected]> wrote:
> > >
> > > There was a discussion (and patches) in the middle of 2.5 series
> > > about O_STREAMING open flag which mean "do not aggressively cache
> > > this file". Targeted at MP3/video playing, copying large files and such.
> > >
> > > I don't know whether it actually was merged. If it was,
> > > your program can use it.
> >
> > It was not. Instead we have fadvise. So it would be appropriate to change
> > applications such as rsync to optionally run
> >
> > posix_fadvise(fd, 0, -1, POSIX_FADV_DONTNEED)
> >
> > against file descriptors just before closing them, so all the pagecache
> > gets thrown away. (Well, most of the pagecache - dirty pages won't get
> > dropped - the app must fsync the files by hand first if it wants this)
>
> This is not supported in 2.4.x though, right?
No, it is not.
> What if I don't want to fill up the pagecache with garbage in the
> first place?
Call fadvise(POSIX_FADV_DONTNEED) more frequently or use O_DIRECT.
Hi.
On Tue, 19 Aug 2003 00:39:49 -0400, "Anthony R." wrote:
>I would like to tune my kernel not to use as much memory for cache
>as it currently does. I have 2GB RAM, but when I am running one program
>that accesses a lot of files on my disk (like rsync), that program uses
>most of the cache, and other programs wind up swapping out. I'd prefer to
>have just rsync run slower because less of its data is cached, rather
>than have
>all my other programs run more slowly. rsync is not allocating memory,
>but the kernel is caching it at the expense of other programs.
>
>With 2GB on a system, I should never page out, but I consistently do and I
>need to tune the kernel to avoid that. Cache usage is around 1.4 GB!
>I never had this problem with earlier kernels. I've read a lot of comments
>where so-called experts poo-poo this problem, but it is real and
>repeatable and I am
>ready to take matters into my own hands to fix it. I am told the cache
>is replaced when
>another program needs more memory, so it shouldn't swap, but that is not
>the
>behaviour I am seeing.
>
>Can anyone help point me in the right direction?
>Do any kernel developers care about this?
I also have the same problem about pagecache. Pagecache become bigger
until most of memory is used, and it influences the performance of other
programs. It is a serious problem in the OLTP system. When the
transaction increases, transaction processing stalls, because most of
memory is used as pagecache and it takes many time to reclaim memory
from pagecache. It causes the whole system's stalling. To avoid that,
the limitation of pagecache is necessary.
Actually, in the system I constructed(RedHat AdvancedServer2.1, kernel
2.4.9based), the problem occurred due to pagecache. The system's maximum
response time had to be less than 4 seconds, but owing to the pagecache,
response time get uneven, and maximum time became 10 seconds.
This trouble was solved by controlling pagecache
using /proc/sys/vm/pagecache.
I made a patch to add new paramter /proc/sys/vm/pgcache-max. It controls
maximum number of pages used as pagecache.
An attached file is a mere test patch, so it may contain a bug or ugly
code. Please let me know if there is an advice, comment, better
implementation, and so on.
Thanks.
--------------------------------------------------
Takao Indoh
E-Mail : [email protected]
diff -Nur linux-2.5.64/include/linux/gfp.h linux-2.5.64-new/include/linux/gfp.h
--- linux-2.5.64/include/linux/gfp.h Wed Mar 5 12:29:03 2003
+++ linux-2.5.64-new/include/linux/gfp.h Tue Apr 8 11:12:33 2003
@@ -18,6 +18,7 @@
#define __GFP_FS 0x80 /* Can call down to low-level FS? */
#define __GFP_COLD 0x100 /* Cache-cold page required */
#define __GFP_NOWARN 0x200 /* Suppress page allocation failure warning */
+#define __GFP_PGCACHE 0x400 /* Page-cache required */
#define GFP_ATOMIC (__GFP_HIGH)
#define GFP_NOIO (__GFP_WAIT)
diff -Nur linux-2.5.64/include/linux/mm.h linux-2.5.64-new/include/linux/mm.h
--- linux-2.5.64/include/linux/mm.h Wed Mar 5 12:28:56 2003
+++ linux-2.5.64-new/include/linux/mm.h Tue Apr 8 11:12:33 2003
@@ -22,6 +22,7 @@
extern unsigned long num_physpages;
extern void * high_memory;
extern int page_cluster;
+extern unsigned long max_pgcache;
#include <asm/page.h>
#include <asm/pgtable.h>
diff -Nur linux-2.5.64/include/linux/page-flags.h linux-2.5.64-new/include/linux/page-flags.h
--- linux-2.5.64/include/linux/page-flags.h Wed Mar 5 12:29:31 2003
+++ linux-2.5.64-new/include/linux/page-flags.h Tue Apr 8 11:12:33 2003
@@ -74,6 +74,7 @@
#define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
#define PG_reclaim 18 /* To be reclaimed asap */
#define PG_compound 19 /* Part of a compound page */
+#define PG_pgcache 20 /* Page is used as pagecache */
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
@@ -255,6 +256,10 @@
#define PageCompound(page) test_bit(PG_compound, &(page)->flags)
#define SetPageCompound(page) set_bit(PG_compound, &(page)->flags)
#define ClearPageCompound(page) clear_bit(PG_compound, &(page)->flags)
+
+#define PagePgcache(page) test_bit(PG_pgcache, &(page)->flags)
+#define SetPagePgcache(page) set_bit(PG_pgcache, &(page)->flags)
+#define ClearPagePgcache(page) clear_bit(PG_pgcache, &(page)->flags)
/*
* The PageSwapCache predicate doesn't use a PG_flag at this time,
diff -Nur linux-2.5.64/include/linux/pagemap.h linux-2.5.64-new/include/linux/pagemap.h
--- linux-2.5.64/include/linux/pagemap.h Wed Mar 5 12:28:53 2003
+++ linux-2.5.64-new/include/linux/pagemap.h Tue Apr 8 11:12:33 2003
@@ -29,12 +29,12 @@
static inline struct page *page_cache_alloc(struct address_space *x)
{
- return alloc_pages(x->gfp_mask, 0);
+ return alloc_pages(x->gfp_mask|__GFP_PGCACHE, 0);
}
static inline struct page *page_cache_alloc_cold(struct address_space *x)
{
- return alloc_pages(x->gfp_mask|__GFP_COLD, 0);
+ return alloc_pages(x->gfp_mask|__GFP_COLD|__GFP_PGCACHE, 0);
}
typedef int filler_t(void *, struct page *);
@@ -80,6 +80,7 @@
list_add(&page->list, &mapping->clean_pages);
page->mapping = mapping;
page->index = index;
+ SetPagePgcache(page);
mapping->nrpages++;
inc_page_state(nr_pagecache);
diff -Nur linux-2.5.64/include/linux/sysctl.h linux-2.5.64-new/include/linux/sysctl.h
--- linux-2.5.64/include/linux/sysctl.h Wed Mar 5 12:29:21 2003
+++ linux-2.5.64-new/include/linux/sysctl.h Tue Apr 8 11:12:33 2003
@@ -155,6 +155,7 @@
VM_HUGETLB_PAGES=18, /* int: Number of available Huge Pages */
VM_SWAPPINESS=19, /* Tendency to steal mapped memory */
VM_LOWER_ZONE_PROTECTION=20,/* Amount of protection of lower zones */
+ VM_MAXPGCACHE=21,/* maximum number of page used as pagecache */
};
diff -Nur linux-2.5.64/kernel/sysctl.c linux-2.5.64-new/kernel/sysctl.c
--- linux-2.5.64/kernel/sysctl.c Wed Mar 5 12:28:58 2003
+++ linux-2.5.64-new/kernel/sysctl.c Tue Apr 8 11:12:33 2003
@@ -319,6 +319,8 @@
&sysctl_lower_zone_protection, sizeof(sysctl_lower_zone_protection),
0644, NULL, &proc_dointvec_minmax, &sysctl_intvec, NULL, &zero,
NULL, },
+ {VM_MAXPGCACHE, "pgcache-max", &max_pgcache, sizeof(unsigned long),
+ 0644, NULL,&proc_dointvec_minmax, &sysctl_intvec, NULL,&zero,NULL},
{0}
};
diff -Nur linux-2.5.64/mm/filemap.c linux-2.5.64-new/mm/filemap.c
--- linux-2.5.64/mm/filemap.c Wed Mar 5 12:29:15 2003
+++ linux-2.5.64-new/mm/filemap.c Tue Apr 8 11:12:33 2003
@@ -86,6 +86,7 @@
radix_tree_delete(&mapping->page_tree, page->index);
list_del(&page->list);
page->mapping = NULL;
+ ClearPagePgcache(page);
mapping->nrpages--;
dec_page_state(nr_pagecache);
@@ -437,7 +438,7 @@
page = find_lock_page(mapping, index);
if (!page) {
if (!cached_page) {
- cached_page = alloc_page(gfp_mask);
+ cached_page = alloc_page(gfp_mask|__GFP_PGCACHE);
if (!cached_page)
return NULL;
}
@@ -507,7 +508,7 @@
return NULL;
}
gfp_mask = mapping->gfp_mask & ~__GFP_FS;
- page = alloc_pages(gfp_mask, 0);
+ page = alloc_pages(gfp_mask|__GFP_PGCACHE, 0);
if (page && add_to_page_cache_lru(page, mapping, index, gfp_mask)) {
page_cache_release(page);
page = NULL;
diff -Nur linux-2.5.64/mm/page_alloc.c linux-2.5.64-new/mm/page_alloc.c
--- linux-2.5.64/mm/page_alloc.c Wed Mar 5 12:28:58 2003
+++ linux-2.5.64-new/mm/page_alloc.c Tue Apr 8 17:21:03 2003
@@ -39,6 +39,7 @@
int nr_swap_pages;
int numnodes = 1;
int sysctl_lower_zone_protection = 0;
+unsigned long max_pgcache = ULONG_MAX;
/*
* Used by page_zone() to look up the address of the struct zone whose
@@ -52,6 +53,9 @@
static int zone_balance_min[MAX_NR_ZONES] __initdata = { 20 , 20, 20, };
static int zone_balance_max[MAX_NR_ZONES] __initdata = { 255 , 255, 255, };
+extern int shrink_pgcache(struct zonelist *zonelist, unsigned int gfp_mask,
+ unsigned int max_nrpage, struct page_state *ps);
+
/*
* Temporary debugging check for pages not lying within a given zone.
*/
@@ -548,6 +552,19 @@
classzone = zones[0];
if (classzone == NULL) /* no zones in the zonelist */
return NULL;
+
+ if (gfp_mask & __GFP_PGCACHE) {
+ struct page_state ps;
+ int nr_page;
+
+ min = 1UL << order;
+ get_page_state(&ps);
+ if (ps.nr_pagecache + min >= max_pgcache) {
+ /* try to shrink pagecache */
+ nr_page = ps.nr_pagecache + min - max_pgcache;
+ shrink_pgcache(zonelist, gfp_mask, nr_page, &ps);
+ }
+ }
/* Go through the zonelist once, looking for a zone with enough free */
min = 1UL << order;
diff -Nur linux-2.5.64/mm/swap_state.c linux-2.5.64-new/mm/swap_state.c
--- linux-2.5.64/mm/swap_state.c Wed Mar 5 12:29:17 2003
+++ linux-2.5.64-new/mm/swap_state.c Tue Apr 8 11:12:33 2003
@@ -360,7 +360,7 @@
* Get a new page to read into from swap.
*/
if (!new_page) {
- new_page = alloc_page(GFP_HIGHUSER);
+ new_page = alloc_page(GFP_HIGHUSER|__GFP_PGCACHE);
if (!new_page)
break; /* Out of memory */
}
diff -Nur linux-2.5.64/mm/vmscan.c linux-2.5.64-new/mm/vmscan.c
--- linux-2.5.64/mm/vmscan.c Wed Mar 5 12:28:59 2003
+++ linux-2.5.64-new/mm/vmscan.c Tue Apr 8 11:12:33 2003
@@ -493,6 +493,13 @@
list_add(&page->lru, &zone->inactive_list);
continue;
}
+ if (gfp_mask & __GFP_PGCACHE) {
+ if (!PagePgcache(page)) {
+ SetPageLRU(page);
+ list_add(&page->lru, &zone->inactive_list);
+ continue;
+ }
+ }
list_add(&page->lru, &page_list);
page_cache_get(page);
nr_taken++;
@@ -737,6 +744,40 @@
}
return shrink_cache(nr_pages, zone, gfp_mask,
max_scan, nr_mapped);
+}
+
+/*
+ * Try to reclaim `nr_pages' from pagecache of this zone.
+ * Returns the number of reclaimed pages.
+ */
+int shrink_pgcache(struct zonelist *zonelist, unsigned int gfp_mask,
+ unsigned int nr_pages, struct page_state *ps)
+{
+ struct zone **zones;
+ struct zone *first_classzone;
+ struct zone *zone;
+ unsigned int ret = 0, reclaim;
+ unsigned long rest_nr_page;
+ int dummy, i;
+
+ zones = zonelist->zones;
+ for (i = 0; zones[i] != NULL; i++) {
+ zone = zones[i];
+ first_classzone = zone->zone_pgdat->node_zones;
+ for (; zone >= first_classzone; zone--) {
+ if (zone->all_unreclaimable) /* all pages pinned */
+ continue;
+
+ rest_nr_page = nr_pages - ret;
+ reclaim = max(((zone->nr_inactive)>>2)+1, rest_nr_page);
+ ret += shrink_zone(zone, zone->nr_inactive,
+ gfp_mask|__GFP_PGCACHE,
+ reclaim, &dummy, ps, DEF_PRIORITY);
+ if (ret >= nr_pages)
+ return ret;
+ }
+ }
+ return ret;
}
/*
Takao Indoh wrote:
>
> I made a patch to add new paramter /proc/sys/vm/pgcache-max. It controls
> maximum number of pages used as pagecache.
> An attached file is a mere test patch, so it may contain a bug or ugly
> code. Please let me know if there is an advice, comment, better
> implementation, and so on.
>
Do you have something like this for 2.4 kernels?
[ I expected to find that by default Linux stops polluting memory
with cache when there is no more pages. But as I see your patch is
hacking something somewhere in the middle... But I'm not a specialist in
VM... Gone reading sources. ]
Thanks for the patch.
On Thu, Aug 21, 2003 at 09:49:45AM +0900, Takao Indoh wrote:
> Actually, in the system I constructed(RedHat AdvancedServer2.1, kernel
> 2.4.9based), the problem occurred due to pagecache. The system's maximum
> response time had to be less than 4 seconds, but owing to the pagecache,
> response time get uneven, and maximum time became 10 seconds.
Please try the 2.4.18 based redhat kernel, or the 2.4-aa kernel.
On Thu, 21 Aug 2003 16:47:09 -0700, Mike Fedyk wrote:
>On Thu, Aug 21, 2003 at 09:49:45AM +0900, Takao Indoh wrote:
>> Actually, in the system I constructed(RedHat AdvancedServer2.1, kernel
>> 2.4.9based), the problem occurred due to pagecache. The system's maximum
>> response time had to be less than 4 seconds, but owing to the pagecache,
>> response time get uneven, and maximum time became 10 seconds.
>
>Please try the 2.4.18 based redhat kernel, or the 2.4-aa kernel.
I need a tuning parameter which can control pagecache
like /proc/sys/vm/pagecache, which RedHat Linux has.
The latest 2.4 or 2.5 standard kernel does not have such a parameter.
2.4.18 kernel or 2.4-aa kernel has a alternative method?
--------------------------------------------------
Takao Indoh
E-Mail : [email protected]
On Thu, Aug 21, 2003 at 09:49:45AM +0900, Takao Indoh wrote:
>>> Actually, in the system I constructed(RedHat AdvancedServer2.1, kernel
>>> 2.4.9based), the problem occurred due to pagecache. The system's maximum
>>> response time had to be less than 4 seconds, but owing to the pagecache,
>>> response time get uneven, and maximum time became 10 seconds.
On Thu, 21 Aug 2003 16:47:09 -0700, Mike Fedyk wrote:
>> Please try the 2.4.18 based redhat kernel, or the 2.4-aa kernel.
On Mon, Aug 25, 2003 at 11:45:58AM +0900, Takao Indoh wrote:
> I need a tuning parameter which can control pagecache
> like /proc/sys/vm/pagecache, which RedHat Linux has.
> The latest 2.4 or 2.5 standard kernel does not have such a parameter.
> 2.4.18 kernel or 2.4-aa kernel has a alternative method?
This is moderately misguided; essentially the only way userspace can
utilize RAM at all is via the pagecache. It's not useful to limit this;
you probably need inode-highmem or some such nonsense.
-- wli
Thank you for interest in my patch.
On Thu, 21 Aug 2003 11:52:52 +0200, Ihar 'Philips' Filipau wrote:
>Takao Indoh wrote:
>>
>> I made a patch to add new paramter /proc/sys/vm/pgcache-max. It controls
>> maximum number of pages used as pagecache.
>> An attached file is a mere test patch, so it may contain a bug or ugly
>> code. Please let me know if there is an advice, comment, better
>> implementation, and so on.
>>
>
> Do you have something like this for 2.4 kernels?
No, I have only a patch for 2.5 kernel.
But, RedHat AdvancedServer2.1(2.4.9based kernel) has a similar parameter
(/proc/sys/vm/pagecache). If you can see the source, please check it.
>
> [ I expected to find that by default Linux stops polluting memory
>with cache when there is no more pages. But as I see your patch is
>hacking something somewhere in the middle... But I'm not a specialist in
>VM... Gone reading sources. ]
>
> Thanks for the patch.
I'm not a specialist in VM, too. So, that patch may have many bugs.
What is done in that patch is very simple.
1) Add PG_pgcache flag to the page used as pagecache.
2) Watch the total amount of pagecahe.
3) If the amount of pagecahe exceeds maximum,
try to remove only the page which has PG_pgcache flag.
Thanks.
--------------------------------------------------
Takao Indoh
E-Mail : [email protected]
On Sun, Aug 24, 2003 at 09:11:17PM -0700, William Lee Irwin III wrote:
> On Thu, Aug 21, 2003 at 09:49:45AM +0900, Takao Indoh wrote:
> >>> Actually, in the system I constructed(RedHat AdvancedServer2.1, kernel
> >>> 2.4.9based), the problem occurred due to pagecache. The system's maximum
> >>> response time had to be less than 4 seconds, but owing to the pagecache,
> >>> response time get uneven, and maximum time became 10 seconds.
>
> On Thu, 21 Aug 2003 16:47:09 -0700, Mike Fedyk wrote:
> >> Please try the 2.4.18 based redhat kernel, or the 2.4-aa kernel.
>
> On Mon, Aug 25, 2003 at 11:45:58AM +0900, Takao Indoh wrote:
> > I need a tuning parameter which can control pagecache
> > like /proc/sys/vm/pagecache, which RedHat Linux has.
> > The latest 2.4 or 2.5 standard kernel does not have such a parameter.
> > 2.4.18 kernel or 2.4-aa kernel has a alternative method?
>
Takao,
I doubt that there will be that option in the 2.4 stable series. I think
you are trying to fix the problem without understanding the entire picture.
If there is too much pagechache, then the kernel developers need to know
about your workload so that they can fix it. But you have to try -aa first
to see if it's already fixed.
> This is moderately misguided; essentially the only way userspace can
> utilize RAM at all is via the pagecache. It's not useful to limit this;
> you probably need inode-highmem or some such nonsense.
Exactly. Every program you have opened, and all of its libraries will show
up as pagecache memory also, so seeing a large pagecache in and of itself
may not be a problem.
Let's get past the tuning paramenter you want in /proc, and tell us more
about what you are doing that is causing this problem to be shown.
On Sun, Aug 24, 2003 at 09:11:17PM -0700, William Lee Irwin III wrote:
>> This is moderately misguided; essentially the only way userspace can
>> utilize RAM at all is via the pagecache. It's not useful to limit this;
>> you probably need inode-highmem or some such nonsense.
On Mon, Aug 25, 2003 at 03:58:47PM -0700, Mike Fedyk wrote:
> Exactly. Every program you have opened, and all of its libraries will show
> up as pagecache memory also, so seeing a large pagecache in and of itself
> may not be a problem.
> Let's get past the tuning paramenter you want in /proc, and tell us more
> about what you are doing that is causing this problem to be shown.
One thing I thought of after the post was whether they actually had in
mind tunable hard limits on _unmapped_ pagecache, which is, in fact,
useful. OTOH that's largely speculation and we really need them to
articulate the true nature of their problem.
-- wli
Mike Fedyk wrote:
>>On Mon, Aug 25, 2003 at 11:45:58AM +0900, Takao Indoh wrote:
>>
>>>I need a tuning parameter which can control pagecache
>>>like /proc/sys/vm/pagecache, which RedHat Linux has.
>>>The latest 2.4 or 2.5 standard kernel does not have such a parameter.
>>>2.4.18 kernel or 2.4-aa kernel has a alternative method?
>
> I doubt that there will be that option in the 2.4 stable series. I think
> you are trying to fix the problem without understanding the entire picture.
> If there is too much pagechache, then the kernel developers need to know
> about your workload so that they can fix it. But you have to try -aa first
> to see if it's already fixed.
>
Let me give my point of view.
Linux trys to scale up to the limits of given hardware.
That is _*horribly*_ wrong.
If I have 1GB of memory and my applications for use only 16MB - it
doesn't mean I want to fill 1GB-16MB with garbage like file my momy had
viewed two weeks ago.
That's it: OS should scale for *application* *needs*.
Can you compare in your mind overhead of managing 1GB of cache with
managing e.g. 16MB of cache?
So IMHO problem is: OS needless overhead.
It is possible to minimize overhead in several ways:
1) Optimize algorithms and data structures.
2) Minimize amount of resources.
3) As a compromise of 1&2 - teach OS to not use unneeded resource til
the time they will be really needed, and free them afterwards.
1) is already done, 3) is awful heuristics which will never work
reliably.
And Takao's patch was trying to approach problem from 2) point.
So as for me it is justified.
Comments are welcome.
On Tue, Aug 26, 2003 at 12:15:46PM +0200, Ihar 'Philips' Filipau wrote:
> If I have 1GB of memory and my applications for use only 16MB - it
> doesn't mean I want to fill 1GB-16MB with garbage like file my momy had
> viewed two weeks ago.
>
> That's it: OS should scale for *application* *needs*.
>
> Can you compare in your mind overhead of managing 1GB of cache with
> managing e.g. 16MB of cache?
>
Ok, let's benchmark it.
Yes, I can see the logic in your argument, but at this point, numbers are
needed to see if or how much of a win this might be.
Mike Fedyk wrote:
> On Tue, Aug 26, 2003 at 12:15:46PM +0200, Ihar 'Philips' Filipau wrote:
>> If I have 1GB of memory and my applications for use only 16MB - it
>>doesn't mean I want to fill 1GB-16MB with garbage like file my momy had
>>viewed two weeks ago.
>>
>> That's it: OS should scale for *application* *needs*.
>>
>> Can you compare in your mind overhead of managing 1GB of cache with
>>managing e.g. 16MB of cache?
>>
>
> Ok, let's benchmark it.
>
> Yes, I can see the logic in your argument, but at this point, numbers are
> needed to see if or how much of a win this might be.
[ I beleive you can see those thread about O_STREAMING patch.
Not-caching was giving 10%-15% peformance boost for gcc on kernel
compiles. Isn't that overhead? ]
I will try to produce some benchmarktings tomorrow with different
'mem=%dMB'. I'm afraid to confirm that it will make difference.
But in advance: mantainance of page tables for 1GB and for 128MB of
RAM are going to make a difference.
On Tue, Aug 26, 2003 at 09:08:51PM +0200, Ihar 'Philips' Filipau wrote:
> Mike Fedyk wrote:
> >Ok, let's benchmark it.
> >
> >Yes, I can see the logic in your argument, but at this point, numbers are
> >needed to see if or how much of a win this might be.
>
> [ I beleive you can see those thread about O_STREAMING patch.
> Not-caching was giving 10%-15% peformance boost for gcc on kernel
> compiles. Isn't that overhead? ]
>
That was because they wanted the non-streaming files to be left in the cache.
> I will try to produce some benchmarktings tomorrow with different
> 'mem=%dMB'. I'm afraid to confirm that it will make difference.
> But in advance: mantainance of page tables for 1GB and for 128MB of
> RAM are going to make a difference.
I'm sorry to say, but you *will* get lower performance if you lower the mem=
value below your working set. This will also lower the total amount of
memory available for your applications, and force your apps, to swap and
balance cache, and app memory.
That's not what you are looking to benchmark.
Thanks for advice.
On Mon, 25 Aug 2003 15:58:47 -0700, Mike Fedyk wrote:
>I doubt that there will be that option in the 2.4 stable series. I think
>you are trying to fix the problem without understanding the entire picture.
>If there is too much pagechache, then the kernel developers need to know
>about your workload so that they can fix it. But you have to try -aa first
>to see if it's already fixed.
>
>> This is moderately misguided; essentially the only way userspace can
>> utilize RAM at all is via the pagecache. It's not useful to limit this;
>> you probably need inode-highmem or some such nonsense.
>
>Exactly. Every program you have opened, and all of its libraries will show
>up as pagecache memory also, so seeing a large pagecache in and of itself
>may not be a problem.
>
>Let's get past the tuning paramenter you want in /proc, and tell us more
>about what you are doing that is causing this problem to be shown.
This problem happened a few month ago and the detailed data does not
remain. Therefore it is difficult to know what is essential cause for
this problem, but, I guessed that pagecache used as I/O cache grew
gradually during system running, and finally it oppressed memory.
Besides this problem, there are many cases where increase of pagecache
causes trouble, I think.
For example, DBMS.
DBMS caches index of DB in their process space.
This index cache conflicts with the pagecache used by other applications,
and index cache may be paged out. It cause uneven response of DBMS.
In this case, limiting pagecache is effective.
On Tue, 26 Aug 2003 02:46:34 -0700, William Lee Irwin III wrote:
>One thing I thought of after the post was whether they actually had in
>mind tunable hard limits on _unmapped_ pagecache, which is, in fact,
>useful. OTOH that's largely speculation and we really need them to
>articulate the true nature of their problem.
I also think that is effective. Empirically, in the case where pagecache
causes memory shortage, most of pagecache is unmapped page. Of course
real problem may not be pagecashe, as you or Mike said.
--------------------------------------------------
Takao Indoh
E-Mail : [email protected]
On Wed, Aug 27, 2003 at 06:36:10PM +0900, Takao Indoh wrote:
> This problem happened a few month ago and the detailed data does not
> remain. Therefore it is difficult to know what is essential cause for
> this problem, but, I guessed that pagecache used as I/O cache grew
> gradually during system running, and finally it oppressed memory.
But this doesn't make any sense; the only memory you could "oppress"
is pagecache.
On Wed, Aug 27, 2003 at 06:36:10PM +0900, Takao Indoh wrote:
> Besides this problem, there are many cases where increase of pagecache
> causes trouble, I think.
> For example, DBMS.
> DBMS caches index of DB in their process space.
> This index cache conflicts with the pagecache used by other applications,
> and index cache may be paged out. It cause uneven response of DBMS.
> In this case, limiting pagecache is effective.
Why is it effective? You're describing pagecache vs. pagecache
competition and the DBMS outcompeting the cooperating applications for
memory to the detriment of the workload; this is a very different
scenario from what "limiting pagecache" sounds like.
How do you know it would be effective? Have you written a patch to
limit it in some way and tried running it?
On Tue, 26 Aug 2003 02:46:34 -0700, William Lee Irwin III wrote:
>> One thing I thought of after the post was whether they actually had in
>> mind tunable hard limits on _unmapped_ pagecache, which is, in fact,
>> useful. OTOH that's largely speculation and we really need them to
>> articulate the true nature of their problem.
On Wed, Aug 27, 2003 at 06:36:10PM +0900, Takao Indoh wrote:
> I also think that is effective. Empirically, in the case where pagecache
> causes memory shortage, most of pagecache is unmapped page. Of course
> real problem may not be pagecashe, as you or Mike said.
How do you know most of it is unmapped?
At any rate, the above assigns a meaningful definition to the words you
used; it does not necessarily have anything to do with the issue you're
trying to describe. If you could start from the very basics, reproduce
the problem, instrument the workload with top(1) and vmstat(1), and find
some way to describe how the performance is inadequate (e.g. performance
metrics for your running DBMS/whatever in MB/s or transactions/s etc.),
it would be much more helpful than proposing a solution up front.
Without any evidence, we can't know it is a solution at all, or that
it's the right solution.
-- wli
Mike Fedyk wrote:
>
> That was because they wanted the non-streaming files to be left in the cache.
>
>> I will try to produce some benchmarktings tomorrow with different
>>'mem=%dMB'. I'm afraid to confirm that it will make difference.
>> But in advance: mantainance of page tables for 1GB and for 128MB of
>>RAM are going to make a difference.
>
> I'm sorry to say, but you *will* get lower performance if you lower the mem=
> value below your working set. This will also lower the total amount of
> memory available for your applications, and force your apps, to swap and
> balance cache, and app memory.
>
> That's not what you are looking to benchmark.
>
Okay. I'm completely puzzled.
I will qute here only one test - and I really do not understand this
stuff.
Three boots with the same parameters and only mem=nMB, n =
{512,256,128} (I have 512MB RAM)
hdparm tests:
[root@hera ifilipau]# hdparm -t /dev/hda
/dev/hda:
Timing buffered disk reads: 64 MB in 1.56 seconds = 41.03 MB/sec
[root@hera ifilipau]# hdparm -T /dev/hda
/dev/hda:
Timing buffer-cache reads: 128 MB in 0.44 seconds =290.91 MB/sec
[root@hera ifilipau]#
Before tests I was doing 'swapoff -a; sync'
RedHat's 2.4.20-20.9 kernel.
What has really puzzled me.
Operation: "cat *.bz2 >big_file", where *.bz2 is just two bzipped
kernels. Total size: 29MB+32MB (2.4.22 + 2.6.0-test1)
To be bsolutely fair in this unfair benchmark I have run test only
once. Times in seconds as shown by bash's time.
cat sync
512MB: 1.565 0.007
256MB: 1.649 0.008
128MB: 2.184 0.007
Kill me - shoot me, but how it can be?
Resulting file fits RAM.
Not hard to guess that source files, which no one cares about already
- are still hanging in the RAM...
That's not right: as long as resulting file fits memory - and it fits
memory in all (512MB, 256MB, 128MB) cases - this operation should take
the _same_ time. (Actually before 128MB test, vmstat was saying that I
have +70MB of free non-touched memory)
So resume is quite simple: kernel loses *terribly* much time
resorting read()s against write()s. Way _too_ _much_ time.
I will try to download RedHat's AS kernel and play with page-cache.
After all: if RH has included that feature in their kernels - that
means it really make sense ;-)))
--
Ihar 'Philips' Filipau / with best regards from Saarbruecken.
- - - - - - - - - - - - - - - - - - - -
* Please avoid sending me Word/PowerPoint/Excel attachments.
* See http://www.fsf.org/philosophy/no-word-attachments.html
- - - - - - - - - - - - - - - - - - - -
There should be some SCO's source code in Linux -
my servers sometimes are crashing. -- People
Ihar 'Philips' Filipau wrote:
> Mike Fedyk wrote:
>
>>
>> That was because they wanted the non-streaming files to be left in
>> the cache.
>>
>>> I will try to produce some benchmarktings tomorrow with different
>>> 'mem=%dMB'. I'm afraid to confirm that it will make difference.
>>> But in advance: mantainance of page tables for 1GB and for 128MB of
>>> RAM are going to make a difference.
>>
>>
>> I'm sorry to say, but you *will* get lower performance if you lower
>> the mem=
>> value below your working set. This will also lower the total amount of
>> memory available for your applications, and force your apps, to swap and
>> balance cache, and app memory.
>>
>> That's not what you are looking to benchmark.
>>
>
> Okay. I'm completely puzzled.
> I will qute here only one test - and I really do not understand this
> stuff.
>
> Three boots with the same parameters and only mem=nMB, n =
> {512,256,128} (I have 512MB RAM)
>
> hdparm tests:
> [root@hera ifilipau]# hdparm -t /dev/hda
> /dev/hda:
> Timing buffered disk reads: 64 MB in 1.56 seconds = 41.03 MB/sec
> [root@hera ifilipau]# hdparm -T /dev/hda
> /dev/hda:
> Timing buffer-cache reads: 128 MB in 0.44 seconds =290.91 MB/sec
> [root@hera ifilipau]#
>
> Before tests I was doing 'swapoff -a; sync'
> RedHat's 2.4.20-20.9 kernel.
>
> What has really puzzled me.
> Operation: "cat *.bz2 >big_file", where *.bz2 is just two bzipped
> kernels. Total size: 29MB+32MB (2.4.22 + 2.6.0-test1)
>
> To be bsolutely fair in this unfair benchmark I have run test only
> once. Times in seconds as shown by bash's time.
>
> cat sync
> 512MB: 1.565 0.007
> 256MB: 1.649 0.008
> 128MB: 2.184 0.007
>
> Kill me - shoot me, but how it can be?
> Resulting file fits RAM.
> Not hard to guess that source files, which no one cares about
> already - are still hanging in the RAM...
>
> That's not right: as long as resulting file fits memory - and it
> fits memory in all (512MB, 256MB, 128MB) cases - this operation should
> take the _same_ time. (Actually before 128MB test, vmstat was saying
> that I have +70MB of free non-touched memory)
>
> So resume is quite simple: kernel loses *terribly* much time
> resorting read()s against write()s. Way _too_ _much_ time.
The kernel spends _very_ little time in the disk elevator actually. The
2.4 elevator can send very suboptimal orderings of requests to the disk
when reads and writes are going to the disk at the same time. That might
be happening here. The VM might also be doing more work if you have other
things in RAM as well, although its unlikely to cause such a big
difference.
On Wed, 27 Aug 2003 02:45:12 -0700, William Lee Irwin III wrote:
>On Wed, Aug 27, 2003 at 06:36:10PM +0900, Takao Indoh wrote:
>> Besides this problem, there are many cases where increase of pagecache
>> causes trouble, I think.
>> For example, DBMS.
>> DBMS caches index of DB in their process space.
>> This index cache conflicts with the pagecache used by other applications,
>> and index cache may be paged out. It cause uneven response of DBMS.
>> In this case, limiting pagecache is effective.
>
>Why is it effective? You're describing pagecache vs. pagecache
>competition and the DBMS outcompeting the cooperating applications for
>memory to the detriment of the workload; this is a very different
>scenario from what "limiting pagecache" sounds like.
>
>How do you know it would be effective? Have you written a patch to
>limit it in some way and tried running it?
It's just my guess. You mean that "index cache" is on the pagecache?
"index cache" is allocated in the user space by malloc,
so I think it is not on the pagecache.
>On Tue, 26 Aug 2003 02:46:34 -0700, William Lee Irwin III wrote:
>>> One thing I thought of after the post was whether they actually had in
>>> mind tunable hard limits on _unmapped_ pagecache, which is, in fact,
>>> useful. OTOH that's largely speculation and we really need them to
>>> articulate the true nature of their problem.
>
>On Wed, Aug 27, 2003 at 06:36:10PM +0900, Takao Indoh wrote:
>> I also think that is effective. Empirically, in the case where pagecache
>> causes memory shortage, most of pagecache is unmapped page. Of course
>> real problem may not be pagecashe, as you or Mike said.
>
>How do you know most of it is unmapped?
I checked /proc/meminfo.
For example, this is my /proc/meminfo(kernel 2.5.73)
MemTotal: 902728 kB
MemFree: 53096 kB
Buffers: 18520 kB
Cached: 732360 kB
SwapCached: 0 kB
Active: 623068 kB
Inactive: 179552 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 902728 kB
LowFree: 53096 kB
SwapTotal: 506036 kB
SwapFree: 506036 kB
Dirty: 33204 kB
Writeback: 0 kB
Mapped: 73360 kB
Slab: 32468 kB
Committed_AS: 167396 kB
PageTables: 988 kB
VmallocTotal: 122808 kB
VmallocUsed: 20432 kB
VmallocChunk: 102376 kB
According to this information, I thought that
all pagecache was 732360 kB and all mapped page was 73360 kB, so
almost of pagecache was not mapped...
Do I misread meminfo?
--------------------------------------------------
Takao Indoh
E-Mail : [email protected]
On Wed, 27 Aug 2003 02:45:12 -0700, William Lee Irwin III wrote:
>> How do you know it would be effective? Have you written a patch to
>> limit it in some way and tried running it?
On Wed, Aug 27, 2003 at 08:14:12PM +0900, Takao Indoh wrote:
> It's just my guess. You mean that "index cache" is on the pagecache?
> "index cache" is allocated in the user space by malloc,
> so I think it is not on the pagecache.
That will be in the pagecache.
On Wed, 27 Aug 2003 02:45:12 -0700, William Lee Irwin III wrote:
>> How do you know most of it is unmapped?
On Wed, Aug 27, 2003 at 08:14:12PM +0900, Takao Indoh wrote:
> I checked /proc/meminfo.
> For example, this is my /proc/meminfo(kernel 2.5.73)
[...]
> Buffers: 18520 kB
> Cached: 732360 kB
> SwapCached: 0 kB
> Active: 623068 kB
> Inactive: 179552 kB
[...]
> Dirty: 33204 kB
> Writeback: 0 kB
> Mapped: 73360 kB
> Slab: 32468 kB
> Committed_AS: 167396 kB
[...]
> According to this information, I thought that
> all pagecache was 732360 kB and all mapped page was 73360 kB, so
> almost of pagecache was not mapped...
> Do I misread meminfo?
No. Most of your pagecache is unmapped pagecache. This would correspond
to memory that caches files which are not being mmapped by any process.
This could result from either the page replacement policy favoring
filesystem cache too heavily or from lots of io causing the filesystem
cache to be too bloated and so defeating the swapper's heuristics (you
can do this by generating large amounts of read() traffic).
Limiting unmapped pagecache would resolve your issue. Whether it's the
right thing to do is still open to question without some knowledge of
application behavior (for instance, teaching userspace to do fadvise()
may be right thing to do as opposed to the /proc/ tunable).
Can you gather traces of system calls being made by the applications?
-- wli
I've had experience with unneeded, *mapped* pages that would be ideally
flushed oppressing needed mapped and unmapped pages.
Test case: grep --mmap SOME_STRING_I_WONT_FIND some_multi-GB-file
Sure, it's bad programming etc, but in that case, once those pages are
mapped, they can't be forcibly unmapped even though
in a utopian VM they would be discarded as unneeded.
This could very well be the problem?
-joe
----- Original Message -----
From: "William Lee Irwin III" <[email protected]>
To: "Takao Indoh" <[email protected]>
Cc: "Mike Fedyk" <[email protected]>; <[email protected]>
Sent: Wednesday, August 27, 2003 5:45 AM
Subject: Re: cache limit
> On Wed, Aug 27, 2003 at 06:36:10PM +0900, Takao Indoh wrote:
> > This problem happened a few month ago and the detailed data does not
> > remain. Therefore it is difficult to know what is essential cause for
> > this problem, but, I guessed that pagecache used as I/O cache grew
> > gradually during system running, and finally it oppressed memory.
>
> But this doesn't make any sense; the only memory you could "oppress"
> is pagecache.
>
>
> On Wed, Aug 27, 2003 at 06:36:10PM +0900, Takao Indoh wrote:
> > Besides this problem, there are many cases where increase of pagecache
> > causes trouble, I think.
> > For example, DBMS.
> > DBMS caches index of DB in their process space.
> > This index cache conflicts with the pagecache used by other
applications,
> > and index cache may be paged out. It cause uneven response of DBMS.
> > In this case, limiting pagecache is effective.
>
> Why is it effective? You're describing pagecache vs. pagecache
> competition and the DBMS outcompeting the cooperating applications for
> memory to the detriment of the workload; this is a very different
> scenario from what "limiting pagecache" sounds like.
>
> How do you know it would be effective? Have you written a patch to
> limit it in some way and tried running it?
>
>
> On Tue, 26 Aug 2003 02:46:34 -0700, William Lee Irwin III wrote:
> >> One thing I thought of after the post was whether they actually had in
> >> mind tunable hard limits on _unmapped_ pagecache, which is, in fact,
> >> useful. OTOH that's largely speculation and we really need them to
> >> articulate the true nature of their problem.
>
> On Wed, Aug 27, 2003 at 06:36:10PM +0900, Takao Indoh wrote:
> > I also think that is effective. Empirically, in the case where pagecache
> > causes memory shortage, most of pagecache is unmapped page. Of course
> > real problem may not be pagecashe, as you or Mike said.
>
> How do you know most of it is unmapped?
>
> At any rate, the above assigns a meaningful definition to the words you
> used; it does not necessarily have anything to do with the issue you're
> trying to describe. If you could start from the very basics, reproduce
> the problem, instrument the workload with top(1) and vmstat(1), and find
> some way to describe how the performance is inadequate (e.g. performance
> metrics for your running DBMS/whatever in MB/s or transactions/s etc.),
> it would be much more helpful than proposing a solution up front.
> Without any evidence, we can't know it is a solution at all, or that
> it's the right solution.
>
>
> -- wli
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
I was premature about the test case, but still, a process (or several) that
mmap several GB's of files and dont unmap what they don't need has caused
issues in the past.
-joe
----- Original Message -----
From: "Joseph Malicki" <[email protected]>
To: "William Lee Irwin III" <[email protected]>; "Takao Indoh"
<[email protected]>
Cc: "Mike Fedyk" <[email protected]>; <[email protected]>
Sent: Wednesday, August 27, 2003 12:01 PM
Subject: Re: cache limit
> I've had experience with unneeded, *mapped* pages that would be ideally
> flushed oppressing needed mapped and unmapped pages.
> Test case: grep --mmap SOME_STRING_I_WONT_FIND some_multi-GB-file
>
> Sure, it's bad programming etc, but in that case, once those pages are
> mapped, they can't be forcibly unmapped even though
> in a utopian VM they would be discarded as unneeded.
>
> This could very well be the problem?
>
> -joe
>
> ----- Original Message -----
> From: "William Lee Irwin III" <[email protected]>
> To: "Takao Indoh" <[email protected]>
> Cc: "Mike Fedyk" <[email protected]>; <[email protected]>
> Sent: Wednesday, August 27, 2003 5:45 AM
> Subject: Re: cache limit
>
>
> > On Wed, Aug 27, 2003 at 06:36:10PM +0900, Takao Indoh wrote:
> > > This problem happened a few month ago and the detailed data does not
> > > remain. Therefore it is difficult to know what is essential cause for
> > > this problem, but, I guessed that pagecache used as I/O cache grew
> > > gradually during system running, and finally it oppressed memory.
> >
> > But this doesn't make any sense; the only memory you could "oppress"
> > is pagecache.
> >
> >
> > On Wed, Aug 27, 2003 at 06:36:10PM +0900, Takao Indoh wrote:
> > > Besides this problem, there are many cases where increase of pagecache
> > > causes trouble, I think.
> > > For example, DBMS.
> > > DBMS caches index of DB in their process space.
> > > This index cache conflicts with the pagecache used by other
> applications,
> > > and index cache may be paged out. It cause uneven response of DBMS.
> > > In this case, limiting pagecache is effective.
> >
> > Why is it effective? You're describing pagecache vs. pagecache
> > competition and the DBMS outcompeting the cooperating applications for
> > memory to the detriment of the workload; this is a very different
> > scenario from what "limiting pagecache" sounds like.
> >
> > How do you know it would be effective? Have you written a patch to
> > limit it in some way and tried running it?
> >
> >
> > On Tue, 26 Aug 2003 02:46:34 -0700, William Lee Irwin III wrote:
> > >> One thing I thought of after the post was whether they actually had
in
> > >> mind tunable hard limits on _unmapped_ pagecache, which is, in fact,
> > >> useful. OTOH that's largely speculation and we really need them to
> > >> articulate the true nature of their problem.
> >
> > On Wed, Aug 27, 2003 at 06:36:10PM +0900, Takao Indoh wrote:
> > > I also think that is effective. Empirically, in the case where
pagecache
> > > causes memory shortage, most of pagecache is unmapped page. Of course
> > > real problem may not be pagecashe, as you or Mike said.
> >
> > How do you know most of it is unmapped?
> >
> > At any rate, the above assigns a meaningful definition to the words you
> > used; it does not necessarily have anything to do with the issue you're
> > trying to describe. If you could start from the very basics, reproduce
> > the problem, instrument the workload with top(1) and vmstat(1), and find
> > some way to describe how the performance is inadequate (e.g. performance
> > metrics for your running DBMS/whatever in MB/s or transactions/s etc.),
> > it would be much more helpful than proposing a solution up front.
> > Without any evidence, we can't know it is a solution at all, or that
> > it's the right solution.
> >
> >
> > -- wli
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
> On Wed, 27 Aug 2003 02:45:12 -0700, William Lee Irwin III wrote:
> >> How do you know it would be effective? Have you written a patch to
> >> limit it in some way and tried running it?
>
> On Wed, Aug 27, 2003 at 08:14:12PM +0900, Takao Indoh wrote:
> > It's just my guess. You mean that "index cache" is on the pagecache?
> > "index cache" is allocated in the user space by malloc,
> > so I think it is not on the pagecache.
>
> That will be in the pagecache.
No. DBMS usually uses DIRECTIO that bypass the pagecache.
So, "index caches" in the DBMS user space will not be in pagecache.
On Wed, 27 Aug 2003 04:36:46 -0700, William Lee Irwin III wrote:
>On Wed, 27 Aug 2003 02:45:12 -0700, William Lee Irwin III wrote:
>>> How do you know most of it is unmapped?
>
>On Wed, Aug 27, 2003 at 08:14:12PM +0900, Takao Indoh wrote:
>> I checked /proc/meminfo.
>> For example, this is my /proc/meminfo(kernel 2.5.73)
>[...]
>> Buffers: 18520 kB
>> Cached: 732360 kB
>> SwapCached: 0 kB
>> Active: 623068 kB
>> Inactive: 179552 kB
>[...]
>> Dirty: 33204 kB
>> Writeback: 0 kB
>> Mapped: 73360 kB
>> Slab: 32468 kB
>> Committed_AS: 167396 kB
>[...]
>> According to this information, I thought that
>> all pagecache was 732360 kB and all mapped page was 73360 kB, so
>> almost of pagecache was not mapped...
>> Do I misread meminfo?
>
>No. Most of your pagecache is unmapped pagecache. This would correspond
>to memory that caches files which are not being mmapped by any process.
>This could result from either the page replacement policy favoring
>filesystem cache too heavily or from lots of io causing the filesystem
>cache to be too bloated and so defeating the swapper's heuristics (you
>can do this by generating large amounts of read() traffic).
>
>Limiting unmapped pagecache would resolve your issue. Whether it's the
>right thing to do is still open to question without some knowledge of
>application behavior (for instance, teaching userspace to do fadvise()
>may be right thing to do as opposed to the /proc/ tunable).
>
>Can you gather traces of system calls being made by the applications?
>
>
>-- wli
This is an output of strace -cf.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
47.91 57.459696 65 885531 read
20.27 24.309112 33 727702 write
17.01 20.405231 1058 19292 vfork
13.41 16.087846 11 1524468 lseek
0.32 0.379605 10 38586 close
0.31 0.368272 19 19290 wait4
0.27 0.326425 17 19292 pipe
0.19 0.227052 12 19296 old_mmap
0.16 0.192420 10 19291 munmap
0.13 0.158041 8 19302 fstat64
0.01 0.009983 4992 2 fsync
0.00 0.001233 6 202 brk
0.00 0.001029 515 2 unlink
0.00 0.000173 25 7 1 open
0.00 0.000128 64 2 chmod
0.00 0.000092 7 13 6 stat64
0.00 0.000019 19 1 getcwd
0.00 0.000016 5 3 access
0.00 0.000015 4 4 shmat
0.00 0.000012 3 4 rt_sigaction
0.00 0.000007 7 1 mprotect
0.00 0.000002 2 1 getpid
------ ----------- ----------- --------- --------- ----------------
100.00 119.926409 3292292 7 total
According to this information, many I/O increase pagecache and cause
memory shortage.
fadvise may be effective, but fadvise always releases cache
even if there are enough free memory, and may degrade performance.
In the case of /proc tunable,
pagecache is not released until system memory become lack.
On Thu, 28 Aug 2003 01:02:45 +0900, YoshiyaETO wrote:
>> On Wed, 27 Aug 2003 02:45:12 -0700, William Lee Irwin III wrote:
>> >> How do you know it would be effective? Have you written a patch to
>> >> limit it in some way and tried running it?
>>
>> On Wed, Aug 27, 2003 at 08:14:12PM +0900, Takao Indoh wrote:
>> > It's just my guess. You mean that "index cache" is on the pagecache?
>> > "index cache" is allocated in the user space by malloc,
>> > so I think it is not on the pagecache.
>>
>> That will be in the pagecache.
>
> No. DBMS usually uses DIRECTIO that bypass the pagecache.
>So, "index caches" in the DBMS user space will not be in pagecache.
If so, limiting pagecache seems to be effective for DBMS.
--------------------------------------------------
Takao Indoh
E-Mail : [email protected]
On Tue, Sep 02, 2003 at 07:52:51PM +0900, Takao Indoh wrote:
> >> According to this information, I thought that
> >> all pagecache was 732360 kB and all mapped page was 73360 kB, so
> >> almost of pagecache was not mapped...
> >> Do I misread meminfo?
Can you try your workload again with:
echo 0 > /proc/sys/vm/swappiness
On Tue, Sep 02, 2003 at 07:52:51PM +0900, Takao Indoh wrote:
> According to this information, many I/O increase pagecache and cause
> memory shortage.
> fadvise may be effective, but fadvise always releases cache
> even if there are enough free memory, and may degrade performance.
> In the case of /proc tunable,
> pagecache is not released until system memory become lack.
[...]
> If so, limiting pagecache seems to be effective for DBMS.
There are reasons why databases use raw io and direct io; this is one
of them. I'd say the kernel shouldn't try to engage in such tunable
shenanigans.
-- wli