2004-01-03 11:36:08

by Alex Buell

[permalink] [raw]
Subject: inode_cache / dentry_cache not being reclaimed aggressively enough on low-memory PCs

I've just run across a problem with 2.4.x (and probably 2.6.x as well, if
reports I've see are correct). When updatedb is run overnight, it builds
up large amounts of inode_cache and dentry_cache. This is a big problem on
low memory boxes as those caches are not being reclaimed aggressively
enough, which means the box will be constantly swapping if it runs out of
free memory. I've looked at archives and I find that there's similar
reports going back to 2.4.16, and doesn't seem to have been solved as this
problem is apparently in 2.6.0 as well!

The only solution I've found so far is to run L*rry McV*y's lmdd to force
reclaimation of those caches but this isn't ideal. What patches are out
there that solves this problem?

Thanks,
Alex.
--
http://www.munted.org.uk

Your mother cooks socks in hell


2004-01-03 16:30:40

by John Lash

[permalink] [raw]
Subject: Re: inode_cache / dentry_cache not being reclaimed aggressively enough on low-memory PCs

On Sat, 3 Jan 2004 11:35:36 +0000 (GMT)
Alex Buell <[email protected]> wrote:

> I've just run across a problem with 2.4.x (and probably 2.6.x as well, if
> reports I've see are correct). When updatedb is run overnight, it builds
> up large amounts of inode_cache and dentry_cache. This is a big problem on
> low memory boxes as those caches are not being reclaimed aggressively
> enough, which means the box will be constantly swapping if it runs out of
> free memory. I've looked at archives and I find that there's similar
> reports going back to 2.4.16, and doesn't seem to have been solved as this
> problem is apparently in 2.6.0 as well!
>
> The only solution I've found so far is to run L*rry McV*y's lmdd to force
> reclaimation of those caches but this isn't ideal. What patches are out
> there that solves this problem?
>

After taking the 30 minute tour of the 2.6.0 dcache code, one thing that seems
possible is shrink_dcache_memory() is causing you the problem. It is doing some
manipulation of the nr_unused value to prevent all unused dcache entries from
being removed and thereby blowing away all recently used entries as well as the
ones that are really stale.

As it stands, it will maintain as many unused entries as there are used entries.
If this low memory system las a large, stable, number of inuse dentry objects,
the unused entries will match it thereby holding double the memory and possibly
causing the problem you see.

here are the lines I mean:
nr_unused = dentry_stat.nr_unused;
nr_used = dentry_stat.nr_dentry - nr_unused;
if (nr_unused < nr_used * unused_ratio)
return 0;
return nr_unused - nr_used * unused_ratio; /* unused_ratio = 1 at the top of
the fn */

Check on your system, /proc/sys/fs/dentry-state, first two values appear to be
nr_dentry and nr_unused. Plug those values into the above code and if you get
something around zero, that's why the memory is stuck.

A couple of solutions come to mind. The one I like best would be to adjust the
above code to make it conscious of the total memory in the system and keep
nr_unused to a reasonable percentage. Another is to allow unused_ratio to be
less than 1, Possibly some/proc entry to lower it (.5, .25, whatever), or to
avoid the float, provide another parameter to do an integer divisor for
unused_ratio. Something like:

nr_unused - nr_used * unused_ratio / ratio_fraction

If that's not why the memory is stuck, then it's something deeper in the reclaim
code. Either way, I'd be curious to know what you find. Depending on what your
system shows, I could provide a patch to try some things out.

--john


> Thanks,
> Alex.
> --
> http://www.munted.org.uk
>
> Your mother cooks socks in hell
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-01-03 18:28:59

by Alex Buell

[permalink] [raw]
Subject: Re: inode_cache / dentry_cache not being reclaimed aggressively enough on low-memory PCs

On Sat, 3 Jan 2004, John Lash wrote:

> Check on your system, /proc/sys/fs/dentry-state, first two values appear
> to be nr_dentry and nr_unused. Plug those values into the above code and
> if you get something around zero, that's why the memory is stuck.

Right, I put together this simple C program as follows:
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
int dentries_used, dentries_unused;
int inodes_used, inodes_unused;
int ratio = 1, used;
char entries[80];
FILE *fh;

fh = fopen("/proc/sys/fs/dentry-state", "r");
fgets(entries, 4096, fh);
fclose(fh);

dentries_used = atoi(entries);
dentries_unused = atoi(&entries[6]);

printf("nr_dentry: %d\nnr_unused: %d\n\n", dentries_used, dentries_unused);

used = dentries_used - dentries_unused;
printf("%d < %d * %d = %d\n", dentries_unused, used, ratio, (dentries_unused < used * ratio));

return 0;
}

This gives me interesting results:

1) On a box with humuguous dentries:
./fs_cache
nr_dentry: 76637
nr_unused: 67869

67869 < 8768 * 1 = 0

2) On a box with not so many:
./fs_cache
nr_dentry: 7950
nr_unused: 572

572 < 7378 * 1 = 1

So it seems you're quite right.

> A couple of solutions come to mind. The one I like best would be to
> adjust the above code to make it conscious of the total memory in the
> system and keep nr_unused to a reasonable percentage. Another is to
> allow unused_ratio to be less than 1, Possibly some/proc entry to lower
> it (.5, .25, whatever), or to avoid the float, provide another parameter
> to do an integer divisor for unused_ratio. Something like:
>
> nr_unused - nr_used * unused_ratio / ratio_fraction

That solution does seem be the best answer.

--
http://www.munted.org.uk

Your mother cooks socks in hell

2004-01-03 22:25:07

by Mike Fedyk

[permalink] [raw]
Subject: Re: inode_cache / dentry_cache not being reclaimed aggressively enough on low-memory PCs

On Sat, Jan 03, 2004 at 06:27:42PM +0000, Alex Buell wrote:
> On Sat, 3 Jan 2004, John Lash wrote:
> > A couple of solutions come to mind. The one I like best would be to
> > adjust the above code to make it conscious of the total memory in the
> > system and keep nr_unused to a reasonable percentage. Another is to
> > allow unused_ratio to be less than 1, Possibly some/proc entry to lower
> > it (.5, .25, whatever), or to avoid the float, provide another parameter
> > to do an integer divisor for unused_ratio. Something like:
> >
> > nr_unused - nr_used * unused_ratio / ratio_fraction
>
> That solution does seem be the best answer.

Be sure to run your changes by roger luetger. He's working with the
problems with lowmem machines and 2.6.

2004-01-03 22:56:01

by Andrew Morton

[permalink] [raw]
Subject: Re: inode_cache / dentry_cache not being reclaimed aggressively enough on low-memory PCs

John Lash <[email protected]> wrote:
>
> As it stands, it will maintain as many unused entries as there are used entries.
> If this low memory system las a large, stable, number of inuse dentry objects,
> the unused entries will match it thereby holding double the memory and possibly
> causing the problem you see.

Yup. There is a fix in 2.6.1-rc1 for this.

2004-01-04 00:15:54

by Alex Buell

[permalink] [raw]
Subject: Re: inode_cache / dentry_cache not being reclaimed aggressively enough on low-memory PCs

On Sat, 3 Jan 2004, Andrew Morton wrote:

> John Lash <[email protected]> wrote:
> >
> > As it stands, it will maintain as many unused entries as there are used entries.
> > If this low memory system las a large, stable, number of inuse dentry objects,
> > the unused entries will match it thereby holding double the memory and possibly
> > causing the problem you see.
>
> Yup. There is a fix in 2.6.1-rc1 for this.

Which change would that be? It would be nice to back-port that to 2.4.x if
that's possible?

--
http://www.munted.org.uk

Your mother cooks socks in hell

2004-01-04 03:05:58

by Andrew Morton

[permalink] [raw]
Subject: Re: inode_cache / dentry_cache not being reclaimed aggressively enough on low-memory PCs

Alex Buell <[email protected]> wrote:
>
> On Sat, 3 Jan 2004, Andrew Morton wrote:
>
> > John Lash <[email protected]> wrote:
> > >
> > > As it stands, it will maintain as many unused entries as there are used entries.
> > > If this low memory system las a large, stable, number of inuse dentry objects,
> > > the unused entries will match it thereby holding double the memory and possibly
> > > causing the problem you see.
> >
> > Yup. There is a fix in 2.6.1-rc1 for this.
>
> Which change would that be? It would be nice to back-port that to 2.4.x if
> that's possible?

It is not backportable.

You could try increasing `count' in shrink_dcache_memory() and
shrink_icache_memory(). Also you should be using 2.4.23 or later because
it does have improvements in the memory reclaim area.


2004-01-04 05:31:33

by John Lash

[permalink] [raw]
Subject: Re: inode_cache / dentry_cache not being reclaimed aggressively enough on low-memory PCs

ahh, good. I'll take a look. Thanks Andrew.

--john


On Sat, 3 Jan 2004 14:55:57 -0800
Andrew Morton <[email protected]> wrote:

> John Lash <[email protected]> wrote:
> >
> > As it stands, it will maintain as many unused entries as there are used
> > entries.
> > If this low memory system las a large, stable, number of inuse dentry
> > objects, the unused entries will match it thereby holding double the memory
> > and possibly causing the problem you see.
>
> Yup. There is a fix in 2.6.1-rc1 for this.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-01-04 07:23:19

by Mike Fedyk

[permalink] [raw]
Subject: Re: inode_cache / dentry_cache not being reclaimed aggressively enough on low-memory PCs

On Sat, Jan 03, 2004 at 07:05:43PM -0800, Andrew Morton wrote:
> Alex Buell <[email protected]> wrote:
> >
> > On Sat, 3 Jan 2004, Andrew Morton wrote:
> >
> > > John Lash <[email protected]> wrote:
> > > >
> > > > As it stands, it will maintain as many unused entries as there are used entries.
> > > > If this low memory system las a large, stable, number of inuse dentry objects,
> > > > the unused entries will match it thereby holding double the memory and possibly
> > > > causing the problem you see.
> > >
> > > Yup. There is a fix in 2.6.1-rc1 for this.
> >
> > Which change would that be? It would be nice to back-port that to 2.4.x if
> > that's possible?
>
> It is not backportable.
>
> You could try increasing `count' in shrink_dcache_memory() and
> shrink_icache_memory(). Also you should be using 2.4.23 or later because
> it does have improvements in the memory reclaim area.

Also, if there are any improvements considered for the 2.4 VM, it should be
on top of the -aa series. That's where the latest updates are, and it
doesn't make sence to work from a base that already has seperate
improvements available.

2004-01-05 17:44:01

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: inode_cache / dentry_cache not being reclaimed aggressively enough on low-memory PCs



On Sat, 3 Jan 2004, Mike Fedyk wrote:

> On Sat, Jan 03, 2004 at 07:05:43PM -0800, Andrew Morton wrote:
> > Alex Buell <[email protected]> wrote:
> > >
> > > On Sat, 3 Jan 2004, Andrew Morton wrote:
> > >
> > > > John Lash <[email protected]> wrote:
> > > > >
> > > > > As it stands, it will maintain as many unused entries as there are used entries.
> > > > > If this low memory system las a large, stable, number of inuse dentry objects,
> > > > > the unused entries will match it thereby holding double the memory and possibly
> > > > > causing the problem you see.
> > > >
> > > > Yup. There is a fix in 2.6.1-rc1 for this.
> > >
> > > Which change would that be? It would be nice to back-port that to 2.4.x if
> > > that's possible?
> >
> > It is not backportable.
> >
> > You could try increasing `count' in shrink_dcache_memory() and
> > shrink_icache_memory(). Also you should be using 2.4.23 or later because
> > it does have improvements in the memory reclaim area.
>
> Also, if there are any improvements considered for the 2.4 VM, it should be
> on top of the -aa series. That's where the latest updates are, and it
> doesn't make sence to work from a base that already has seperate
> improvements available.

The fix in -aa seems to reclaim inodes very aggressively. The 2.4 RH tree
seems to contain a better version. Need to look into that.

2004-01-05 18:48:10

by Mike Fedyk

[permalink] [raw]
Subject: Re: inode_cache / dentry_cache not being reclaimed aggressively enough on low-memory PCs

On Mon, Jan 05, 2004 at 03:32:57PM -0200, Marcelo Tosatti wrote:
>
>
> On Sat, 3 Jan 2004, Mike Fedyk wrote:
> > Also, if there are any improvements considered for the 2.4 VM, it should be
> > on top of the -aa series. That's where the latest updates are, and it
> > doesn't make sence to work from a base that already has seperate
> > improvements available.
>
> The fix in -aa seems to reclaim inodes very aggressively. The 2.4 RH tree
> seems to contain a better version. Need to look into that.

http://www.matchmail.com/stats/lrrd/matchmail.com/fileserver.matchmail.com-memory.html

Comparing[1] week 51 (2.4.23-rc5), and week 01 (2.4.23-aa1) would show that the
slab cache can grow larger for the same workload in -aa right now.

I have a backup that runs every day at 4-6am that is somewhat memory
intensive since it uses smbfs, (that's notorious for its bad memory usage
patterns) and it doesn't shrink the slab at all. The only thing that
affected slab size was closing one of my mutt instances that was running on
a maildir folder with 28k messages in it on tuesday of week 01.



http://www.matchmail.com/stats/lrrd/matchmail.com/fileserver.matchmail.com-swap.html

2.4.23-aa may or may not have problems with inode/dentry reclaim (I haven't
checked other workloads), but it sure improves the amount of swap I/O
performed.



Here's the top slab users right now:

inode_cache 456388 457304 512 57163 57163 1 : 534792 4012606
132385 68961 0 : 124

dentry_cache 626116 641116 136 22897 22897 1 : 731108 7905948
51318 27096 0 : 252 12

buffer_head 127878 132732 108 3684 3687 1 : 163476 29891760
62063 58376 0 : 252 126

size-64 137960 141934 72 2678 2678 1 : 152534 2392141 3263
585 0 : 252 126 :

vm_area_struct 5586 6600 76 124 132 1 : 8650 17144108
1294 1162 0 : 252 126

blkdev_requests 4096 4120 96 103 103 1 : 4640 5018
11613 0 : 252 126 :

size-4096 98 98 4096 98 98 1 : 827 264784
173433 173335 0 : 60 3


2004-01-09 20:49:51

by Rik van Riel

[permalink] [raw]
Subject: Re: inode_cache / dentry_cache not being reclaimed aggressively enough on low-memory PCs

On Mon, 5 Jan 2004, Marcelo Tosatti wrote:

> The fix in -aa seems to reclaim inodes very aggressively. The 2.4 RH tree
> seems to contain a better version. Need to look into that.

Here it is, against yesterday's 2.4 bitkeeper tree.
Some comments:
- prune_icache() can be called with a priority of zero, so we
should check for <= 0
- the main reason we don't strip the page cache from inodes
with lots of cache in memory are inefficiencies observed
in invalidate_inode_pages()

I have tested this code with a trivial shell script that
touches a zillion and a half really small files over and
over again, more than the system can cache. Once the
inodes and dentries fill up more than about 800MB of low
memory the system gets very uncomfortable without the
patch. With the patch it simply frees a whole bunch of
inodes and things run fine.


===== fs/inode.c 1.46 vs edited =====
--- 1.46/fs/inode.c Wed Dec 31 07:31:15 2003
+++ edited/fs/inode.c Thu Jan 8 12:23:51 2004
@@ -49,7 +49,8 @@
* other linked list is the "type" list:
* "in_use" - valid inode, i_count > 0, i_nlink > 0
* "dirty" - as "in_use" but also dirty
- * "unused" - valid inode, i_count = 0
+ * "unused" - valid inode, i_count = 0, no pages in the pagecache
+ * "unused_pagecache" - valid inode, i_count = 0, data in the pagecache
*
* A "dirty" list is maintained for each super block,
* allowing for low-overhead inode sync() operations.
@@ -57,6 +58,7 @@

static LIST_HEAD(inode_in_use);
static LIST_HEAD(inode_unused);
+static LIST_HEAD(inode_unused_pagecache);
static struct list_head *inode_hashtable;
static LIST_HEAD(anon_hash_chain); /* for inodes with NULL i_sb */

@@ -286,6 +288,36 @@
inodes_stat.nr_unused--;
}

+static inline void __refile_inode(struct inode *inode)
+{
+ struct list_head *to;
+
+ if (inode->i_state & I_FREEING)
+ return;
+ if (list_empty(&inode->i_hash))
+ return;
+
+ if (inode->i_state & I_DIRTY)
+ to = &inode->i_sb->s_dirty;
+ else if (atomic_read(&inode->i_count))
+ to = &inode_in_use;
+ else if (inode->i_data.nrpages)
+ to = &inode_unused_pagecache;
+ else
+ to = &inode_unused;
+ list_del(&inode->i_list);
+ list_add(&inode->i_list, to);
+}
+
+void refile_inode(struct inode *inode)
+{
+ if (!inode)
+ return;
+ spin_lock(&inode_lock);
+ __refile_inode(inode);
+ spin_unlock(&inode_lock);
+}
+
static inline void __sync_one(struct inode *inode, int sync)
{
unsigned dirty;
@@ -312,17 +344,8 @@

spin_lock(&inode_lock);
inode->i_state &= ~I_LOCK;
- if (!(inode->i_state & I_FREEING)) {
- struct list_head *to;
- if (inode->i_state & I_DIRTY)
- to = &inode->i_sb->s_dirty;
- else if (atomic_read(&inode->i_count))
- to = &inode_in_use;
- else
- to = &inode_unused;
- list_del(&inode->i_list);
- list_add(&inode->i_list, to);
- }
+ if (!(inode->i_state & I_FREEING))
+ __refile_inode(inode);
wake_up(&inode->i_wait);
}

@@ -699,6 +722,7 @@
spin_lock(&inode_lock);
busy = invalidate_list(&inode_in_use, sb, &throw_away);
busy |= invalidate_list(&inode_unused, sb, &throw_away);
+ busy |= invalidate_list(&inode_unused_pagecache, sb, &throw_away);
busy |= invalidate_list(&sb->s_dirty, sb, &throw_away);
busy |= invalidate_list(&sb->s_locked_inodes, sb, &throw_away);
spin_unlock(&inode_lock);
@@ -762,7 +786,7 @@
{
LIST_HEAD(list);
struct list_head *entry, *freeable = &list;
- int count;
+ int count, avg_pages;
struct inode * inode;

spin_lock(&inode_lock);
@@ -785,7 +809,7 @@
list_add(tmp, freeable);
inode->i_state |= I_FREEING;
count++;
- if (!--goal)
+ if (--goal <= 0)
break;
}
inodes_stat.nr_unused -= count;
@@ -799,8 +823,70 @@
* from here or we're either synchronously dogslow
* or we deadlock with oom.
*/
- if (goal)
+ if (goal > 0)
schedule_task(&unused_inodes_flush_task);
+
+#ifdef CONFIG_HIGHMEM
+ /*
+ * On highmem machines it is possible to have low memory
+ * filled with inodes that cannot be reclaimed because they
+ * have page cache pages in highmem attached to them.
+ * This could deadlock the system if the memory used by
+ * inodes is significant compared to the amount of freeable
+ * low memory. In that case we forcefully remove the page
+ * cache pages from the inodes we want to reclaim.
+ *
+ * Note that this loop doesn't actually reclaim the inodes;
+ * once the last pagecache pages belonging to the inode is
+ * gone it will be placed on the inode_unused list and the
+ * loop above will prune it the next time prune_icache() is
+ * called.
+ */
+ if (goal <= 0)
+ return;
+ if (inodes_stat.nr_unused * sizeof(struct inode) * 10 <
+ freeable_lowmem() * PAGE_SIZE)
+ return;
+
+ wakeup_bdflush();
+
+ avg_pages = page_cache_size;
+ avg_pages -= atomic_read(&buffermem_pages) + swapper_space.nrpages;
+ avg_pages = avg_pages / (inodes_stat.nr_inodes + 1);
+ spin_lock(&inode_lock);
+ while (goal-- > 0) {
+ if (list_empty(&inode_unused_pagecache))
+ break;
+ entry = inode_unused_pagecache.prev;
+ list_del(entry);
+ list_add(entry, &inode_unused_pagecache);
+
+ inode = INODE(entry);
+ /* Don't nuke inodes with lots of page cache attached. */
+ if (inode->i_mapping->nrpages > 5 * avg_pages)
+ continue;
+ /* Because of locking we grab the inode and unlock the list .*/
+ if (inode->i_state & I_LOCK)
+ continue;
+ inode->i_state |= I_LOCK;
+ spin_unlock(&inode_lock);
+
+ /*
+ * If the inode has clean pages only, we can free all its
+ * pagecache memory; the inode will automagically be refiled
+ * onto the unused_list. The wakeup_bdflush above makes
+ * sure that all inodes become clean eventually.
+ */
+ if (list_empty(&inode->i_mapping->dirty_pages) &&
+ !inode_has_buffers(inode))
+ invalidate_inode_pages(inode);
+
+ /* Release the inode again. */
+ spin_lock(&inode_lock);
+ inode->i_state &= ~I_LOCK;
+ }
+ spin_unlock(&inode_lock);
+#endif /* CONFIG_HIGHMEM */
}

int shrink_icache_memory(int priority, int gfp_mask)
===== include/linux/fs.h 1.95 vs edited =====
--- 1.95/include/linux/fs.h Fri Dec 5 22:25:43 2003
+++ edited/include/linux/fs.h Thu Jan 8 15:03:26 2004
@@ -1399,6 +1399,7 @@
extern void inode_init_once(struct inode *);
extern void __inode_init_once(struct inode *);
extern void iput(struct inode *);
+extern void refile_inode(struct inode *inode);
extern void force_delete(struct inode *);
extern struct inode * igrab(struct inode *);
extern struct inode * ilookup(struct super_block *, unsigned long);
===== include/linux/swap.h 1.39 vs edited =====
--- 1.39/include/linux/swap.h Fri Sep 12 10:25:22 2003
+++ edited/include/linux/swap.h Thu Jan 8 15:03:26 2004
@@ -85,6 +85,7 @@

extern unsigned int nr_free_pages(void);
extern unsigned int nr_free_buffer_pages(void);
+extern unsigned int freeable_lowmem(void);
extern int nr_active_pages;
extern int nr_inactive_pages;
extern unsigned long page_cache_size;
===== mm/filemap.c 1.96 vs edited =====
--- 1.96/mm/filemap.c Wed Dec 31 07:31:15 2003
+++ edited/mm/filemap.c Tue Jan 6 16:04:15 2004
@@ -102,6 +102,8 @@
page->mapping = NULL;
wmb();
mapping->nrpages--;
+ if (!mapping->nrpages)
+ refile_inode(mapping->host);
}

static inline void remove_page_from_hash_queue(struct page * page)
===== mm/page_alloc.c 1.67 vs edited =====
--- 1.67/mm/page_alloc.c Thu Dec 18 12:47:08 2003
+++ edited/mm/page_alloc.c Thu Jan 8 14:53:31 2004
@@ -534,6 +534,23 @@

return pages;
}
+
+unsigned int freeable_lowmem(void)
+{
+ unsigned int pages = 0;
+ pg_data_t *pgdat;
+
+ for_each_pgdat(pgdat) {
+ pages += pgdat->node_zones[ZONE_DMA].free_pages;
+ pages += pgdat->node_zones[ZONE_DMA].nr_active_pages;
+ pages += pgdat->node_zones[ZONE_DMA].nr_inactive_pages;
+ pages += pgdat->node_zones[ZONE_NORMAL].free_pages;
+ pages += pgdat->node_zones[ZONE_NORMAL].nr_active_pages;
+ pages += pgdat->node_zones[ZONE_NORMAL].nr_inactive_pages;
+ }
+
+ return pages;
+}
#endif

#define K(x) ((x) << (PAGE_SHIFT-10))