2002-07-20 19:39:52

by Rik van Riel

[permalink] [raw]
Subject: [PATCH][1/2] return values shrink_dcache_memory etc

Hi,

this patch, against current 2.5.27, builds on the patch that let
kmem_cache_shrink return the number of pages freed. This value
is used as the return value for shrink_dcache_memory and friends.

This is useful not just for more accurate OOM detection, but also as
a preparation for putting these reclaimable slab pages on the LRU list.
This change was originally done by Ed Tomlinson.

please apply,
thank you,

Rik
--
Bravely reimplemented by the knights who say "NIH".

fs/dcache.c | 3 +--
fs/dquot.c | 3 +--
fs/inode.c | 3 +--
mm/vmscan.c | 6 +++---
4 files changed, 6 insertions(+), 9 deletions(-)

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.660 -> 1.661
# fs/dcache.c 1.29 -> 1.30
# fs/dquot.c 1.43 -> 1.44
# mm/vmscan.c 1.85 -> 1.86
# fs/inode.c 1.66 -> 1.67
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/07/19 [email protected] 1.661
# use the return values from shrink_dcache_memory, shrink_icache_memory
# and shrink_dqcache_memory (thanks to Ed Tomlinson)
# --------------------------------------------
#
diff -Nru a/fs/dcache.c b/fs/dcache.c
--- a/fs/dcache.c Fri Jul 19 18:22:35 2002
+++ b/fs/dcache.c Fri Jul 19 18:22:35 2002
@@ -603,8 +603,7 @@
count = dentry_stat.nr_unused / priority;

prune_dcache(count);
- kmem_cache_shrink(dentry_cache);
- return 0;
+ return kmem_cache_shrink(dentry_cache);
}

#define NAME_ALLOC_LEN(len) ((len+16) & ~15)
diff -Nru a/fs/dquot.c b/fs/dquot.c
--- a/fs/dquot.c Fri Jul 19 18:22:35 2002
+++ b/fs/dquot.c Fri Jul 19 18:22:35 2002
@@ -498,8 +498,7 @@
count = dqstats.free_dquots / priority;
prune_dqcache(count);
unlock_kernel();
- kmem_cache_shrink(dquot_cachep);
- return 0;
+ return kmem_cache_shrink(dquot_cachep);
}

/*
diff -Nru a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c Fri Jul 19 18:22:35 2002
+++ b/fs/inode.c Fri Jul 19 18:22:35 2002
@@ -431,8 +431,7 @@
count = inodes_stat.nr_unused / priority;

prune_icache(count);
- kmem_cache_shrink(inode_cachep);
- return 0;
+ return kmem_cache_shrink(inode_cachep);
}

/*
diff -Nru a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c Fri Jul 19 18:22:35 2002
+++ b/mm/vmscan.c Fri Jul 19 18:22:35 2002
@@ -389,12 +389,12 @@

wakeup_bdflush();

- shrink_dcache_memory(priority, gfp_mask);
+ nr_pages += shrink_dcache_memory(priority, gfp_mask);

/* After shrinking the dcache, get rid of unused inodes too .. */
- shrink_icache_memory(1, gfp_mask);
+ nr_pages += shrink_icache_memory(1, gfp_mask);
#ifdef CONFIG_QUOTA
- shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
+ nr_pages += shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
#endif

return nr_pages;



2002-07-20 20:07:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc



On Sat, 20 Jul 2002, Rik van Riel wrote:
>
> this patch, against current 2.5.27, builds on the patch that let
> kmem_cache_shrink return the number of pages freed. This value
> is used as the return value for shrink_dcache_memory and friends.

I disagree with the whole approach of having shrink_cache() return the
number of pages free.

The number is meaningless, since it has nothing to do with the actual
memory zones that are under pressure (right now, the memory zone is almost
always ZONE_NORMAL, which is correct, but that's just pure luck rather
than anything fundamental).

I'd be much more interested in the "put the cache pages on the dirty list,
and have memory pressure push them out in LRU order" approach. Somebody
already had preliminary patches.

That gets _rid_ of dcache_shrink() and friends, instead of making them
return meaningless numbers.

Linus

2002-07-20 20:38:58

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc

On Sat, 20 Jul 2002, Linus Torvalds wrote:
> On Sat, 20 Jul 2002, Rik van Riel wrote:
> >
> > this patch, against current 2.5.27, builds on the patch that let
> > kmem_cache_shrink return the number of pages freed. This value
> > is used as the return value for shrink_dcache_memory and friends.

> I'd be much more interested in the "put the cache pages on the dirty list,
> and have memory pressure push them out in LRU order" approach. Somebody
> already had preliminary patches.
>
> That gets _rid_ of dcache_shrink() and friends, instead of making them
> return meaningless numbers.

OK, I'll try to forward-port Ed's code to do that from 2.4 to 2.5
this weekend...

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-07-20 20:49:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc



On Sat, 20 Jul 2002, Rik van Riel wrote:
>
> OK, I'll try to forward-port Ed's code to do that from 2.4 to 2.5
> this weekend...

Side note: while I absolutely think that is the right thing to do, that's
also the much more "interesting" change. As a result, I'd be happier if it
went through channels (ie probably Andrew) and had some wider testing
first at least in the form of a CFT on linux-kernel.

[ Or has it already been in 2.4.x in any major tree? (In which case my
testing argument is lessened to some degree and it's mainly just to
verify that the forward-port works). ]

Thanks,
Linus

2002-07-20 21:39:18

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc

On Sat, 20 Jul 2002, Linus Torvalds wrote:
> On Sat, 20 Jul 2002, Rik van Riel wrote:
> >
> > OK, I'll try to forward-port Ed's code to do that from 2.4 to 2.5
> > this weekend...
>
> Side note: while I absolutely think that is the right thing to do, that's
> also the much more "interesting" change. As a result, I'd be happier if it
> went through channels (ie probably Andrew) and had some wider testing
> first at least in the form of a CFT on linux-kernel.

Absolutely. This idea has been implemented but hasn't yet
received wider testing.

I guess I'll try to get this patch on LWN with a large
CFT attached ;)))

> [ Or has it already been in 2.4.x in any major tree?

Not yet. I'll pass it on through Andrew...

cheers,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-07-22 13:58:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc

On Mon, 22 Jul 2002, Martin J. Bligh wrote:

> > Was it purely Oracle which drove pte-highmem, or do you think
>
> I don't see you can get into pathalogical crap without heavy
> sharing of large amounts of data .... without sharing, you're at
> a fixed percentage of phys mem - with sharing, I can have more
> PTEs needed that I have phys mem.

... for which pte_highmem wouldn't fix the problem, either.

>From what I can see we really want/need 2 complementary
solutions to fix this problem:

1) large pages and/or shared page tables to reduce page
table overhead, which is a real "solution"

2) page table garbage collection, to reduce the amount
of (resident?) page tables when the shit hits the fan;
this is an "emergency" thing to have and we wouldn't
want to use it continously, but it might be important
to keep the box alive

Since I've heard that (1) is already in use by some people
I've started working on (2) the moment Linus asked me to put
the dentry/icache pages in the LRU ;)))

cheers,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-07-22 13:31:24

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc

On Sun, 21 Jul 2002, Andrew Morton wrote:
> "Martin J. Bligh" wrote:

> > These large NUMA machines should actually be rmap's glory day in the
> > sun.
>
> "should be". Sigh. Be nice to see an "is" one day ;)

You asked for a "minimal rmap" patch and you got it. ;)

Bill and I actually have code for many of the things listed
but we haven't submitted it yet exactly because everybody
wanted the code merged in small, manageable chunks.

> Do you think that large pages alone would be enough to allow us
> to leave pte_chains (and page tables?) in ZONE_NORMAL, or would
> shared pagetables also be needed?

Large pages should reduce the page table overhead by a factor
of 1024 (or 512 for PAE) and have the same alignment restrictions
that shared page tables have.

OTOH, shared page tables would allow us to map in chunks smaller
than 4MB ... but at what seems like a pretty horrible locking and
accounting complexity, unless somebody comes up with a smart trick.

Apart from both of these we'll also need code to garbage collect
empty page tables so users can't clog up memory by mmaping a page
every 4 MB ;)

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-07-22 06:06:19

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc

>> > If we can get something in place which works acceptably on Martin
>> > Bligh's machines, and we can see that the gains of rmap (whatever
>> > they are ;)) are worth the as-yet uncoded pains then let's move on.
>> > But until then, adding new stuff to the VM just makes a `patch -R'
>> > harder to do.
>>
>> I have the same kinds of machines and have already been testing with
>> precisely the many tasks workloads he's concerned about for the sake of
>> correctness, and efficiency is also a concern here. highpte_chain is
>> already so high up on my priority queue that all other work is halted.
>
> OK. But we're adding non-trivial amounts of new code simply
> to get the reverse mapping working as robustly as the virtual
> scan. And we'll always have rmap's additional storage requirements.
>
> At some point we need to make a decision as to whether it's all
> worth it. Right now we do not even have the information on the
> pluses side to do this. That's worrisome.

These large NUMA machines should actually be rmap's glory day in the
sun. Per-node kswapd, being able to free mem pressure on one node
easily (without cross-node bouncing), breakup of the lru list into
smaller chunks, etc. These actually fix some of the biggest problems
that we have right now and are hard to solve in other ways.

The large rmap overheads we still have to kill seem to me to be the
memory usage and the fork overhead. There's also a certain amount of
overhead to managing any more data structures, of course. I think we
know how to kill most of it. I don't think adding highpte_chain is
the correct thing to do ... seems like adding insult to injury. I'd
rather see us drive a silver stake through the problem's heart and
kill it properly ...

M.

2002-07-22 07:19:57

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc

> Well that would be nice. And by extension, pte-highmem gets a stake
> as well.
>
> Do you think that large pages alone would be enough to allow us
> to leave pte_chains (and page tables?) in ZONE_NORMAL, or would
> shared pagetables also be needed?
>
> Was it purely Oracle which drove pte-highmem, or do you think

I don't see you can get into pathalogical crap without heavy
sharing of large amounts of data .... without sharing, you're at
a fixed percentage of phys mem - with sharing, I can have more
PTEs needed that I have phys mem. So anything that shares heavily
is the worst problem, and databases seemt to be the worst for this ...

If phys ram is significantly greater than virtual address space,
you can, of course, still kill yourself quite happily, particularly
as ZONE_NORMAL is full of struct page's etc at this point. But I am
under the impression we can make pte_highmem much nicer by putting
everything into UKVA, and thus just mapping our own processes mappings
rather than everyone in the universe ... giving some sort of solution
to the problem. pte_chains don't do that nicely.

> that page table and pte_chain consumption could be a problem
> on applications which can't/won't use large pages?

Bill and Ben (desperately tempting to bring up English children's
programs at this point ...) were describing some sort of 32Kb
windowing Oracle uses which would render large pages useless
without some funky windowing we apparently used to do for PTX.
Ben ... could you describe this 32Kb stuff?

M.

2002-07-22 05:13:11

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc

On Sun, Jul 21, 2002 at 10:04:29PM -0700, Andrew Morton wrote:
> I'd suggest that we avoid putting any additional changes into
> the VM until we have solutions available for:
> 2: Make it work with pte-highmem (Bill Irwin is signed up for this)
> 4: Move the pte_chains into highmem too (Bill, I guess)
> 6: maybe GC the pte_chain backing pages. (Seems unavoidable. Rik?)
> Especially pte_chains in highmem. Failure to fix this well
> is a showstopper for rmap on large ia32 machines, which makes
> it a showstopper full stop.

I'll send you an update of my solution for (6), the initial version of
which was posted earlier today, in a separate post.

highpte_chain will do (2) and (4) simultaneously when it's debugged.


On Sun, Jul 21, 2002 at 10:04:29PM -0700, Andrew Morton wrote:
> If we can get something in place which works acceptably on Martin
> Bligh's machines, and we can see that the gains of rmap (whatever
> they are ;)) are worth the as-yet uncoded pains then let's move on.
> But until then, adding new stuff to the VM just makes a `patch -R'
> harder to do.

I have the same kinds of machines and have already been testing with
precisely the many tasks workloads he's concerned about for the sake of
correctness, and efficiency is also a concern here. highpte_chain is
already so high up on my priority queue that all other work is halted.


Cheers,
Bill

2002-07-22 04:53:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc

Linus Torvalds wrote:
>
> On Sat, 20 Jul 2002, Rik van Riel wrote:
> >
> > OK, I'll try to forward-port Ed's code to do that from 2.4 to 2.5
> > this weekend...
>
> Side note: while I absolutely think that is the right thing to do, that's
> also the much more "interesting" change. As a result, I'd be happier if it
> went through channels (ie probably Andrew) and had some wider testing
> first at least in the form of a CFT on linux-kernel.
>

I'd suggest that we avoid putting any additional changes into
the VM until we have solutions available for:

2: Make it work with pte-highmem (Bill Irwin is signed up for this)

4: Move the pte_chains into highmem too (Bill, I guess)

6: maybe GC the pte_chain backing pages. (Seems unavoidable. Rik?)


Especially pte_chains in highmem. Failure to fix this well
is a showstopper for rmap on large ia32 machines, which makes
it a showstopper full stop.

If we can get something in place which works acceptably on Martin
Bligh's machines, and we can see that the gains of rmap (whatever
they are ;)) are worth the as-yet uncoded pains then let's move on.
But until then, adding new stuff to the VM just makes a `patch -R'
harder to do.

-

2002-07-22 05:27:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc

William Lee Irwin III wrote:
>
> On Sun, Jul 21, 2002 at 10:04:29PM -0700, Andrew Morton wrote:
> > I'd suggest that we avoid putting any additional changes into
> > the VM until we have solutions available for:
> > 2: Make it work with pte-highmem (Bill Irwin is signed up for this)
> > 4: Move the pte_chains into highmem too (Bill, I guess)
> > 6: maybe GC the pte_chain backing pages. (Seems unavoidable. Rik?)
> > Especially pte_chains in highmem. Failure to fix this well
> > is a showstopper for rmap on large ia32 machines, which makes
> > it a showstopper full stop.
>
> I'll send you an update of my solution for (6), the initial version of
> which was posted earlier today, in a separate post.

Thanks, Bill. Yup, I'm playing with pte_chain_mempool at present.

> highpte_chain will do (2) and (4) simultaneously when it's debugged.
>
> On Sun, Jul 21, 2002 at 10:04:29PM -0700, Andrew Morton wrote:
> > If we can get something in place which works acceptably on Martin
> > Bligh's machines, and we can see that the gains of rmap (whatever
> > they are ;)) are worth the as-yet uncoded pains then let's move on.
> > But until then, adding new stuff to the VM just makes a `patch -R'
> > harder to do.
>
> I have the same kinds of machines and have already been testing with
> precisely the many tasks workloads he's concerned about for the sake of
> correctness, and efficiency is also a concern here. highpte_chain is
> already so high up on my priority queue that all other work is halted.

OK. But we're adding non-trivial amounts of new code simply
to get the reverse mapping working as robustly as the virtual
scan. And we'll always have rmap's additional storage requirements.

At some point we need to make a decision as to whether it's all
worth it. Right now we do not even have the information on the
pluses side to do this. That's worrisome.

-

2002-07-22 13:41:59

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc

On Mon, 22 Jul 2002, Rik van Riel wrote:

> Apart from both of these we'll also need code to garbage collect
> empty page tables so users can't clog up memory by mmaping a page
> every 4 MB ;)

Btw, I've started work on this code already.

Putting the dcache/icache pages on the LRU list in the way
Linus wanted is definately a lower priority thing for me at
this point, especially considering the fact that Ed Tomlinson's
way of having these pages on the LRU seems to work just fine ;)

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-07-22 06:35:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][1/2] return values shrink_dcache_memory etc

"Martin J. Bligh" wrote:
>
> >> > If we can get something in place which works acceptably on Martin
> >> > Bligh's machines, and we can see that the gains of rmap (whatever
> >> > they are ;)) are worth the as-yet uncoded pains then let's move on.
> >> > But until then, adding new stuff to the VM just makes a `patch -R'
> >> > harder to do.
> >>
> >> I have the same kinds of machines and have already been testing with
> >> precisely the many tasks workloads he's concerned about for the sake of
> >> correctness, and efficiency is also a concern here. highpte_chain is
> >> already so high up on my priority queue that all other work is halted.
> >
> > OK. But we're adding non-trivial amounts of new code simply
> > to get the reverse mapping working as robustly as the virtual
> > scan. And we'll always have rmap's additional storage requirements.
> >
> > At some point we need to make a decision as to whether it's all
> > worth it. Right now we do not even have the information on the
> > pluses side to do this. That's worrisome.
>
> These large NUMA machines should actually be rmap's glory day in the
> sun.

"should be". Sigh. Be nice to see an "is" one day ;)

> Per-node kswapd, being able to free mem pressure on one node
> easily (without cross-node bouncing), breakup of the lru list into
> smaller chunks, etc. These actually fix some of the biggest problems
> that we have right now and are hard to solve in other ways.
>
> The large rmap overheads we still have to kill seem to me to be the
> memory usage and the fork overhead. There's also a certain amount of
> overhead to managing any more data structures, of course. I think we
> know how to kill most of it. I don't think adding highpte_chain is
> the correct thing to do ... seems like adding insult to injury. I'd
> rather see us drive a silver stake through the problem's heart and
> kill it properly ...

Well that would be nice. And by extension, pte-highmem gets a stake
as well.

Do you think that large pages alone would be enough to allow us
to leave pte_chains (and page tables?) in ZONE_NORMAL, or would
shared pagetables also be needed?

Was it purely Oracle which drove pte-highmem, or do you think
that page table and pte_chain consumption could be a problem
on applications which can't/won't use large pages?

-