2008-11-15 09:39:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH] mm: evict streaming IO cache first

Hi Andrew,

I think we need this patch at 2.6.28.
Can this thinking get acception?


--------------------------------------------------
From: Rik van Riel <[email protected]>

Gene Heskett reported 2.6.28-rc3 often make unnecessary swap-out
on his system(4GB mem, 2GB swap).
and He has had to do a "swapoff -a; swapon -a" daily to clear the swap.


Actually, When there is a lot of streaming IO (or lite memory pressure workload)
going on, we do not want to scan or evict pages from the working set.
The old VM used to skip any mapped page, but still evict indirect blocks and
other data that is useful to cache.

This patch adds logic to skip scanning the anon lists and
the active file list if most of the file pages are on the
inactive file list (where streaming IO pages live), while
at the lowest scanning priority.

If the system is not doing a lot of streaming IO, eg. the
system is running a database workload, then more often used
file pages will be on the active file list and this logic
is automatically disabled.


IOW, Large server apparently doesn't need this patch. but
desktop or small server need it.


Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Ackted-by: Gene Heskett <[email protected]>
Tested-by: Gene Heskett <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/vmscan.c | 18 ++++++++++++++++--
2 files changed, 17 insertions(+), 2 deletions(-)

Index: b/include/linux/mmzone.h
===================================================================
--- a/include/linux/mmzone.h 2008-11-10 16:10:34.000000000 +0900
+++ b/include/linux/mmzone.h 2008-11-10 16:12:20.000000000 +0900
@@ -453,6 +453,7 @@ static inline int zone_is_oom_locked(con
* queues ("queue_length >> 12") during an aging round.
*/
#define DEF_PRIORITY 12
+#define PRIO_CACHE_ONLY (DEF_PRIORITY+1)

/* Maximum number of zones on a zonelist */
#define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c 2008-11-10 16:10:34.000000000 +0900
+++ b/mm/vmscan.c 2008-11-10 16:11:30.000000000 +0900
@@ -1443,6 +1443,20 @@ static unsigned long shrink_zone(int pri
}
}

+ /*
+ * If there is a lot of sequential IO going on, most of the
+ * file pages will be on the inactive file list. We start
+ * out by reclaiming those pages, without putting pressure on
+ * the working set. We only do this if the bulk of the file pages
+ * are not in the working set (on the active file list).
+ */
+ if (priority == PRIO_CACHE_ONLY &&
+ (nr[LRU_INACTIVE_FILE] > nr[LRU_ACTIVE_FILE]))
+ for_each_evictable_lru(l)
+ /* Scan only the inactive_file list. */
+ if (l != LRU_INACTIVE_FILE)
+ nr[l] = 0;
+
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
@@ -1573,7 +1587,7 @@ static unsigned long do_try_to_free_page
}
}

- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for (priority = PRIO_CACHE_ONLY; priority >= 0; priority--) {
sc->nr_scanned = 0;
if (!priority)
disable_swap_token();
@@ -1735,7 +1749,7 @@ loop_again:
for (i = 0; i < pgdat->nr_zones; i++)
temp_priority[i] = DEF_PRIORITY;

- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for (priority = PRIO_CACHE_ONLY; priority >= 0; priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;






2008-11-16 00:59:21

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

KOSAKI Motohiro wrote:
> Hi Andrew,
>
> I think we need this patch at 2.6.28.

I agree. I thought we would need this patch from right before
the time I wrote it, but we had no good workload to demonstrate
it at the time.

Gene Heskett found that the problem happens on his system and
the patch fixes is.

> Can this thinking get acception?

One of the reasons that I could not find a justification for
the patch is that all the benchmarks that I tried were
unaffected by it. This makes me believe that the risk of
performance regressions is low, while the patch does fix a
real performance bug for Gene's desktop.

--
All rights reversed.

2008-11-16 05:01:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

On Sat, 15 Nov 2008 18:38:59 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:

> Hi Andrew,
>
> I think we need this patch at 2.6.28.
> Can this thinking get acception?
>
>
> --------------------------------------------------
> From: Rik van Riel <[email protected]>
>
> Gene Heskett reported 2.6.28-rc3 often make unnecessary swap-out
> on his system(4GB mem, 2GB swap).
> and He has had to do a "swapoff -a; swapon -a" daily to clear the swap.
>
>
> Actually, When there is a lot of streaming IO (or lite memory pressure workload)
> going on, we do not want to scan or evict pages from the working set.
> The old VM used to skip any mapped page, but still evict indirect blocks and
> other data that is useful to cache.

Well yes, that was to stop precisely this problem from happening.

> This patch adds logic to skip scanning the anon lists and
> the active file list if most of the file pages are on the
> inactive file list (where streaming IO pages live), while
> at the lowest scanning priority.
>
> If the system is not doing a lot of streaming IO, eg. the
> system is running a database workload, then more often used
> file pages will be on the active file list and this logic
> is automatically disabled.
>
>
> IOW, Large server apparently doesn't need this patch. but
> desktop or small server need it.
>
>
> Signed-off-by: Rik van Riel <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Ackted-by: Gene Heskett <[email protected]>
> Tested-by: Gene Heskett <[email protected]>
> ---
> include/linux/mmzone.h | 1 +
> mm/vmscan.c | 18 ++++++++++++++++--
> 2 files changed, 17 insertions(+), 2 deletions(-)
>
> Index: b/include/linux/mmzone.h
> ===================================================================
> --- a/include/linux/mmzone.h 2008-11-10 16:10:34.000000000 +0900
> +++ b/include/linux/mmzone.h 2008-11-10 16:12:20.000000000 +0900
> @@ -453,6 +453,7 @@ static inline int zone_is_oom_locked(con
> * queues ("queue_length >> 12") during an aging round.
> */
> #define DEF_PRIORITY 12
> +#define PRIO_CACHE_ONLY (DEF_PRIORITY+1)
>
> /* Maximum number of zones on a zonelist */
> #define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c 2008-11-10 16:10:34.000000000 +0900
> +++ b/mm/vmscan.c 2008-11-10 16:11:30.000000000 +0900
> @@ -1443,6 +1443,20 @@ static unsigned long shrink_zone(int pri
> }
> }
>
> + /*
> + * If there is a lot of sequential IO going on, most of the
> + * file pages will be on the inactive file list. We start
> + * out by reclaiming those pages, without putting pressure on
> + * the working set. We only do this if the bulk of the file pages
> + * are not in the working set (on the active file list).
> + */
> + if (priority == PRIO_CACHE_ONLY &&
> + (nr[LRU_INACTIVE_FILE] > nr[LRU_ACTIVE_FILE]))
> + for_each_evictable_lru(l)
> + /* Scan only the inactive_file list. */
> + if (l != LRU_INACTIVE_FILE)
> + nr[l] = 0;
> +
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
> for_each_evictable_lru(l) {
> @@ -1573,7 +1587,7 @@ static unsigned long do_try_to_free_page
> }
> }
>
> - for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> + for (priority = PRIO_CACHE_ONLY; priority >= 0; priority--) {
> sc->nr_scanned = 0;
> if (!priority)
> disable_swap_token();
> @@ -1735,7 +1749,7 @@ loop_again:
> for (i = 0; i < pgdat->nr_zones; i++)
> temp_priority[i] = DEF_PRIORITY;
>
> - for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> + for (priority = PRIO_CACHE_ONLY; priority >= 0; priority--) {
> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
> unsigned long lru_pages = 0;

This is a twiddle, and not a terribly good one, IMO.

See, the old reclaim_mapped logic tried to work by *observing* the
behaviour of page reclaim. The above tweak tries to predict a-priori
what will happen, and hence has far less information. This is a key
difference!

The guesses which the above code makes can and will go wrong, I expect.

For a start, it has a sudden transition point at (nr[LRU_INACTIVE_FILE]
== nr[LRU_ACTIVE_FILE). That is a completely arbitrary wet finger in
the air, and when it is crossed, there will be large changes in
behaviour. This cannot be right!

Secondly, the above code is dependent upon the size of the zone. We
scan 1/8192th of the zone's pages in "only reclaim file pages" mode,
then we flip into "scan anon pages as well" mode.

If the size of the zone is less than (SWAP_CLUSTER_MAX * 8192) (1GB
with 4k pages) then this code will _always_ fail to reclaim
SWAP_CLUSTER_MAX pages on that initial pass, and hence will always
decrement `priority' and will always fall into "scan anon pages as
well" mode.

So I suspect this code didn't fix the problem for small zones much at
all.


Really, I think that the old approach of observing the scanner
behaviour (rather than trying to predict it) was better. It has more
information. I see that bits of it are still left there - afaict
zone->prev_priority doesn't do anything any more?

2008-11-16 18:16:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first



On Sat, 15 Nov 2008, Andrew Morton wrote:
>
> Really, I think that the old approach of observing the scanner
> behaviour (rather than trying to predict it) was better.

That's generally true. Self-adjusting behaviour rather than a-priori rules
would be much nicer. However, we apparently need to fix this some way.
Anybody willing to re-introduce some of the old logic?

Linus

2008-11-16 21:20:49

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

Linus Torvalds wrote:
> On Sat, 15 Nov 2008, Andrew Morton wrote:
>> Really, I think that the old approach of observing the scanner
>> behaviour (rather than trying to predict it) was better.
>
> That's generally true. Self-adjusting behaviour rather than a-priori rules
> would be much nicer. However, we apparently need to fix this some way.
> Anybody willing to re-introduce some of the old logic?

The old behaviour has big problems, especially on large memory
systems. If the old behaviour worked right, we would not have
been working on the split LRU code for the last year and a half.

Due to programs manipulating memory many pages at a time, the
LRU ends up getting mapped and cache pages on the list in bunches.

On large memory systems, after the scanner runs into a bunch
of mapped pages, it will switch to evicting mapped pages, even
if the next bunch of pages turns out to be cache pages.

I am not convinced that "reacting to what happened in the last
1/4096th of the LRU" is any better than "look at the list stats
and decide what to do".

Andrew's objection to how things behave on small memory systems
(the patch does not change anything) is valid, but going back
to the old behaviour does not seem like an option to me, either.

I will take a look at producing smoother self tuning behaviour
in get_scan_ratio(), with logic along these lines:
- the more file pages are inactive, the more eviction should
focus on file pages, because we are not eating away at the
working set yet
- the more file pages are active, the more there needs to be
a balance between file and anon scanning, because we are
starting to get to the working sets for both

I wonder if the "do not do mark_page_accessed at page fault time"
patch is triggering the current troublesome behaviour in the VM,
because actively used file pages are not moved out of the way of
the VM - which leads get_scan_ratio to believe that we are already
hitting the working set on the file side and should also start
scanning the anon LRUs.

--
All rights reversed.

2008-11-16 21:31:23

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

On Sun, 16 Nov 2008, Rik van Riel wrote:
>
> I wonder if the "do not do mark_page_accessed at page fault time"
> patch is triggering the current troublesome behaviour in the VM,
> because actively used file pages are not moved out of the way of
> the VM - which leads get_scan_ratio to believe that we are already
> hitting the working set on the file side and should also start
> scanning the anon LRUs.

That patch is only in the -mm tree, not in 2.6.28-rc.

Hugh

2008-11-17 04:47:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

On Sun, 16 Nov 2008 16:20:26 -0500 Rik van Riel <[email protected]> wrote:

> Linus Torvalds wrote:
> > On Sat, 15 Nov 2008, Andrew Morton wrote:
> >> Really, I think that the old approach of observing the scanner
> >> behaviour (rather than trying to predict it) was better.
> >
> > That's generally true. Self-adjusting behaviour rather than a-priori rules
> > would be much nicer. However, we apparently need to fix this some way.
> > Anybody willing to re-introduce some of the old logic?
>
> The old behaviour has big problems, especially on large memory
> systems. If the old behaviour worked right, we would not have
> been working on the split LRU code for the last year and a half.

Split LRU is (in this aspect) worse than the old code.

> Due to programs manipulating memory many pages at a time, the
> LRU ends up getting mapped and cache pages on the list in bunches.
>
> On large memory systems, after the scanner runs into a bunch
> of mapped pages, it will switch to evicting mapped pages, even
> if the next bunch of pages turns out to be cache pages.

Sure. But that sounds like theory to me. I've never seen anyone even
vaguely get anywhere near the level of instrumentation and
investigation and testing to be in a position to demonstrate that this
is a problem in practice.

> I am not convinced that "reacting to what happened in the last
> 1/4096th of the LRU" is any better than "look at the list stats
> and decide what to do".

I bet it is. The list stats are aggregated over the entire list and
aren't very useful for predicting the state of the few hundred pages at
the tail of the list.

> Andrew's objection to how things behave on small memory systems
> (the patch does not change anything) is valid, but going back
> to the old behaviour does not seem like an option to me, either.

There's also the behaviour change at the randomly-chosen
(nr[LRU_INACTIVE_FILE] == nr[LRU_ACTIVE_FILE) point..

> I will take a look at producing smoother self tuning behaviour
> in get_scan_ratio(), with logic along these lines:
> - the more file pages are inactive, the more eviction should
> focus on file pages, because we are not eating away at the
> working set yet
> - the more file pages are active, the more there needs to be
> a balance between file and anon scanning, because we are
> starting to get to the working sets for both

hm. I wonder if it would be prohibitive to say "hey, we did the wrong
thing in that scanning pass - rewind and try it again". Probably it
would be.

Anyway, we need to do something.

Shouldn't get_scan_ratio() be handling this case already?

2008-11-17 06:19:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

On Monday 17 November 2008 15:47, Andrew Morton wrote:
> On Sun, 16 Nov 2008 16:20:26 -0500 Rik van Riel <[email protected]> wrote:

> > I will take a look at producing smoother self tuning behaviour
> > in get_scan_ratio(), with logic along these lines:
> > - the more file pages are inactive, the more eviction should
> > focus on file pages, because we are not eating away at the
> > working set yet
> > - the more file pages are active, the more there needs to be
> > a balance between file and anon scanning, because we are
> > starting to get to the working sets for both
>
> hm. I wonder if it would be prohibitive to say "hey, we did the wrong
> thing in that scanning pass - rewind and try it again". Probably it
> would be.
>
> Anyway, we need to do something.
>
> Shouldn't get_scan_ratio() be handling this case already?

I have a patch that was actually for the old vmscan logic that I
found really helps prevent working set get paged out. Actually
the old vmscan behaviour wasn't any good either at preventing
unmapped pagecache from being evicted, which is the main thing I
was trying to fix (eg. keep git tree in cache while doing other
use-once IO).

It ended up working really well, but I suspect it would still fall
over in the case where you were trying to populate your caches with
the git tree *while* the streaming IO is happening (if those pages
don't have a chance to get touched again for a while, they'll look
like use-once IO to the vm)... but still no worse than current
behaviour.

I need to forward port it to the new system, however...

But this would be a pretty big change and I can't see how it could
be appropriate for -rc6. Aren't we allergic to even single-line
changes in vmscan without seemingly multi-year "testing" phases? :)

2008-11-17 06:31:09

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

On Sun, 16 Nov 2008 20:47:20 -0800
Andrew Morton <[email protected]> wrote:

> On Sun, 16 Nov 2008 16:20:26 -0500 Rik van Riel <[email protected]> wrote:
> Anyway, we need to do something.
>
> Shouldn't get_scan_ratio() be handling this case already?
>
Hmm, could I make a question ?

I think

- recent_rolated[LRU_FILE] is incremented when file cache is moved from
ACTIVE_FILE to INACTIVE_FILE.
- recent_scanned[LRU_FILE] is sum of scanning numbers on INACTIVE/ACTIVE list
of file.
- file caches are added to INACITVE_FILE, at first.
- get_scan_ratio() calculates %file to be

file recent rotated.
%file = IO_cost * ------------ / -------------
anon + file recent scanned.

But when "files are used by streaming or some touch once application",
there is no rotation because they are in INACTIVE FILE at first add_to_lru().
But recent_rotated will not increase while recent_scanned goes bigger and bigger.

Then %file goes to 0 rapidly.

Hmm?

Thanks,
-kame

2008-11-17 06:39:30

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

2008/11/17 KAMEZAWA Hiroyuki <[email protected]>:
> On Sun, 16 Nov 2008 20:47:20 -0800
> Andrew Morton <[email protected]> wrote:
>
>> On Sun, 16 Nov 2008 16:20:26 -0500 Rik van Riel <[email protected]> wrote:
>> Anyway, we need to do something.
>>
>> Shouldn't get_scan_ratio() be handling this case already?
>>
> Hmm, could I make a question ?
>
> I think
>
> - recent_rolated[LRU_FILE] is incremented when file cache is moved from
> ACTIVE_FILE to INACTIVE_FILE.
> - recent_scanned[LRU_FILE] is sum of scanning numbers on INACTIVE/ACTIVE list
> of file.
> - file caches are added to INACITVE_FILE, at first.
> - get_scan_ratio() calculates %file to be
>
> file recent rotated.
> %file = IO_cost * ------------ / -------------
> anon + file recent scanned.

rewote by div to mul changing.


file recent scanned.
%file = IO_cost * ------------ * -------------
anon + file recent rotated.


> But when "files are used by streaming or some touch once application",
> there is no rotation because they are in INACTIVE FILE at first add_to_lru().
> But recent_rotated will not increase while recent_scanned goes bigger and bigger.

Yup.

> Then %file goes to 0 rapidly.

I think reverse.

The problem is, when streaming access started right after, recent
scanned isn't so much.
then %file don't reach 100%.

then, few anon pages swaped out althouth memory pressure isn't so heavy.

2008-11-17 06:55:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

On Mon, 17 Nov 2008 15:39:20 +0900
"KOSAKI Motohiro" <[email protected]> wrote:

> rewote by div to mul changing.
>
>
> file recent scanned.
> %file = IO_cost * ------------ * -------------
> anon + file recent rotated.
>
>
Ah, sorry.

> > But when "files are used by streaming or some touch once application",
> > there is no rotation because they are in INACTIVE FILE at first add_to_lru().
> > But recent_rotated will not increase while recent_scanned goes bigger and bigger.
>
> Yup.
>
> > Then %file goes to 0 rapidly.
>
> I think reverse.
>
> The problem is, when streaming access started right after, recent
> scanned isn't so much.
> then %file don't reach 100%.
>
> then, few anon pages swaped out althouth memory pressure isn't so heavy.
>
"few" ?
85Mbytes of used swap while 1.2GBytes of free memory in Gene Heskett's report.
Hmm..

How about resetting zone->recent_scanned/rotated to be some value calculated from
INACTIVE_ANON/INACTIVE_FILE at some time (when the system is enough idle) ?

Thanks,
-Kame


2008-11-17 07:03:58

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

>> then, few anon pages swaped out althouth memory pressure isn't so heavy.
>>
> "few" ?
> 85Mbytes of used swap while 1.2GBytes of free memory in Gene Heskett's report.
> Hmm..

because accumuration of a days.


> How about resetting zone->recent_scanned/rotated to be some value calculated from
> INACTIVE_ANON/INACTIVE_FILE at some time (when the system is enough idle) ?

in get_scan_ratio()

2008-11-17 08:22:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

On Mon, 17 Nov 2008 16:03:48 +0900
"KOSAKI Motohiro" <[email protected]> wrote:
> > How about resetting zone->recent_scanned/rotated to be some value calculated from
> > INACTIVE_ANON/INACTIVE_FILE at some time (when the system is enough idle) ?
>
> in get_scan_ratio()
>
But active/inactive ratio (and mapped_ratio) is not handled there.

Follwoing 2 will return the same scan ratio.
==case 1==
active_anon = 480M
inactive_anon = 32M
active_file = 2M
inactive_file = 510M

==case 2==
active_anon = 480M
inactive_anon = 32M
active_file = 480M
inactive_file = 32M
==



-Kame

2008-11-17 08:32:45

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

>> > How about resetting zone->recent_scanned/rotated to be some value calculated from
>> > INACTIVE_ANON/INACTIVE_FILE at some time (when the system is enough idle) ?
>>
>> in get_scan_ratio()
>>
> But active/inactive ratio (and mapped_ratio) is not handled there.

Yes.
I think akpm pointed out just this point.


> Follwoing 2 will return the same scan ratio.
> ==case 1==
> active_anon = 480M
> inactive_anon = 32M
> active_file = 2M
> inactive_file = 510M
>
> ==case 2==
> active_anon = 480M
> inactive_anon = 32M
> active_file = 480M
> inactive_file = 32M
> ==

Yes.
the patch handle this situation by special priority.


Umm..
Perhaps, I am missing your point?

2008-11-17 16:22:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first



On Mon, 17 Nov 2008, KAMEZAWA Hiroyuki wrote:
>
> How about resetting zone->recent_scanned/rotated to be some value calculated from
> INACTIVE_ANON/INACTIVE_FILE at some time (when the system is enough idle) ?

.. or how about just considering the act of adding a new page to the LRU
to be a "scan" event? IOW, "scanning" is not necessarily just an act of
the VM looking for pages to free, but would be a more general "activity"
meter.

IOW, when we calculate the percentages of anon-vs-file in get_scan_ratio()
we take into account how much anon-page activity vs how much file cache
activity there has been.

So if we've seen a lot of filesystem activity ("streaming"), we would tend
to prefer to scan the page cache. If we've seen a lot of anon page
mapping, we'd tend to prefer to scan the anon side.

That would seem to be the right kind of thing to do: if we literally have
a load that only does streaming and pages never get moved to the active
LRU, it should basically keep the page cache close to constant size -
which is just another way of saying that we should only be scanning page
cache pages.

Hmm?

Linus

2008-11-17 16:22:57

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

KAMEZAWA Hiroyuki wrote:
> On Mon, 17 Nov 2008 16:03:48 +0900
> "KOSAKI Motohiro" <[email protected]> wrote:
>>> How about resetting zone->recent_scanned/rotated to be some value calculated from
>>> INACTIVE_ANON/INACTIVE_FILE at some time (when the system is enough idle) ?
>> in get_scan_ratio()
>>
> But active/inactive ratio (and mapped_ratio) is not handled there.
>
> Follwoing 2 will return the same scan ratio.

get_scan_ratio does not look at the sizes of the lists, but
at the ratio between "pages scanned" and "pages rotated".

A page that is moved from the inactive to the active list
is always counted as rotated.

A page that is moved from the active to the inactive list
is counted as rotated if it was mapped and referenced.

> ==case 1==
> active_anon = 480M
> inactive_anon = 32M
> active_file = 2M
> inactive_file = 510M
>
> ==case 2==
> active_anon = 480M
> inactive_anon = 32M
> active_file = 480M
> inactive_file = 32M
> ==
>
>
>
> -Kame
>


--
All rights reversed.

2008-11-17 16:27:32

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

Linus Torvalds wrote:
>
> On Mon, 17 Nov 2008, KAMEZAWA Hiroyuki wrote:
>> How about resetting zone->recent_scanned/rotated to be some value calculated from
>> INACTIVE_ANON/INACTIVE_FILE at some time (when the system is enough idle) ?
>
> .. or how about just considering the act of adding a new page to the LRU
> to be a "scan" event? IOW, "scanning" is not necessarily just an act of
> the VM looking for pages to free, but would be a more general "activity"
> meter.

That might work.

Adding a new page to the inactive file list would increment
zone->recent_scanned[file].

Adding a new anonymous page to the active anon list could
increment both zone->recent_scanned[anon] and
zone->recent_rotated[anon].

That way adding anonymous memory would move some pressure
to the file side, while doing lots of streaming IO would
result in the same.

> That would seem to be the right kind of thing to do: if we literally have
> a load that only does streaming and pages never get moved to the active
> LRU, it should basically keep the page cache close to constant size -
> which is just another way of saying that we should only be scanning page
> cache pages.

The only thing left at that point is the fact that the
streaming IO puts pressure on the working set in the
page cache, by always decreasing the active file list
too. But that's a different thing entirely and can be
looked at later :)

Your idea looks like it should work. I'll whip up a patch
for Gene Heskett.

--
All rights reversed.

2008-11-17 16:38:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first



On Mon, 17 Nov 2008, Linus Torvalds wrote:
> On Mon, 17 Nov 2008, KAMEZAWA Hiroyuki wrote:
> >
> > How about resetting zone->recent_scanned/rotated to be some value calculated from
> > INACTIVE_ANON/INACTIVE_FILE at some time (when the system is enough idle) ?
>
> .. or how about just considering the act of adding a new page to the LRU
> to be a "scan" event? IOW, "scanning" is not necessarily just an act of
> the VM looking for pages to free, but would be a more general "activity"
> meter.

Another thing strikes me: it looks like the logic in "get_scan_ratio()"
has a tendency to get unbalanced - if we end up deciding that we should
scan a lot of anonymous pages, the scan numbers for anonymous pages will
go up, and we get even _more_ eager to scan those. Of course, "rotate"
events will then make us less likely again, but for streaming loads, you
wouldn't expect to see those at all.

There seems to be another bug there wrt the "aging" - we age anon page
events and file page events independently, which sounds like it would make
the math totally nonsensical. We do that whole

anon / (anon + file)

thing, but since anon and file counts are aged independently, that "math"
is not math, it looks like a totally random number that has no meaning.

So instead of having two independent aging things, if we age one side, we
should age the other. No?

But maybe I'm looking at it wrong. It doesn't seem sensible to me, but
maybe there's some deeper truth in there somewhere that I'm missing..

Linus

2008-11-17 16:54:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first

Linus Torvalds wrote:

> Another thing strikes me: it looks like the logic in "get_scan_ratio()"
> has a tendency to get unbalanced - if we end up deciding that we should
> scan a lot of anonymous pages, the scan numbers for anonymous pages will
> go up, and we get even _more_ eager to scan those. Of course, "rotate"
> events will then make us less likely again, but for streaming loads, you
> wouldn't expect to see those at all.

True for streaming loads - if we scan the file list and find
mostly pages from streaming loads, we will become more eager
to scan the file list.

I do not expect streaming loads to ever hit the anon list in
the same way, because anonymous pages start out referenced and
on the active list, which means an anonymous deactivation will
always be counted as a rotate event.

> There seems to be another bug there wrt the "aging" - we age anon page
> events and file page events independently, which sounds like it would make
> the math totally nonsensical. We do that whole
>
> anon / (anon + file)

That's an outdated comment. Andrew had a patch to update that
comment, but it must have gotten lost somewhere. I'll send you
a patch to update it.

If you look at the actual calculation, you'l see that the
scan percentages are keyed off just swappiness and the
rotated/scanned ratios for each page category.

--
All rights reversed.

2008-11-17 17:06:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: evict streaming IO cache first



On Mon, 17 Nov 2008, Rik van Riel wrote:

> Linus Torvalds wrote:
>
> > Another thing strikes me: it looks like the logic in "get_scan_ratio()" has
> > a tendency to get unbalanced - if we end up deciding that we should scan a
> > lot of anonymous pages, the scan numbers for anonymous pages will go up, and
> > we get even _more_ eager to scan those. Of course, "rotate" events will then
> > make us less likely again, but for streaming loads, you wouldn't expect to
> > see those at all.
>
> True for streaming loads - if we scan the file list and find
> mostly pages from streaming loads, we will become more eager
> to scan the file list.

The "count adding as activity" might hide that, but it does seem a big
iffy.

> > There seems to be another bug there wrt the "aging" - we age anon page
> > events and file page events independently, which sounds like it would make
> > the math totally nonsensical. We do that whole
> >
> > anon / (anon + file)
>
> That's an outdated comment. Andrew had a patch to update that
> comment, but it must have gotten lost somewhere. I'll send you
> a patch to update it.
>
> If you look at the actual calculation, you'l see that the
> scan percentages are keyed off just swappiness and the
> rotated/scanned ratios for each page category.

Ok, that makes sense. Yes, as ratios the math looks valid.

Linus

2008-11-17 17:17:18

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] vmscan: fix get_scan_ratio comment

Fix the old comment on the scan ratio calculations.

Signed-off-by: Rik van Riel <[email protected]>
---
mm/vmscan.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6.28-rc5/mm/vmscan.c
===================================================================
--- linux-2.6.28-rc5.orig/mm/vmscan.c 2008-11-16 17:47:13.000000000 -0500
+++ linux-2.6.28-rc5/mm/vmscan.c 2008-11-17 12:05:03.000000000 -0500
@@ -1386,9 +1386,9 @@ static void get_scan_ratio(struct zone *
file_prio = 200 - sc->swappiness;

/*
- * anon recent_rotated[0]
- * %anon = 100 * ----------- / ----------------- * IO cost
- * anon + file rotate_sum
+ * recent_scanned[anon]
+ * %anon = -------------------- * sc->swappiness
+ * recent_rotated[anon]
*/
ap = (anon_prio + 1) * (zone->recent_scanned[0] + 1);
ap /= zone->recent_rotated[0] + 1;


Attachments:
get-scan-ratio-comment.patch (910.00 B)

2008-11-17 17:32:38

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] vmscan: fix get_scan_ratio comment

2008/11/18 Rik van Riel <[email protected]>:
>
> Fix the old comment on the scan ratio calculations.
>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> mm/vmscan.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> Index: linux-2.6.28-rc5/mm/vmscan.c
> ===================================================================
> --- linux-2.6.28-rc5.orig/mm/vmscan.c 2008-11-16 17:47:13.000000000 -0500
> +++ linux-2.6.28-rc5/mm/vmscan.c 2008-11-17 12:05:03.000000000 -0500
> @@ -1386,9 +1386,9 @@ static void get_scan_ratio(struct zone *
> file_prio = 200 - sc->swappiness;
>
> /*
> - * anon recent_rotated[0]
> - * %anon = 100 * ----------- / ----------------- * IO cost
> - * anon + file rotate_sum
> + * recent_scanned[anon]
> + * %anon = -------------------- * sc->swappiness
> + * recent_rotated[anon]
> */
> ap = (anon_prio + 1) * (zone->recent_scanned[0] + 1);
> ap /= zone->recent_rotated[0] + 1;

looks good to me, absolutely.
Reviewed-by: KOSAKI Motohiro <[email protected]>

2008-11-17 17:36:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] vmscan: fix get_scan_ratio comment



Grr. Don't use attachments. Now I can't sanely reply and quote things
with a sane default setting (yes, I could set "include attachements in
reply", but that is _not_ a sane default setting, and anybody who claims
it is is a moron).

Why don't people learn?

Anyway, without quoting, the thing is - your fix isn't any better. The
more interesting part is how the fractions get combined, and that is
indeed approximately "anon% = anon / (anon + file)".

So you in many ways made the comment worse. It wasn't good before, but
it's still not good, and now it comments on the part that isn't even
interesting (ie it comments the _trivial_ fractional part)

Linus

2008-11-17 18:54:57

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: fix get_scan_ratio comment

Linus Torvalds wrote:

> Anyway, without quoting, the thing is - your fix isn't any better. The
> more interesting part is how the fractions get combined, and that is
> indeed approximately "anon% = anon / (anon + file)".

Well, the "anon" and "file" in that calculation are the
scanned/rotated ratios for anon and file pages, not the
sizes of the lists.

> So you in many ways made the comment worse. It wasn't good before, but
> it's still not good, and now it comments on the part that isn't even
> interesting (ie it comments the _trivial_ fractional part)

How about something like the following: ?

/*
* The amount of pressure on anon vs file pages is inversely
* proportional to the fraction of recently scanned pages on
* each list that were recently referenced and in active use.
*/

(I'll mail the next patches out with claws-mail - silly thunderbird)

--
All rights reversed.

2008-11-17 20:24:38

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: fix get_scan_ratio comment

Fix the old comment on the scan ratio calculations.

Signed-off-by: Rik van Riel <[email protected]>
---
mm/vmscan.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6.28-rc5/mm/vmscan.c
===================================================================
--- linux-2.6.28-rc5.orig/mm/vmscan.c 2008-11-16 17:47:13.000000000 -0500
+++ linux-2.6.28-rc5/mm/vmscan.c 2008-11-17 15:22:22.000000000 -0500
@@ -1386,9 +1386,9 @@ static void get_scan_ratio(struct zone *
file_prio = 200 - sc->swappiness;

/*
- * anon recent_rotated[0]
- * %anon = 100 * ----------- / ----------------- * IO cost
- * anon + file rotate_sum
+ * The amount of pressure on anon vs file pages is inversely
+ * proportional to the fraction of recently scanned pages on
+ * each list that were recently referenced and in active use.
*/
ap = (anon_prio + 1) * (zone->recent_scanned[0] + 1);
ap /= zone->recent_rotated[0] + 1;

2008-11-18 00:07:06

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] vmscan: evict streaming IO first

Count the insertion of new pages in the statistics used to drive the
pageout scanning code. This should help the kernel quickly evict
streaming file IO.

We count on the fact that new file pages start on the inactive file
LRU and new anonymous pages start on the active anon list. This
means streaming file IO will increment the recent scanned file
statistic, while leaving the recent rotated file statistic alone,
driving pageout scanning to the file LRUs.

Pageout activity does its own list manipulation.

Signed-off-by: Rik van Riel <[email protected]>
---
mm/swap.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

On Mon, 17 Nov 2008 08:22:13 -0800 (PST)
Linus Torvalds <[email protected]> wrote:

> .. or how about just considering the act of adding a new page to the LRU
> to be a "scan" event? IOW, "scanning" is not necessarily just an act of
> the VM looking for pages to free, but would be a more general "activity"
> meter.

Linus, this should implement your idea.

Gene, does this patch resolve the problem for you?

Index: linux-2.6.28-rc5/mm/swap.c
===================================================================
--- linux-2.6.28-rc5.orig/mm/swap.c 2008-11-16 17:47:13.000000000 -0500
+++ linux-2.6.28-rc5/mm/swap.c 2008-11-17 18:58:32.000000000 -0500
@@ -445,6 +445,7 @@ void ____pagevec_lru_add(struct pagevec
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
struct zone *pagezone = page_zone(page);
+ int file;

if (pagezone != zone) {
if (zone)
@@ -456,8 +457,12 @@ void ____pagevec_lru_add(struct pagevec
VM_BUG_ON(PageUnevictable(page));
VM_BUG_ON(PageLRU(page));
SetPageLRU(page);
- if (is_active_lru(lru))
+ file = is_file_lru(lru);
+ zone->recent_scanned[file]++;
+ if (is_active_lru(lru)) {
SetPageActive(page);
+ zone->recent_rotated[file]++;
+ }
add_page_to_lru_list(zone, page, lru);
}
if (zone)

2008-12-01 20:25:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] vmscan: evict streaming IO first

On Mon, 17 Nov 2008 19:06:42 -0500
Rik van Riel <[email protected]> wrote:

> Count the insertion of new pages in the statistics used to drive the
> pageout scanning code. This should help the kernel quickly evict
> streaming file IO.
>
> We count on the fact that new file pages start on the inactive file
> LRU and new anonymous pages start on the active anon list. This
> means streaming file IO will increment the recent scanned file
> statistic, while leaving the recent rotated file statistic alone,
> driving pageout scanning to the file LRUs.
>
> Pageout activity does its own list manipulation.
>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> mm/swap.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> On Mon, 17 Nov 2008 08:22:13 -0800 (PST)
> Linus Torvalds <[email protected]> wrote:
>
> > .. or how about just considering the act of adding a new page to the LRU
> > to be a "scan" event? IOW, "scanning" is not necessarily just an act of
> > the VM looking for pages to free, but would be a more general "activity"
> > meter.
>
> Linus, this should implement your idea.
>
> Gene, does this patch resolve the problem for you?

Has Gene had a chance to confirm this yet?

> Index: linux-2.6.28-rc5/mm/swap.c
> ===================================================================
> --- linux-2.6.28-rc5.orig/mm/swap.c 2008-11-16 17:47:13.000000000 -0500
> +++ linux-2.6.28-rc5/mm/swap.c 2008-11-17 18:58:32.000000000 -0500
> @@ -445,6 +445,7 @@ void ____pagevec_lru_add(struct pagevec
> for (i = 0; i < pagevec_count(pvec); i++) {
> struct page *page = pvec->pages[i];
> struct zone *pagezone = page_zone(page);
> + int file;
>
> if (pagezone != zone) {
> if (zone)
> @@ -456,8 +457,12 @@ void ____pagevec_lru_add(struct pagevec
> VM_BUG_ON(PageUnevictable(page));
> VM_BUG_ON(PageLRU(page));
> SetPageLRU(page);
> - if (is_active_lru(lru))
> + file = is_file_lru(lru);
> + zone->recent_scanned[file]++;
> + if (is_active_lru(lru)) {
> SetPageActive(page);
> + zone->recent_rotated[file]++;
> + }
> add_page_to_lru_list(zone, page, lru);
> }
> if (zone)

Were you not able to reproduce the problem? It looks like it'd be a
pretty simple test case to set up?