2010-11-01 07:07:04

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

Hi

> On ChromiumOS, we do not use swap. When memory is low, the only way to
> free memory is to reclaim pages from the file list. This results in a
> lot of thrashing under low memory conditions. We see the system become
> unresponsive for minutes before it eventually OOMs. We also see very
> slow browser tab switching under low memory. Instead of an unresponsive
> system, we'd really like the kernel to OOM as soon as it starts to
> thrash. If it can't keep the working set in memory, then OOM.
> Losing one of many tabs is a better behaviour for the user than an
> unresponsive system.
>
> This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> of file-backed pages when when there are less than min_filelist_bytes worth
> of such pages in the cache. This tunable is handy for low memory systems
> using solid-state storage where interactive response is more important
> than not OOMing.
>
> With this patch and min_filelist_kbytes set to 50000, I see very little
> block layer activity during low memory. The system stays responsive under
> low memory and browser tab switching is fast. Eventually, a process a gets
> killed by OOM. Without this patch, the system gets wedged for minutes
> before it eventually OOMs. Below is the vmstat output from my test runs.

I've heared similar requirement sometimes from embedded people. then also
don't use swap. then, I don't think this is hopeless idea. but I hope to
clarify some thing at first.

Yes, a system often have should-not-be-evicted-file-caches. Typically, they
are libc, libX11 and some GUI libraries. Traditionally, we was making tiny
application which linked above important lib and call mlockall() at startup.
such technique prevent reclaim. So, Q1: Why do you think above traditional way
is insufficient?

Q2: In the above you used min_filelist_kbytes=50000. How do you decide
such value? Do other users can calculate proper value?

In addition, I have two request. R1: I think chromium specific feature is
harder acceptable because it's harder maintable. but we have good chance to
solve embedded generic issue. Please discuss Minchan and/or another embedded
developers. R2: If you want to deal OOM combination, please consider to
combination of memcg OOM notifier too. It is most flexible and powerful OOM
mechanism. Probably desktop and server people never use bare OOM killer intentionally.

Thanks.



2010-11-01 18:24:37

by Mandeep Singh Baines

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

KOSAKI Motohiro ([email protected]) wrote:
> Hi
>
> > On ChromiumOS, we do not use swap. When memory is low, the only way to
> > free memory is to reclaim pages from the file list. This results in a
> > lot of thrashing under low memory conditions. We see the system become
> > unresponsive for minutes before it eventually OOMs. We also see very
> > slow browser tab switching under low memory. Instead of an unresponsive
> > system, we'd really like the kernel to OOM as soon as it starts to
> > thrash. If it can't keep the working set in memory, then OOM.
> > Losing one of many tabs is a better behaviour for the user than an
> > unresponsive system.
> >
> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> > of file-backed pages when when there are less than min_filelist_bytes worth
> > of such pages in the cache. This tunable is handy for low memory systems
> > using solid-state storage where interactive response is more important
> > than not OOMing.
> >
> > With this patch and min_filelist_kbytes set to 50000, I see very little
> > block layer activity during low memory. The system stays responsive under
> > low memory and browser tab switching is fast. Eventually, a process a gets
> > killed by OOM. Without this patch, the system gets wedged for minutes
> > before it eventually OOMs. Below is the vmstat output from my test runs.
>
> I've heared similar requirement sometimes from embedded people. then also
> don't use swap. then, I don't think this is hopeless idea. but I hope to
> clarify some thing at first.
>

swap would be intersting if we could somehow control swap thrashing. Maybe
we could add min_anonlist_kbytes. Just kidding:)

> Yes, a system often have should-not-be-evicted-file-caches. Typically, they
> are libc, libX11 and some GUI libraries. Traditionally, we was making tiny
> application which linked above important lib and call mlockall() at startup.
> such technique prevent reclaim. So, Q1: Why do you think above traditional way
> is insufficient?
>

mlock is too coarse grain. It requires locking the whole file in memory.
The chrome and X binaries are quite large so locking them would waste a lot
of memory. We could lock just the pages that are part of the working set but
that is difficult to do in practice. Its unmaintainable if you do it
statically. If you do it at runtime by mlocking the working set, you're
sort of giving up on mm's active list.

Like akpm, I'm sad that we need this patch. I'd rather the kernel did a better
job of identifying the working set. We did look at ways to do a better
job of keeping the working set in the active list but these were tricker
patches and never quite worked out. This patch is simple and works great.

Under memory pressure, I see the active list get smaller and smaller. Its
getting smaller because we're scanning it faster and faster, causing more
and more page faults which slows forward progress resulting in the active
list getting smaller still. One way to approach this might to make the
scan rate constant and configurable. It doesn't seem right that we scan
memory faster and faster under low memory. For us, we'd rather OOM than
evict pages that are likely to be accessed again so we'd prefer to make
a conservative estimate as to what belongs in the working set. Other
folks (long computations) might want to reclaim more aggressively.

> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
> such value? Do other users can calculate proper value?
>

50M was small enough that we were comfortable with keeping 50M of file pages
in memory and large enough that it is bigger than the working set. I tested
by loading up a bunch of popular web sites in chrome and then observing what
happend when I ran out of memory. With 50M, I saw almost no thrashing and
the system stayed responsive even under low memory. but I wanted to be
conservative since I'm really just guessing.

Other users could calculate their value by doing something similar. Load
up the system (exhaust free memory) with a typical load and then observe
file io via vmstat. They can then set min_filelist_kbytes to the value
where they see a tolerable amounting of thrashing (page faults, block io).

> In addition, I have two request. R1: I think chromium specific feature is
> harder acceptable because it's harder maintable. but we have good chance to
> solve embedded generic issue. Please discuss Minchan and/or another embedded

I think this feature should be useful to a lot of embedded applications where
OOM is OK, especially web browsing applications where the user is OK with
losing 1 of many tabs they have open. However, I consider this patch a
stop-gap. I think the real solution is to do a better job of protecting
the active list.

> developers. R2: If you want to deal OOM combination, please consider to
> combination of memcg OOM notifier too. It is most flexible and powerful OOM
> mechanism. Probably desktop and server people never use bare OOM killer intentionally.
>

Yes, will definitely look at OOM notifier. Currently trying to see if we can
get by with oomadj. With OOM notifier you'd have to respond earlier so you
might OOM more. However, with a notifier you might be able to take action that
might prevent OOM altogether.

I see memcg more as an isolation mechanism but I guess you could use it to
isolate the working set from anon browser tab data as Kamezawa suggests.

Regards,
Mandeep

> Thanks.
>
>
>

2010-11-01 18:51:38

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

On 11/01/2010 02:24 PM, Mandeep Singh Baines wrote:

> Under memory pressure, I see the active list get smaller and smaller. Its
> getting smaller because we're scanning it faster and faster, causing more
> and more page faults which slows forward progress resulting in the active
> list getting smaller still. One way to approach this might to make the
> scan rate constant and configurable. It doesn't seem right that we scan
> memory faster and faster under low memory. For us, we'd rather OOM than
> evict pages that are likely to be accessed again so we'd prefer to make
> a conservative estimate as to what belongs in the working set. Other
> folks (long computations) might want to reclaim more aggressively.

Have you actually read the code?

The active file list is only ever scanned when it is larger
than the inactive file list.

>> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
>> such value? Do other users can calculate proper value?
>>
>
> 50M was small enough that we were comfortable with keeping 50M of file pages
> in memory and large enough that it is bigger than the working set. I tested
> by loading up a bunch of popular web sites in chrome and then observing what
> happend when I ran out of memory. With 50M, I saw almost no thrashing and
> the system stayed responsive even under low memory. but I wanted to be
> conservative since I'm really just guessing.
>
> Other users could calculate their value by doing something similar.

Maybe we can scale this by memory amount?

Say, make sure the total amount of page cache in the system
is at least 2* as much as the sum of all the zone->pages_high
watermarks, and refuse to evict page cache if we have less
than that?

This may need to be tunable for a few special use cases,
like HPC and virtual machine hosting nodes, but it may just
do the right thing for everybody else.

Another alternative could be to really slow down the
reclaiming of page cache once we hit this level, so virt
hosts and HPC nodes can still decrease the page cache to
something really small ... but only if it is not being
used.

Andrew, could a hack like the above be "good enough"?

Anybody - does the above hack inspire you to come up with
an even better idea?

2010-11-01 19:43:11

by Mandeep Singh Baines

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

On Mon, Nov 1, 2010 at 11:50 AM, Rik van Riel <[email protected]> wrote:
> On 11/01/2010 02:24 PM, Mandeep Singh Baines wrote:
>
>> Under memory pressure, I see the active list get smaller and smaller. Its
>> getting smaller because we're scanning it faster and faster, causing more
>> and more page faults which slows forward progress resulting in the active
>> list getting smaller still. One way to approach this might to make the
>> scan rate constant and configurable. It doesn't seem right that we scan
>> memory faster and faster under low memory. For us, we'd rather OOM than
>> evict pages that are likely to be accessed again so we'd prefer to make
>> a conservative estimate as to what belongs in the working set. Other
>> folks (long computations) might want to reclaim more aggressively.
>
> Have you actually read the code?
>

I have but really just recently. I consider myself an mm newb so take any
conclusion I make with a grain of salt.

> The active file list is only ever scanned when it is larger
> than the inactive file list.
>

Yes, this prevents you from reclaiming the active list all at once. But if the
memory pressure doesn't go away, you'll start to reclaim the active list
little by little. First you'll empty the inactive list, and then
you'll start scanning
the active list and pulling pages from inactive to active. The problem is that
there is no minimum time limit to how long a page will sit in the inactive list
before it is reclaimed. Just depends on scan rate which does not depend
on time.

In my experiments, I saw the active list get smaller and smaller
over time until eventually it was only a few MB at which point the system came
grinding to a halt due to thrashing.

I played around with making the active/inactive ratio configurable. I
sent a patch out
for an inactive_file_ratio. So instead of the default 50%, you'd make the
ratio configurable.

inactive_file_ratio = (inactive * 100) / (inactive + active)

I saw less thrashing at 10% but this patch wasn't nearly as effective
as min_filelist_kbytes.
I can resend the patch if you think its interesting.

>>> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
>>> such value? Do other users can calculate proper value?
>>>
>>
>> 50M was small enough that we were comfortable with keeping 50M of file
>> pages
>> in memory and large enough that it is bigger than the working set. I
>> tested
>> by loading up a bunch of popular web sites in chrome and then observing
>> what
>> happend when I ran out of memory. With 50M, I saw almost no thrashing and
>> the system stayed responsive even under low memory. but I wanted to be
>> conservative since I'm really just guessing.
>>
>> Other users could calculate their value by doing something similar.
>
> Maybe we can scale this by memory amount?
>
> Say, make sure the total amount of page cache in the system
> is at least 2* as much as the sum of all the zone->pages_high
> watermarks, and refuse to evict page cache if we have less
> than that?
>
> This may need to be tunable for a few special use cases,
> like HPC and virtual machine hosting nodes, but it may just
> do the right thing for everybody else.
>
> Another alternative could be to really slow down the
> reclaiming of page cache once we hit this level, so virt
> hosts and HPC nodes can still decrease the page cache to
> something really small ... but only if it is not being
> used.
>
> Andrew, could a hack like the above be "good enough"?
>
> Anybody - does the above hack inspire you to come up with
> an even better idea?
>

2010-11-01 23:47:00

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

On Tue, Nov 2, 2010 at 3:24 AM, Mandeep Singh Baines <[email protected]> wrote:
> KOSAKI Motohiro ([email protected]) wrote:
>> Hi
>>
>> > On ChromiumOS, we do not use swap. When memory is low, the only way to
>> > free memory is to reclaim pages from the file list. This results in a
>> > lot of thrashing under low memory conditions. We see the system become
>> > unresponsive for minutes before it eventually OOMs. We also see very
>> > slow browser tab switching under low memory. Instead of an unresponsive
>> > system, we'd really like the kernel to OOM as soon as it starts to
>> > thrash. If it can't keep the working set in memory, then OOM.
>> > Losing one of many tabs is a better behaviour for the user than an
>> > unresponsive system.
>> >
>> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
>> > of file-backed pages when when there are less than min_filelist_bytes worth
>> > of such pages in the cache. This tunable is handy for low memory systems
>> > using solid-state storage where interactive response is more important
>> > than not OOMing.
>> >
>> > With this patch and min_filelist_kbytes set to 50000, I see very little
>> > block layer activity during low memory. The system stays responsive under
>> > low memory and browser tab switching is fast. Eventually, a process a gets
>> > killed by OOM. Without this patch, the system gets wedged for minutes
>> > before it eventually OOMs. Below is the vmstat output from my test runs.
>>
>> I've heared similar requirement sometimes from embedded people. then also
>> don't use swap. then, I don't think this is hopeless idea. but I hope to
>> clarify some thing at first.
>>
>
> swap would be intersting if we could somehow control swap thrashing. Maybe
> we could add min_anonlist_kbytes. Just kidding:)
>
>> Yes, a system often have should-not-be-evicted-file-caches. Typically, they
>> are libc, libX11 and some GUI libraries. Traditionally, we was making tiny
>> application which linked above important lib and call mlockall() at startup.
>> such technique prevent reclaim. So, Q1: Why do you think above traditional way
>> is insufficient?
>>
>
> mlock is too coarse grain. It requires locking the whole file in memory.
> The chrome and X binaries are quite large so locking them would waste a lot
> of memory. We could lock just the pages that are part of the working set but
> that is difficult to do in practice. Its unmaintainable if you do it
> statically. If you do it at runtime by mlocking the working set, you're
> sort of giving up on mm's active list.
>
> Like akpm, I'm sad that we need this patch. I'd rather the kernel did a better
> job of identifying the working set. We did look at ways to do a better
> job of keeping the working set in the active list but these were tricker
> patches and never quite worked out. This patch is simple and works great.
>
> Under memory pressure, I see the active list get smaller and smaller. Its
> getting smaller because we're scanning it faster and faster, causing more
> and more page faults which slows forward progress resulting in the active
> list getting smaller still. One way to approach this might to make the
> scan rate constant and configurable. It doesn't seem right that we scan
> memory faster and faster under low memory. For us, we'd rather OOM than
> evict pages that are likely to be accessed again so we'd prefer to make
> a conservative estimate as to what belongs in the working set. Other
> folks (long computations) might want to reclaim more aggressively.
>
>> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
>> such value? Do other users can calculate proper value?
>>
>
> 50M was small enough that we were comfortable with keeping 50M of file pages
> in memory and large enough that it is bigger than the working set. I tested
> by loading up a bunch of popular web sites in chrome and then observing what
> happend when I ran out of memory. With 50M, I saw almost no thrashing and
> the system stayed responsive even under low memory. but I wanted to be
> conservative since I'm really just guessing.
>
> Other users could calculate their value by doing something similar. Load
> up the system (exhaust free memory) with a typical load and then observe
> file io via vmstat. They can then set min_filelist_kbytes to the value
> where they see a tolerable amounting of thrashing (page faults, block io).
>
>> In addition, I have two request. R1: I think chromium specific feature is
>> harder acceptable because it's harder maintable. but we have good chance to
>> solve embedded generic issue. Please discuss Minchan and/or another embedded
>
> I think this feature should be useful to a lot of embedded applications where
> OOM is OK, especially web browsing applications where the user is OK with
> losing 1 of many tabs they have open. However, I consider this patch a
> stop-gap. I think the real solution is to do a better job of protecting
> the active list.
>
>> developers. R2: If you want to deal OOM combination, please consider to
>> combination of memcg OOM notifier too. It is most flexible and powerful OOM
>> mechanism. Probably desktop and server people never use bare OOM killer intentionally.
>>
>
> Yes, will definitely look at OOM notifier. Currently trying to see if we can
> get by with oomadj. With OOM notifier you'd have to respond earlier so you
> might OOM more. However, with a notifier you might be able to take action that
> might prevent OOM altogether.
>
> I see memcg more as an isolation mechanism but I guess you could use it to
> isolate the working set from anon browser tab data as Kamezawa suggests.


I don't think current VM behavior has a problem.
Current problem is that you use up many memory than real memory.
As system memory without swap is low, VM doesn't have a many choice.
It ends up evict your working set to meet for user request. It's very
natural result for greedy user.

Rather than OOM notifier, what we need is memory notifier.
AFAIR, before some years ago, KOSAKI tried similar thing .
http://lwn.net/Articles/268732/
(I can't remember why KOSAKI quit it exactly, AFAIR, some signal time
can't meet yours requirement. I mean when the user receive the memory
low signal, it's too late. Maybe there are other causes for KOSAKi to
quit it.)
Anyway, If the system memory is low, your intelligent middleware can
control it very well than VM.
In this chance, how about improving it?
Mandeep, Could you feel needing this feature?



> Regards,
> Mandeep
>
>> Thanks.
>>
>>
>>
>



--
Kind regards,
Minchan Kim

2010-11-02 03:12:00

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:

> Yes, this prevents you from reclaiming the active list all at once. But if the
> memory pressure doesn't go away, you'll start to reclaim the active list
> little by little. First you'll empty the inactive list, and then
> you'll start scanning
> the active list and pulling pages from inactive to active. The problem is that
> there is no minimum time limit to how long a page will sit in the inactive list
> before it is reclaimed. Just depends on scan rate which does not depend
> on time.
>
> In my experiments, I saw the active list get smaller and smaller
> over time until eventually it was only a few MB at which point the system came
> grinding to a halt due to thrashing.

I believe that changing the active/inactive ratio has other
potential thrashing issues. Specifically, when the inactive
list is too small, pages may not stick around long enough to
be accessed multiple times and get promoted to the active
list, even when they are in active use.

I prefer a more flexible solution, that automatically does
the right thing.

The problem you see is that the file list gets reclaimed
very quickly, even when it is already very small.

I wonder if a possible solution would be to limit how fast
file pages get reclaimed, when the page cache is very small.
Say, inactive_file * active_file < 2 * zone->pages_high ?

At that point, maybe we could slow down the reclaiming of
page cache pages to be significantly slower than they can
be refilled by the disk. Maybe 100 pages a second - that
can be refilled even by an actual spinning metal disk
without even the use of readahead.

That can be rounded up to one batch of SWAP_CLUSTER_MAX
file pages every 1/4 second, when the number of page cache
pages is very low.

This way HPC and virtual machine hosting nodes can still
get rid of totally unused page cache, but on any system
that actually uses page cache, some minimal amount of
cache will be protected under heavy memory pressure.

Does this sound like a reasonable approach?

I realize the threshold may have to be tweaked...

The big question is, how do we integrate this with the
OOM killer? Do we pretend we are out of memory when
we've hit our file cache eviction quota and kill something?

Would there be any downsides to this approach?

Are there any volunteers for implementing this idea?
(Maybe someone who needs the feature?)

--
All rights reversed

2010-11-03 00:48:46

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

Hi Rik,

On Tue, Nov 2, 2010 at 12:11 PM, Rik van Riel <[email protected]> wrote:
> On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:
>
>> Yes, this prevents you from reclaiming the active list all at once. But if
>> the
>> memory pressure doesn't go away, you'll start to reclaim the active list
>> little by little. First you'll empty the inactive list, and then
>> you'll start scanning
>> the active list and pulling pages from inactive to active. The problem is
>> that
>> there is no minimum time limit to how long a page will sit in the inactive
>> list
>> before it is reclaimed. Just depends on scan rate which does not depend
>> on time.
>>
>> In my experiments, I saw the active list get smaller and smaller
>> over time until eventually it was only a few MB at which point the system
>> came
>> grinding to a halt due to thrashing.
>
> I believe that changing the active/inactive ratio has other
> potential thrashing issues. ?Specifically, when the inactive
> list is too small, pages may not stick around long enough to
> be accessed multiple times and get promoted to the active
> list, even when they are in active use.
>
> I prefer a more flexible solution, that automatically does
> the right thing.

I agree. Ideally, it's the best if we handle it well in kernel internal.

>
> The problem you see is that the file list gets reclaimed
> very quickly, even when it is already very small.
>
> I wonder if a possible solution would be to limit how fast
> file pages get reclaimed, when the page cache is very small.
> Say, inactive_file * active_file < 2 * zone->pages_high ?

Why do you multiply inactive_file and active_file?
What's meaning?

I think it's very difficult to fix _a_ threshold.
At least, user have to set it with proper value to use the feature.
Anyway, we need default value. It needs some experiments in desktop
and embedded.

>
> At that point, maybe we could slow down the reclaiming of
> page cache pages to be significantly slower than they can
> be refilled by the disk. ?Maybe 100 pages a second - that
> can be refilled even by an actual spinning metal disk
> without even the use of readahead.
>
> That can be rounded up to one batch of SWAP_CLUSTER_MAX
> file pages every 1/4 second, when the number of page cache
> pages is very low.

How about reducing scanning window size?
I think it could approximate the idea.

>
> This way HPC and virtual machine hosting nodes can still
> get rid of totally unused page cache, but on any system
> that actually uses page cache, some minimal amount of
> cache will be protected under heavy memory pressure.
>
> Does this sound like a reasonable approach?
>
> I realize the threshold may have to be tweaked...

Absolutely.

>
> The big question is, how do we integrate this with the
> OOM killer? ?Do we pretend we are out of memory when
> we've hit our file cache eviction quota and kill something?

I think "Yes".
But I think killing isn't best if oom_badness can't select proper victim.
Normally, embedded system doesn't have swap. And it could try to keep
many task in memory due to application startup latency.
It means some tasks never executed during long time and just stay in
memory with consuming the memory.
OOM have to kill it. Anyway it's off topic.

>
> Would there be any downsides to this approach?

At first feeling, I have a concern unbalance aging of anon/file.
But I think it's no problem. It a result user want. User want to
protect file-backed page(ex, code page) so many anon swapout is
natural result to go on the system. If the system has no swap, we have
no choice except OOM.

>
> Are there any volunteers for implementing this idea?
> (Maybe someone who needs the feature?)

I made quick patch to discuss as combining your idea and Mandeep.
(Just pass the compile test.)


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7687228..98380ec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -29,6 +29,7 @@ extern unsigned long num_physpages;
extern unsigned long totalram_pages;
extern void * high_memory;
extern int page_cluster;
+extern int min_filelist_kbytes;

#ifdef CONFIG_SYSCTL
extern int sysctl_legacy_va_layout;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3a45c22..c61f0c9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1320,6 +1320,14 @@ static struct ctl_table vm_table[] = {
.extra2 = &one,
},
#endif
+ {
+ .procname = "min_filelist_kbytes",
+ .data = &min_filelist_kbytes,
+ .maxlen = sizeof(min_filelist_kbytes),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .extra1 = &zero,
+ },

/*
* NOTE: do not add new entries to this table unless you have read
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..3b0e95d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -130,6 +130,11 @@ struct scan_control {
int vm_swappiness = 60;
long vm_total_pages; /* The total number of pages which the VM controls */

+/*
+ * Low watermark used to prevent fscache thrashing during low memory.
+ * 20M is a arbitrary value. We need more discussion.
+ */
+int min_filelist_kbytes = 1024 * 20;
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

@@ -1635,6 +1640,7 @@ static void get_scan_count(struct zone *zone,
struct scan_control *sc,
u64 fraction[2], denominator;
enum lru_list l;
int noswap = 0;
+ int low_pagecache = 0;

/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
@@ -1651,6 +1657,7 @@ static void get_scan_count(struct zone *zone,
struct scan_control *sc,
zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);

if (scanning_global_lru(sc)) {
+ unsigned long pagecache_threshold;
free = zone_page_state(zone, NR_FREE_PAGES);
/* If we have very few page cache pages,
force-scan anon pages. */
@@ -1660,6 +1667,10 @@ static void get_scan_count(struct zone *zone,
struct scan_control *sc,
denominator = 1;
goto out;
}
+
+ pagecache_threshold = min_filelist_kbytes >> (PAGE_SHIFT - 10);
+ if (file < pagecache_threshold)
+ low_pagecache = 1;
}

/*
@@ -1715,6 +1726,12 @@ out:
if (priority || noswap) {
scan >>= priority;
scan = div64_u64(scan * fraction[file], denominator);
+ /*
+ * If the system has low page cache, we slow down
+ * scanning speed with 1/8 to protect working set.
+ */
+ if (low_pagecache)
+ scan >>= 3;
}
nr[l] = nr_scan_try_batch(scan,
&reclaim_stat->nr_saved_scan[l]);



> --
> All rights reversed
>



--
Kind regards,
Minchan Kim


Attachments:
slow_down_file_lru.patch (2.64 kB)

2010-11-03 02:01:04

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

On 11/02/2010 08:48 PM, Minchan Kim wrote:

>> I wonder if a possible solution would be to limit how fast
>> file pages get reclaimed, when the page cache is very small.
>> Say, inactive_file * active_file< 2 * zone->pages_high ?
>
> Why do you multiply inactive_file and active_file?
> What's meaning?

That was a stupid typo, it should have been a + :)

> I think it's very difficult to fix _a_ threshold.
> At least, user have to set it with proper value to use the feature.
> Anyway, we need default value. It needs some experiments in desktop
> and embedded.

Yes, setting a threshold will be difficult. However,
if the behaviour below that threshold is harmless to
pretty much any workload, it doesn't matter a whole
lot where we set it...

>> At that point, maybe we could slow down the reclaiming of
>> page cache pages to be significantly slower than they can
>> be refilled by the disk. Maybe 100 pages a second - that
>> can be refilled even by an actual spinning metal disk
>> without even the use of readahead.
>>
>> That can be rounded up to one batch of SWAP_CLUSTER_MAX
>> file pages every 1/4 second, when the number of page cache
>> pages is very low.
>
> How about reducing scanning window size?
> I think it could approximate the idea.

A good idea in principle, but if it results in the VM
simply calling the pageout code more often, I suspect
it will not have any effect.

Your patch looks like it would have that effect.

I suspect we will need a time-based approach to really
protect the last bits of page cache in a near-OOM
situation.

>> Would there be any downsides to this approach?
>
> At first feeling, I have a concern unbalance aging of anon/file.
> But I think it's no problem. It a result user want. User want to
> protect file-backed page(ex, code page) so many anon swapout is
> natural result to go on the system. If the system has no swap, we have
> no choice except OOM.

We already have an unbalance in aging anon and file
pages, several of which are introduced on purpose.

In this proposal, there would only be an imbalance
if the number of file pages is really low.

--
All rights reversed

2010-11-03 03:03:21

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

On Wed, Nov 3, 2010 at 11:00 AM, Rik van Riel <[email protected]> wrote:
> On 11/02/2010 08:48 PM, Minchan Kim wrote:
>
>>> I wonder if a possible solution would be to limit how fast
>>> file pages get reclaimed, when the page cache is very small.
>>> Say, inactive_file * active_file< ?2 * zone->pages_high ?
>>
>> Why do you multiply inactive_file and active_file?
>> What's meaning?
>
> That was a stupid typo, it should have been a + :)
>
>> I think it's very difficult to fix _a_ threshold.
>> At least, user have to set it with proper value to use the feature.
>> Anyway, we need default value. It needs some experiments in desktop
>> and embedded.
>
> Yes, setting a threshold will be difficult. ?However,
> if the behaviour below that threshold is harmless to
> pretty much any workload, it doesn't matter a whole
> lot where we set it...

Okay. But I doubt we could make the default value with effective when
we really need the function.
Maybe whenever user uses the feature, he have to tweak the knob.

>
>>> At that point, maybe we could slow down the reclaiming of
>>> page cache pages to be significantly slower than they can
>>> be refilled by the disk. ?Maybe 100 pages a second - that
>>> can be refilled even by an actual spinning metal disk
>>> without even the use of readahead.
>>>
>>> That can be rounded up to one batch of SWAP_CLUSTER_MAX
>>> file pages every 1/4 second, when the number of page cache
>>> pages is very low.
>>
>> How about reducing scanning window size?
>> I think it could approximate the idea.
>
> A good idea in principle, but if it results in the VM
> simply calling the pageout code more often, I suspect
> it will not have any effect.
>
> Your patch looks like it would have that effect.


It could.
But time based approach would be same, IMHO.
First of all, I don't want long latency of direct reclaim process.
It could affect response of foreground process directly.

If VM limits the number of pages reclaimed per second, direct reclaim
process's latency will be affected. so we should avoid throttling in
direct reclaim path. Agree?

So, for slow down reclaim pages in kswapd, there will be processes
enter direct relcaim. So it results in the VM simply calling the
pageout code more often.

If I misunderstood way to implement your idea, please let me know it.

>
> I suspect we will need a time-based approach to really
> protect the last bits of page cache in a near-OOM
> situation.
>
>>> Would there be any downsides to this approach?
>>
>> At first feeling, I have a concern unbalance aging of anon/file.
>> But I think it's no problem. It a result user want. User want to
>> protect file-backed page(ex, code page) so many anon swapout is
>> natural result to go on the system. If the system has no swap, we have
>> no choice except OOM.
>
> We already have an unbalance in aging anon and file
> pages, several of which are introduced on purpose.
>
> In this proposal, there would only be an imbalance
> if the number of file pages is really low.

Right.

>
> --
> All rights reversed
>



--
Kind regards,
Minchan Kim

2010-11-03 11:42:29

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

On 11/02/2010 11:03 PM, Minchan Kim wrote:

> It could.
> But time based approach would be same, IMHO.
> First of all, I don't want long latency of direct reclaim process.
> It could affect response of foreground process directly.
>
> If VM limits the number of pages reclaimed per second, direct reclaim
> process's latency will be affected. so we should avoid throttling in
> direct reclaim path. Agree?

The idea would be to not throttle the processes trying to
reclaim page cache pages, but to only reclaim anonymous
pages when the page cache pages are low (and occasionally
a few page cache pages, say 128 a second).

If too many reclaimers come in when the page cache is
low and no swap is available, we will OOM kill instead
of stalling.

After all, the entire point of this patch would be to
avoid minutes-long latencies in triggering the OOM
killer.

--
All rights reversed

2010-11-03 15:43:03

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

On Wed, Nov 03, 2010 at 07:41:35AM -0400, Rik van Riel wrote:
> On 11/02/2010 11:03 PM, Minchan Kim wrote:
>
> >It could.
> >But time based approach would be same, IMHO.
> >First of all, I don't want long latency of direct reclaim process.
> >It could affect response of foreground process directly.
> >
> >If VM limits the number of pages reclaimed per second, direct reclaim
> >process's latency will be affected. so we should avoid throttling in
> >direct reclaim path. Agree?
>
> The idea would be to not throttle the processes trying to
> reclaim page cache pages, but to only reclaim anonymous
> pages when the page cache pages are low (and occasionally
> a few page cache pages, say 128 a second).

Fair enough. Only anon reclaim is better than thrashing of code pages.

>
> If too many reclaimers come in when the page cache is
> low and no swap is available, we will OOM kill instead
> of stalling.

I understand why you use (file < pages_min).
We can keep the threshold small value. Otherwise,
we can see the many OOM question. "Why OOM happens although my system have enough
file LRU pages?"

>
> After all, the entire point of this patch would be to
> avoid minutes-long latencies in triggering the OOM
> killer.

I got your point. The patch's goal is not protect working set fully, but prevent
page cache thrashing in low file LRU.
It could make minutes-long latencies by reaching the OOM.

Okay. I will look into this idea.
Thanks for the good suggestion, Rik.

>
> --
> All rights reversed

--
Kind regards,
Minchan Kim

2010-11-03 22:41:11

by Mandeep Singh Baines

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

Rik van Riel ([email protected]) wrote:
> On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:
>
> >Yes, this prevents you from reclaiming the active list all at once. But if the
> >memory pressure doesn't go away, you'll start to reclaim the active list
> >little by little. First you'll empty the inactive list, and then
> >you'll start scanning
> >the active list and pulling pages from inactive to active. The problem is that
> >there is no minimum time limit to how long a page will sit in the inactive list
> >before it is reclaimed. Just depends on scan rate which does not depend
> >on time.
> >
> >In my experiments, I saw the active list get smaller and smaller
> >over time until eventually it was only a few MB at which point the system came
> >grinding to a halt due to thrashing.
>
> I believe that changing the active/inactive ratio has other
> potential thrashing issues. Specifically, when the inactive
> list is too small, pages may not stick around long enough to
> be accessed multiple times and get promoted to the active
> list, even when they are in active use.
>
> I prefer a more flexible solution, that automatically does
> the right thing.
>
> The problem you see is that the file list gets reclaimed
> very quickly, even when it is already very small.
>
> I wonder if a possible solution would be to limit how fast
> file pages get reclaimed, when the page cache is very small.
> Say, inactive_file * active_file < 2 * zone->pages_high ?
>
> At that point, maybe we could slow down the reclaiming of
> page cache pages to be significantly slower than they can
> be refilled by the disk. Maybe 100 pages a second - that
> can be refilled even by an actual spinning metal disk
> without even the use of readahead.
>
> That can be rounded up to one batch of SWAP_CLUSTER_MAX
> file pages every 1/4 second, when the number of page cache
> pages is very low.
>
> This way HPC and virtual machine hosting nodes can still
> get rid of totally unused page cache, but on any system
> that actually uses page cache, some minimal amount of
> cache will be protected under heavy memory pressure.
>
> Does this sound like a reasonable approach?
>
> I realize the threshold may have to be tweaked...
>
> The big question is, how do we integrate this with the
> OOM killer? Do we pretend we are out of memory when
> we've hit our file cache eviction quota and kill something?
>
> Would there be any downsides to this approach?
>
> Are there any volunteers for implementing this idea?
> (Maybe someone who needs the feature?)
>

I've created a patch which takes a slightly different approach.
Instead of limiting how fast pages get reclaimed, the patch limits
how fast the active list gets scanned. This should result in the
active list being a better measure of the working set. I've seen
fairly good results with this patch and a scan inteval of 1
centisecond. I see no thrashing when the scan interval is non-zero.

I've made it a tunable because I don't know what to set the scan
interval. The final patch could set the value based on HZ and some
other system parameters. Maybe relate it to sched_period?

---

[PATCH] vmscan: add a configurable scan interval

On ChromiumOS, we see a lot of thrashing under low memory. We do not
use swap, so the mm system can only free file-backed pages. Eventually,
we are left with little file back pages remaining (a few MB) and the
system becomes unresponsive due to thrashing.

Our preference is for the system to OOM instead of becoming unresponsive.

This patch create a tunable, vmscan_interval_centisecs, for controlling
the minimum interval between active list scans. At 0, I see the same
thrashing. At 1, I see no thrashing. The mm system does a good job
of protecting the working set. If a page has been referenced in the
last vmscan_interval_centisecs it is kept in memory.

Signed-off-by: Mandeep Singh Baines <[email protected]>
---
include/linux/mm.h | 2 ++
include/linux/mmzone.h | 9 +++++++++
kernel/sysctl.c | 7 +++++++
mm/page_alloc.c | 2 ++
mm/vmscan.c | 21 +++++++++++++++++++--
5 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 721f451..af058f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -36,6 +36,8 @@ extern int sysctl_legacy_va_layout;
#define sysctl_legacy_va_layout 0
#endif

+extern unsigned int vmscan_interval;
+
#include <asm/page.h>
#include <asm/pgtable.h>
#include <asm/processor.h>
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 39c24eb..6c4b6e1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -415,6 +415,15 @@ struct zone {
unsigned long present_pages; /* amount of memory (excluding holes) */

/*
+ * To avoid over-scanning, we store the time of the last
+ * scan (in jiffies).
+ *
+ * The anon LRU stats live in [0], file LRU stats in [1]
+ */
+
+ unsigned long last_scan[2];
+
+ /*
* rarely used fields:
*/
const char *name;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c33a1ed..c34251d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1318,6 +1318,13 @@ static struct ctl_table vm_table[] = {
.extra2 = &one,
},
#endif
+ {
+ .procname = "scan_interval_centisecs",
+ .data = &vmscan_interval,
+ .maxlen = sizeof(vmscan_interval),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },

/*
* NOTE: do not add new entries to this table unless you have read
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07a6544..46991d2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -51,6 +51,7 @@
#include <linux/kmemleak.h>
#include <linux/memory.h>
#include <linux/compaction.h>
+#include <linux/jiffies.h>
#include <trace/events/kmem.h>
#include <linux/ftrace_event.h>

@@ -4150,6 +4151,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
BUG_ON(ret);
memmap_init(size, nid, j, zone_start_pfn);
zone_start_pfn += size;
+ zone->last_scan[0] = zone->last_scan[1] = jiffies;
}
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b8a6fdc..be45b91 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
#include <linux/memcontrol.h>
#include <linux/delayacct.h>
#include <linux/sysctl.h>
+#include <linux/jiffies.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -136,6 +137,11 @@ struct scan_control {
int vm_swappiness = 60;
long vm_total_pages; /* The total number of pages which the VM controls */

+/*
+ * Minimum interval between active list scans.
+ */
+unsigned int vmscan_interval = 0;
+
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

@@ -1659,14 +1665,25 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
return inactive_anon_is_low(zone, sc);
}

+static int list_scanned_recently(struct zone *zone, int file)
+{
+ unsigned long now = jiffies;
+ unsigned long delta = vmscan_interval * HZ / 100;
+
+ return time_after(zone->last_scan[file] + delta, now);
+}
+
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc, int priority)
{
int file = is_file_lru(lru);

if (is_active_lru(lru)) {
- if (inactive_list_is_low(zone, sc, file))
- shrink_active_list(nr_to_scan, zone, sc, priority, file);
+ if (inactive_list_is_low(zone, sc, file) &&
+ !list_scanned_recently(zone, file)) {
+ shrink_active_list(nr_to_scan, zone, sc, priority, file);
+ zone->last_scan[file] = jiffies;
+ }
return 0;
}

--
1.7.3.1

2010-11-03 23:49:28

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

Hello.

On Thu, Nov 4, 2010 at 7:40 AM, Mandeep Singh Baines <[email protected]> wrote:
> Rik van Riel ([email protected]) wrote:
>> On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:
>>
>> >Yes, this prevents you from reclaiming the active list all at once. But if the
>> >memory pressure doesn't go away, you'll start to reclaim the active list
>> >little by little. First you'll empty the inactive list, and then
>> >you'll start scanning
>> >the active list and pulling pages from inactive to active. The problem is that
>> >there is no minimum time limit to how long a page will sit in the inactive list
>> >before it is reclaimed. Just depends on scan rate which does not depend
>> >on time.
>> >
>> >In my experiments, I saw the active list get smaller and smaller
>> >over time until eventually it was only a few MB at which point the system came
>> >grinding to a halt due to thrashing.
>>
>> I believe that changing the active/inactive ratio has other
>> potential thrashing issues. ?Specifically, when the inactive
>> list is too small, pages may not stick around long enough to
>> be accessed multiple times and get promoted to the active
>> list, even when they are in active use.
>>
>> I prefer a more flexible solution, that automatically does
>> the right thing.
>>
>> The problem you see is that the file list gets reclaimed
>> very quickly, even when it is already very small.
>>
>> I wonder if a possible solution would be to limit how fast
>> file pages get reclaimed, when the page cache is very small.
>> Say, inactive_file * active_file < 2 * zone->pages_high ?
>>
>> At that point, maybe we could slow down the reclaiming of
>> page cache pages to be significantly slower than they can
>> be refilled by the disk. ?Maybe 100 pages a second - that
>> can be refilled even by an actual spinning metal disk
>> without even the use of readahead.
>>
>> That can be rounded up to one batch of SWAP_CLUSTER_MAX
>> file pages every 1/4 second, when the number of page cache
>> pages is very low.
>>
>> This way HPC and virtual machine hosting nodes can still
>> get rid of totally unused page cache, but on any system
>> that actually uses page cache, some minimal amount of
>> cache will be protected under heavy memory pressure.
>>
>> Does this sound like a reasonable approach?
>>
>> I realize the threshold may have to be tweaked...
>>
>> The big question is, how do we integrate this with the
>> OOM killer? ?Do we pretend we are out of memory when
>> we've hit our file cache eviction quota and kill something?
>>
>> Would there be any downsides to this approach?
>>
>> Are there any volunteers for implementing this idea?
>> (Maybe someone who needs the feature?)
>>
>
> I've created a patch which takes a slightly different approach.
> Instead of limiting how fast pages get reclaimed, the patch limits
> how fast the active list gets scanned. This should result in the
> active list being a better measure of the working set. I've seen
> fairly good results with this patch and a scan inteval of 1
> centisecond. I see no thrashing when the scan interval is non-zero.
>
> I've made it a tunable because I don't know what to set the scan
> interval. The final patch could set the value based on HZ and some
> other system parameters. Maybe relate it to sched_period?
>
> ---
>
> [PATCH] vmscan: add a configurable scan interval
>
> On ChromiumOS, we see a lot of thrashing under low memory. We do not
> use swap, so the mm system can only free file-backed pages. Eventually,
> we are left with little file back pages remaining (a few MB) and the
> system becomes unresponsive due to thrashing.
>
> Our preference is for the system to OOM instead of becoming unresponsive.
>
> This patch create a tunable, vmscan_interval_centisecs, for controlling
> the minimum interval between active list scans. At 0, I see the same
> thrashing. At 1, I see no thrashing. The mm system does a good job
> of protecting the working set. If a page has been referenced in the
> last vmscan_interval_centisecs it is kept in memory.
>
> Signed-off-by: Mandeep Singh Baines <[email protected]>

vmscan already have used HZ/10 to calm down congestion of writeback or
something.
(But I don't know why VM used the value and who determined it by any
rationale. It might be a value determined by some experiments.)
If there isn't any good math, we will depend on experiment in this time, too.

Anyway If interval is long, It could make inactive list's size very
shortly in many reclaim workload and then unnecessary OOM kill.
So I hope if inactive list size is very small compared to active list
size, quit the check and refiill the inactive list.

Anyway, the approach makes sense to me.
But need other guy's opinion.

Nitpick :
I expect you will include description of knob in
Documentation/sysctl/vm.txt in your formal patch.

--
Kind regards,
Minchan Kim

2010-11-04 01:53:09

by Mandeep Singh Baines

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

Minchan Kim ([email protected]) wrote:
> On Tue, Nov 2, 2010 at 3:24 AM, Mandeep Singh Baines <[email protected]> wrote:
> > KOSAKI Motohiro ([email protected]) wrote:
> >> Hi
> >>
> >> > On ChromiumOS, we do not use swap. When memory is low, the only way to
> >> > free memory is to reclaim pages from the file list. This results in a
> >> > lot of thrashing under low memory conditions. We see the system become
> >> > unresponsive for minutes before it eventually OOMs. We also see very
> >> > slow browser tab switching under low memory. Instead of an unresponsive
> >> > system, we'd really like the kernel to OOM as soon as it starts to
> >> > thrash. If it can't keep the working set in memory, then OOM.
> >> > Losing one of many tabs is a better behaviour for the user than an
> >> > unresponsive system.
> >> >
> >> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> >> > of file-backed pages when when there are less than min_filelist_bytes worth
> >> > of such pages in the cache. This tunable is handy for low memory systems
> >> > using solid-state storage where interactive response is more important
> >> > than not OOMing.
> >> >
> >> > With this patch and min_filelist_kbytes set to 50000, I see very little
> >> > block layer activity during low memory. The system stays responsive under
> >> > low memory and browser tab switching is fast. Eventually, a process a gets
> >> > killed by OOM. Without this patch, the system gets wedged for minutes
> >> > before it eventually OOMs. Below is the vmstat output from my test runs.
> >>
> >> I've heared similar requirement sometimes from embedded people. then also
> >> don't use swap. then, I don't think this is hopeless idea. but I hope to
> >> clarify some thing at first.
> >>
> >
> > swap would be intersting if we could somehow control swap thrashing. Maybe
> > we could add min_anonlist_kbytes. Just kidding:)
> >
> >> Yes, a system often have should-not-be-evicted-file-caches. Typically, they
> >> are libc, libX11 and some GUI libraries. Traditionally, we was making tiny
> >> application which linked above important lib and call mlockall() at startup.
> >> such technique prevent reclaim. So, Q1: Why do you think above traditional way
> >> is insufficient?
> >>
> >
> > mlock is too coarse grain. It requires locking the whole file in memory.
> > The chrome and X binaries are quite large so locking them would waste a lot
> > of memory. We could lock just the pages that are part of the working set but
> > that is difficult to do in practice. Its unmaintainable if you do it
> > statically. If you do it at runtime by mlocking the working set, you're
> > sort of giving up on mm's active list.
> >
> > Like akpm, I'm sad that we need this patch. I'd rather the kernel did a better
> > job of identifying the working set. We did look at ways to do a better
> > job of keeping the working set in the active list but these were tricker
> > patches and never quite worked out. This patch is simple and works great.
> >
> > Under memory pressure, I see the active list get smaller and smaller. Its
> > getting smaller because we're scanning it faster and faster, causing more
> > and more page faults which slows forward progress resulting in the active
> > list getting smaller still. One way to approach this might to make the
> > scan rate constant and configurable. It doesn't seem right that we scan
> > memory faster and faster under low memory. For us, we'd rather OOM than
> > evict pages that are likely to be accessed again so we'd prefer to make
> > a conservative estimate as to what belongs in the working set. Other
> > folks (long computations) might want to reclaim more aggressively.
> >
> >> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
> >> such value? Do other users can calculate proper value?
> >>
> >
> > 50M was small enough that we were comfortable with keeping 50M of file pages
> > in memory and large enough that it is bigger than the working set. I tested
> > by loading up a bunch of popular web sites in chrome and then observing what
> > happend when I ran out of memory. With 50M, I saw almost no thrashing and
> > the system stayed responsive even under low memory. but I wanted to be
> > conservative since I'm really just guessing.
> >
> > Other users could calculate their value by doing something similar. Load
> > up the system (exhaust free memory) with a typical load and then observe
> > file io via vmstat. They can then set min_filelist_kbytes to the value
> > where they see a tolerable amounting of thrashing (page faults, block io).
> >
> >> In addition, I have two request. R1: I think chromium specific feature is
> >> harder acceptable because it's harder maintable. but we have good chance to
> >> solve embedded generic issue. Please discuss Minchan and/or another embedded
> >
> > I think this feature should be useful to a lot of embedded applications where
> > OOM is OK, especially web browsing applications where the user is OK with
> > losing 1 of many tabs they have open. However, I consider this patch a
> > stop-gap. I think the real solution is to do a better job of protecting
> > the active list.
> >
> >> developers. R2: If you want to deal OOM combination, please consider to
> >> combination of memcg OOM notifier too. It is most flexible and powerful OOM
> >> mechanism. Probably desktop and server people never use bare OOM killer intentionally.
> >>
> >
> > Yes, will definitely look at OOM notifier. Currently trying to see if we can
> > get by with oomadj. With OOM notifier you'd have to respond earlier so you
> > might OOM more. However, with a notifier you might be able to take action that
> > might prevent OOM altogether.
> >
> > I see memcg more as an isolation mechanism but I guess you could use it to
> > isolate the working set from anon browser tab data as Kamezawa suggests.
>
>
> I don't think current VM behavior has a problem.
> Current problem is that you use up many memory than real memory.
> As system memory without swap is low, VM doesn't have a many choice.
> It ends up evict your working set to meet for user request. It's very
> natural result for greedy user.
>
> Rather than OOM notifier, what we need is memory notifier.
> AFAIR, before some years ago, KOSAKI tried similar thing .
> http://lwn.net/Articles/268732/

Thanks! This is perfect. I wonder why its not merged. Was a different
solution eventually implemented? Is there another way of doing the
same thing?

> (I can't remember why KOSAKI quit it exactly, AFAIR, some signal time
> can't meet yours requirement. I mean when the user receive the memory
> low signal, it's too late. Maybe there are other causes for KOSAKi to
> quit it.)
> Anyway, If the system memory is low, your intelligent middleware can
> control it very well than VM.

Agree.

> In this chance, how about improving it?
> Mandeep, Could you feel needing this feature?
>

mem_notify seems perfect.

>
>
> > Regards,
> > Mandeep
> >
> >> Thanks.
> >>
> >>
> >>
> >
>
>
>
> --
> Kind regards,
> Minchan Kim

2010-11-04 15:31:46

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

On 11/03/2010 06:40 PM, Mandeep Singh Baines wrote:

> I've created a patch which takes a slightly different approach.
> Instead of limiting how fast pages get reclaimed, the patch limits
> how fast the active list gets scanned. This should result in the
> active list being a better measure of the working set. I've seen
> fairly good results with this patch and a scan inteval of 1
> centisecond. I see no thrashing when the scan interval is non-zero.
>
> I've made it a tunable because I don't know what to set the scan
> interval. The final patch could set the value based on HZ and some
> other system parameters. Maybe relate it to sched_period?

I like your approach. For file pages it looks like it
could work fine, since new pages always start on the
inactive file list.

However, for anonymous pages I could see your patch
leading to problems, because all anonymous pages start
on the active list. With a scan interval of 1
centiseconds, that means there would be a limit of 3200
pages, or 12MB of anonymous memory that can be moved to
the inactive list a second.

I have seen systems with single SATA disks push out
several times that to swap per second, which matters
when someone starts up a program that is just too big
to fit in memory and requires that something is pushed
out.

That would reduce the size of the inactive list to
zero, reducing our page replacement to a slow FIFO
at best, causing false OOM kills at worst.

Staying with a default of 0 would of course not do
anything, which would make merging the code not too
useful.

I believe we absolutely need to preserve the ability
to evict pages quickly, when new pages are brought
into memory or allocated quickly.

However, speed limits are probably a very good idea
once a cache has been reduced to a smaller size, or
when most IO bypasses the reclaim-speed-limited cache.

--
All rights reversed

2010-11-05 02:36:08

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

On Thu, Nov 4, 2010 at 10:52 AM, Mandeep Singh Baines <[email protected]> wrote:
> Minchan Kim ([email protected]) wrote:
>> On Tue, Nov 2, 2010 at 3:24 AM, Mandeep Singh Baines <[email protected]> wrote:
>> > I see memcg more as an isolation mechanism but I guess you could use it to
>> > isolate the working set from anon browser tab data as Kamezawa suggests.
>>
>>
>> I don't think current VM behavior has a problem.
>> Current problem is that you use up many memory than real memory.
>> As system memory without swap is low, VM doesn't have a many choice.
>> It ends up evict your working set to meet for user request. It's very
>> natural result for greedy user.
>>
>> Rather than OOM notifier, what we need is memory notifier.
>> AFAIR, before some years ago, KOSAKI tried similar thing .
>> http://lwn.net/Articles/268732/
>
> Thanks! This is perfect. I wonder why its not merged. Was a different
> solution eventually implemented? Is there another way of doing the
> same thing?

If my remember is right, there was timing issue.
When the application is notified, it was too late to handle it.
Mabye KOSAKI can explain more detail problem.

I think we need some leveling mechanism.
For example, user can set the limits 30M, 20M, 10M, 5M.

If free memory is low below 30M, master application can require
freeing of extra memory of background sleeping application.
If free memory is low below 20M, master application can require
existing of background sleeping application.
If free memory is low below 10M, master application can kill
none-critical application.
If free memory is low below 5M, master application can require freeing
of memory of critical application.

I think this mechanism would be useful memcg, too.

>
>> (I can't remember why KOSAKI quit it exactly, AFAIR, some signal time
>> can't meet yours requirement. I mean when the user receive the memory
>> low signal, it's too late. Maybe there are other causes for KOSAKi to
>> quit it.)
>> Anyway, If the system memory is low, your intelligent middleware can
>> control it very well than VM.
>
> Agree.
>
>> In this chance, how about improving it?
>> Mandeep, Could you feel needing this feature?
>>
>
> mem_notify seems perfect.

BTW, Regardless of mem_notify, I think this patch is useful in general
system, too.
We have to progress this patch.

>
>>
>>
>> > Regards,
>> > Mandeep
>> >
>> >> Thanks.
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Kind regards,
>> Minchan Kim
>



--
Kind regards,
Minchan Kim

2010-11-08 21:55:42

by Mandeep Singh Baines

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

Rik van Riel ([email protected]) wrote:
> On 11/03/2010 06:40 PM, Mandeep Singh Baines wrote:
>
> >I've created a patch which takes a slightly different approach.
> >Instead of limiting how fast pages get reclaimed, the patch limits
> >how fast the active list gets scanned. This should result in the
> >active list being a better measure of the working set. I've seen
> >fairly good results with this patch and a scan inteval of 1
> >centisecond. I see no thrashing when the scan interval is non-zero.
> >
> >I've made it a tunable because I don't know what to set the scan
> >interval. The final patch could set the value based on HZ and some
> >other system parameters. Maybe relate it to sched_period?
>
> I like your approach. For file pages it looks like it
> could work fine, since new pages always start on the
> inactive file list.
>
> However, for anonymous pages I could see your patch
> leading to problems, because all anonymous pages start
> on the active list. With a scan interval of 1
> centiseconds, that means there would be a limit of 3200
> pages, or 12MB of anonymous memory that can be moved to
> the inactive list a second.
>

Good point.

> I have seen systems with single SATA disks push out
> several times that to swap per second, which matters
> when someone starts up a program that is just too big
> to fit in memory and requires that something is pushed
> out.
>
> That would reduce the size of the inactive list to
> zero, reducing our page replacement to a slow FIFO
> at best, causing false OOM kills at worst.
>
> Staying with a default of 0 would of course not do
> anything, which would make merging the code not too
> useful.
>
> I believe we absolutely need to preserve the ability
> to evict pages quickly, when new pages are brought
> into memory or allocated quickly.
>

Agree.

Instead of doing one scan of SWAP_CLUSTER_MAX pages per vmscan_interval,
we could one "full" scan per vmscan_interval. You could do one full scan
all at once or scan SWAP_CLUSTER_MAX every scan until you've scanned
the whole list.

Psuedo code:

if (zone->to_scan[file] == 0 && !list_scanned_recently(zone, file))
zone->to_scan[file] = list_get_size(zone, file);
if (zone->to_scan[file]) {
shrink_active_list(nr_to_scan, zone, sc, priority, file);
zone->to_scan[file] -= min(zone->to_scan[file], nr_to_scan);
}

> However, speed limits are probably a very good idea
> once a cache has been reduced to a smaller size, or
> when most IO bypasses the reclaim-speed-limited cache.
>
> --
> All rights reversed

2010-11-09 02:49:51

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

> On 11/03/2010 06:40 PM, Mandeep Singh Baines wrote:
>
> > I've created a patch which takes a slightly different approach.
> > Instead of limiting how fast pages get reclaimed, the patch limits
> > how fast the active list gets scanned. This should result in the
> > active list being a better measure of the working set. I've seen
> > fairly good results with this patch and a scan inteval of 1
> > centisecond. I see no thrashing when the scan interval is non-zero.
> >
> > I've made it a tunable because I don't know what to set the scan
> > interval. The final patch could set the value based on HZ and some
> > other system parameters. Maybe relate it to sched_period?
>
> I like your approach. For file pages it looks like it
> could work fine, since new pages always start on the
> inactive file list.
>
> However, for anonymous pages I could see your patch
> leading to problems, because all anonymous pages start
> on the active list. With a scan interval of 1
> centiseconds, that means there would be a limit of 3200
> pages, or 12MB of anonymous memory that can be moved to
> the inactive list a second.
>
> I have seen systems with single SATA disks push out
> several times that to swap per second, which matters
> when someone starts up a program that is just too big
> to fit in memory and requires that something is pushed
> out.
>
> That would reduce the size of the inactive list to
> zero, reducing our page replacement to a slow FIFO
> at best, causing false OOM kills at worst.
>
> Staying with a default of 0 would of course not do
> anything, which would make merging the code not too
> useful.
>
> I believe we absolutely need to preserve the ability
> to evict pages quickly, when new pages are brought
> into memory or allocated quickly.
>
> However, speed limits are probably a very good idea
> once a cache has been reduced to a smaller size, or
> when most IO bypasses the reclaim-speed-limited cache.

Yeah.

But I doubt fixed rate limit is good thing. When playing movie case
(aka streaming I/O case), We don't want any throttle. I think.
Also, I don't like jiffies dependency. CPU hardware improvement naturally
will break such heuristics.


btw, now congestion_wait() already has jiffies dependency. but we should
kill such strange timeout eventually. I think.

2010-11-09 02:53:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

> > I don't think current VM behavior has a problem.
> > Current problem is that you use up many memory than real memory.
> > As system memory without swap is low, VM doesn't have a many choice.
> > It ends up evict your working set to meet for user request. It's very
> > natural result for greedy user.
> >
> > Rather than OOM notifier, what we need is memory notifier.
> > AFAIR, before some years ago, KOSAKI tried similar thing .
> > http://lwn.net/Articles/268732/
>
> Thanks! This is perfect. I wonder why its not merged. Was a different
> solution eventually implemented? Is there another way of doing the
> same thing?

Now memcg has memory threshold notification feature and almost people
are using it. If you think notification fit your case, can you please
try this feature at first?
And if it doesn't fit your case and we will get a feedback from you,
we probably can extend such one.

Thanks.