LinuxLists.cc - [PATCH v2] vmscan: limit concurrent reclaimers in shrink

2009-12-11 21:47:10

Subject: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Under very heavy multi-process workloads, like AIM7, the VM can
get into trouble in a variety of ways. The trouble start when
there are hundreds, or even thousands of processes active in the
page reclaim code.

Not only can the system suffer enormous slowdowns because of
lock contention (and conditional reschedules) between thousands
of processes in the page reclaim code, but each process will try
to free up to SWAP_CLUSTER_MAX pages, even when the system already
has lots of memory free.

It should be possible to avoid both of those issues at once, by
simply limiting how many processes are active in the page reclaim
code simultaneously.

If too many processes are active doing page reclaim in one zone,
simply go to sleep in shrink_zone().

On wakeup, check whether enough memory has been freed already
before jumping into the page reclaim code ourselves. We want
to use the same threshold here that is used in the page allocator
for deciding whether or not to call the page reclaim code in the
first place, otherwise some unlucky processes could end up freeing
memory for the rest of the system.

Reported-by: Larry Woodman <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>

---
v2:
- fix typos in sysctl.c and vm.txt
- move the code in sysctl.c out from under the ifdef
- only __GFP_FS|__GFP_IO tasks can wait

Documentation/sysctl/vm.txt | 18 ++++++++++++++
include/linux/mmzone.h | 4 +++
include/linux/swap.h | 1 +
kernel/sysctl.c | 7 +++++
mm/page_alloc.c | 3 ++
mm/vmscan.c | 40 +++++++++++++++++++++++++++++++++
6 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index fc5790d..8bd1a96 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
- legacy_va_layout
- lowmem_reserve_ratio
- max_map_count
+- max_zone_concurrent_reclaimers
- memory_failure_early_kill
- memory_failure_recovery
- min_free_kbytes
@@ -278,6 +279,23 @@ The default value is 65536.

=============================================================

+max_zone_concurrent_reclaimers:
+
+The number of processes that are allowed to simultaneously reclaim
+memory from a particular memory zone.
+
+With certain workloads, hundreds of processes end up in the page
+reclaim code simultaneously. This can cause large slowdowns due
+to lock contention, freeing of way too much memory and occasionally
+false OOM kills.
+
+To avoid these problems, only allow a smaller number of processes
+to reclaim pages from each memory zone simultaneously.
+
+The default value is 8.
+
+=============================================================
+
memory_failure_early_kill:

Control how to kill processes when uncorrected memory error (typically
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30fe668..ed614b8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -345,6 +345,10 @@ struct zone {
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];

+ /* Number of processes running page reclaim code on this zone. */
+ atomic_t concurrent_reclaimers;
+ wait_queue_head_t reclaim_wait;
+
/*
* prev_priority holds the scanning priority for this zone. It is
* defined as the scanning priority at which we achieved our reclaim
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a2602a8..661eec7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,6 +254,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern long vm_total_pages;
+extern int max_zone_concurrent_reclaimers;

#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6ff0ae6..4ec17ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1271,6 +1271,13 @@ static struct ctl_table vm_table[] = {
.extra2 = &one,
},
#endif
+ {
+ .procname = "max_zone_concurrent_reclaimers",
+ .data = &max_zone_concurrent_reclaimers,
+ .maxlen = sizeof(max_zone_concurrent_reclaimers),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },

/*
* NOTE: do not add new entries to this table unless you have read
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 11ae66e..ca9cae1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

zone->prev_priority = DEF_PRIORITY;

+ atomic_set(&zone->concurrent_reclaimers, 0);
+ init_waitqueue_head(&zone->reclaim_wait);
+
zone_pcp_init(zone);
for_each_lru(l) {
INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bbee91..ecfe28c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
#include <linux/memcontrol.h>
#include <linux/delayacct.h>
#include <linux/sysctl.h>
+#include <linux/wait.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -129,6 +130,17 @@ struct scan_control {
int vm_swappiness = 60;
long vm_total_pages; /* The total number of pages which the VM controls */

+/*
+ * Maximum number of processes concurrently running the page
+ * reclaim code in a memory zone. Having too many processes
+ * just results in them burning CPU time waiting for locks,
+ * so we're better off limiting page reclaim to a sane number
+ * of processes at a time. We do this per zone so local node
+ * reclaim on one NUMA node will not block other nodes from
+ * making progress.
+ */
+int max_zone_concurrent_reclaimers = 8;
+
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

@@ -1600,6 +1612,31 @@ static void shrink_zone(int priority, struct zone *zone,
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int noswap = 0;

+ if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
+ max_zone_concurrent_reclaimers &&
+ (sc->gfp_mask & (__GFP_IO|__GFP_FS)) ==
+ (__GFP_IO|__GFP_FS)) {
+ /*
+ * Do not add to the lock contention if this zone has
+ * enough processes doing page reclaim already, since
+ * we would just make things slower.
+ */
+ sleep_on(&zone->reclaim_wait);
+
+ /*
+ * If other processes freed enough memory while we waited,
+ * break out of the loop and go back to the allocator.
+ */
+ if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
+ 0, 0)) {
+ wake_up(&zone->reclaim_wait);
+ sc->nr_reclaimed += nr_to_reclaim;
+ return;
+ }
+ }
+
+ atomic_inc(&zone->concurrent_reclaimers);
+
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
noswap = 1;
@@ -1655,6 +1692,9 @@ static void shrink_zone(int priority, struct zone *zone,
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

throttle_vm_writeout(sc->gfp_mask);
+
+ atomic_dec(&zone->concurrent_reclaimers);
+ wake_up(&zone->reclaim_wait);
}

/*

2009-12-14 00:15:27

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Hi, Rik.

On Sat, Dec 12, 2009 at 6:46 AM, Rik van Riel <[email protected]> wrote:
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways. The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>
> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
>
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves. We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.

I am worried about one.

Now, we can put too many processes reclaim_wait with NR_UNINTERRUBTIBLE state.
If OOM happens, OOM will kill many innocent processes since
uninterruptible task
can't handle kill signal until the processes free from reclaim_wait list.

I think reclaim_wait list staying time might be long if VM pressure is heavy.
Is this a exaggeration?

If it is serious problem, how about this?

We add new PF_RECLAIM_BLOCK flag and don't pick the process
in select_bad_process.

--
Kind regards,
Minchan Kim

2009-12-14 04:10:05

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

On 12/13/2009 07:14 PM, Minchan Kim wrote:

> On Sat, Dec 12, 2009 at 6:46 AM, Rik van Riel<[email protected]> wrote:

>> If too many processes are active doing page reclaim in one zone,
>> simply go to sleep in shrink_zone().

> I am worried about one.
>
> Now, we can put too many processes reclaim_wait with NR_UNINTERRUBTIBLE state.
> If OOM happens, OOM will kill many innocent processes since
> uninterruptible task
> can't handle kill signal until the processes free from reclaim_wait list.
>
> I think reclaim_wait list staying time might be long if VM pressure is heavy.
> Is this a exaggeration?
>
> If it is serious problem, how about this?
>
> We add new PF_RECLAIM_BLOCK flag and don't pick the process
> in select_bad_process.

A simpler solution may be to use sleep_on_interruptible, and
simply have the process continue into shrink_zone() if it
gets a signal.

--
All rights reversed.

2009-12-14 04:20:06

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

On Mon, Dec 14, 2009 at 1:09 PM, Rik van Riel <[email protected]> wrote:
> On 12/13/2009 07:14 PM, Minchan Kim wrote:
>
>> On Sat, Dec 12, 2009 at 6:46 AM, Rik van Riel<[email protected]> wrote:
>
>>> If too many processes are active doing page reclaim in one zone,
>>> simply go to sleep in shrink_zone().
>
>> I am worried about one.
>>
>> Now, we can put too many processes reclaim_wait with NR_UNINTERRUBTIBLE
>> state.
>> If OOM happens, OOM will kill many innocent processes since
>> uninterruptible task
>> can't handle kill signal until the processes free from reclaim_wait list.
>>
>> I think reclaim_wait list staying time might be long if VM pressure is
>> heavy.
>> Is this a exaggeration?
>>
>> If it is serious problem, how about this?
>>
>> We add new PF_RECLAIM_BLOCK flag and don't pick the process
>> in select_bad_process.
>
> A simpler solution may be to use sleep_on_interruptible, and
> simply have the process continue into shrink_zone() if it
> gets a signal.

I thought it but I was not sure.
Okay. If it is possible, It' more simple.
Could you repost patch with that?

Sorry but I have one requesting.

===

+The default value is 8.
+
+=============================================================

I like this. but why do you select default value as constant 8?
Do you have any reason?

I think it would be better to select the number proportional to NR_CPU.
ex) NR_CPU * 2 or something.

Otherwise looks good to me.

Pessimistically, I assume that the pageout code spends maybe
10% of its time on locking (we have seen far, far worse than
this with thousands of processes in the pageout code). That
means if we have more than 10 threads in the pageout code,
we could end up spending more time on locking and less doing
real work - slowing everybody down.

I rounded it down to the closest power of 2 to come up with
an arbitrary number that looked safe :)
===

We discussed above.
I want to add your desciption into changelog.
That's because after long time, We don't know why we select '8' as
default value.
Your desciption in changelog will explain it to follow-up people. :)

Sorry for bothering you.

> --
> All rights reversed.
>

--
Kind regards,
Minchan Kim

2009-12-14 04:30:30

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

On 12/13/2009 11:19 PM, Minchan Kim wrote:
> On Mon, Dec 14, 2009 at 1:09 PM, Rik van Riel<[email protected]> wrote:

>> A simpler solution may be to use sleep_on_interruptible, and
>> simply have the process continue into shrink_zone() if it
>> gets a signal.
>
> I thought it but I was not sure.
> Okay. If it is possible, It' more simple.
> Could you repost patch with that?

Sure, not a problem.

> +The default value is 8.
> +
> +=============================================================
>
>
> I like this. but why do you select default value as constant 8?
> Do you have any reason?
>
> I think it would be better to select the number proportional to NR_CPU.
> ex) NR_CPU * 2 or something.
>
> Otherwise looks good to me.
>
>
> Pessimistically, I assume that the pageout code spends maybe
> 10% of its time on locking (we have seen far, far worse than
> this with thousands of processes in the pageout code). That
> means if we have more than 10 threads in the pageout code,
> we could end up spending more time on locking and less doing
> real work - slowing everybody down.
>
> I rounded it down to the closest power of 2 to come up with
> an arbitrary number that looked safe :)
> ===
>
> We discussed above.
> I want to add your desciption into changelog.

The thing is, I don't know if 8 is the best value for
the default, which is a reason I made it tunable in
the first place.

There are a lot of assumptions in my reasoning, and
they may be wrong. I suspect that documenting something
wrong is probably worse than letting people wonder out
the default (and maybe finding a better one).

--
All rights reversed.

2009-12-14 05:01:01

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

On Mon, Dec 14, 2009 at 1:29 PM, Rik van Riel <[email protected]> wrote:
> On 12/13/2009 11:19 PM, Minchan Kim wrote:
>>
>> On Mon, Dec 14, 2009 at 1:09 PM, Rik van Riel<[email protected]> wrote:
>
>>> A simpler solution may be to use sleep_on_interruptible, and
>>> simply have the process continue into shrink_zone() if it
>>> gets a signal.
>>
>> I thought it but I was not sure.
>> Okay. If it is possible, It' more simple.
>> Could you repost patch with that?
>
> Sure, not a problem.
>
>> +The default value is 8.
>> +
>> +=============================================================
>>
>>
>> I like this. but why do you select default value as constant 8?
>> Do you have any reason?
>>
>> I think it would be better to select the number proportional to
>> NR_CPU.
>> ex) NR_CPU * 2 or something.
>>
>> Otherwise looks good to me.
>>
>>
>> Pessimistically, I assume that the pageout code spends maybe
>> 10% of its time on locking (we have seen far, far worse than
>> this with thousands of processes in the pageout code). That
>> means if we have more than 10 threads in the pageout code,
>> we could end up spending more time on locking and less doing
>> real work - slowing everybody down.
>>
>> I rounded it down to the closest power of 2 to come up with
>> an arbitrary number that looked safe :)
>> ===
>>
>> We discussed above.
>> I want to add your desciption into changelog.
>
> The thing is, I don't know if 8 is the best value for
> the default, which is a reason I made it tunable in
> the first place.
>
> There are a lot of assumptions in my reasoning, and
> they may be wrong. I suspect that documenting something
> wrong is probably worse than letting people wonder out
> the default (and maybe finding a better one).

Indeed. But whenever I see magic values in kernel, I have a question
about that.
Actually I used to doubt the value because I guess
"that value was determined by server or desktop experiments.
so probably it don't proper small system."
At least if we put the logical why which might be wrong,
later people can think that value is not proper any more now or his
system(ex, small or super computer and so on) by based on our
description.
so they can improve it.

I think it isn't important your reasoning is right or wrong.
Most important thing is which logical reason determines that value.

I want to not bother you if you mind my suggestion.
Pz, think it was just nitpick. :)

> --
> All rights reversed.
>

--
Kind regards,
Minchan Kim

2009-12-14 12:22:33

Subject: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Subject: [cleanup][PATCH 1/8] vmscan: Make shrink_zone_begin/end helper function

Subject: [PATCH 2/8] Mark sleep_on as deprecated

Subject: [PATCH 3/8] Don't use sleep_on()

Subject: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: [PATCH 5/8] Use io_schedule() instead schedule()

Subject: [PATCH 6/8] Stop reclaim quickly when the task reclaimed enough lots pages

Subject: [PATCH 7/8] Use TASK_KILLABLE instead TASK_UNINTERRUPTIBLE

Subject: [PATCH 8/8] mm: Give up allocation if the task have fatal signal

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Subject: Re: [PATCH 2/8] Mark sleep_on as deprecated

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [cleanup][PATCH 1/8] vmscan: Make shrink_zone_begin/end helper function

Subject: Re: [PATCH 2/8] Mark sleep_on as deprecated

Subject: Re: [PATCH 3/8] Don't use sleep_on()

Subject: Re: [PATCH 5/8] Use io_schedule() instead schedule()

Subject: Re: [PATCH 6/8] Stop reclaim quickly when the task reclaimed enough lots pages

Subject: Re: [PATCH 7/8] Use TASK_KILLABLE instead TASK_UNINTERRUPTIBLE

Subject: Re: [PATCH 8/8] mm: Give up allocation if the task have fatal signal

Subject: Re: [PATCH 2/8] Mark sleep_on as deprecated

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Subject: Re: [cleanup][PATCH 1/8] vmscan: Make shrink_zone_begin/end helper function

Subject: Re: [PATCH 2/8] Mark sleep_on as deprecated

Subject: Re: [PATCH 3/8] Don't use sleep_on()

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [PATCH 5/8] Use io_schedule() instead schedule()

Subject: Re: [PATCH 6/8] Stop reclaim quickly when the task reclaimed enough lots pages

Subject: Re: [PATCH 7/8] Use TASK_KILLABLE instead TASK_UNINTERRUPTIBLE

Subject: Re: [PATCH 8/8] mm: Give up allocation if the task have fatal signal

Subject: Re: [PATCH 6/8] Stop reclaim quickly when the task reclaimed enough lots pages

Subject: Re: [PATCH 6/8] Stop reclaim quickly when the task reclaimed enough lots pages

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Subject: Re: [PATCH 8/8] mm: Give up allocation if the task have fatal signal

Subject: Re: [PATCH 5/8] Use io_schedule() instead schedule()

Subject: Re: [PATCH 8/8] mm: Give up allocation if the task have fatal signal

Subject: Re: [PATCH 5/8] Use io_schedule() instead schedule()

Subject: Re: [PATCH 8/8] mm: Give up allocation if the task have fatal signal

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [PATCH 4/8] Use prepare_to_wait_exclusive() instead prepare_to_wait()

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone