2009-12-10 23:56:34

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

Under very heavy multi-process workloads, like AIM7, the VM can
get into trouble in a variety of ways. The trouble start when
there are hundreds, or even thousands of processes active in the
page reclaim code.

Not only can the system suffer enormous slowdowns because of
lock contention (and conditional reschedules) between thousands
of processes in the page reclaim code, but each process will try
to free up to SWAP_CLUSTER_MAX pages, even when the system already
has lots of memory free. In Larry's case, this resulted in over
6000 processes fighting over locks in the page reclaim code, even
though the system already had 1.5GB of free memory.

It should be possible to avoid both of those issues at once, by
simply limiting how many processes are active in the page reclaim
code simultaneously.

If too many processes are active doing page reclaim in one zone,
simply go to sleep in shrink_zone().

On wakeup, check whether enough memory has been freed already
before jumping into the page reclaim code ourselves. We want
to use the same threshold here that is used in the page allocator
for deciding whether or not to call the page reclaim code in the
first place, otherwise some unlucky processes could end up freeing
memory for the rest of the system.

Reported-by: Larry Woodman <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>

---
This patch is against today's MMOTM tree. It has only been compile tested,
I do not have an AIM7 system standing by.

Larry, does this fix your issue?

Documentation/sysctl/vm.txt | 18 ++++++++++++++++++
include/linux/mmzone.h | 4 ++++
include/linux/swap.h | 1 +
kernel/sysctl.c | 7 +++++++
mm/page_alloc.c | 3 +++
mm/vmscan.c | 38 ++++++++++++++++++++++++++++++++++++++
6 files changed, 71 insertions(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index fc5790d..5cf766f 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
- legacy_va_layout
- lowmem_reserve_ratio
- max_map_count
+- max_zone_concurrent_reclaim
- memory_failure_early_kill
- memory_failure_recovery
- min_free_kbytes
@@ -278,6 +279,23 @@ The default value is 65536.

=============================================================

+max_zone_concurrent_reclaim:
+
+The number of processes that are allowed to simultaneously reclaim
+memory from a particular memory zone.
+
+With certain workloads, hundreds of processes end up in the page
+reclaim code simultaneously. This can cause large slowdowns due
+to lock contention, freeing of way too much memory and occasionally
+false OOM kills.
+
+To avoid these problems, only allow a smaller number of processes
+to reclaim pages from each memory zone simultaneously.
+
+The default value is 8.
+
+=============================================================
+
memory_failure_early_kill:

Control how to kill processes when uncorrected memory error (typically
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30fe668..ed614b8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -345,6 +345,10 @@ struct zone {
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];

+ /* Number of processes running page reclaim code on this zone. */
+ atomic_t concurrent_reclaimers;
+ wait_queue_head_t reclaim_wait;
+
/*
* prev_priority holds the scanning priority for this zone. It is
* defined as the scanning priority at which we achieved our reclaim
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a2602a8..661eec7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,6 +254,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern long vm_total_pages;
+extern int max_zone_concurrent_reclaimers;

#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6ff0ae6..89b919c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1270,6 +1270,13 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one,
},
+ {
+ .procname = "max_zone_concurrent_reclaimers",
+ .data = &max_zone_concurrent_reclaimers,
+ .maxlen = sizeof(max_zone_concurrent_reclaimers),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif

/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 11ae66e..ca9cae1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

zone->prev_priority = DEF_PRIORITY;

+ atomic_set(&zone->concurrent_reclaimers, 0);
+ init_waitqueue_head(&zone->reclaim_wait);
+
zone_pcp_init(zone);
for_each_lru(l) {
INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bbee91..cf3ef29 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
#include <linux/memcontrol.h>
#include <linux/delayacct.h>
#include <linux/sysctl.h>
+#include <linux/wait.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -129,6 +130,17 @@ struct scan_control {
int vm_swappiness = 60;
long vm_total_pages; /* The total number of pages which the VM controls */

+/*
+ * Maximum number of processes concurrently running the page
+ * reclaim code in a memory zone. Having too many processes
+ * just results in them burning CPU time waiting for locks,
+ * so we're better off limiting page reclaim to a sane number
+ * of processes at a time. We do this per zone so local node
+ * reclaim on one NUMA node will not block other nodes from
+ * making progress.
+ */
+int max_zone_concurrent_reclaimers = 8;
+
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

@@ -1600,6 +1612,29 @@ static void shrink_zone(int priority, struct zone *zone,
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int noswap = 0;

+ if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
+ max_zone_concurrent_reclaimers) {
+ /*
+ * Do not add to the lock contention if this zone has
+ * enough processes doing page reclaim already, since
+ * we would just make things slower.
+ */
+ sleep_on(&zone->reclaim_wait);
+
+ /*
+ * If other processes freed enough memory while we waited,
+ * break out of the loop and go back to the allocator.
+ */
+ if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
+ 0, 0)) {
+ wake_up(&zone->reclaim_wait);
+ sc->nr_reclaimed += nr_to_reclaim;
+ return;
+ }
+ }
+
+ atomic_inc(&zone->concurrent_reclaimers);
+
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
noswap = 1;
@@ -1655,6 +1690,9 @@ static void shrink_zone(int priority, struct zone *zone,
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

throttle_vm_writeout(sc->gfp_mask);
+
+ atomic_dec(&zone->concurrent_reclaimers);
+ wake_up(&zone->reclaim_wait);
}

/*


2009-12-11 02:03:48

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

Hi, Rik.


On Fri, Dec 11, 2009 at 8:56 AM, Rik van Riel <[email protected]> wrote:
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways.  The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free.  In Larry's case, this resulted in over
> 6000 processes fighting over locks in the page reclaim code, even
> though the system already had 1.5GB of free memory.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>
> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
>
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves.  We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.
>
> Reported-by: Larry Woodman <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>
>
> ---
> This patch is against today's MMOTM tree. It has only been compile tested,
> I do not have an AIM7 system standing by.
>
> Larry, does this fix your issue?
>
>  Documentation/sysctl/vm.txt |   18 ++++++++++++++++++
>  include/linux/mmzone.h      |    4 ++++
>  include/linux/swap.h        |    1 +
>  kernel/sysctl.c             |    7 +++++++
>  mm/page_alloc.c             |    3 +++
>  mm/vmscan.c                 |   38 ++++++++++++++++++++++++++++++++++++++
>  6 files changed, 71 insertions(+)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index fc5790d..5cf766f 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
>  - legacy_va_layout
>  - lowmem_reserve_ratio
>  - max_map_count
> +- max_zone_concurrent_reclaim
>  - memory_failure_early_kill
>  - memory_failure_recovery
>  - min_free_kbytes
> @@ -278,6 +279,23 @@ The default value is 65536.
>
>  =============================================================
>
> +max_zone_concurrent_reclaim:
> +
> +The number of processes that are allowed to simultaneously reclaim
> +memory from a particular memory zone.
> +
> +With certain workloads, hundreds of processes end up in the page
> +reclaim code simultaneously.  This can cause large slowdowns due
> +to lock contention, freeing of way too much memory and occasionally
> +false OOM kills.
> +
> +To avoid these problems, only allow a smaller number of processes
> +to reclaim pages from each memory zone simultaneously.
> +
> +The default value is 8.
> +
> +=============================================================

I like this. but why do you select default value as constant 8?
Do you have any reason?

I think it would be better to select the number proportional to NR_CPU.
ex) NR_CPU * 2 or something.

Otherwise looks good to me.

Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2009-12-11 03:19:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

On 12/10/2009 09:03 PM, Minchan Kim wrote:

>> +The default value is 8.
>> +
>> +=============================================================
>
> I like this. but why do you select default value as constant 8?
> Do you have any reason?
>
> I think it would be better to select the number proportional to NR_CPU.
> ex) NR_CPU * 2 or something.
>
> Otherwise looks good to me.

Pessimistically, I assume that the pageout code spends maybe
10% of its time on locking (we have seen far, far worse than
this with thousands of processes in the pageout code). That
means if we have more than 10 threads in the pageout code,
we could end up spending more time on locking and less doing
real work - slowing everybody down.

I rounded it down to the closest power of 2 to come up with
an arbitrary number that looked safe :)

However, this number is per zone - I imagine that really large
systems will have multiple memory zones, so they can run with
more than 8 processes in the pageout code simultaneously.

> Reviewed-by: Minchan Kim<[email protected]>

Thank you.

--
All rights reversed.

2009-12-11 03:43:47

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

On Fri, Dec 11, 2009 at 12:19 PM, Rik van Riel <[email protected]> wrote:
> On 12/10/2009 09:03 PM, Minchan Kim wrote:
>
>>> +The default value is 8.
>>> +
>>> +=============================================================
>>
>> I like this. but why do you select default value as constant 8?
>> Do you have any reason?
>>
>> I think it would be better to select the number proportional to NR_CPU.
>> ex) NR_CPU * 2 or something.
>>
>> Otherwise looks good to me.
>
> Pessimistically, I assume that the pageout code spends maybe
> 10% of its time on locking (we have seen far, far worse than
> this with thousands of processes in the pageout code).  That
> means if we have more than 10 threads in the pageout code,
> we could end up spending more time on locking and less doing
> real work - slowing everybody down.

Thanks for giving precious information to me. :

We always have a question magic value with no comment.

Actually I don't want to add another magic value without comment.
so I would like to add the your good explanation in change log
or as comment on code.

>
> I rounded it down to the closest power of 2 to come up with
> an arbitrary number that looked safe :)
>
> However, this number is per zone - I imagine that really large
> systems will have multiple memory zones, so they can run with
> more than 8 processes in the pageout code simultaneously.
>
>> Reviewed-by: Minchan Kim<[email protected]>
>
> Thank you.

Thanks for quick reply. :)

> --
> All rights reversed.
>



--
Kind regards,
Minchan Kim

2009-12-11 11:54:38

by Larry Woodman

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

Rik van Riel wrote:
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways. The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free. In Larry's case, this resulted in over
> 6000 processes fighting over locks in the page reclaim code, even
> though the system already had 1.5GB of free memory.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>
> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
>
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves. We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.
>
> Reported-by: Larry Woodman <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>
>
> ---
> This patch is against today's MMOTM tree. It has only been compile tested,
> I do not have an AIM7 system standing by.
>
> Larry, does this fix your issue?
>
> Documentation/sysctl/vm.txt | 18 ++++++++++++++++++
> include/linux/mmzone.h | 4 ++++
> include/linux/swap.h | 1 +
> kernel/sysctl.c | 7 +++++++
> mm/page_alloc.c | 3 +++
> mm/vmscan.c | 38 ++++++++++++++++++++++++++++++++++++++
> 6 files changed, 71 insertions(+)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index fc5790d..5cf766f 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
> - legacy_va_layout
> - lowmem_reserve_ratio
> - max_map_count
> +- max_zone_concurrent_reclaim
> - memory_failure_early_kill
> - memory_failure_recovery
> - min_free_kbytes
> @@ -278,6 +279,23 @@ The default value is 65536.
>
> =============================================================
>
> +max_zone_concurrent_reclaim:
> +
> +The number of processes that are allowed to simultaneously reclaim
> +memory from a particular memory zone.
> +
> +With certain workloads, hundreds of processes end up in the page
> +reclaim code simultaneously. This can cause large slowdowns due
> +to lock contention, freeing of way too much memory and occasionally
> +false OOM kills.
> +
> +To avoid these problems, only allow a smaller number of processes
> +to reclaim pages from each memory zone simultaneously.
> +
> +The default value is 8.
> +
> +=============================================================
> +
> memory_failure_early_kill:
>
> Control how to kill processes when uncorrected memory error (typically
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..ed614b8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -345,6 +345,10 @@ struct zone {
> /* Zone statistics */
> atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
>
> + /* Number of processes running page reclaim code on this zone. */
> + atomic_t concurrent_reclaimers;
> + wait_queue_head_t reclaim_wait;
> +
> /*
> * prev_priority holds the scanning priority for this zone. It is
> * defined as the scanning priority at which we achieved our reclaim
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a2602a8..661eec7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,6 +254,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
> extern int vm_swappiness;
> extern int remove_mapping(struct address_space *mapping, struct page *page);
> extern long vm_total_pages;
> +extern int max_zone_concurrent_reclaimers;
>
> #ifdef CONFIG_NUMA
> extern int zone_reclaim_mode;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6ff0ae6..89b919c 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1270,6 +1270,13 @@ static struct ctl_table vm_table[] = {
> .extra1 = &zero,
> .extra2 = &one,
> },
> + {
> + .procname = "max_zone_concurrent_reclaimers",
> + .data = &max_zone_concurrent_reclaimers,
> + .maxlen = sizeof(max_zone_concurrent_reclaimers),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + },
> #endif
>
> /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 11ae66e..ca9cae1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>
> zone->prev_priority = DEF_PRIORITY;
>
> + atomic_set(&zone->concurrent_reclaimers, 0);
> + init_waitqueue_head(&zone->reclaim_wait);
> +
> zone_pcp_init(zone);
> for_each_lru(l) {
> INIT_LIST_HEAD(&zone->lru[l].list);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2bbee91..cf3ef29 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -40,6 +40,7 @@
> #include <linux/memcontrol.h>
> #include <linux/delayacct.h>
> #include <linux/sysctl.h>
> +#include <linux/wait.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -129,6 +130,17 @@ struct scan_control {
> int vm_swappiness = 60;
> long vm_total_pages; /* The total number of pages which the VM controls */
>
> +/*
> + * Maximum number of processes concurrently running the page
> + * reclaim code in a memory zone. Having too many processes
> + * just results in them burning CPU time waiting for locks,
> + * so we're better off limiting page reclaim to a sane number
> + * of processes at a time. We do this per zone so local node
> + * reclaim on one NUMA node will not block other nodes from
> + * making progress.
> + */
> +int max_zone_concurrent_reclaimers = 8;
> +
> static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>
> @@ -1600,6 +1612,29 @@ static void shrink_zone(int priority, struct zone *zone,
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int noswap = 0;
>
> + if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
> + max_zone_concurrent_reclaimers) {
> + /*
> + * Do not add to the lock contention if this zone has
> + * enough processes doing page reclaim already, since
> + * we would just make things slower.
> + */
> + sleep_on(&zone->reclaim_wait);
> +
> + /*
> + * If other processes freed enough memory while we waited,
> + * break out of the loop and go back to the allocator.
> + */
> + if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
> + 0, 0)) {
> + wake_up(&zone->reclaim_wait);
> + sc->nr_reclaimed += nr_to_reclaim;
> + return;
> + }
> + }
> +
> + atomic_inc(&zone->concurrent_reclaimers);
> +
> /* If we have no swap space, do not bother scanning anon pages. */
> if (!sc->may_swap || (nr_swap_pages <= 0)) {
> noswap = 1;
> @@ -1655,6 +1690,9 @@ static void shrink_zone(int priority, struct zone *zone,
> shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>
> throttle_vm_writeout(sc->gfp_mask);
> +
> + atomic_dec(&zone->concurrent_reclaimers);
> + wake_up(&zone->reclaim_wait);
> }
>
> /*
>
>
FYI everyone, I put together a similar patch(doesnt block explicitly,
relies on vm_throttle) for
RHEL5(2.6.18 based kernel) and it fixes hangs due to several CPUs
spinning on zone->lru_lock.
This might eliminate the need to add complexity to the reclaim code.
Its impossible to make it parallel
enough to allow hundreds or even thousands of processes in this code at
the same time.

Will backport this patch & test it as well as testing it on latest kernel.

Larry

2009-12-11 12:13:10

by Larry Woodman

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

Minchan Kim wrote:
>
> I like this. but why do you select default value as constant 8?
> Do you have any reason?
>
> I think it would be better to select the number proportional to NR_CPU.
> ex) NR_CPU * 2 or something.
>
> Otherwise looks good to me.
>
> Reviewed-by: Minchan Kim <[email protected]>
>
>
This is a per-zone count so perhaps a reasonable default is the number
of CPUs on the
NUMA node that the zone resides on ?

2009-12-11 13:41:28

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

Hi, Larry.

On Fri, Dec 11, 2009 at 9:07 PM, Larry Woodman <[email protected]> wrote:
> Minchan Kim wrote:
>>
>> I like this. but why do you select default value as constant 8?
>> Do you have any reason?
>>
>> I think it would be better to select the number proportional to NR_CPU.
>> ex) NR_CPU * 2 or something.
>>
>> Otherwise looks good to me.
>>
>> Reviewed-by: Minchan Kim <[email protected]>
>>
>>
>
> This is a per-zone count so perhaps a reasonable default is the number of
> CPUs on the
> NUMA node that the zone resides on ?

For example, It assume one CPU per node.
It means your default value is 1.
On the CPU, process A try to reclaim HIGH zone.
Process B want to reclaim NORMAL zone.
But Process B can't enter reclaim path sincev throttle default value is 1
Even kswap can't reclaim.

I think it's really agressive throttle approach although it would
solve your problem.

I have another idea.

We make default value rather big and we provide latency vaule as knob.
So first many processes can enter reclaim path. When shrinking time exceeds
our konb(ex, some HZ), we can decrease default value of number of concurrent
reclaim process. If shrink time is still long alghouth we do it, we
can decrease
default vaule again. When shrink time is fast, we can allow to enter
reclaim path of another processes as increase the number.

It's like old pdflush mechanism. but it's more complex than Rik's one.
If Rik's approach solve this problem well, my approach is rather
overkill, I think.

I am looking forward to Rik's approach work well.

>
>



--
Kind regards,
Minchan Kim

2009-12-11 13:48:33

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

On 12/11/2009 07:07 AM, Larry Woodman wrote:
> Minchan Kim wrote:
>>
>> I like this. but why do you select default value as constant 8?
>> Do you have any reason?
>>
>> I think it would be better to select the number proportional to NR_CPU.
>> ex) NR_CPU * 2 or something.
>>
>> Otherwise looks good to me.
>>
>> Reviewed-by: Minchan Kim <[email protected]>
>>
> This is a per-zone count so perhaps a reasonable default is the number
> of CPUs on the
> NUMA node that the zone resides on ?

One reason I made it tunable is so people can easily test
what a good value would be :)

--
All rights reversed.

2009-12-11 13:52:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

On 12/11/2009 08:41 AM, Minchan Kim wrote:
> Hi, Larry.
>
> On Fri, Dec 11, 2009 at 9:07 PM, Larry Woodman<[email protected]> wrote:
>> Minchan Kim wrote:
>>>
>>> I like this. but why do you select default value as constant 8?
>>> Do you have any reason?
>>>
>>> I think it would be better to select the number proportional to NR_CPU.
>>> ex) NR_CPU * 2 or something.
>>>
>>> Otherwise looks good to me.
>>>
>>> Reviewed-by: Minchan Kim<[email protected]>
>>>
>>>
>>
>> This is a per-zone count so perhaps a reasonable default is the number of
>> CPUs on the
>> NUMA node that the zone resides on ?
>
> For example, It assume one CPU per node.
> It means your default value is 1.
> On the CPU, process A try to reclaim HIGH zone.
> Process B want to reclaim NORMAL zone.
> But Process B can't enter reclaim path sincev throttle default value is 1
> Even kswap can't reclaim.

1) the value is per zone, so process B can go ahead

2) kswapd is always excempt from this limit, since
there is only 1 kswapd per node anyway

--
All rights reversed.

2009-12-11 14:08:40

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

On Fri, Dec 11, 2009 at 10:51 PM, Rik van Riel <[email protected]> wrote:
> On 12/11/2009 08:41 AM, Minchan Kim wrote:
>>
>> Hi, Larry.
>>
>> On Fri, Dec 11, 2009 at 9:07 PM, Larry Woodman<[email protected]>
>>  wrote:
>>>
>>> Minchan Kim wrote:
>>>>
>>>> I like this. but why do you select default value as constant 8?
>>>> Do you have any reason?
>>>>
>>>> I think it would be better to select the number proportional to NR_CPU.
>>>> ex) NR_CPU * 2 or something.
>>>>
>>>> Otherwise looks good to me.
>>>>
>>>> Reviewed-by: Minchan Kim<[email protected]>
>>>>
>>>>
>>>
>>> This is a per-zone count so perhaps a reasonable default is the number of
>>> CPUs on the
>>> NUMA node that the zone resides on ?
>>
>> For example, It assume one CPU per node.
>> It means your default value is 1.
>> On the CPU, process A try to reclaim HIGH zone.
>> Process B want to reclaim NORMAL zone.
>> But Process B can't enter reclaim path sincev throttle default value is 1
>> Even kswap can't reclaim.
>
> 1) the value is per zone, so process B can go ahead

Sorry. I misunderstood Larry's point.
I though Larry mentioned global limit not per zone.

> 2) kswapd is always excempt from this limit, since
>   there is only 1 kswapd per node anyway

Larry could test with Rik's patch for what's good default value.
If it proves NR_CPU on node is proper as default value,
We can change default value with it.


> --
> All rights reversed.
>



--
Kind regards,
Minchan Kim

2009-12-11 21:25:36

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

On 12/10/2009 09:03 PM, Minchan Kim wrote:
> On Fri, Dec 11, 2009 at 8:56 AM, Rik van Riel<[email protected]> wrote:
>> Under very heavy multi-process workloads, like AIM7, the VM can
>> get into trouble in a variety of ways. The trouble start when
>> there are hundreds, or even thousands of processes active in the
>> page reclaim code.

> Otherwise looks good to me.
>
> Reviewed-by: Minchan Kim<[email protected]>

OK, we found three issues with my patch :)

1) there is a typo in sysctl.c

2) there is another typo in Documentation/vm/sysctl.c

3) the code in vmscan.c has a bug, where tasks without
__GFP_IO or __GFP_FS can end up waiting for tasks
with __GFP_IO or __GFP_FS, leading to a deadlock

I will fix these issues and send out a new patch.

--
All rights reversed.

2009-12-14 13:08:13

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

Rik van Riel <[email protected]> writes:

> +max_zone_concurrent_reclaim:
> +
> +The number of processes that are allowed to simultaneously reclaim
> +memory from a particular memory zone.
> +
> +With certain workloads, hundreds of processes end up in the page
> +reclaim code simultaneously. This can cause large slowdowns due
> +to lock contention, freeing of way too much memory and occasionally
> +false OOM kills.
> +
> +To avoid these problems, only allow a smaller number of processes
> +to reclaim pages from each memory zone simultaneously.
> +
> +The default value is 8.

I don't like the hardcoded number. Is the same number good for a 128MB
embedded system as for as 1TB server? Seems doubtful.

This should be perhaps scaled with memory size and number of CPUs?

> +/*
> + * Maximum number of processes concurrently running the page
> + * reclaim code in a memory zone. Having too many processes
> + * just results in them burning CPU time waiting for locks,
> + * so we're better off limiting page reclaim to a sane number
> + * of processes at a time. We do this per zone so local node
> + * reclaim on one NUMA node will not block other nodes from
> + * making progress.
> + */
> +int max_zone_concurrent_reclaimers = 8;

__read_mostly

> +
> static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>
> @@ -1600,6 +1612,29 @@ static void shrink_zone(int priority, struct zone *zone,
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int noswap = 0;
>
> + if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
> + max_zone_concurrent_reclaimers) {
> + /*
> + * Do not add to the lock contention if this zone has
> + * enough processes doing page reclaim already, since
> + * we would just make things slower.
> + */
> + sleep_on(&zone->reclaim_wait);

wait_event()? sleep_on is a really deprecated racy interface.

This would still badly thunder the herd if not enough memory is freed
, won't it? It would be better to only wake up a single process if memory got freed.

How about for each page freed do a wake up for one thread?


-Andi
--
[email protected] -- Speaking for myself only.

2009-12-14 13:15:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

On Thu, Dec 10, 2009 at 06:56:26PM -0500, Rik van Riel wrote:
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways. The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free. In Larry's case, this resulted in over
> 6000 processes fighting over locks in the page reclaim code, even
> though the system already had 1.5GB of free memory.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>

This sounds like a very good argument against using direct reclaim at
all. It reminds a bit of the issue we had in XFS with lots of processes
pushing the AIL and causing massive slowdowns due to lock contention
and cacheline bonucing. Moving all the AIL pushing into a dedicated
thread solved that nicely. In the VM we already have that dedicated
per-node kswapd thread, so moving off as much as possible work to
should be equivalent.

Of course any of this kind of tuning really requires a lot of testing
and benchrmarking to verify those assumptions.

2009-12-14 14:20:10

by Larry Woodman

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

On Mon, 2009-12-14 at 08:14 -0500, Christoph Hellwig wrote:
> On Thu, Dec 10, 2009 at 06:56:26PM -0500, Rik van Riel wrote:
> > Under very heavy multi-process workloads, like AIM7, the VM can
> > get into trouble in a variety of ways. The trouble start when
> > there are hundreds, or even thousands of processes active in the
> > page reclaim code.
> >
> > Not only can the system suffer enormous slowdowns because of
> > lock contention (and conditional reschedules) between thousands
> > of processes in the page reclaim code, but each process will try
> > to free up to SWAP_CLUSTER_MAX pages, even when the system already
> > has lots of memory free. In Larry's case, this resulted in over
> > 6000 processes fighting over locks in the page reclaim code, even
> > though the system already had 1.5GB of free memory.
> >
> > It should be possible to avoid both of those issues at once, by
> > simply limiting how many processes are active in the page reclaim
> > code simultaneously.
> >
>
> This sounds like a very good argument against using direct reclaim at
> all. It reminds a bit of the issue we had in XFS with lots of processes
> pushing the AIL and causing massive slowdowns due to lock contention
> and cacheline bonucing. Moving all the AIL pushing into a dedicated
> thread solved that nicely. In the VM we already have that dedicated
> per-node kswapd thread, so moving off as much as possible work to
> should be equivalent.

Some of the new systems have 16 CPUs per-node.

>
> Of course any of this kind of tuning really requires a lot of testing
> and benchrmarking to verify those assumptions.
>

2009-12-14 14:21:14

by Larry Woodman

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

On Mon, 2009-12-14 at 14:08 +0100, Andi Kleen wrote:
> Rik van Riel <[email protected]> writes:
>
> > +max_zone_concurrent_reclaim:
> > +
> > +The number of processes that are allowed to simultaneously reclaim
> > +memory from a particular memory zone.
> > +
> > +With certain workloads, hundreds of processes end up in the page
> > +reclaim code simultaneously. This can cause large slowdowns due
> > +to lock contention, freeing of way too much memory and occasionally
> > +false OOM kills.
> > +
> > +To avoid these problems, only allow a smaller number of processes
> > +to reclaim pages from each memory zone simultaneously.
> > +
> > +The default value is 8.
>
> I don't like the hardcoded number. Is the same number good for a 128MB
> embedded system as for as 1TB server? Seems doubtful.
>
> This should be perhaps scaled with memory size and number of CPUs?

Remember this a per-zone number.

>
> > +/*
> > + * Maximum number of processes concurrently running the page
> > + * reclaim code in a memory zone. Having too many processes
> > + * just results in them burning CPU time waiting for locks,
> > + * so we're better off limiting page reclaim to a sane number
> > + * of processes at a time. We do this per zone so local node
> > + * reclaim on one NUMA node will not block other nodes from
> > + * making progress.
> > + */
> > +int max_zone_concurrent_reclaimers = 8;
>
> __read_mostly
>
> > +
> > static LIST_HEAD(shrinker_list);
> > static DECLARE_RWSEM(shrinker_rwsem);
> >
> > @@ -1600,6 +1612,29 @@ static void shrink_zone(int priority, struct zone *zone,
> > struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > int noswap = 0;
> >
> > + if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
> > + max_zone_concurrent_reclaimers) {
> > + /*
> > + * Do not add to the lock contention if this zone has
> > + * enough processes doing page reclaim already, since
> > + * we would just make things slower.
> > + */
> > + sleep_on(&zone->reclaim_wait);
>
> wait_event()? sleep_on is a really deprecated racy interface.
>
> This would still badly thunder the herd if not enough memory is freed
> , won't it? It would be better to only wake up a single process if memory got freed.
>
> How about for each page freed do a wake up for one thread?
>
>
> -Andi

2009-12-14 14:40:54

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

On 12/14/2009 08:08 AM, Andi Kleen wrote:
> Rik van Riel<[email protected]> writes:
>
>> +max_zone_concurrent_reclaim:
>> +
>> +The number of processes that are allowed to simultaneously reclaim
>> +memory from a particular memory zone.
>> +
>> +With certain workloads, hundreds of processes end up in the page
>> +reclaim code simultaneously. This can cause large slowdowns due
>> +to lock contention, freeing of way too much memory and occasionally
>> +false OOM kills.
>> +
>> +To avoid these problems, only allow a smaller number of processes
>> +to reclaim pages from each memory zone simultaneously.
>> +
>> +The default value is 8.
>
> I don't like the hardcoded number. Is the same number good for a 128MB
> embedded system as for as 1TB server? Seems doubtful.
>
> This should be perhaps scaled with memory size and number of CPUs?

The limit is per _zone_, so the number of concurrent reclaimers
is automatically scaled by the number of memory zones in the
system.

Scaling up the per-zone value as well looks like it could lead
to the kind of lock contention we are aiming to avoid in the
first place.

--
All rights reversed.

2009-12-14 14:52:37

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

On 12/14/2009 08:14 AM, Christoph Hellwig wrote:
> On Thu, Dec 10, 2009 at 06:56:26PM -0500, Rik van Riel wrote:

>> It should be possible to avoid both of those issues at once, by
>> simply limiting how many processes are active in the page reclaim
>> code simultaneously.
>>
>
> This sounds like a very good argument against using direct reclaim at
> all.

Not completely possible. Processes that call try_to_free_pages
without __GFP_FS or __GFP_IO in their gfp_flags may be holding
some kind of lock that kswapd could end up waiting on.

That means those tasks cannot wait on kswapd, because a deadlock
could happen.

Having said that, maybe we can get away with making direct
reclaim a limited subset of what it does today. Kswapd could
be the only process scanning and unmapping ptes, submitting
IOs and scanning the active list. Direct reclaim could limit
itself to reclaiming clean inactive pages and sleeping in
congestion_wait().

--
All rights reversed.

2009-12-14 16:19:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

On Mon, Dec 14, 2009 at 09:23:19AM -0500, Larry Woodman wrote:
> On Mon, 2009-12-14 at 14:08 +0100, Andi Kleen wrote:
> > Rik van Riel <[email protected]> writes:
> >
> > > +max_zone_concurrent_reclaim:
> > > +
> > > +The number of processes that are allowed to simultaneously reclaim
> > > +memory from a particular memory zone.
> > > +
> > > +With certain workloads, hundreds of processes end up in the page
> > > +reclaim code simultaneously. This can cause large slowdowns due
> > > +to lock contention, freeing of way too much memory and occasionally
> > > +false OOM kills.
> > > +
> > > +To avoid these problems, only allow a smaller number of processes
> > > +to reclaim pages from each memory zone simultaneously.
> > > +
> > > +The default value is 8.
> >
> > I don't like the hardcoded number. Is the same number good for a 128MB
> > embedded system as for as 1TB server? Seems doubtful.
> >
> > This should be perhaps scaled with memory size and number of CPUs?
>
> Remember this a per-zone number.

A zone could be 64MB or 32GB. And the system could have 1 or 1024 CPUs.

-Andi

--
[email protected] -- Speaking for myself only.