LinuxLists.cc - [RFC][PATCH] the proposal of improve page reclaim by throttle

2008-02-19 05:47:39

Subject: [RFC][PATCH] the proposal of improve page reclaim by throttle

background
========================================
current VM implementation doesn't has limit of # of parallel reclaim.
when heavy workload, it bring to 2 bad things
- heavy lock contention
- unnecessary swap out

abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page
reclaim throttle and explain it improve reclaim time.
http://marc.info/?l=linux-mm&m=119667465917215&w=2

but unfortunately it works only memcgroup reclaim.
Today, I implement it again for support global reclaim and mesure it.

test machine, method and result
==================================================
<test machine>
CPU: IA64 x8
MEM: 8GB
SWAP: 2GB

<test method>
got hackbench from
http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

$ /usr/bin/time hackbench 120 process 1000

this parameter mean consume all physical memory and
1GB swap space on my test environment.

<test result (average of 3 times measurement)>

before:
hackbench result: 282.30
/usr/bin/time result
user: 14.16
sys: 1248.47
elapse: 432.93
major fault: 29026
max parallel reclaim tasks: 1298
max consumption time of
try_to_free_pages(): 70394

after:
hackbench result: 30.36
/usr/bin/time result
user: 14.26
sys: 294.44
elapse: 118.01
major fault: 3064
max parallel reclaim tasks: 4
max consumption time of
try_to_free_pages(): 12234

conclusion
=========================================
this patch improve 3 things.
1. reduce unnecessary swap
(see above major fault. about 90% reduced)
2. improve throughput performance
(see above hackbench result. about 90% reduced)
3. improve interactive performance.
(see above max consumption of try_to_free_pages.
about 80% reduced)
4. reduce lock contention.
(see above sys time. about 80% reduced)

Now, we got about 1000% performance improvement of hackbench :)

foture works
==========================================================
- more discussion with memory controller guys.

Signed-off-by: KOSAKI Motohiro <[email protected]>
CC: KAMEZAWA Hiroyuki <[email protected]>
CC: Balbir Singh <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Lee Schermerhorn <[email protected]>

---
include/linux/nodemask.h | 1
mm/vmscan.c | 49 +++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 48 insertions(+), 2 deletions(-)

Index: b/include/linux/nodemask.h
===================================================================
--- a/include/linux/nodemask.h 2008-02-19 13:58:05.000000000 +0900
+++ b/include/linux/nodemask.h 2008-02-19 13:58:23.000000000 +0900
@@ -431,6 +431,7 @@ static inline int num_node_state(enum no

#define num_online_nodes() num_node_state(N_ONLINE)
#define num_possible_nodes() num_node_state(N_POSSIBLE)
+#define num_highmem_nodes() num_node_state(N_HIGH_MEMORY)
#define node_online(node) node_state((node), N_ONLINE)
#define node_possible(node) node_state((node), N_POSSIBLE)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c 2008-02-19 13:58:05.000000000 +0900
+++ b/mm/vmscan.c 2008-02-19 14:04:06.000000000 +0900
@@ -127,6 +127,11 @@ long vm_total_pages; /* The total number
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

+static atomic_t nr_reclaimers = ATOMIC_INIT(0);
+static DECLARE_WAIT_QUEUE_HEAD(reclaim_throttle_waitq);
+#define RECLAIM_LIMIT (2 * num_highmem_nodes())
+
+
#ifdef CONFIG_CGROUP_MEM_CONT
#define scan_global_lru(sc) (!(sc)->mem_cgroup)
#else
@@ -1421,6 +1426,46 @@ out:
return ret;
}

+static unsigned long try_to_free_pages_throttled(struct zone **zones,
+ int order,
+ gfp_t gfp_mask,
+ struct scan_control *sc)
+{
+ unsigned long nr_reclaimed = 0;
+ unsigned long start_time;
+ int i;
+
+ start_time = jiffies;
+
+ wait_event(reclaim_throttle_waitq,
+ atomic_add_unless(&nr_reclaimers, 1, RECLAIM_LIMIT));
+
+ /* more reclaim until needed? */
+ if (unlikely(time_after(jiffies, start_time + HZ))) {
+ for (i = 0; zones[i] != NULL; i++) {
+ struct zone *zone = zones[i];
+ int classzone_idx = zone_idx(zones[0]);
+
+ if (!populated_zone(zone))
+ continue;
+
+ if (zone_watermark_ok(zone, order, 4*zone->pages_high,
+ classzone_idx, 0)) {
+ nr_reclaimed = 1;
+ goto out;
+ }
+ }
+ }
+
+ nr_reclaimed = do_try_to_free_pages(zones, gfp_mask, sc);
+
+out:
+ atomic_dec(&nr_reclaimers);
+ wake_up_all(&reclaim_throttle_waitq);
+
+ return nr_reclaimed;
+}
+
unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
{
struct scan_control sc = {
@@ -1434,7 +1479,7 @@ unsigned long try_to_free_pages(struct z
.isolate_pages = isolate_pages_global,
};

- return do_try_to_free_pages(zones, gfp_mask, &sc);
+ return try_to_free_pages_throttled(zones, order, gfp_mask, &sc);
}

#ifdef CONFIG_CGROUP_MEM_CONT
@@ -1456,7 +1501,7 @@ unsigned long try_to_free_mem_cgroup_pag
int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);

zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones;
- if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
+ if (try_to_free_pages_throttled(zones, 0, sc.gfp_mask, &sc))
return 1;
return 0;
}

2008-02-19 07:10:36

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

Hi Nick,

> Yeah this is definitely needed and a nice result.
>
> I'm worried about a) placing a global limit on parallelism, and b)
> placing a limit on parallelism at all.

sorry, i don't understand yet.
a) and b) have any relation?

>
> I think it should maybe be a per-zone thing...
>
> What happens if you make it a per-zone mutex, and allow just a single
> process to reclaim pages from a given zone at a time? I guess that is
> going to slow down throughput a little bit in some cases though...

That makes sense.

OK.
I'll repost after 2-3 days.

Thanks.

- kosaki

2008-02-19 08:04:01

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

On Tuesday 19 February 2008 16:44, KOSAKI Motohiro wrote:
> background
> ========================================
> current VM implementation doesn't has limit of # of parallel reclaim.
> when heavy workload, it bring to 2 bad things
> - heavy lock contention
> - unnecessary swap out
>
> abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page
> reclaim throttle and explain it improve reclaim time.
> http://marc.info/?l=linux-mm&m=119667465917215&w=2
>
> but unfortunately it works only memcgroup reclaim.
> Today, I implement it again for support global reclaim and mesure it.
>
>
> test machine, method and result
> ==================================================
> <test machine>
> CPU: IA64 x8
> MEM: 8GB
> SWAP: 2GB
>
> <test method>
> got hackbench from
> http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
>
> $ /usr/bin/time hackbench 120 process 1000
>
> this parameter mean consume all physical memory and
> 1GB swap space on my test environment.
>
> <test result (average of 3 times measurement)>
>
> before:
> hackbench result: 282.30
> /usr/bin/time result
> user: 14.16
> sys: 1248.47
> elapse: 432.93
> major fault: 29026
> max parallel reclaim tasks: 1298
> max consumption time of
> try_to_free_pages(): 70394
>
> after:
> hackbench result: 30.36
> /usr/bin/time result
> user: 14.26
> sys: 294.44
> elapse: 118.01
> major fault: 3064
> max parallel reclaim tasks: 4
> max consumption time of
> try_to_free_pages(): 12234
>
>
> conclusion
> =========================================
> this patch improve 3 things.
> 1. reduce unnecessary swap
> (see above major fault. about 90% reduced)
> 2. improve throughput performance
> (see above hackbench result. about 90% reduced)
> 3. improve interactive performance.
> (see above max consumption of try_to_free_pages.
> about 80% reduced)
> 4. reduce lock contention.
> (see above sys time. about 80% reduced)
>
>
> Now, we got about 1000% performance improvement of hackbench :)
>
>
>
> foture works
> ==========================================================
> - more discussion with memory controller guys.

Hi,

Yeah this is definitely needed and a nice result.

I'm worried about a) placing a global limit on parallelism, and b)
placing a limit on parallelism at all.

I think it should maybe be a per-zone thing...

What happens if you make it a per-zone mutex, and allow just a single
process to reclaim pages from a given zone at a time? I guess that is
going to slow down throughput a little bit in some cases though...

2008-02-19 13:33:14

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

On Tue, 19 Feb 2008 17:34:59 +1100
Nick Piggin <[email protected]> wrote:

> On Tuesday 19 February 2008 16:44, KOSAKI Motohiro wrote:
> > background
> > ========================================
> > current VM implementation doesn't has limit of # of parallel reclaim.
> > when heavy workload, it bring to 2 bad things
> > - heavy lock contention
> > - unnecessary swap out

> I think it should maybe be a per-zone thing...
>
> What happens if you make it a per-zone mutex, and allow just a single
> process to reclaim pages from a given zone at a time? I guess that is
> going to slow down throughput a little bit in some cases though...

I agree, doing things per zone will probably work better, because
that way one process can do page reclaim on every NUMA node at
the same time.

--
All rights reversed.

2008-02-20 08:56:18

by Minchan Kim

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

Hi, KOSAKI.

I am a many interested in your patch. so I want to test it with exact
same method as you did.
I will test it in embedded environment(ARM 920T, 32M ram) and my
desktop machine.(Core2Duo 2.2G, 2G ram)

I guess this patch won't be efficient in embedded environment.
Since many embedded board just have one processor and don't have any
swap device.

What I want to know is that this patch have a regression in UP and NO
swap device like embedded.
I think I can't show some field only top or freemem.
Becuase top or freemem won't be able to work well if system have a
great overhead with page reclaiming and swapping.
So, How do I evaluate following field as you did ?

* elapse (what do you mean it ??)
* major fault
* max parallel reclaim tasks:
* max consumption time of
try_to_free_pages():

If you have a patch for testing, Let me receive it.

On Feb 19, 2008 2:44 PM, KOSAKI Motohiro <[email protected]> wrote:
> background
> ========================================
> current VM implementation doesn't has limit of # of parallel reclaim.
> when heavy workload, it bring to 2 bad things
> - heavy lock contention
> - unnecessary swap out
>
> abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page
> reclaim throttle and explain it improve reclaim time.
> http://marc.info/?l=linux-mm&m=119667465917215&w=2
>
> but unfortunately it works only memcgroup reclaim.
> Today, I implement it again for support global reclaim and mesure it.
>
>
> test machine, method and result
> ==================================================
> <test machine>
> CPU: IA64 x8
> MEM: 8GB
> SWAP: 2GB
>
> <test method>
> got hackbench from
> http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
>
> $ /usr/bin/time hackbench 120 process 1000
>
> this parameter mean consume all physical memory and
> 1GB swap space on my test environment.
>
> <test result (average of 3 times measurement)>
>
> before:
> hackbench result: 282.30
> /usr/bin/time result
> user: 14.16
> sys: 1248.47
> elapse: 432.93
> major fault: 29026
> max parallel reclaim tasks: 1298
> max consumption time of
> try_to_free_pages(): 70394
>
> after:
> hackbench result: 30.36
> /usr/bin/time result
> user: 14.26
> sys: 294.44
> elapse: 118.01
> major fault: 3064
> max parallel reclaim tasks: 4
> max consumption time of
> try_to_free_pages(): 12234
>
>
> conclusion
> =========================================
> this patch improve 3 things.
> 1. reduce unnecessary swap
> (see above major fault. about 90% reduced)
> 2. improve throughput performance
> (see above hackbench result. about 90% reduced)
> 3. improve interactive performance.
> (see above max consumption of try_to_free_pages.
> about 80% reduced)
> 4. reduce lock contention.
> (see above sys time. about 80% reduced)
>
>
> Now, we got about 1000% performance improvement of hackbench :)
>
>
>
> foture works
> ==========================================================
> - more discussion with memory controller guys.
>
>
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> CC: KAMEZAWA Hiroyuki <[email protected]>
> CC: Balbir Singh <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Lee Schermerhorn <[email protected]>
>
> ---
> include/linux/nodemask.h | 1
> mm/vmscan.c | 49 +++++++++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 48 insertions(+), 2 deletions(-)
>
> Index: b/include/linux/nodemask.h
> ===================================================================
> --- a/include/linux/nodemask.h 2008-02-19 13:58:05.000000000 +0900
> +++ b/include/linux/nodemask.h 2008-02-19 13:58:23.000000000 +0900
> @@ -431,6 +431,7 @@ static inline int num_node_state(enum no
>
> #define num_online_nodes() num_node_state(N_ONLINE)
> #define num_possible_nodes() num_node_state(N_POSSIBLE)
> +#define num_highmem_nodes() num_node_state(N_HIGH_MEMORY)
> #define node_online(node) node_state((node), N_ONLINE)
> #define node_possible(node) node_state((node), N_POSSIBLE)
>
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c 2008-02-19 13:58:05.000000000 +0900
> +++ b/mm/vmscan.c 2008-02-19 14:04:06.000000000 +0900
> @@ -127,6 +127,11 @@ long vm_total_pages; /* The total number
> static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>
> +static atomic_t nr_reclaimers = ATOMIC_INIT(0);
> +static DECLARE_WAIT_QUEUE_HEAD(reclaim_throttle_waitq);
> +#define RECLAIM_LIMIT (2 * num_highmem_nodes())
> +
> +
> #ifdef CONFIG_CGROUP_MEM_CONT
> #define scan_global_lru(sc) (!(sc)->mem_cgroup)
> #else
> @@ -1421,6 +1426,46 @@ out:
> return ret;
> }
>
> +static unsigned long try_to_free_pages_throttled(struct zone **zones,
> + int order,
> + gfp_t gfp_mask,
> + struct scan_control *sc)
> +{
> + unsigned long nr_reclaimed = 0;
> + unsigned long start_time;
> + int i;
> +
> + start_time = jiffies;
> +
> + wait_event(reclaim_throttle_waitq,
> + atomic_add_unless(&nr_reclaimers, 1, RECLAIM_LIMIT));
> +
> + /* more reclaim until needed? */
> + if (unlikely(time_after(jiffies, start_time + HZ))) {
> + for (i = 0; zones[i] != NULL; i++) {
> + struct zone *zone = zones[i];
> + int classzone_idx = zone_idx(zones[0]);
> +
> + if (!populated_zone(zone))
> + continue;
> +
> + if (zone_watermark_ok(zone, order, 4*zone->pages_high,
> + classzone_idx, 0)) {
> + nr_reclaimed = 1;
> + goto out;
> + }
> + }
> + }
> +
> + nr_reclaimed = do_try_to_free_pages(zones, gfp_mask, sc);
> +
> +out:
> + atomic_dec(&nr_reclaimers);
> + wake_up_all(&reclaim_throttle_waitq);
> +
> + return nr_reclaimed;
> +}
> +
> unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
> {
> struct scan_control sc = {
> @@ -1434,7 +1479,7 @@ unsigned long try_to_free_pages(struct z
> .isolate_pages = isolate_pages_global,
> };
>
> - return do_try_to_free_pages(zones, gfp_mask, &sc);
> + return try_to_free_pages_throttled(zones, order, gfp_mask, &sc);
> }
>
> #ifdef CONFIG_CGROUP_MEM_CONT
> @@ -1456,7 +1501,7 @@ unsigned long try_to_free_mem_cgroup_pag
> int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
>
> zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones;
> - if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
> + if (try_to_free_pages_throttled(zones, 0, sc.gfp_mask, &sc))
> return 1;
> return 0;
> }
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

--
Thanks,
barrios

2008-02-20 09:27:09

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

Hi Kim-san

Do you adjust hackbench parameter?
my parameter adjust my test machine(8GB mem),
if unchanged, maybe doesn't works it because lack memory.

> I am a many interested in your patch. so I want to test it with exact
> same method as you did.
> I will test it in embedded environment(ARM 920T, 32M ram) and my
> desktop machine.(Core2Duo 2.2G, 2G ram)

Hm
I don't have embedded test machine.
but I can desktop.
I will test it about weekend.
if you don't mind, could you please send me .config file
and tell me your test kernel version?

Thanks, interesting report.

> I guess this patch won't be efficient in embedded environment.
> Since many embedded board just have one processor and don't have any
> swap device.

reclaim conflict rarely happened on UP.
thus, my patch expect no improvement.

but (of course) I will fix regression.

> So, How do I evaluate following field as you did ?
>
> * elapse (what do you mean it ??)
> * major fault

/usr/bin/time command output that.

> * max parallel reclaim tasks:
> * max consumption time of
> try_to_free_pages():

sorry, I inserted debug code to my patch at that time.

2008-02-20 09:49:19

by Minchan Kim

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

On Feb 20, 2008 6:24 PM, KOSAKI Motohiro <[email protected]> wrote:
> Hi Kim-san
>
> Do you adjust hackbench parameter?
> my parameter adjust my test machine(8GB mem),
> if unchanged, maybe doesn't works it because lack memory.

I already adjusted it. :-)
But, In my desktop, I couldn't make to consune my swap device above
half. (My swap device is 512M size)
Because my kernel almost was hang before happening many swapping.
Perhaps, it might be a not hang. However, Although I wait a very long
time, My box don't have a any response.
I will try do it more.

> > I am a many interested in your patch. so I want to test it with exact
> > same method as you did.
> > I will test it in embedded environment(ARM 920T, 32M ram) and my
> > desktop machine.(Core2Duo 2.2G, 2G ram)
>
> Hm
> I don't have embedded test machine.
> but I can desktop.
> I will test it about weekend.
> if you don't mind, could you please send me .config file
> and tell me your test kernel version?

I mean I will test your patch by myself.
Because I already have a embedded board and Desktop.

> Thanks, interesting report.
>
>
> > I guess this patch won't be efficient in embedded environment.
> > Since many embedded board just have one processor and don't have any
> > swap device.
>
> reclaim conflict rarely happened on UP.
> thus, my patch expect no improvement.

I agree with you.

> but (of course) I will fix regression.

I didn't say your patch had a regression.
What I mean is just that I am concern about it.
Actually, Many VM guys is working on server environment.
They didn't try to do performance test in embedde system.
and that patch was submitted in mainline.

Actually, I am concern about it.

> > So, How do I evaluate following field as you did ?
> >
> > * elapse (what do you mean it ??)
> > * major fault
>
> /usr/bin/time command output that.
>
>
> > * max parallel reclaim tasks:
> > * max consumption time of
> > try_to_free_pages():
>
> sorry, I inserted debug code to my patch at that time.
>

Could you send me that debug code ?
If you will send it to me, I will test it my environment (ARM-920T, Core2Duo).
And I will report test result.

--
Thanks,
barrios

2008-02-20 10:15:47

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

Hi

> > > * max parallel reclaim tasks:
> > > * max consumption time of
> > > try_to_free_pages():
> >
> > sorry, I inserted debug code to my patch at that time.
>
> Could you send me that debug code ?
> If you will send it to me, I will test it my environment (ARM-920T, Core2Duo).
> And I will report test result.

attached it.
but it is very messy ;-)

usage:
./benchloop.sh

sample output
=========================================================
max reclaim 2
Running with 120*40 (== 4800) tasks.
Time: 34.177
14.17user 284.38system 1:43.85elapsed 287%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (3813major+148922minor)pagefaults 0swaps
max prepare time: 4599 0
max reclaim time: 2350 5781
total
8271
max reclaimer
4
max overkill
62131
max saved overkill
9740

max reclaimer represent to max parallel reclaim tasks.
total represetnto max consumption time of try_to_free_pages().

Thanks

Attachments:

reclaim-throttle-3.patch (6.94 kB)
benchloop.sh (1.48 kB)
Download all attachments

2008-02-21 09:39:22

by Minchan Kim

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

I miss CC's. so I resend.

First of all, I tried test it in embedded board.

---
<test machine>
CPU: 200MHz(ARM926EJ-S)
MEM: 32M
SWAP: none
KERNEL : 2.6.25-rc1

<test 1> - NO SWAP

before :

Running with 5*40 (== 200) tasks.

Time: 12.591
Command being timed: "./hackbench.arm 5 process 100"
User time (seconds): 0.78
System time(seconds): 13.39
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 14.22s
Major (requiring I/O) page faults: 20
max parallel reclaim tasks: 30
max consumption time of
try_to_free_pages(): 789

after:

Running with 5*40 (== 200) tasks.
Time: 11.535
Command being timed: "./hackbench.arm 5 process 100"
User time (seconds): 0.69
System time (seconds): 12.42
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 13.16s
Major (requiring I/O) page faults: 18
max parallel reclaim tasks: 4
max consumption time of
try_to_free_pages(): 740

<test 2> - SWAP
before:
Running with 6*40 (== 240) tasks.
Time: 121.686
Command being timed: "./hackbench.arm 6 process 100"
User time (seconds): 1.89
System time (seconds): 44.95
Percent of CPU this job got: 37%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2m 3.79s
Major (requiring I/O) page faults: 230
max parallel reclaim tasks: 56
max consumption time of
try_to_free_pages(): 10811

after :
Running with 6*40 (== 240) tasks.
Time: 67.757
Command being timed: "./hackbench.arm 6 process 100"
User time (seconds): 1.56
System time (seconds): 35.41
Percent of CPU this job got: 52%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1m 9.87s
Major (requiring I/O) page faults: 16
max parallel reclaim tasks: 4
max consumption time of
try_to_free_pages(): 6419

<test 3> NO_SWAP

before:

' OOM killer kill hackbench!!!'

after :
Time: 16.578
Command being timed: "./hackbench.arm 6 process 100"
User time (seconds): 0.71
System time (seconds): 17.92
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 18.69s
Major (requiring I/O) page faults: 22
max parallel reclaim tasks: 4
max consumption time of
try_to_free_pages(): 1785

===============================

It was a very interesting result.
In embedded system, your patch improve performance a little in case
without noswap(normal case in embedded system).
But, more important thing is OOM occured when I made 240 process
without swap device and vanilla kernel.
Then, I applied your patch, it worked very well without OOM.

I think that's why zone's page_scanned was six times greater than
number of lru pages.
At result, OOM happened.

So, I think your patch also improves performance in embedded system.

In case OOM didn't occur, reclaiming performance without swap device
was better than one with swap device.
Now, I think we need to improve reclaiming procedure in embedded
system(UP and NO swap).

On Wed, Feb 20, 2008 at 7:09 PM, KOSAKI Motohiro
<[email protected]> wrote:
> Hi
>
>
> > > > * max parallel reclaim tasks:
> > > > * max consumption time of
> > > > try_to_free_pages():
> > >
> > > sorry, I inserted debug code to my patch at that time.
> >
> > Could you send me that debug code ?
> > If you will send it to me, I will test it my environment (ARM-920T, Core2Duo).
> > And I will report test result.
>
> attached it.
> but it is very messy ;-)
>
> usage:
> ./benchloop.sh
>
> sample output
> =========================================================
> max reclaim 2
> Running with 120*40 (== 4800) tasks.
> Time: 34.177
> 14.17user 284.38system 1:43.85elapsed 287%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (3813major+148922minor)pagefaults 0swaps
> max prepare time: 4599 0
> max reclaim time: 2350 5781
> total
> 8271
> max reclaimer
> 4
> max overkill
> 62131
> max saved overkill
> 9740
>
>
> max reclaimer represent to max parallel reclaim tasks.
> total represetnto max consumption time of try_to_free_pages().
>
> Thanks
>
>

--
Thanks,
barrios

2008-02-21 09:53:22

by Balbir Singh

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

KOSAKI Motohiro wrote:
> background
> ========================================
> current VM implementation doesn't has limit of # of parallel reclaim.
> when heavy workload, it bring to 2 bad things
> - heavy lock contention
> - unnecessary swap out
>
> abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page
> reclaim throttle and explain it improve reclaim time.
> http://marc.info/?l=linux-mm&m=119667465917215&w=2
>
> but unfortunately it works only memcgroup reclaim.
> Today, I implement it again for support global reclaim and mesure it.
>

Hi, Kosaki,

It's good to keep the main reclaim code and the memory controller reclaim in
sync, so this is a nice effort.

> @@ -1456,7 +1501,7 @@ unsigned long try_to_free_mem_cgroup_pag
> int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
>
> zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones;
> - if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
> + if (try_to_free_pages_throttled(zones, 0, sc.gfp_mask, &sc))
> return 1;
> return 0;
> }
>

try_to_free_pages_throttled checks for zone_watermark_ok(), that will not work
in the case that we are reclaiming from a cgroup which over it's limit. We need
a different check, to see if the mem_cgroup is still over it's limit or not.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-02-21 10:55:51

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

Hi Kim-san,

Thank you very much.
btw, what different between <test 1> and <test 2>?

> It was a very interesting result.
> In embedded system, your patch improve performance a little in case
> without noswap(normal case in embedded system).
> But, more important thing is OOM occured when I made 240 process
> without swap device and vanilla kernel.
> Then, I applied your patch, it worked very well without OOM.

Wow, it is very interesting result!
I am very happy.

> I think that's why zone's page_scanned was six times greater than
> number of lru pages.
> At result, OOM happened.

please repost question with change subject.
i don't know reason of vanilla kernel behavior, sorry.

2008-02-21 11:01:35

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

Hi balbir-san

> It's good to keep the main reclaim code and the memory controller reclaim in
> sync, so this is a nice effort.

thank you.
I will repost next version (fixed nick's opinion) while a few days.

> > @@ -1456,7 +1501,7 @@ unsigned long try_to_free_mem_cgroup_pag
> > int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
> >
> > zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones;
> > - if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
> > + if (try_to_free_pages_throttled(zones, 0, sc.gfp_mask, &sc))
> > return 1;
> > return 0;
> > }
>
> try_to_free_pages_throttled checks for zone_watermark_ok(), that will not work
> in the case that we are reclaiming from a cgroup which over it's limit. We need
> a different check, to see if the mem_cgroup is still over it's limit or not.

That makes sense.

unfortunately, I don't know mem-cgroup so much.
What do i use function, instead?

2008-02-21 11:07:27

by Balbir Singh

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

KOSAKI Motohiro wrote:
> Hi balbir-san
>
>> It's good to keep the main reclaim code and the memory controller reclaim in
>> sync, so this is a nice effort.
>
> thank you.
> I will repost next version (fixed nick's opinion) while a few days.
>
>> > @@ -1456,7 +1501,7 @@ unsigned long try_to_free_mem_cgroup_pag
>> > int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
>> >
>> > zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones;
>> > - if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
>> > + if (try_to_free_pages_throttled(zones, 0, sc.gfp_mask, &sc))
>> > return 1;
>> > return 0;
>> > }
>>
>> try_to_free_pages_throttled checks for zone_watermark_ok(), that will not work
>> in the case that we are reclaiming from a cgroup which over it's limit. We need
>> a different check, to see if the mem_cgroup is still over it's limit or not.
>
> That makes sense.
>
> unfortunately, I don't know mem-cgroup so much.
> What do i use function, instead?

One option could be that once the memory controller has this feature, we'll need
no changes in try_to_free_mem_cgroup_pages.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-02-21 12:29:42

by Minchan Kim

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

On Thu, Feb 21, 2008 at 7:55 PM, KOSAKI Motohiro
<[email protected]> wrote:
> Hi Kim-san,
>
> Thank you very much.
> btw, what different between <test 1> and <test 2>?

<test 1> have no swap device with 200 tasks by hackbench.
But <test 2> have swap device(32M) with 240 tasks by hackbench.
If <test2> have no swap device without your patch, <test2> is killed by OOM.

<test 1> - NO SWAP
Running with 5*40 (== 200) tasks.
...
<test 2> - SWAP
Running with 6*40 (== 240) tasks.
...

>
> > It was a very interesting result.
> > In embedded system, your patch improve performance a little in case
> > without noswap(normal case in embedded system).
> > But, more important thing is OOM occured when I made 240 process
> > without swap device and vanilla kernel.
> > Then, I applied your patch, it worked very well without OOM.
>
> Wow, it is very interesting result!
> I am very happy.
>
>
> > I think that's why zone's page_scanned was six times greater than
> > number of lru pages.
> > At result, OOM happened.
>
> please repost question with change subject.
> i don't know reason of vanilla kernel behavior, sorry.

Normally, embedded linux have only one zone(DMA).

If your patch isn't applied, several processes can reclaim memory in parallel.
then, DMA zone's pages_scanned is suddenly increased largely. Because
embedded linux have no swap device, kernel can't stop to scan lru
list until meeting page cache page. so if zone->pages_scanned is
greater six time than lru list pages, kernel make the zone with
unreclaimable state, As a result, OOM will kill it, too.

--
Thanks,
barrios

2008-02-21 12:41:28

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

> > please repost question with change subject.
> > i don't know reason of vanilla kernel behavior, sorry.
>
> Normally, embedded linux have only one zone(DMA).
>
> If your patch isn't applied, several processes can reclaim memory in parallel.
> then, DMA zone's pages_scanned is suddenly increased largely. Because
> embedded linux have no swap device, kernel can't stop to scan lru
> list until meeting page cache page. so if zone->pages_scanned is
> greater six time than lru list pages, kernel make the zone with
> unreclaimable state, As a result, OOM will kill it, too.

sorry, my last mail is easy confusious.
if you want discuss vanilla kernel bug, you shold post mail by another thread.
if not, your mail is only readed by few people.

Thanks.