LinuxLists.cc - [PATCHv2 1/5] vmscan: separate sc.swap_cluster_max and sc.nr_max

[permalink] [raw]

Subject: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
memory shrinker) for hibernate performance improvement. and sc.swap_cluster_max
was introduced by commit a06fe4d307 (Speed freeing memory for suspend).

commit a06fe4d307 said

Without the patch:
Freed 14600 pages in 1749 jiffies = 32.61 MB/s (Anomolous!)
Freed 88563 pages in 14719 jiffies = 23.50 MB/s
Freed 205734 pages in 32389 jiffies = 24.81 MB/s

With the patch:
Freed 68252 pages in 496 jiffies = 537.52 MB/s
Freed 116464 pages in 569 jiffies = 798.54 MB/s
Freed 209699 pages in 705 jiffies = 1161.89 MB/s

At that time, their patch was pretty worth. However, Modern Hardware
trend and recent VM improvement broke its worth. From several reason,
I think we should remove shrink_all_zones() at all.

detail:

1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
at no i/o congestion.
but current shrink_zone() is sane, not slow.

2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
fine on numa system.
example)
System has 4GB memory and each node have 2GB. and hibernate need 1GB.

optimal)
steal 500MB from each node.
shrink_all_zones)
steal 1GB from node-0.

Oh, Cache balancing logic was broken. ;)
Unfortunately, Desktop system moved ahead NUMA at nowadays.
(Side note, if hibernate require 2GB, shrink_all_zones() never success
on above machine)

3) if the node has several I/O flighting pages, shrink_all_zones() makes
pretty bad result.

schenario) hibernate need 1GB

1) shrink_all_zones() try to reclaim 1GB from Node-0
2) but it only reclaimed 990MB
3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
4) it reclaimed 990MB

Oh, well. it reclaimed twice much than required.
In the other hand, current shrink_zone() has sane baling out logic.
then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.

4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.

Then, This patch changed shrink_all_memory() to only the wrapper function of
do_try_to_free_pages(). it bring good reviewability and debuggability, and solve
above problems.

side note: Reclaim logic unificication makes two good side effect.
- Fix recursive reclaim bug on shrink_all_memory().
it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
- Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rik van Riel <[email protected]>
---
mm/vmscan.c | 156 +++++++++++------------------------------------------------
1 files changed, 28 insertions(+), 128 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6a3eb9f..9862c04 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -58,6 +58,8 @@ struct scan_control {
/* How many pages shrink_list() should reclaim */
unsigned long nr_to_reclaim;

+ unsigned long hibernation_mode;
+
/* This context's GFP mask */
gfp_t gfp_mask;

@@ -1748,7 +1750,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
if (scanning_global_lru(sc)) {
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {

- if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+ if (!sc->hibernation_mode &&
+ !cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;

lru_pages += zone_reclaimable_pages(zone);
@@ -1791,7 +1794,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
}

/* Take a nap, wait for some writeback to complete */
- if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
+ if (!sc->hibernation_mode && sc->nr_scanned &&
+ priority < DEF_PRIORITY - 2)
congestion_wait(BLK_RW_ASYNC, HZ/10);
}
/* top priority shrink_zones still had more to do? don't OOM, then */
@@ -2262,148 +2266,44 @@ unsigned long zone_reclaimable_pages(struct zone *zone)

#ifdef CONFIG_HIBERNATION
/*
- * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages
- * from LRU lists system-wide, for given pass and priority.
- *
- * For pass > 3 we also try to shrink the LRU lists that contain a few pages
- */
-static void shrink_all_zones(unsigned long nr_pages, int prio,
- int pass, struct scan_control *sc)
-{
- struct zone *zone;
- unsigned long nr_reclaimed = 0;
- struct zone_reclaim_stat *reclaim_stat;
-
- for_each_populated_zone(zone) {
- enum lru_list l;
-
- if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
- continue;
-
- for_each_evictable_lru(l) {
- enum zone_stat_item ls = NR_LRU_BASE + l;
- unsigned long lru_pages = zone_page_state(zone, ls);
-
- /* For pass = 0, we don't shrink the active list */
- if (pass == 0 && (l == LRU_ACTIVE_ANON ||
- l == LRU_ACTIVE_FILE))
- continue;
-
- reclaim_stat = get_reclaim_stat(zone, sc);
- reclaim_stat->nr_saved_scan[l] +=
- (lru_pages >> prio) + 1;
- if (reclaim_stat->nr_saved_scan[l]
- >= nr_pages || pass > 3) {
- unsigned long nr_to_scan;
-
- reclaim_stat->nr_saved_scan[l] = 0;
- nr_to_scan = min(nr_pages, lru_pages);
- nr_reclaimed += shrink_list(l, nr_to_scan, zone,
- sc, prio);
- if (nr_reclaimed >= nr_pages) {
- sc->nr_reclaimed += nr_reclaimed;
- return;
- }
- }
- }
- }
- sc->nr_reclaimed += nr_reclaimed;
-}
-
-/*
- * Try to free `nr_pages' of memory, system-wide, and return the number of
+ * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
* freed pages.
*
* Rather than trying to age LRUs the aim is to preserve the overall
* LRU order by reclaiming preferentially
* inactive > active > active referenced > active mapped
*/
-unsigned long shrink_all_memory(unsigned long nr_pages)
+unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
{
- unsigned long lru_pages, nr_slab;
- int pass;
struct reclaim_state reclaim_state;
struct scan_control sc = {
- .gfp_mask = GFP_KERNEL,
- .may_unmap = 0,
+ .gfp_mask = GFP_HIGHUSER_MOVABLE,
+ .may_swap = 1,
+ .may_unmap = 1,
.may_writepage = 1,
+ .swap_cluster_max = SWAP_CLUSTER_MAX,
+ .nr_to_reclaim = nr_to_reclaim,
+ .hibernation_mode = 1,
+ .swappiness = vm_swappiness,
+ .order = 0,
.isolate_pages = isolate_pages_global,
- .nr_reclaimed = 0,
};
+ struct zonelist * zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
+ struct task_struct *p = current;
+ unsigned long nr_reclaimed;

- current->reclaim_state = &reclaim_state;
-
- lru_pages = global_reclaimable_pages();
- nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
- /* If slab caches are huge, it's better to hit them first */
- while (nr_slab >= lru_pages) {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
- if (!reclaim_state.reclaimed_slab)
- break;
-
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- nr_slab -= reclaim_state.reclaimed_slab;
- }
-
- /*
- * We try to shrink LRUs in 5 passes:
- * 0 = Reclaim from inactive_list only
- * 1 = Reclaim from active list but don't reclaim mapped
- * 2 = 2nd pass of type 1
- * 3 = Reclaim mapped (normal reclaim)
- * 4 = 2nd pass of type 3
- */
- for (pass = 0; pass < 5; pass++) {
- int prio;
-
- /* Force reclaiming mapped pages in the passes #3 and #4 */
- if (pass > 2)
- sc.may_unmap = 1;
-
- for (prio = DEF_PRIORITY; prio >= 0; prio--) {
- unsigned long nr_to_scan = nr_pages - sc.nr_reclaimed;
-
- sc.nr_scanned = 0;
- sc.swap_cluster_max = nr_to_scan;
- shrink_all_zones(nr_to_scan, prio, pass, &sc);
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(sc.nr_scanned, sc.gfp_mask,
- global_reclaimable_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ / 10);
- }
- }
-
- /*
- * If sc.nr_reclaimed = 0, we could not shrink LRUs, but there may be
- * something in slab caches
- */
- if (!sc.nr_reclaimed) {
- do {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask,
- global_reclaimable_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- } while (sc.nr_reclaimed < nr_pages &&
- reclaim_state.reclaimed_slab > 0);
- }
+ p->flags |= PF_MEMALLOC;
+ lockdep_set_current_reclaim_state(sc.gfp_mask);
+ reclaim_state.reclaimed_slab = 0;
+ p->reclaim_state = &reclaim_state;

+ nr_reclaimed = do_try_to_free_pages(zonelist, &sc);

-out:
- current->reclaim_state = NULL;
+ p->reclaim_state = NULL;
+ lockdep_clear_current_reclaim_state();
+ p->flags &= ~PF_MEMALLOC;

- return sc.nr_reclaimed;
+ return nr_reclaimed;
}
#endif /* CONFIG_HIBERNATION */

--
1.6.2.5

2009-11-01 15:12:19

[permalink] [raw]

Subject: [PATCHv2 3/5] vmscan: Stop zone_reclaim()'s wrong swap_cluster_max usage

In old days, we didn't have sc.nr_to_reclaim and it brought
sc.swap_cluster_max misuse.

huge sc.swap_cluster_max might makes unnecessary OOM and
no performance benefit.

Now, we can remove above dangerous one.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9862c04..e463d48 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2455,8 +2455,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
- .swap_cluster_max = max_t(unsigned long, nr_pages,
- SWAP_CLUSTER_MAX),
+ .swap_cluster_max = SWAP_CLUSTER_MAX,
.nr_to_reclaim = max_t(unsigned long, nr_pages,
SWAP_CLUSTER_MAX),
.gfp_mask = gfp_mask,
--
1.6.2.5

2009-11-01 15:13:08

[permalink] [raw]

Subject: [PATCHv2 4/5] vmscan: Kill sc.swap_cluster_max

Now, all caller is settng to swap_cluster_max = SWAP_CLUSTER_MAX.
Then, we can remove it perfectly.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 26 ++++++--------------------
1 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e463d48..7bdf4f0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -71,12 +71,6 @@ struct scan_control {
/* Can pages be swapped as part of reclaim? */
int may_swap;

- /* This context's SWAP_CLUSTER_MAX. If freeing memory for
- * suspend, we effectively ignore SWAP_CLUSTER_MAX.
- * In this context, it doesn't matter that we scan the
- * whole list at once. */
- int swap_cluster_max;
-
int swappiness;

int all_unreclaimable;
@@ -1127,7 +1121,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
unsigned long nr_anon;
unsigned long nr_file;

- nr_taken = sc->isolate_pages(sc->swap_cluster_max,
+ nr_taken = sc->isolate_pages(SWAP_CLUSTER_MAX,
&page_list, &nr_scan, sc->order, mode,
zone, sc->mem_cgroup, 0, file);

@@ -1562,15 +1556,14 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
* until we collected @swap_cluster_max pages to scan.
*/
static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
- unsigned long *nr_saved_scan,
- unsigned long swap_cluster_max)
+ unsigned long *nr_saved_scan)
{
unsigned long nr;

*nr_saved_scan += nr_to_scan;
nr = *nr_saved_scan;

- if (nr >= swap_cluster_max)
+ if (nr >= SWAP_CLUSTER_MAX)
*nr_saved_scan = 0;
else
nr = 0;
@@ -1589,7 +1582,6 @@ static void shrink_zone(int priority, struct zone *zone,
unsigned long percent[2]; /* anon @ 0; file @ 1 */
enum lru_list l;
unsigned long nr_reclaimed = sc->nr_reclaimed;
- unsigned long swap_cluster_max = sc->swap_cluster_max;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int noswap = 0;
@@ -1612,8 +1604,7 @@ static void shrink_zone(int priority, struct zone *zone,
scan = (scan * percent[file]) / 100;
}
nr[l] = nr_scan_try_batch(scan,
- &reclaim_stat->nr_saved_scan[l],
- swap_cluster_max);
+ &reclaim_stat->nr_saved_scan[l]);
}

trace_printk("sz p=%d %d:%s %lu:%lu [%lu %lu %lu %lu] \n",
@@ -1625,7 +1616,8 @@ static void shrink_zone(int priority, struct zone *zone,
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
if (nr[l]) {
- nr_to_scan = min(nr[l], swap_cluster_max);
+ nr_to_scan = min_t(unsigned long,
+ nr[l], SWAP_CLUSTER_MAX);
nr[l] -= nr_to_scan;

nr_reclaimed += shrink_list(l, nr_to_scan,
@@ -1834,7 +1826,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
struct scan_control sc = {
.gfp_mask = gfp_mask,
.may_writepage = !laptop_mode,
- .swap_cluster_max = SWAP_CLUSTER_MAX,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
.may_unmap = 1,
.may_swap = 1,
@@ -1859,7 +1850,6 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
.may_writepage = !laptop_mode,
.may_unmap = 1,
.may_swap = !noswap,
- .swap_cluster_max = SWAP_CLUSTER_MAX,
.swappiness = swappiness,
.order = 0,
.mem_cgroup = mem,
@@ -1893,7 +1883,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
.may_writepage = !laptop_mode,
.may_unmap = 1,
.may_swap = !noswap,
- .swap_cluster_max = SWAP_CLUSTER_MAX,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
.swappiness = swappiness,
.order = 0,
@@ -1941,7 +1930,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
.may_swap = 1,
- .swap_cluster_max = SWAP_CLUSTER_MAX,
.nr_to_reclaim = ULONG_MAX,
.swappiness = vm_swappiness,
.order = order,
@@ -2281,7 +2269,6 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
.may_swap = 1,
.may_unmap = 1,
.may_writepage = 1,
- .swap_cluster_max = SWAP_CLUSTER_MAX,
.nr_to_reclaim = nr_to_reclaim,
.hibernation_mode = 1,
.swappiness = vm_swappiness,
@@ -2455,7 +2442,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
- .swap_cluster_max = SWAP_CLUSTER_MAX,
.nr_to_reclaim = max_t(unsigned long, nr_pages,
SWAP_CLUSTER_MAX),
.gfp_mask = gfp_mask,
--
1.6.2.5

2009-11-01 15:13:21

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

On 11/01/2009 10:09 AM, KOSAKI Motohiro wrote:

> Signed-off-by: KOSAKI Motohiro<[email protected]>
> Cc: Rafael J. Wysocki<[email protected]>
> Cc: Rik van Riel<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-11-01 15:13:51

[permalink] [raw]

Subject: [PATCHv2 5/5][nit fix] vmscan Make consistent of reclaim bale out between do_try_to_free_page and shrink_zone

Fix small inconsistent of ">" and ">=".

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7bdf4f0..e6ea011 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1632,7 +1632,7 @@ static void shrink_zone(int priority, struct zone *zone,
* with multiple processes reclaiming pages, the total
* freeing target can get unreasonably large.
*/
- if (nr_reclaimed > nr_to_reclaim && priority < DEF_PRIORITY)
+ if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
break;
}

--
1.6.2.5

2009-11-01 17:52:22

[permalink] [raw]

Subject: Re: [PATCHv2 3/5] vmscan: Stop zone_reclaim()'s wrong swap_cluster_max usage

On 11/01/2009 10:11 AM, KOSAKI Motohiro wrote:
> In old days, we didn't have sc.nr_to_reclaim and it brought
> sc.swap_cluster_max misuse.
>
> huge sc.swap_cluster_max might makes unnecessary OOM and
> no performance benefit.
>
> Now, we can remove above dangerous one.
>
> Signed-off-by: KOSAKI Motohiro<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-11-01 17:56:58

[permalink] [raw]

Subject: Re: [PATCHv2 4/5] vmscan: Kill sc.swap_cluster_max

On 11/01/2009 10:12 AM, KOSAKI Motohiro wrote:
> Now, all caller is settng to swap_cluster_max = SWAP_CLUSTER_MAX.
> Then, we can remove it perfectly.
>
> Signed-off-by: KOSAKI Motohiro<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-11-01 17:58:30

[permalink] [raw]

Subject: Re: [PATCHv2 5/5][nit fix] vmscan Make consistent of reclaim bale out between do_try_to_free_page and shrink_zone

On 11/01/2009 10:13 AM, KOSAKI Motohiro wrote:
> Fix small inconsistent of ">" and">=".

Nice catch.

> Signed-off-by: KOSAKI Motohiro<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-11-01 21:36:50

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

On Sunday 01 November 2009, KOSAKI Motohiro wrote:
> shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
> memory shrinker) for hibernate performance improvement. and sc.swap_cluster_max
> was introduced by commit a06fe4d307 (Speed freeing memory for suspend).
>
> commit a06fe4d307 said
>
> Without the patch:
> Freed 14600 pages in 1749 jiffies = 32.61 MB/s (Anomolous!)
> Freed 88563 pages in 14719 jiffies = 23.50 MB/s
> Freed 205734 pages in 32389 jiffies = 24.81 MB/s
>
> With the patch:
> Freed 68252 pages in 496 jiffies = 537.52 MB/s
> Freed 116464 pages in 569 jiffies = 798.54 MB/s
> Freed 209699 pages in 705 jiffies = 1161.89 MB/s
>
> At that time, their patch was pretty worth. However, Modern Hardware
> trend and recent VM improvement broke its worth. From several reason,
> I think we should remove shrink_all_zones() at all.
>
> detail:
>
> 1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
> at no i/o congestion.
> but current shrink_zone() is sane, not slow.
>
> 2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
> fine on numa system.
> example)
> System has 4GB memory and each node have 2GB. and hibernate need 1GB.
>
> optimal)
> steal 500MB from each node.
> shrink_all_zones)
> steal 1GB from node-0.
>
> Oh, Cache balancing logic was broken. ;)
> Unfortunately, Desktop system moved ahead NUMA at nowadays.
> (Side note, if hibernate require 2GB, shrink_all_zones() never success
> on above machine)
>
> 3) if the node has several I/O flighting pages, shrink_all_zones() makes
> pretty bad result.
>
> schenario) hibernate need 1GB
>
> 1) shrink_all_zones() try to reclaim 1GB from Node-0
> 2) but it only reclaimed 990MB
> 3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
> 4) it reclaimed 990MB
>
> Oh, well. it reclaimed twice much than required.
> In the other hand, current shrink_zone() has sane baling out logic.
> then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
>
> 4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
> shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
>
> Then, This patch changed shrink_all_memory() to only the wrapper function of
> do_try_to_free_pages(). it bring good reviewability and debuggability, and solve
> above problems.
>
> side note: Reclaim logic unificication makes two good side effect.
> - Fix recursive reclaim bug on shrink_all_memory().
> it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
> - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.

As I said previously, I don't really see a reason to keep shrink_all_memory().

Do you think that removing it will result in performance degradation?

Rafael

2009-11-01 22:00:28

by Nigel Cunningham

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

Hi.

KOSAKI Motohiro wrote:
> shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
> memory shrinker) for hibernate performance improvement. and sc.swap_cluster_max
> was introduced by commit a06fe4d307 (Speed freeing memory for suspend).
>
> commit a06fe4d307 said
>
> Without the patch:
> Freed 14600 pages in 1749 jiffies = 32.61 MB/s (Anomolous!)
> Freed 88563 pages in 14719 jiffies = 23.50 MB/s
> Freed 205734 pages in 32389 jiffies = 24.81 MB/s
>
> With the patch:
> Freed 68252 pages in 496 jiffies = 537.52 MB/s
> Freed 116464 pages in 569 jiffies = 798.54 MB/s
> Freed 209699 pages in 705 jiffies = 1161.89 MB/s
>
> At that time, their patch was pretty worth. However, Modern Hardware
> trend and recent VM improvement broke its worth. From several reason,
> I think we should remove shrink_all_zones() at all.
>
> detail:
>
> 1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
> at no i/o congestion.
> but current shrink_zone() is sane, not slow.
>
> 2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
> fine on numa system.
> example)
> System has 4GB memory and each node have 2GB. and hibernate need 1GB.
>
> optimal)
> steal 500MB from each node.
> shrink_all_zones)
> steal 1GB from node-0.

I haven't given much thought to numa awareness in hibernate code, but I
can say that the shrink_all_memory interface is woefully inadequate as
far as zone awareness goes. Since lowmem needs to be atomically restored
before we can restore highmem, we really need to be able to ask for a
particular number of pages of a particular zone type to be freed.

> Oh, Cache balancing logic was broken. ;)
> Unfortunately, Desktop system moved ahead NUMA at nowadays.
> (Side note, if hibernate require 2GB, shrink_all_zones() never success
> on above machine)
>
> 3) if the node has several I/O flighting pages, shrink_all_zones() makes
> pretty bad result.
>
> schenario) hibernate need 1GB
>
> 1) shrink_all_zones() try to reclaim 1GB from Node-0
> 2) but it only reclaimed 990MB
> 3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
> 4) it reclaimed 990MB
>
> Oh, well. it reclaimed twice much than required.
> In the other hand, current shrink_zone() has sane baling out logic.
> then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.

Yes, this is bad.

> 4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
> shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.

I don't follow your logic here. Without being a mm expert, I'd imagine
that it shouldn't matter that much if most of the inactive list gets freed.

Regards,

Nigel

2009-11-02 00:37:50

[permalink] [raw]

Subject: Re: [PATCHv2 1/5] vmscan: separate sc.swap_cluster_max and sc.nr_max_reclaim

Hi, Kosaki.

On Mon, 2 Nov 2009 00:08:44 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> Currently, sc.scap_cluster_max has double meanings.
>
> 1) reclaim batch size as isolate_lru_pages()'s argument
> 2) reclaim baling out thresolds
>
> The two meanings pretty unrelated. Thus, Let's separate it.
> this patch doesn't change any behavior.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> ---
> mm/vmscan.c | 21 +++++++++++++++------
> 1 files changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f805958..6a3eb9f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -55,6 +55,9 @@ struct scan_control {
> /* Number of pages freed so far during a call to shrink_zones() */
> unsigned long nr_reclaimed;
>
> + /* How many pages shrink_list() should reclaim */
> + unsigned long nr_to_reclaim;

If you try to divide meaning of swap_cluster_max,
How about changing 'swap_cluster_max', too?

It has a meaning which represents 'batch size'. ;)
I hope we change it in this chance.

> +
> /* This context's GFP mask */
> gfp_t gfp_mask;
>
> @@ -1585,6 +1588,7 @@ static void shrink_zone(int priority, struct zone *zone,
> enum lru_list l;
> unsigned long nr_reclaimed = sc->nr_reclaimed;
> unsigned long swap_cluster_max = sc->swap_cluster_max;
> + unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int noswap = 0;
>
> @@ -1634,8 +1638,7 @@ static void shrink_zone(int priority, struct zone *zone,
> * with multiple processes reclaiming pages, the total
> * freeing target can get unreasonably large.
> */

How about adding following comment or more good one.
'kswapd doesn't bail out because it has 'MAX' nr_to_reclaim'.

> - if (nr_reclaimed > swap_cluster_max &&
> - priority < DEF_PRIORITY && !current_is_kswapd())
> + if (nr_reclaimed > nr_to_reclaim && priority < DEF_PRIORITY)
> break;
> }
>
> @@ -1733,6 +1736,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> struct zoneref *z;
> struct zone *zone;
> enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
> + unsigned long writeback_threshold;
>
> delayacct_freepages_start();
>
> @@ -1768,7 +1772,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> }
> }
> total_scanned += sc->nr_scanned;
> - if (sc->nr_reclaimed >= sc->swap_cluster_max) {
> + if (sc->nr_reclaimed >= sc->nr_to_reclaim) {
> ret = sc->nr_reclaimed;
> goto out;
> }
> @@ -1780,8 +1784,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> * that's undesirable in laptop mode, where we *want* lumpy
> * writeout. So in laptop mode, write out the whole world.
> */
> - if (total_scanned > sc->swap_cluster_max +
> - sc->swap_cluster_max / 2) {
> + writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
> + if (total_scanned > writeback_threshold) {
> wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
> sc->may_writepage = 1;
> }
> @@ -1827,6 +1831,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> .gfp_mask = gfp_mask,
> .may_writepage = !laptop_mode,
> .swap_cluster_max = SWAP_CLUSTER_MAX,
> + .nr_to_reclaim = SWAP_CLUSTER_MAX,
> .may_unmap = 1,
> .may_swap = 1,
> .swappiness = vm_swappiness,
> @@ -1885,6 +1890,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> .may_unmap = 1,
> .may_swap = !noswap,
> .swap_cluster_max = SWAP_CLUSTER_MAX,
> + .nr_to_reclaim = SWAP_CLUSTER_MAX,
> .swappiness = swappiness,
> .order = 0,
> .mem_cgroup = mem_cont,
> @@ -1932,6 +1938,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> .may_unmap = 1,
> .may_swap = 1,
>
.swap_cluster_max = SWAP_CLUSTER_MAX,
Or add comment in here.

'kswapd doesn't want to be bailed out while reclaim.'

> + .nr_to_reclaim = ULONG_MAX,
> .swappiness = vm_swappiness,
> .order = order,
> .mem_cgroup = NULL,
> @@ -2549,7 +2556,9 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> .may_swap = 1,
> .swap_cluster_max = max_t(unsigned long, nr_pages,
> - SWAP_CLUSTER_MAX),
> + SWAP_CLUSTER_MAX),
> + .nr_to_reclaim = max_t(unsigned long, nr_pages,
> + SWAP_CLUSTER_MAX),
> .gfp_mask = gfp_mask,
> .swappiness = vm_swappiness,
> .order = order,
> --
> 1.6.2.5
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2009-11-02 00:42:45

[permalink] [raw]

Subject: Re: [PATCHv2 3/5] vmscan: Stop zone_reclaim()'s wrong swap_cluster_max usage

On Mon, 2 Nov 2009 00:11:02 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> In old days, we didn't have sc.nr_to_reclaim and it brought
> sc.swap_cluster_max misuse.
>
> huge sc.swap_cluster_max might makes unnecessary OOM and
> no performance benefit.
>
> Now, we can remove above dangerous one.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2009-11-02 00:49:03

[permalink] [raw]

Subject: Re: [PATCHv2 4/5] vmscan: Kill sc.swap_cluster_max

On Mon, 2 Nov 2009 00:12:02 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> Now, all caller is settng to swap_cluster_max = SWAP_CLUSTER_MAX.
> Then, we can remove it perfectly.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2009-11-02 00:48:13

[permalink] [raw]

Subject: Re: [PATCHv2 1/5] vmscan: separate sc.swap_cluster_max and sc.nr_max_reclaim

On Mon, Nov 2, 2009 at 9:35 AM, Minchan Kim <[email protected]> wrote:
> Hi, Kosaki.
>
> On Mon, 2 Nov 2009 00:08:44 +0900 (JST)
> KOSAKI Motohiro <[email protected]> wrote:
>
>> Currently, sc.scap_cluster_max has double meanings.
>>
>> ?1) reclaim batch size as isolate_lru_pages()'s argument
>> ?2) reclaim baling out thresolds
>>
>> The two meanings pretty unrelated. Thus, Let's separate it.
>> this patch doesn't change any behavior.
>>
>> Signed-off-by: KOSAKI Motohiro <[email protected]>
>> Cc: Rafael J. Wysocki <[email protected]>
>> Reviewed-by: Rik van Riel <[email protected]>
>> ---
>> ?mm/vmscan.c | ? 21 +++++++++++++++------
>> ?1 files changed, 15 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index f805958..6a3eb9f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -55,6 +55,9 @@ struct scan_control {
>> ? ? ? /* Number of pages freed so far during a call to shrink_zones() */
>> ? ? ? unsigned long nr_reclaimed;
>>
>> + ? ? /* How many pages shrink_list() should reclaim */
>> + ? ? unsigned long nr_to_reclaim;
>
> If you try to divide meaning of swap_cluster_max,
> How about changing 'swap_cluster_max', too?
>
> It has a meaning which represents 'batch size'. ;)
> I hope we change it in this chance.

I see the your 4th patch 'Kill sc.swap_cluster_max' now.
It's good to me. Forget this comment. :)

--
Kind regards,
Minchan Kim

2009-11-02 00:50:55

[permalink] [raw]

Subject: Re: [PATCHv2 5/5][nit fix] vmscan Make consistent of reclaim bale out between do_try_to_free_page and shrink_zone

On Mon, 2 Nov 2009 00:13:04 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> Fix small inconsistent of ">" and ">=".
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2009-11-02 15:36:17

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

> > Then, This patch changed shrink_all_memory() to only the wrapper function of
> > do_try_to_free_pages(). it bring good reviewability and debuggability, and solve
> > above problems.
> >
> > side note: Reclaim logic unificication makes two good side effect.
> > - Fix recursive reclaim bug on shrink_all_memory().
> > it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
> > - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
>
> As I said previously, I don't really see a reason to keep shrink_all_memory().
>
> Do you think that removing it will result in performance degradation?

Hmm...
Probably, I misunderstood your mention. I thought you suggested to kill
all hibernation specific reclaim code. I did. It's no performance degression.
(At least, I didn't observe)

But, if you hope to kill shrink_all_memory() function itsef, the short answer is,
it's impossible.

Current VM reclaim code need some preparetion to caller, and there are existing in
both alloc_pages_slowpath() and try_to_free_pages(). We can't omit its preparation.

Please see following shrink_all_memory() code. it's pretty small. it only have
few vmscan preparation. I don't think it is hard to maintainance.

=====================================================
unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
{
struct reclaim_state reclaim_state;
struct scan_control sc = {
.gfp_mask = GFP_HIGHUSER_MOVABLE,
.may_swap = 1,
.may_unmap = 1,
.may_writepage = 1,
.nr_to_reclaim = nr_to_reclaim,
.hibernation_mode = 1,
.swappiness = vm_swappiness,
.order = 0,
.isolate_pages = isolate_pages_global,
};
struct zonelist * zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
struct task_struct *p = current;
unsigned long nr_reclaimed;

p->flags |= PF_MEMALLOC;
lockdep_set_current_reclaim_state(sc.gfp_mask);
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;

nr_reclaimed = do_try_to_free_pages(zonelist, &sc);

p->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
p->flags &= ~PF_MEMALLOC;

return nr_reclaimed;
}

2009-11-02 15:36:29

[permalink] [raw]

Subject: Re: [PATCHv2 1/5] vmscan: separate sc.swap_cluster_max and sc.nr_max_reclaim

Hi

> > @@ -1932,6 +1938,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> > .may_unmap = 1,
> > .may_swap = 1,
> >
> .swap_cluster_max = SWAP_CLUSTER_MAX,
> Or add comment in here.
>
> 'kswapd doesn't want to be bailed out while reclaim.'

OK, reasonable.
How about this?

---
mm/vmscan.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7fb3435..84e4da0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1930,6 +1930,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
.may_swap = 1,
+ /*
+ * kswapd doesn't want to be bailed out while reclaim. because
+ * we want to put equal scanning pressure on each zone.
+ */
.nr_to_reclaim = ULONG_MAX,
.swappiness = vm_swappiness,
.order = order,
--
1.6.2.5

2009-11-02 15:36:36

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

Hi

Thank you for the reviewing :)

> > 2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
> > fine on numa system.
> > example)
> > System has 4GB memory and each node have 2GB. and hibernate need 1GB.
> >
> > optimal)
> > steal 500MB from each node.
> > shrink_all_zones)
> > steal 1GB from node-0.
>
> I haven't given much thought to numa awareness in hibernate code, but I
> can say that the shrink_all_memory interface is woefully inadequate as
> far as zone awareness goes. Since lowmem needs to be atomically restored
> before we can restore highmem, we really need to be able to ask for a
> particular number of pages of a particular zone type to be freed.

Honestly, I am not suspend/hibernation expert. Can I ask why caller need to know
per-zone number of freed pages information? if hibernation don't need highmem.
following incremental patch prevent highmem reclaim perfectly. Is it enough?

---
mm/vmscan.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e6ea011..7fb3435 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2265,7 +2265,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
{
struct reclaim_state reclaim_state;
struct scan_control sc = {
- .gfp_mask = GFP_HIGHUSER_MOVABLE,
+ .gfp_mask = GFP_KERNEL,
.may_swap = 1,
.may_unmap = 1,
.may_writepage = 1,
--
1.6.2.5

2009-11-02 19:02:08

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

On Monday 02 November 2009, KOSAKI Motohiro wrote:
> > > Then, This patch changed shrink_all_memory() to only the wrapper function of
> > > do_try_to_free_pages(). it bring good reviewability and debuggability, and solve
> > > above problems.
> > >
> > > side note: Reclaim logic unificication makes two good side effect.
> > > - Fix recursive reclaim bug on shrink_all_memory().
> > > it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
> > > - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
> >
> > As I said previously, I don't really see a reason to keep shrink_all_memory().
> >
> > Do you think that removing it will result in performance degradation?
>
> Hmm...
> Probably, I misunderstood your mention. I thought you suggested to kill
> all hibernation specific reclaim code. I did. It's no performance degression.
> (At least, I didn't observe)
>
> But, if you hope to kill shrink_all_memory() function itsef, the short answer is,
> it's impossible.
>
> Current VM reclaim code need some preparetion to caller, and there are existing in
> both alloc_pages_slowpath() and try_to_free_pages(). We can't omit its preparation.

Well, my grepping for 'shrink_all_memory' throughout the entire kernel source
code seems to indicate that hibernate_preallocate_memory() is the only current
user of it. I may be wrong, but I doubt it, unless some new users have been
added since 2.6.31.

In case I'm not wrong, it should be safe to drop it from
hibernate_preallocate_memory(), because it's there for performance reasons
only. Now, since hibernate_preallocate_memory() appears to be the only user of
it, it should be safe to drop it entirely.

> Please see following shrink_all_memory() code. it's pretty small. it only have
> few vmscan preparation. I don't think it is hard to maintainance.

No, it's not, but I'm really not sure it's worth keeping.

Thanks,
Rafael

> =====================================================
> unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
> {
> struct reclaim_state reclaim_state;
> struct scan_control sc = {
> .gfp_mask = GFP_HIGHUSER_MOVABLE,
> .may_swap = 1,
> .may_unmap = 1,
> .may_writepage = 1,
> .nr_to_reclaim = nr_to_reclaim,
> .hibernation_mode = 1,
> .swappiness = vm_swappiness,
> .order = 0,
> .isolate_pages = isolate_pages_global,
> };
> struct zonelist * zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
> struct task_struct *p = current;
> unsigned long nr_reclaimed;
>
> p->flags |= PF_MEMALLOC;
> lockdep_set_current_reclaim_state(sc.gfp_mask);
> reclaim_state.reclaimed_slab = 0;
> p->reclaim_state = &reclaim_state;
>
> nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
>
> p->reclaim_state = NULL;
> lockdep_clear_current_reclaim_state();
> p->flags &= ~PF_MEMALLOC;
>
> return nr_reclaimed;
> }

2009-11-02 19:03:23

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

On Monday 02 November 2009, KOSAKI Motohiro wrote:
> Hi
>
> Thank you for the reviewing :)
>
> > > 2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
> > > fine on numa system.
> > > example)
> > > System has 4GB memory and each node have 2GB. and hibernate need 1GB.
> > >
> > > optimal)
> > > steal 500MB from each node.
> > > shrink_all_zones)
> > > steal 1GB from node-0.
> >
> > I haven't given much thought to numa awareness in hibernate code, but I
> > can say that the shrink_all_memory interface is woefully inadequate as
> > far as zone awareness goes. Since lowmem needs to be atomically restored
> > before we can restore highmem, we really need to be able to ask for a
> > particular number of pages of a particular zone type to be freed.
>
> Honestly, I am not suspend/hibernation expert. Can I ask why caller need to know
> per-zone number of freed pages information? if hibernation don't need highmem.

It does need highmem. At least the mainline version does.

> following incremental patch prevent highmem reclaim perfectly. Is it enough?

Thanks,
Rafael

> ---
> mm/vmscan.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e6ea011..7fb3435 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2265,7 +2265,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
> {
> struct reclaim_state reclaim_state;
> struct scan_control sc = {
> - .gfp_mask = GFP_HIGHUSER_MOVABLE,
> + .gfp_mask = GFP_KERNEL,
> .may_swap = 1,
> .may_unmap = 1,
> .may_writepage = 1,

2009-11-02 21:18:38

by Nigel Cunningham

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

Hi.

KOSAKI Motohiro wrote:
>> I haven't given much thought to numa awareness in hibernate code, but I
>> can say that the shrink_all_memory interface is woefully inadequate as
>> far as zone awareness goes. Since lowmem needs to be atomically restored
>> before we can restore highmem, we really need to be able to ask for a
>> particular number of pages of a particular zone type to be freed.
>
> Honestly, I am not suspend/hibernation expert. Can I ask why caller need to know
> per-zone number of freed pages information? if hibernation don't need highmem.
> following incremental patch prevent highmem reclaim perfectly. Is it enough?

(Disclaimer: I don't think about highmem a lot any more, and might have
forgotten some of the details, or swsusp's algorithms might have
changed. Rafael might need to correct some of this...)

Imagine that you have a system with 1000 pages of lowmem and 5000 pages
of highmem. Of these, 950 lowmem pages are in use and 500 highmem pages
are in use.

In order to to be able to save an image, we need to be able to do an
atomic copy of those lowmem pages.

You might think that we could just copy everything into the spare
highmem pages, but we can't because mapping and unmapping the highmem
pages as we copy the data will leave us with an inconsistent copy.
Depending on the configuration, it might (for example) have one page -
say on in the pagetables - reflecting one page being kmapped and another
page - containing the variables that record what kmap slots are used,
for example - recording a different page being kmapped.

What we do, then, is seek to atomically copy the lowmem pages to lowmem.
That requires, however, that we have at least half of the lowmem pages
free. So, then, we need a function that lets us free lowmem pages only.

I hope that makes it clearer.

> ---
> mm/vmscan.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e6ea011..7fb3435 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2265,7 +2265,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
> {
> struct reclaim_state reclaim_state;
> struct scan_control sc = {
> - .gfp_mask = GFP_HIGHUSER_MOVABLE,
> + .gfp_mask = GFP_KERNEL,
> .may_swap = 1,
> .may_unmap = 1,
> .may_writepage = 1,

I don't think so. I think what we really need is:

shrink_memory_type(gfp_mask, pages_needed)

That is, a function that would let us say "Free 489 pages of lowmem" or
"Free 983 pages of highmem" or "Free 340 pages of any kind of memory".
(The later might be used if we just want to free some pages because the
image as it stands is too big for the storage available).

Regards,

Nigel

2009-11-02 23:37:12

[permalink] [raw]

Subject: Re: [PATCHv2 1/5] vmscan: separate sc.swap_cluster_max and sc.nr_max_reclaim

On Tue, 3 Nov 2009 00:35:30 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> Hi
>
> > > @@ -1932,6 +1938,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> > > .may_unmap = 1,
> > > .may_swap = 1,
> > >
> > .swap_cluster_max = SWAP_CLUSTER_MAX,
> > Or add comment in here.
> >
> > 'kswapd doesn't want to be bailed out while reclaim.'
>
> OK, reasonable.
> How about this?
>
>
>
> ---
> mm/vmscan.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7fb3435..84e4da0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1930,6 +1930,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> .gfp_mask = GFP_KERNEL,
> .may_unmap = 1,
> .may_swap = 1,
> + /*
> + * kswapd doesn't want to be bailed out while reclaim. because
> + * we want to put equal scanning pressure on each zone.
> + */
> .nr_to_reclaim = ULONG_MAX,
> .swappiness = vm_swappiness,
> .order = order,
> --
> 1.6.2.5
>

Reviewed-by: Minchan Kim <[email protected]>
It looks good than mine.
Thanks, Kosaki. :)

--
Kind regards,
Minchan Kim

2009-11-03 11:28:47

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

On Monday 02 November 2009, Nigel Cunningham wrote:
> Hi.

Hi,

> KOSAKI Motohiro wrote:
> >> I haven't given much thought to numa awareness in hibernate code, but I
> >> can say that the shrink_all_memory interface is woefully inadequate as
> >> far as zone awareness goes. Since lowmem needs to be atomically restored
> >> before we can restore highmem, we really need to be able to ask for a
> >> particular number of pages of a particular zone type to be freed.
> >
> > Honestly, I am not suspend/hibernation expert. Can I ask why caller need to know
> > per-zone number of freed pages information? if hibernation don't need highmem.
> > following incremental patch prevent highmem reclaim perfectly. Is it enough?
>
> (Disclaimer: I don't think about highmem a lot any more, and might have
> forgotten some of the details, or swsusp's algorithms might have
> changed. Rafael might need to correct some of this...)
>
> Imagine that you have a system with 1000 pages of lowmem and 5000 pages
> of highmem. Of these, 950 lowmem pages are in use and 500 highmem pages
> are in use.
>
> In order to to be able to save an image, we need to be able to do an
> atomic copy of those lowmem pages.
>
> You might think that we could just copy everything into the spare
> highmem pages, but we can't because mapping and unmapping the highmem
> pages as we copy the data will leave us with an inconsistent copy.

This isn't the case any more for the mainline hibernate code. We use highmem
for storing image data as well as lowmem.

Thanks,
Rafael

2009-11-03 14:01:23

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

> On Monday 02 November 2009, KOSAKI Motohiro wrote:
> > > > Then, This patch changed shrink_all_memory() to only the wrapper function of
> > > > do_try_to_free_pages(). it bring good reviewability and debuggability, and solve
> > > > above problems.
> > > >
> > > > side note: Reclaim logic unificication makes two good side effect.
> > > > - Fix recursive reclaim bug on shrink_all_memory().
> > > > it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
> > > > - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
> > >
> > > As I said previously, I don't really see a reason to keep shrink_all_memory().
> > >
> > > Do you think that removing it will result in performance degradation?
> >
> > Hmm...
> > Probably, I misunderstood your mention. I thought you suggested to kill
> > all hibernation specific reclaim code. I did. It's no performance degression.
> > (At least, I didn't observe)
> >
> > But, if you hope to kill shrink_all_memory() function itsef, the short answer is,
> > it's impossible.
> >
> > Current VM reclaim code need some preparetion to caller, and there are existing in
> > both alloc_pages_slowpath() and try_to_free_pages(). We can't omit its preparation.
>
> Well, my grepping for 'shrink_all_memory' throughout the entire kernel source
> code seems to indicate that hibernate_preallocate_memory() is the only current
> user of it. I may be wrong, but I doubt it, unless some new users have been
> added since 2.6.31.
>
> In case I'm not wrong, it should be safe to drop it from
> hibernate_preallocate_memory(), because it's there for performance reasons
> only. Now, since hibernate_preallocate_memory() appears to be the only user of
> it, it should be safe to drop it entirely.

Hmmm...
I've try the dropping shrink_all_memory() today. but I've got bad result.

In 3 times test, result were

2 times: kernel hang-up ;)
1 time: success, but make slower than with shrink_all_memory() about 100x times.

Did you try to drop it yourself on your machine? Is this success?

> > Please see following shrink_all_memory() code. it's pretty small. it only have
> > few vmscan preparation. I don't think it is hard to maintainance.
>
> No, it's not, but I'm really not sure it's worth keeping.

2009-11-03 14:01:26

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

> Hi.
>
> KOSAKI Motohiro wrote:
> >> I haven't given much thought to numa awareness in hibernate code, but I
> >> can say that the shrink_all_memory interface is woefully inadequate as
> >> far as zone awareness goes. Since lowmem needs to be atomically restored
> >> before we can restore highmem, we really need to be able to ask for a
> >> particular number of pages of a particular zone type to be freed.
> >
> > Honestly, I am not suspend/hibernation expert. Can I ask why caller need to know
> > per-zone number of freed pages information? if hibernation don't need highmem.
> > following incremental patch prevent highmem reclaim perfectly. Is it enough?
>
> (Disclaimer: I don't think about highmem a lot any more, and might have
> forgotten some of the details, or swsusp's algorithms might have
> changed. Rafael might need to correct some of this...)
>
> Imagine that you have a system with 1000 pages of lowmem and 5000 pages
> of highmem. Of these, 950 lowmem pages are in use and 500 highmem pages
> are in use.
>
> In order to to be able to save an image, we need to be able to do an
> atomic copy of those lowmem pages.
>
> You might think that we could just copy everything into the spare
> highmem pages, but we can't because mapping and unmapping the highmem
> pages as we copy the data will leave us with an inconsistent copy.
> Depending on the configuration, it might (for example) have one page -
> say on in the pagetables - reflecting one page being kmapped and another
> page - containing the variables that record what kmap slots are used,
> for example - recording a different page being kmapped.
>
> What we do, then, is seek to atomically copy the lowmem pages to lowmem.
> That requires, however, that we have at least half of the lowmem pages
> free. So, then, we need a function that lets us free lowmem pages only.
>
> I hope that makes it clearer.
>
> > ---
> > mm/vmscan.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index e6ea011..7fb3435 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2265,7 +2265,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
> > {
> > struct reclaim_state reclaim_state;
> > struct scan_control sc = {
> > - .gfp_mask = GFP_HIGHUSER_MOVABLE,
> > + .gfp_mask = GFP_KERNEL,
> > .may_swap = 1,
> > .may_unmap = 1,
> > .may_writepage = 1,
>
> I don't think so. I think what we really need is:
>
> shrink_memory_type(gfp_mask, pages_needed)
>
> That is, a function that would let us say "Free 489 pages of lowmem" or
> "Free 983 pages of highmem" or "Free 340 pages of any kind of memory".
> (The later might be used if we just want to free some pages because the
> image as it stands is too big for the storage available).

I can add gfp_mask argument to shrink_all_memory(). it's easy.
but I obviously need to help from PM folks for meaningful change. I'm not sure
How should we calculate required free memory for hibernation.

Sidenote, current reclaim logic can't do "Free 983 pages of highmem".
but I doubt it's really necessary. I guess we only need "reclaim lowmem" and/or
"reclaim any kind of memory".

2009-11-03 21:11:08

by Nigel Cunningham

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

Hi Rafael.

Rafael J. Wysocki wrote:
> On Monday 02 November 2009, Nigel Cunningham wrote:
>> Hi.
>
> Hi,
>
>> KOSAKI Motohiro wrote:
>>>> I haven't given much thought to numa awareness in hibernate code, but I
>>>> can say that the shrink_all_memory interface is woefully inadequate as
>>>> far as zone awareness goes. Since lowmem needs to be atomically restored
>>>> before we can restore highmem, we really need to be able to ask for a
>>>> particular number of pages of a particular zone type to be freed.
>>> Honestly, I am not suspend/hibernation expert. Can I ask why caller need to know
>>> per-zone number of freed pages information? if hibernation don't need highmem.
>>> following incremental patch prevent highmem reclaim perfectly. Is it enough?
>> (Disclaimer: I don't think about highmem a lot any more, and might have
>> forgotten some of the details, or swsusp's algorithms might have
>> changed. Rafael might need to correct some of this...)
>>
>> Imagine that you have a system with 1000 pages of lowmem and 5000 pages
>> of highmem. Of these, 950 lowmem pages are in use and 500 highmem pages
>> are in use.
>>
>> In order to to be able to save an image, we need to be able to do an
>> atomic copy of those lowmem pages.
>>
>> You might think that we could just copy everything into the spare
>> highmem pages, but we can't because mapping and unmapping the highmem
>> pages as we copy the data will leave us with an inconsistent copy.
>
> This isn't the case any more for the mainline hibernate code. We use highmem
> for storing image data as well as lowmem.

Highmem for storing copies of lowmem pages?

Regards,

Nigel

2009-11-03 21:50:14

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

On Tuesday 03 November 2009, KOSAKI Motohiro wrote:
> > On Monday 02 November 2009, KOSAKI Motohiro wrote:
> > > > > Then, This patch changed shrink_all_memory() to only the wrapper function of
> > > > > do_try_to_free_pages(). it bring good reviewability and debuggability, and solve
> > > > > above problems.
> > > > >
> > > > > side note: Reclaim logic unificication makes two good side effect.
> > > > > - Fix recursive reclaim bug on shrink_all_memory().
> > > > > it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
> > > > > - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
> > > >
> > > > As I said previously, I don't really see a reason to keep shrink_all_memory().
> > > >
> > > > Do you think that removing it will result in performance degradation?
> > >
> > > Hmm...
> > > Probably, I misunderstood your mention. I thought you suggested to kill
> > > all hibernation specific reclaim code. I did. It's no performance degression.
> > > (At least, I didn't observe)
> > >
> > > But, if you hope to kill shrink_all_memory() function itsef, the short answer is,
> > > it's impossible.
> > >
> > > Current VM reclaim code need some preparetion to caller, and there are existing in
> > > both alloc_pages_slowpath() and try_to_free_pages(). We can't omit its preparation.
> >
> > Well, my grepping for 'shrink_all_memory' throughout the entire kernel source
> > code seems to indicate that hibernate_preallocate_memory() is the only current
> > user of it. I may be wrong, but I doubt it, unless some new users have been
> > added since 2.6.31.
> >
> > In case I'm not wrong, it should be safe to drop it from
> > hibernate_preallocate_memory(), because it's there for performance reasons
> > only. Now, since hibernate_preallocate_memory() appears to be the only user of
> > it, it should be safe to drop it entirely.
>
> Hmmm...
> I've try the dropping shrink_all_memory() today. but I've got bad result.
>
> In 3 times test, result were
>
> 2 times: kernel hang-up ;)
> 1 time: success, but make slower than with shrink_all_memory() about 100x times.
>
>
> Did you try to drop it yourself on your machine? Is this success?

Generally, yes, but the performance was hit really badly.

So, the conclusion is that we need shrink_all_memory() for things to work,
which is kind of interesting.

In that case, please feel free to add Acked-by: Rafael J. Wysocki <[email protected]>
to the patch.

Thanks,
Rafael

2009-11-03 21:50:52

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

On Tuesday 03 November 2009, KOSAKI Motohiro wrote:
> > Hi.
> >
> > KOSAKI Motohiro wrote:
> > >> I haven't given much thought to numa awareness in hibernate code, but I
> > >> can say that the shrink_all_memory interface is woefully inadequate as
> > >> far as zone awareness goes. Since lowmem needs to be atomically restored
> > >> before we can restore highmem, we really need to be able to ask for a
> > >> particular number of pages of a particular zone type to be freed.
> > >
> > > Honestly, I am not suspend/hibernation expert. Can I ask why caller need to know
> > > per-zone number of freed pages information? if hibernation don't need highmem.
> > > following incremental patch prevent highmem reclaim perfectly. Is it enough?
> >
> > (Disclaimer: I don't think about highmem a lot any more, and might have
> > forgotten some of the details, or swsusp's algorithms might have
> > changed. Rafael might need to correct some of this...)
> >
> > Imagine that you have a system with 1000 pages of lowmem and 5000 pages
> > of highmem. Of these, 950 lowmem pages are in use and 500 highmem pages
> > are in use.
> >
> > In order to to be able to save an image, we need to be able to do an
> > atomic copy of those lowmem pages.
> >
> > You might think that we could just copy everything into the spare
> > highmem pages, but we can't because mapping and unmapping the highmem
> > pages as we copy the data will leave us with an inconsistent copy.
> > Depending on the configuration, it might (for example) have one page -
> > say on in the pagetables - reflecting one page being kmapped and another
> > page - containing the variables that record what kmap slots are used,
> > for example - recording a different page being kmapped.
> >
> > What we do, then, is seek to atomically copy the lowmem pages to lowmem.
> > That requires, however, that we have at least half of the lowmem pages
> > free. So, then, we need a function that lets us free lowmem pages only.
> >
> > I hope that makes it clearer.
> >
> > > ---
> > > mm/vmscan.c | 2 +-
> > > 1 files changed, 1 insertions(+), 1 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index e6ea011..7fb3435 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2265,7 +2265,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
> > > {
> > > struct reclaim_state reclaim_state;
> > > struct scan_control sc = {
> > > - .gfp_mask = GFP_HIGHUSER_MOVABLE,
> > > + .gfp_mask = GFP_KERNEL,
> > > .may_swap = 1,
> > > .may_unmap = 1,
> > > .may_writepage = 1,
> >
> > I don't think so. I think what we really need is:
> >
> > shrink_memory_type(gfp_mask, pages_needed)
> >
> > That is, a function that would let us say "Free 489 pages of lowmem" or
> > "Free 983 pages of highmem" or "Free 340 pages of any kind of memory".
> > (The later might be used if we just want to free some pages because the
> > image as it stands is too big for the storage available).
>
> I can add gfp_mask argument to shrink_all_memory(). it's easy.
> but I obviously need to help from PM folks for meaningful change. I'm not sure
> How should we calculate required free memory for hibernation.
>
> Sidenote, current reclaim logic can't do "Free 983 pages of highmem".
> but I doubt it's really necessary. I guess we only need "reclaim lowmem" and/or
> "reclaim any kind of memory".

The latter.

Thanks,
Rafael

2009-11-03 21:58:40

[permalink] [raw]

Subject: Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

On Tuesday 03 November 2009, Nigel Cunningham wrote:
> Hi Rafael.
>
> Rafael J. Wysocki wrote:
> > On Monday 02 November 2009, Nigel Cunningham wrote:
> >> Hi.
> >
> > Hi,
> >
> >> KOSAKI Motohiro wrote:
> >>>> I haven't given much thought to numa awareness in hibernate code, but I
> >>>> can say that the shrink_all_memory interface is woefully inadequate as
> >>>> far as zone awareness goes. Since lowmem needs to be atomically restored
> >>>> before we can restore highmem, we really need to be able to ask for a
> >>>> particular number of pages of a particular zone type to be freed.
> >>> Honestly, I am not suspend/hibernation expert. Can I ask why caller need to know
> >>> per-zone number of freed pages information? if hibernation don't need highmem.
> >>> following incremental patch prevent highmem reclaim perfectly. Is it enough?
> >> (Disclaimer: I don't think about highmem a lot any more, and might have
> >> forgotten some of the details, or swsusp's algorithms might have
> >> changed. Rafael might need to correct some of this...)
> >>
> >> Imagine that you have a system with 1000 pages of lowmem and 5000 pages
> >> of highmem. Of these, 950 lowmem pages are in use and 500 highmem pages
> >> are in use.
> >>
> >> In order to to be able to save an image, we need to be able to do an
> >> atomic copy of those lowmem pages.
> >>
> >> You might think that we could just copy everything into the spare
> >> highmem pages, but we can't because mapping and unmapping the highmem
> >> pages as we copy the data will leave us with an inconsistent copy.
> >
> > This isn't the case any more for the mainline hibernate code. We use highmem
> > for storing image data as well as lowmem.
>
> Highmem for storing copies of lowmem pages?

It is possible in theory, but I don't think it happens in practice given the
way in which the memory is freed. Still copy_data_page() takes this
possibility into account.

Thanks,
Rafael

2009-11-12 23:21:25

by Pavel Machek

[permalink] [raw]

Subject: using highmem for atomic copy of lowmem was Re: [PATCHv2 2/5] vmscan: Kill hibernation specific reclaim logic and unify it

Hi!

> > >> (Disclaimer: I don't think about highmem a lot any more, and might have
> > >> forgotten some of the details, or swsusp's algorithms might have
> > >> changed. Rafael might need to correct some of this...)
> > >>
> > >> Imagine that you have a system with 1000 pages of lowmem and 5000 pages
> > >> of highmem. Of these, 950 lowmem pages are in use and 500 highmem pages
> > >> are in use.
> > >>
> > >> In order to to be able to save an image, we need to be able to do an
> > >> atomic copy of those lowmem pages.
> > >>
> > >> You might think that we could just copy everything into the spare
> > >> highmem pages, but we can't because mapping and unmapping the highmem
> > >> pages as we copy the data will leave us with an inconsistent copy.
> > >
> > > This isn't the case any more for the mainline hibernate code. We use highmem
> > > for storing image data as well as lowmem.
> >
> > Highmem for storing copies of lowmem pages?
>
> It is possible in theory, but I don't think it happens in practice given the
> way in which the memory is freed. Still copy_data_page() takes this
> possibility into account.

Yes, it does, but I wonder if it can ever work...?

copy_data_page() takes great care not to modify any memory -- like
using handmade loop instead of memcpy() -- yet it uses kmap_atomic()
and friends.

If kmap_atomic()+kunmap_atomic() pair is guaranteed not to change any
memory, thats probably safe, but...

(Actually, plain memcpy() should be usable for copying highmem pages;
those should not contain task_structs or anything else touched by
memcpy()).
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-11-12 23:32:09