2016-04-25 05:21:12

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v2 0/6] Introduce ZONE_CMA

From: Joonsoo Kim <[email protected]>

Hello,

Changes from v1
o Separate some patches which deserve to submit independently
o Modify description to reflect current kernel state
(e.g. high-order watermark problem disappeared by Mel's work)
o Don't increase SECTION_SIZE_BITS to make a room in page flags
(detailed reason is on the patch that adds ZONE_CMA)
o Adjust ZONE_CMA population code

This series try to solve problems of current CMA implementation.

CMA is introduced to provide physically contiguous pages at runtime
without exclusive reserved memory area. But, current implementation
works like as previous reserved memory approach, because freepages
on CMA region are used only if there is no movable freepage. In other
words, freepages on CMA region are only used as fallback. In that
situation where freepages on CMA region are used as fallback, kswapd
would be woken up easily since there is no unmovable and reclaimable
freepage, too. If kswapd starts to reclaim memory, fallback allocation
to MIGRATE_CMA doesn't occur any more since movable freepages are
already refilled by kswapd and then most of freepage on CMA are left
to be in free. This situation looks like exclusive reserved memory case.

In my experiment, I found that if system memory has 1024 MB memory and
512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
free memory is left. Detailed reason is that for keeping enough free
memory for unmovable and reclaimable allocation, kswapd uses below
equation when calculating free memory and it easily go under the watermark.

Free memory for unmovable and reclaimable = Free total - Free CMA pages

This is derivated from the property of CMA freepage that CMA freepage
can't be used for unmovable and reclaimable allocation.

Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
is lower than low watermark and tries to make free memory until
(FreeTotal - FreeCMA) is higher than high watermark. That results
in that FreeTotal is moving around 512MB boundary consistently. It
then means that we can't utilize full memory capacity.

To fix this problem, I submitted some patches [1] about 10 months ago,
but, found some more problems to be fixed before solving this problem.
It requires many hooks in allocator hotpath so some developers doesn't
like it. Instead, some of them suggest different approach [2] to fix
all the problems related to CMA, that is, introducing a new zone to deal
with free CMA pages. I agree that it is the best way to go so implement
here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar, I
decide to add a new zone rather than piggyback on ZONE_MOVABLE since
they have some differences. First, reserved CMA pages should not be
offlined. If freepage for CMA is managed by ZONE_MOVABLE, we need to keep
MIGRATE_CMA migratetype and insert many hooks on memory hotplug code
to distiguish hotpluggable memory and reserved memory for CMA in the same
zone. It would make memory hotplug code which is already complicated
more complicated. Second, cma_alloc() can be called more frequently
than memory hotplug operation and possibly we need to control
allocation rate of ZONE_CMA to optimize latency in the future.
In this case, separate zone approach is easy to modify. Third, I'd
like to see statistics for CMA, separately. Sometimes, we need to debug
why cma_alloc() is failed and separate statistics would be more helpful
in this situtaion.

Anyway, this patchset solves four problems related to CMA implementation.

1) Utilization problem
As mentioned above, we can't utilize full memory capacity due to the
limitation of CMA freepage and fallback policy. This patchset implements
a new zone for CMA and uses it for GFP_HIGHUSER_MOVABLE request. This
typed allocation is used for page cache and anonymous pages which
occupies most of memory usage in normal case so we can utilize full
memory capacity. Below is the experiment result about this problem.

8 CPUs, 1024 MB, VIRTUAL MACHINE
make -j16

<Before this series>
CMA reserve: 0 MB 512 MB
Elapsed-time: 92.4 186.5
pswpin: 82 18647
pswpout: 160 69839

<After this series>
CMA reserve: 0 MB 512 MB
Elapsed-time: 93.1 93.4
pswpin: 84 46
pswpout: 183 92

FYI, there is another attempt [3] trying to solve this problem in lkml.
And, as far as I know, Qualcomm also has out-of-tree solution for this
problem.

2) Reclaim problem
Currently, there is no logic to distinguish CMA pages in reclaim path.
If reclaim is initiated for unmovable and reclaimable allocation,
reclaiming CMA pages doesn't help to satisfy the request and reclaiming
CMA page is just waste. By managing CMA pages in the new zone, we can
skip to reclaim ZONE_CMA completely if it is unnecessary.

3) Atomic allocation failure problem
Kswapd isn't started to reclaim pages when allocation request is movable
type and there is enough free page in the CMA region. After bunch of
consecutive movable allocation requests, free pages in ordinary region
(not CMA region) would be exhausted without waking up kswapd. At that time,
if atomic unmovable allocation comes, it can't be successful since there
is not enough page in ordinary region. This problem is reported
by Aneesh [4] and can be solved by this patchset.

4) Inefficiently work of compaction
Usual high-order allocation request is unmovable type and it cannot
be serviced from CMA area. In compaction, migration scanner doesn't
distinguish migratable pages on the CMA area and do migration.
In this case, even if we make high-order page on that region, it
cannot be used due to type mismatch. This patch will solve this problem
by separating CMA pages from ordinary zones.

I passed boot test on x86_64, x86_32, arm and arm64. I did some stress
tests on x86_64 and x86_32 and there is no problem. Feel free to enjoy
and please give me a feedback. :)

This patchset is based on linux-next-20160413.

Thanks.

[1] https://lkml.org/lkml/2014/5/28/64
[2] https://lkml.org/lkml/2014/11/4/55
[3] https://lkml.org/lkml/2014/10/15/623
[4] http://www.spinics.net/lists/linux-mm/msg100562.html

Joonsoo Kim (6):
mm/page_alloc: recalculate some of zone threshold when on/offline
memory
mm/cma: introduce new zone, ZONE_CMA
mm/cma: populate ZONE_CMA
mm/cma: remove ALLOC_CMA
mm/cma: remove MIGRATE_CMA
mm/cma: remove per zone CMA stat

arch/x86/mm/highmem_32.c | 8 ++
fs/proc/meminfo.c | 2 +-
include/linux/cma.h | 6 +
include/linux/gfp.h | 32 +++---
include/linux/memory_hotplug.h | 3 -
include/linux/mempolicy.h | 2 +-
include/linux/mmzone.h | 54 +++++----
include/linux/vm_event_item.h | 10 +-
include/linux/vmstat.h | 8 --
include/trace/events/compaction.h | 10 +-
kernel/power/snapshot.c | 8 ++
mm/cma.c | 58 +++++++++-
mm/compaction.c | 10 +-
mm/hugetlb.c | 2 +-
mm/internal.h | 6 +-
mm/memory_hotplug.c | 3 +
mm/page_alloc.c | 236 ++++++++++++++++++++++----------------
mm/page_isolation.c | 5 +-
mm/vmstat.c | 15 ++-
19 files changed, 303 insertions(+), 175 deletions(-)

--
1.9.1


2016-04-25 05:21:17

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v2 1/6] mm/page_alloc: recalculate some of zone threshold when on/offline memory

From: Joonsoo Kim <[email protected]>

Some of zone threshold depends on number of managed pages in the zone.
When memory is going on/offline, it can be changed and we need to
adjust them.

This patch add recalculation to appropriate places and clean-up
related function for better maintanance.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-------
1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 71fa015..ffa93e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4633,6 +4633,8 @@ int local_memory_node(int node)
}
#endif

+static void setup_min_unmapped_ratio(struct zone *zone);
+static void setup_min_slab_ratio(struct zone *zone);
#else /* CONFIG_NUMA */

static void set_zonelist_order(void)
@@ -5747,9 +5749,8 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
#ifdef CONFIG_NUMA
zone->node = nid;
- zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
- / 100;
- zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
+ setup_min_unmapped_ratio(zone);
+ setup_min_slab_ratio(zone);
#endif
zone->name = zone_names[j];
spin_lock_init(&zone->lock);
@@ -6655,6 +6656,7 @@ int __meminit init_per_zone_wmark_min(void)
{
unsigned long lowmem_kbytes;
int new_min_free_kbytes;
+ struct zone *zone;

lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);
new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16);
@@ -6672,6 +6674,14 @@ int __meminit init_per_zone_wmark_min(void)
setup_per_zone_wmarks();
refresh_zone_stat_thresholds();
setup_per_zone_lowmem_reserve();
+
+ for_each_zone(zone) {
+#ifdef CONFIG_NUMA
+ setup_min_unmapped_ratio(zone);
+ setup_min_slab_ratio(zone);
+#endif
+ }
+
return 0;
}
module_init(init_per_zone_wmark_min)
@@ -6713,6 +6723,12 @@ int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
}

#ifdef CONFIG_NUMA
+static void setup_min_unmapped_ratio(struct zone *zone)
+{
+ zone->min_unmapped_pages = (zone->managed_pages *
+ sysctl_min_unmapped_ratio) / 100;
+}
+
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
@@ -6724,11 +6740,17 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
return rc;

for_each_zone(zone)
- zone->min_unmapped_pages = (zone->managed_pages *
- sysctl_min_unmapped_ratio) / 100;
+ setup_min_unmapped_ratio(zone);
+
return 0;
}

+static void setup_min_slab_ratio(struct zone *zone)
+{
+ zone->min_slab_pages = (zone->managed_pages *
+ sysctl_min_slab_ratio) / 100;
+}
+
int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
@@ -6740,8 +6762,8 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
return rc;

for_each_zone(zone)
- zone->min_slab_pages = (zone->managed_pages *
- sysctl_min_slab_ratio) / 100;
+ setup_min_slab_ratio(zone);
+
return 0;
}
#endif
--
1.9.1

2016-04-25 05:21:31

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v2 3/6] mm/cma: populate ZONE_CMA

From: Joonsoo Kim <[email protected]>

Until now, reserved pages for CMA are managed in the ordinary zones
where page's pfn are belong to. This approach has numorous problems
and fixing them isn't easy. (It is mentioned on previous patch.)
To fix this situation, ZONE_CMA is introduced in previous patch, but,
not yet populated. This patch implement population of ZONE_CMA
by stealing reserved pages from the ordinary zones.

Unlike previous implementation that kernel allocation request with
__GFP_MOVABLE could be serviced from CMA region, allocation request only
with GFP_HIGHUSER_MOVABLE can be serviced from CMA region in the new
approach. This is an inevitable design decision to use the zone
implementation because ZONE_CMA could contain highmem. Due to this
decision, ZONE_CMA will work like as ZONE_HIGHMEM or ZONE_MOVABLE.

I don't think it would be a problem because most of file cache pages
and anonymous pages are requested with GFP_HIGHUSER_MOVABLE. It could
be proved by the fact that there are many systems with ZONE_HIGHMEM and
they work fine. Notable disadvantage is that we cannot use these pages
for blockdev file cache page, because it usually has __GFP_MOVABLE but
not __GFP_HIGHMEM and __GFP_USER. But, in this case, there is pros and
cons. In my experience, blockdev file cache pages are one of the top
reason that causes cma_alloc() to fail temporarily. So, we can get more
guarantee of cma_alloc() success by discarding that case.

Implementation itself is very easy to understand. Steal when cma area is
initialized and recalculate various per zone stat/threshold.

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/memory_hotplug.h | 3 ---
mm/cma.c | 41 +++++++++++++++++++++++++++++++++++++++++
mm/internal.h | 3 +++
mm/page_alloc.c | 26 ++++++++++++++++++++++++--
4 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 20d8a5d..260c741 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -198,9 +198,6 @@ void put_online_mems(void);
void mem_hotplug_begin(void);
void mem_hotplug_done(void);

-extern void set_zone_contiguous(struct zone *zone);
-extern void clear_zone_contiguous(struct zone *zone);
-
#else /* ! CONFIG_MEMORY_HOTPLUG */
/*
* Stub functions for when hotplug is off
diff --git a/mm/cma.c b/mm/cma.c
index ea506eb..8684f50 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -38,6 +38,7 @@
#include <trace/events/cma.h>

#include "cma.h"
+#include "internal.h"

struct cma cma_areas[MAX_CMA_AREAS];
unsigned cma_area_count;
@@ -145,6 +146,11 @@ err:
static int __init cma_init_reserved_areas(void)
{
int i;
+ struct zone *zone;
+ unsigned long start_pfn = UINT_MAX, end_pfn = 0;
+
+ if (!cma_area_count)
+ return 0;

for (i = 0; i < cma_area_count; i++) {
int ret = cma_activate_area(&cma_areas[i]);
@@ -153,6 +159,41 @@ static int __init cma_init_reserved_areas(void)
return ret;
}

+ for (i = 0; i < cma_area_count; i++) {
+ if (start_pfn > cma_areas[i].base_pfn)
+ start_pfn = cma_areas[i].base_pfn;
+ if (end_pfn < cma_areas[i].base_pfn + cma_areas[i].count)
+ end_pfn = cma_areas[i].base_pfn + cma_areas[i].count;
+ }
+
+ for_each_populated_zone(zone) {
+ if (!is_zone_cma(zone))
+ continue;
+
+ /* ZONE_CMA doesn't need to exceed CMA region */
+ zone->zone_start_pfn = max(zone->zone_start_pfn, start_pfn);
+ zone->spanned_pages = min(zone_end_pfn(zone), end_pfn) -
+ zone->zone_start_pfn;
+ }
+
+ /*
+ * Reserved pages for ZONE_CMA are now activated and this would change
+ * ZONE_CMA's managed page counter and other zone's present counter.
+ * We need to re-calculate various zone information that depends on
+ * this initialization.
+ */
+ build_all_zonelists(NULL, NULL);
+ for_each_populated_zone(zone) {
+ zone_pcp_update(zone);
+ set_zone_contiguous(zone);
+ }
+
+ /*
+ * We need to re-init per zone wmark by calling
+ * init_per_zone_wmark_min() but doesn't call here because it is
+ * registered on module_init and it will be called later than us.
+ */
+
return 0;
}
core_initcall(cma_init_reserved_areas);
diff --git a/mm/internal.h b/mm/internal.h
index e30f40e..64e3131 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -156,6 +156,9 @@ extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
extern void prep_compound_page(struct page *page, unsigned int order);
extern int user_min_free_kbytes;

+extern void set_zone_contiguous(struct zone *zone);
+extern void clear_zone_contiguous(struct zone *zone);
+
#if defined CONFIG_COMPACTION || defined CONFIG_CMA

/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 987a87c..0a6a195 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1408,16 +1408,38 @@ void __init page_alloc_init_late(void)
}

#ifdef CONFIG_CMA
+static void __init adjust_present_page_count(struct page *page, long count)
+{
+ struct zone *zone = page_zone(page);
+
+ /* We don't need to hold a lock since it is boot-up process */
+ zone->present_pages += count;
+}
+
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
{
unsigned i = pageblock_nr_pages;
+ unsigned long pfn = page_to_pfn(page);
struct page *p = page;
+ int nid = page_to_nid(page);
+
+ /*
+ * ZONE_CMA will steal present pages from other zones by changing
+ * page links so page_zone() is changed. Before that,
+ * we need to adjust previous zone's page count first.
+ */
+ adjust_present_page_count(page, -pageblock_nr_pages);

do {
__ClearPageReserved(p);
set_page_count(p, 0);
- } while (++p, --i);
+
+ /* Steal pages from other zones */
+ set_page_links(p, ZONE_CMA, nid, pfn);
+ } while (++p, ++pfn, --i);
+
+ adjust_present_page_count(page, pageblock_nr_pages);

set_pageblock_migratetype(page, MIGRATE_CMA);

@@ -7396,7 +7418,7 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
}
#endif

-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined CONFIG_MEMORY_HOTPLUG || defined CONFIG_CMA
/*
* The zone indicated has a new number of managed_pages; batch sizes and percpu
* page high values need to be recalulated.
--
1.9.1

2016-04-25 05:21:22

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v2 2/6] mm/cma: introduce new zone, ZONE_CMA

From: Joonsoo Kim <[email protected]>

Attached cover-letter:

This series try to solve problems of current CMA implementation.

CMA is introduced to provide physically contiguous pages at runtime
without exclusive reserved memory area. But, current implementation
works like as previous reserved memory approach, because freepages
on CMA region are used only if there is no movable freepage. In other
words, freepages on CMA region are only used as fallback. In that
situation where freepages on CMA region are used as fallback, kswapd
would be woken up easily since there is no unmovable and reclaimable
freepage, too. If kswapd starts to reclaim memory, fallback allocation
to MIGRATE_CMA doesn't occur any more since movable freepages are
already refilled by kswapd and then most of freepage on CMA are left
to be in free. This situation looks like exclusive reserved memory case.

In my experiment, I found that if system memory has 1024 MB memory and
512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
free memory is left. Detailed reason is that for keeping enough free
memory for unmovable and reclaimable allocation, kswapd uses below
equation when calculating free memory and it easily go under the watermark.

Free memory for unmovable and reclaimable = Free total - Free CMA pages

This is derivated from the property of CMA freepage that CMA freepage
can't be used for unmovable and reclaimable allocation.

Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
is lower than low watermark and tries to make free memory until
(FreeTotal - FreeCMA) is higher than high watermark. That results
in that FreeTotal is moving around 512MB boundary consistently. It
then means that we can't utilize full memory capacity.

To fix this problem, I submitted some patches [1] about 10 months ago,
but, found some more problems to be fixed before solving this problem.
It requires many hooks in allocator hotpath so some developers doesn't
like it. Instead, some of them suggest different approach [2] to fix
all the problems related to CMA, that is, introducing a new zone to deal
with free CMA pages. I agree that it is the best way to go so implement
here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar, I
decide to add a new zone rather than piggyback on ZONE_MOVABLE since
they have some differences. First, reserved CMA pages should not be
offlined. If freepage for CMA is managed by ZONE_MOVABLE, we need to keep
MIGRATE_CMA migratetype and insert many hooks on memory hotplug code
to distiguish hotpluggable memory and reserved memory for CMA in the same
zone. It would make memory hotplug code which is already complicated
more complicated. Second, cma_alloc() can be called more frequently
than memory hotplug operation and possibly we need to control
allocation rate of ZONE_CMA to optimize latency in the future.
In this case, separate zone approach is easy to modify. Third, I'd
like to see statistics for CMA, separately. Sometimes, we need to debug
why cma_alloc() is failed and separate statistics would be more helpful
in this situtaion.

Anyway, this patchset solves four problems related to CMA implementation.

1) Utilization problem
As mentioned above, we can't utilize full memory capacity due to the
limitation of CMA freepage and fallback policy. This patchset implements
a new zone for CMA and uses it for GFP_HIGHUSER_MOVABLE request. This
typed allocation is used for page cache and anonymous pages which
occupies most of memory usage in normal case so we can utilize full
memory capacity. Below is the experiment result about this problem.

8 CPUs, 1024 MB, VIRTUAL MACHINE
make -j16

<Before this series>
CMA reserve: 0 MB 512 MB
Elapsed-time: 92.4 186.5
pswpin: 82 18647
pswpout: 160 69839

<After this series>
CMA reserve: 0 MB 512 MB
Elapsed-time: 93.1 93.4
pswpin: 84 46
pswpout: 183 92

FYI, there is another attempt [3] trying to solve this problem in lkml.
And, as far as I know, Qualcomm also has out-of-tree solution for this
problem.

2) Reclaim problem
Currently, there is no logic to distinguish CMA pages in reclaim path.
If reclaim is initiated for unmovable and reclaimable allocation,
reclaiming CMA pages doesn't help to satisfy the request and reclaiming
CMA page is just waste. By managing CMA pages in the new zone, we can
skip to reclaim ZONE_CMA completely if it is unnecessary.

3) Atomic allocation failure problem
Kswapd isn't started to reclaim pages when allocation request is movable
type and there is enough free page in the CMA region. After bunch of
consecutive movable allocation requests, free pages in ordinary region
(not CMA region) would be exhausted without waking up kswapd. At that time,
if atomic unmovable allocation comes, it can't be successful since there
is not enough page in ordinary region. This problem is reported
by Aneesh [4] and can be solved by this patchset.

4) Inefficiently work of compaction
Usual high-order allocation request is unmovable type and it cannot
be serviced from CMA area. In compaction, migration scanner doesn't
distinguish migratable pages on the CMA area and do migration.
In this case, even if we make high-order page on that region, it
cannot be used due to type mismatch. This patch will solve this problem
by separating CMA pages from ordinary zones.

[1] https://lkml.org/lkml/2014/5/28/64
[2] https://lkml.org/lkml/2014/11/4/55
[3] https://lkml.org/lkml/2014/10/15/623
[4] http://www.spinics.net/lists/linux-mm/msg100562.html
[5] https://lkml.org/lkml/2014/5/30/320

For this patch:

Currently, reserved pages for CMA are managed together with normal pages.
To distinguish them, we used migratetype, MIGRATE_CMA, and
do special handlings for this migratetype. But, it turns out that
there are too many problems with this approach and to fix all of them
needs many more hooks to page allocation and reclaim path so
some developers express their discomfort and problems on CMA aren't fixed
for a long time.

To terminate this situation and fix CMA problems, this patch implements
ZONE_CMA. Reserved pages for CMA will be managed in this new zone. This
approach will remove all exisiting hooks for MIGRATE_CMA and many
problems related to CMA implementation will be solved.

This patch only add basic infrastructure of ZONE_CMA. In the following
patch, ZONE_CMA is actually populated and used.

Adding a new zone could cause two possible problems. One is the overflow
of page flags and the other is GFP_ZONES_TABLE issue.

Following is page-flags layout described in page-flags-layout.h.

1. No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
2. " plus space for last_cpupid: | NODE | ZONE | LAST_CPUPID ... | FLAGS |
3. classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
4. " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
5. classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |

There is no problem in #1, #2 configurations for 64-bit system. There are
enough room even for extremiely large x86_64 system. 32-bit system would
not have many nodes so it would have no problem, too.
System with #3, #4, #5 configurations could be affected by this zone
addition, but, thanks to recent THP rework which reduce one page flag,
problem surface would be small. In some configurations, problem is
still possible, but, it highly depends on individual configuration
so impact cannot be easily estimated. I guess that usual system
with CONFIG_CMA would not be affected. If there is a problem,
we can adjust section width or node width for that architecture.

Currently, GFP_ZONES_TABLE is 32-bit value for 32-bit bit operation
in the 32-bit system. If we add one more zone, it will be 48-bit and
32-bit bit operation cannot be possible. Although it will cause slight
overhead, there is no other way so this patch relax GFP_ZONES_TABLE's
32-bit limitation. 32-bit System with CONFIG_CMA will be affected by
this change but it would be marginal.

Note that there are many checkpatch warnings but I think that current
code is better for readability than fixing them up.

Signed-off-by: Joonsoo Kim <[email protected]>
---
arch/x86/mm/highmem_32.c | 8 +++++
include/linux/gfp.h | 29 +++++++++++-------
include/linux/mempolicy.h | 2 +-
include/linux/mmzone.h | 31 ++++++++++++++++++-
include/linux/vm_event_item.h | 10 ++++++-
include/trace/events/compaction.h | 10 ++++++-
kernel/power/snapshot.c | 8 +++++
mm/memory_hotplug.c | 3 ++
mm/page_alloc.c | 63 +++++++++++++++++++++++++++++++++------
mm/vmstat.c | 9 +++++-
10 files changed, 148 insertions(+), 25 deletions(-)

diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index a6d7392..a7fcb12 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -120,6 +120,14 @@ void __init set_highmem_pages_init(void)
if (!is_highmem(zone))
continue;

+ /*
+ * ZONE_CMA is a special zone that should not be
+ * participated in initialization because it's pages
+ * would be initialized by initialization of other zones.
+ */
+ if (is_zone_cma(zone))
+ continue;
+
zone_start_pfn = zone->zone_start_pfn;
zone_end_pfn = zone_start_pfn + zone->spanned_pages;

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a..4d6c008 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -301,6 +301,12 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
#define OPT_ZONE_DMA32 ZONE_NORMAL
#endif

+#ifdef CONFIG_CMA
+#define OPT_ZONE_CMA ZONE_CMA
+#else
+#define OPT_ZONE_CMA ZONE_MOVABLE
+#endif
+
/*
* GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
* zone to use given the lowest 4 bits of gfp_t. Entries are ZONE_SHIFT long
@@ -331,7 +337,6 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
* 0xe => BAD (MOVABLE+DMA32+HIGHMEM)
* 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
*
- * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
*/

#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
@@ -341,19 +346,21 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
#define GFP_ZONES_SHIFT ZONES_SHIFT
#endif

-#if 16 * GFP_ZONES_SHIFT > BITS_PER_LONG
-#error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
+#if !defined(CONFIG_64BITS) && GFP_ZONES_SHIFT > 2
+#define GFP_ZONE_TABLE_CAST unsigned long long
+#else
+#define GFP_ZONE_TABLE_CAST unsigned long
#endif

#define GFP_ZONE_TABLE ( \
- (ZONE_NORMAL << 0 * GFP_ZONES_SHIFT) \
- | (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT) \
- | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT) \
- | (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT) \
- | (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT) \
- | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT) \
- | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)\
- | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)\
+ ((GFP_ZONE_TABLE_CAST) ZONE_NORMAL << 0 * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_CAST) ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_CMA << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT) \
)

/*
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 4429d25..c4cc86e 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -157,7 +157,7 @@ extern enum zone_type policy_zone;

static inline void check_highest_zone(enum zone_type k)
{
- if (k > policy_zone && k != ZONE_MOVABLE)
+ if (k > policy_zone && k != ZONE_MOVABLE && !is_zone_cma_idx(k))
policy_zone = k;
}

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f4ae0abb..5c97ba9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -322,6 +322,9 @@ enum zone_type {
ZONE_HIGHMEM,
#endif
ZONE_MOVABLE,
+#ifdef CONFIG_CMA
+ ZONE_CMA,
+#endif
#ifdef CONFIG_ZONE_DEVICE
ZONE_DEVICE,
#endif
@@ -812,11 +815,37 @@ static inline int zone_movable_is_highmem(void)
}
#endif

+static inline int is_zone_cma_idx(enum zone_type idx)
+{
+#ifdef CONFIG_CMA
+ return idx == ZONE_CMA;
+#else
+ return 0;
+#endif
+}
+
+static inline int is_zone_cma(struct zone *zone)
+{
+ int zone_idx = zone_idx(zone);
+
+ return is_zone_cma_idx(zone_idx);
+}
+
+static inline int zone_cma_is_highmem(void)
+{
+#ifdef CONFIG_HIGHMEM
+ return 1;
+#else
+ return 0;
+#endif
+}
+
static inline int is_highmem_idx(enum zone_type idx)
{
#ifdef CONFIG_HIGHMEM
return (idx == ZONE_HIGHMEM ||
- (idx == ZONE_MOVABLE && zone_movable_is_highmem()));
+ (idx == ZONE_MOVABLE && zone_movable_is_highmem()) ||
+ (is_zone_cma_idx(idx) && zone_cma_is_highmem()));
#else
return 0;
#endif
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9ec2940..8e25ba5 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -19,7 +19,15 @@
#define HIGHMEM_ZONE(xx)
#endif

-#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, HIGHMEM_ZONE(xx) xx##_MOVABLE
+#ifdef CONFIG_CMA
+#define MOVABLE_ZONE(xx) xx##_MOVABLE,
+#define CMA_ZONE(xx) xx##_CMA
+#else
+#define MOVABLE_ZONE(xx) xx##_MOVABLE
+#define CMA_ZONE(xx)
+#endif
+
+#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, HIGHMEM_ZONE(xx) MOVABLE_ZONE(xx) CMA_ZONE(xx)

enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 36e2d6f..9d3b254 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -38,12 +38,20 @@
#define IFDEF_ZONE_HIGHMEM(X)
#endif

+#ifdef CONFIG_CMA
+#define IFDEF_ZONE_CMA(X, Y, Z) X Z
+#else
+#define IFDEF_ZONE_CMA(X, Y, Z) Y
+#endif
+
#define ZONE_TYPE \
IFDEF_ZONE_DMA( EM (ZONE_DMA, "DMA")) \
IFDEF_ZONE_DMA32( EM (ZONE_DMA32, "DMA32")) \
EM (ZONE_NORMAL, "Normal") \
IFDEF_ZONE_HIGHMEM( EM (ZONE_HIGHMEM,"HighMem")) \
- EMe(ZONE_MOVABLE,"Movable")
+ IFDEF_ZONE_CMA( EM (ZONE_MOVABLE,"Movable"), \
+ EMe(ZONE_MOVABLE,"Movable"), \
+ EMe(ZONE_CMA, "CMA"))

/*
* First define the enums in the above macros to be exported to userspace
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 3a97060..e8a7d8f 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1042,6 +1042,14 @@ unsigned int snapshot_additional_pages(struct zone *zone)
{
unsigned int rtree, nodes;

+ /*
+ * Estimation of needed pages for ZONE_CMA is already considered
+ * when calculating other zones since span of ZONE_CMA is subset
+ * of other zones.
+ */
+ if (is_zone_cma(zone))
+ return 0;
+
rtree = nodes = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
rtree += DIV_ROUND_UP(rtree * sizeof(struct rtree_node),
LINKED_PAGE_DATA_SIZE);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index caf2a14..354fa9c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1808,6 +1808,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
if (zone_idx(zone) <= ZONE_NORMAL && !can_offline_normal(zone, nr_pages))
return -EINVAL;

+ if (is_zone_cma(zone))
+ return -EINVAL;
+
/* set above range as isolated */
ret = start_isolate_page_range(start_pfn, end_pfn,
MIGRATE_MOVABLE, true);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ffa93e0..987a87c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -202,6 +202,9 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
32,
#endif
32,
+#ifdef CONFIG_CMA
+ 32,
+#endif
};

EXPORT_SYMBOL(totalram_pages);
@@ -218,6 +221,9 @@ static char * const zone_names[MAX_NR_ZONES] = {
"HighMem",
#endif
"Movable",
+#ifdef CONFIG_CMA
+ "CMA",
+#endif
#ifdef CONFIG_ZONE_DEVICE
"Device",
#endif
@@ -4896,6 +4902,15 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
struct memblock_region *r = NULL, *tmp;
#endif

+ /*
+ * Physical pages for ZONE_CMA are belong to other zones now. They
+ * are initialized when corresponding zone is initialized and they
+ * will be moved to ZONE_CMA later. Zone information will also be
+ * adjusted later.
+ */
+ if (is_zone_cma_idx(zone))
+ return;
+
if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;

@@ -5332,7 +5347,7 @@ static void __init find_usable_zone_for_movable(void)
{
int zone_index;
for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) {
- if (zone_index == ZONE_MOVABLE)
+ if (zone_index == ZONE_MOVABLE || is_zone_cma_idx(zone_index))
continue;

if (arch_zone_highest_possible_pfn[zone_index] >
@@ -5541,6 +5556,8 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
unsigned long *zholes_size)
{
unsigned long realtotalpages = 0, totalpages = 0;
+ unsigned long zone_cma_start_pfn = UINT_MAX;
+ unsigned long zone_cma_end_pfn = 0;
enum zone_type i;

for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -5548,6 +5565,13 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
unsigned long zone_start_pfn, zone_end_pfn;
unsigned long size, real_size;

+ if (is_zone_cma_idx(i)) {
+ zone->zone_start_pfn = zone_cma_start_pfn;
+ size = zone_cma_end_pfn - zone_cma_start_pfn;
+ real_size = 0;
+ goto init_zone;
+ }
+
size = zone_spanned_pages_in_node(pgdat->node_id, i,
node_start_pfn,
node_end_pfn,
@@ -5557,13 +5581,23 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
real_size = size - zone_absent_pages_in_node(pgdat->node_id, i,
node_start_pfn, node_end_pfn,
zholes_size);
- if (size)
+ if (size) {
zone->zone_start_pfn = zone_start_pfn;
- else
+ if (zone_cma_start_pfn > zone_start_pfn)
+ zone_cma_start_pfn = zone_start_pfn;
+ if (zone_cma_end_pfn < zone_start_pfn + size)
+ zone_cma_end_pfn = zone_start_pfn + size;
+ } else
zone->zone_start_pfn = 0;
+
+init_zone:
zone->spanned_pages = size;
zone->present_pages = real_size;

+ /* Prevent to over-count node span */
+ if (is_zone_cma_idx(i))
+ size = 0;
+
totalpages += size;
realtotalpages += real_size;
}
@@ -5705,6 +5739,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
struct zone *zone = pgdat->node_zones + j;
unsigned long size, realsize, freesize, memmap_pages;
unsigned long zone_start_pfn = zone->zone_start_pfn;
+ bool zone_kernel = !is_highmem_idx(j) && !is_zone_cma_idx(j);

size = zone->spanned_pages;
realsize = freesize = zone->present_pages;
@@ -5715,7 +5750,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
* and per-cpu initialisations
*/
memmap_pages = calc_memmap_size(size, realsize);
- if (!is_highmem_idx(j)) {
+ if (zone_kernel) {
if (freesize >= memmap_pages) {
freesize -= memmap_pages;
if (memmap_pages)
@@ -5734,7 +5769,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
zone_names[0], dma_reserve);
}

- if (!is_highmem_idx(j))
+ if (zone_kernel)
nr_kernel_pages += freesize;
/* Charge for highmem memmap if there are enough kernel pages */
else if (nr_kernel_pages > memmap_pages * 2)
@@ -5746,7 +5781,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
* when the bootmem allocator frees pages into the buddy system.
* And all highmem pages will be managed by the buddy system.
*/
- zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
+ zone->managed_pages = zone_kernel ? freesize : realsize;
#ifdef CONFIG_NUMA
zone->node = nid;
setup_min_unmapped_ratio(zone);
@@ -5763,7 +5798,12 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);

lruvec_init(&zone->lruvec);
- if (!size)
+
+ /*
+ * ZONE_CMA should be initialized even if it has no present
+ * page now since pages will be moved to the zone later.
+ */
+ if (!size && !is_zone_cma_idx(j))
continue;

set_pageblock_order();
@@ -6217,7 +6257,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
arch_zone_lowest_possible_pfn[0] = find_min_pfn_with_active_regions();
arch_zone_highest_possible_pfn[0] = max_zone_pfn[0];
for (i = 1; i < MAX_NR_ZONES; i++) {
- if (i == ZONE_MOVABLE)
+ if (i == ZONE_MOVABLE || is_zone_cma_idx(i))
continue;
arch_zone_lowest_possible_pfn[i] =
arch_zone_highest_possible_pfn[i-1];
@@ -6234,7 +6274,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
/* Print out the zone ranges */
pr_info("Zone ranges:\n");
for (i = 0; i < MAX_NR_ZONES; i++) {
- if (i == ZONE_MOVABLE)
+ if (i == ZONE_MOVABLE || is_zone_cma_idx(i))
continue;
pr_info(" %-8s ", zone_names[i]);
if (arch_zone_lowest_possible_pfn[i] ==
@@ -7048,6 +7088,11 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
*/
if (zone_idx(zone) == ZONE_MOVABLE)
return false;
+
+ /* ZONE_CMA never contains unmovable pages */
+ if (is_zone_cma(zone))
+ return false;
+
mt = get_pageblock_migratetype(page);
if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
return false;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 070fd90..e8c46ad 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -710,8 +710,15 @@ int fragmentation_index(struct zone *zone, unsigned int order)
#define TEXT_FOR_HIGHMEM(xx)
#endif

+#ifdef CONFIG_CMA
+#define TEXT_FOR_CMA(xx) xx "_cma",
+#else
+#define TEXT_FOR_CMA(xx)
+#endif
+
#define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \
- TEXT_FOR_HIGHMEM(xx) xx "_movable",
+ TEXT_FOR_HIGHMEM(xx) xx "_movable", \
+ TEXT_FOR_CMA(xx)

const char * const vmstat_text[] = {
/* enum zone_stat_item countes */
--
1.9.1

2016-04-25 05:21:32

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v2 4/6] mm/cma: remove ALLOC_CMA

From: Joonsoo Kim <[email protected]>

Now, all reserved pages for CMA region are belong to the ZONE_CMA
and it only serves for GFP_HIGHUSER_MOVABLE. Therefore, we don't need to
consider ALLOC_CMA at all.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/internal.h | 3 +--
mm/page_alloc.c | 18 ++----------------
2 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 64e3131..a25d45b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -478,8 +478,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
-#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */
-#define ALLOC_FAIR 0x100 /* fair zone allocation */
+#define ALLOC_FAIR 0x80 /* fair zone allocation */

enum ttu_flags;
struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0a6a195..69546b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2582,12 +2582,6 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
else
min -= min / 4;

-#ifdef CONFIG_CMA
- /* If allocation can't use CMA areas don't use free CMA pages */
- if (!(alloc_flags & ALLOC_CMA))
- free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
-
/*
* Check watermarks for an order-0 allocation request. If these
* are not met, then a high-order request also cannot go ahead
@@ -2617,10 +2611,8 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
}

#ifdef CONFIG_CMA
- if ((alloc_flags & ALLOC_CMA) &&
- !list_empty(&area->free_list[MIGRATE_CMA])) {
+ if (!list_empty(&area->free_list[MIGRATE_CMA]))
return true;
- }
#endif
}
return false;
@@ -3217,10 +3209,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
unlikely(test_thread_flag(TIF_MEMDIE))))
alloc_flags |= ALLOC_NO_WATERMARKS;
}
-#ifdef CONFIG_CMA
- if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
- alloc_flags |= ALLOC_CMA;
-#endif
+
return alloc_flags;
}

@@ -3573,9 +3562,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;

- if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
- alloc_flags |= ALLOC_CMA;
-
retry_cpuset:
cpuset_mems_cookie = read_mems_allowed_begin();

--
1.9.1

2016-04-25 05:21:39

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v2 6/6] mm/cma: remove per zone CMA stat

From: Joonsoo Kim <[email protected]>

Now, all reserved pages for CMA region are belong to the ZONE_CMA
so we don't need to maintain CMA stat in other zones. Remove it.

Signed-off-by: Joonsoo Kim <[email protected]>
---
fs/proc/meminfo.c | 2 +-
include/linux/cma.h | 6 ++++++
include/linux/mmzone.h | 1 -
mm/cma.c | 15 +++++++++++++++
mm/page_alloc.c | 5 ++---
mm/vmstat.c | 1 -
6 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index ae5cc52..51449d0 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -172,7 +172,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#endif
#ifdef CONFIG_CMA
, K(totalcma_pages)
- , K(global_page_state(NR_FREE_CMA_PAGES))
+ , K(cma_get_free())
#endif
);

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 29f9e77..816290c 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -28,4 +28,10 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
struct cma **res_cma);
extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align);
extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
+
+#ifdef CONFIG_CMA
+extern unsigned long cma_get_free(void);
+#else
+static inline unsigned long cma_get_free(void) { return 0; }
+#endif
#endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 75b41c5..3996a7c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -140,7 +140,6 @@ enum zone_stat_item {
NR_SHMEM_HUGEPAGES, /* transparent shmem huge pages */
NR_SHMEM_PMDMAPPED, /* shmem huge pages currently mapped hugely */
NR_SHMEM_FREEHOLES, /* unused memory of high-order allocations */
- NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };

/*
diff --git a/mm/cma.c b/mm/cma.c
index bd436e4..6dbddf2 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -54,6 +54,21 @@ unsigned long cma_get_size(const struct cma *cma)
return cma->count << PAGE_SHIFT;
}

+unsigned long cma_get_free(void)
+{
+ struct zone *zone;
+ unsigned long freecma = 0;
+
+ for_each_populated_zone(zone) {
+ if (!is_zone_cma(zone))
+ continue;
+
+ freecma += zone_page_state(zone, NR_FREE_PAGES);
+ }
+
+ return freecma;
+}
+
static unsigned long cma_bitmap_aligned_mask(const struct cma *cma,
int align_order)
{
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 51b2b0c..570edad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -63,6 +63,7 @@
#include <linux/sched/rt.h>
#include <linux/page_owner.h>
#include <linux/kthread.h>
+#include <linux/cma.h>

#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -4107,7 +4108,7 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_SHMEM_FREEHOLES),
global_page_state(NR_FREE_PAGES),
free_pcp,
- global_page_state(NR_FREE_CMA_PAGES));
+ cma_get_free());

for_each_populated_zone(zone) {
int i;
@@ -4150,7 +4151,6 @@ void show_free_areas(unsigned int filter)
" bounce:%lukB"
" free_pcp:%lukB"
" local_pcp:%ukB"
- " free_cma:%lukB"
" writeback_tmp:%lukB"
" pages_scanned:%lu"
" all_unreclaimable? %s"
@@ -4188,7 +4188,6 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_BOUNCE)),
K(free_pcp),
K(this_cpu_read(zone->pageset->pcp.count)),
- K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
K(zone_page_state(zone, NR_PAGES_SCANNED)),
(!zone_reclaimable(zone) ? "yes" : "no")
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 39a0c3c..81acdae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -766,7 +766,6 @@ const char * const vmstat_text[] = {
"nr_shmem_hugepages",
"nr_shmem_pmdmapped",
"nr_shmem_freeholes",
- "nr_free_cma",

/* enum writeback_stat_item counters */
"nr_dirty_threshold",
--
1.9.1

2016-04-25 05:21:58

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v2 5/6] mm/cma: remove MIGRATE_CMA

From: Joonsoo Kim <[email protected]>

Now, all reserved pages for CMA region are belong to the ZONE_CMA
and there is no other type of pages. Therefore, we don't need to
use MIGRATE_CMA to distinguish and handle differently for CMA pages
and ordinary pages. Remove MIGRATE_CMA.

Unfortunately, this patch make free CMA counter incorrect because
we count it when pages are on the MIGRATE_CMA. It will be fixed
by next patch. I can squash next patch here but it makes changes
complicated and hard to review so I separate that.

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/gfp.h | 3 +-
include/linux/mmzone.h | 22 ------------
include/linux/vmstat.h | 8 -----
mm/cma.c | 2 +-
mm/compaction.c | 10 ++----
mm/hugetlb.c | 2 +-
mm/page_alloc.c | 90 ++++++++++++++------------------------------------
mm/page_isolation.c | 5 ++-
mm/vmstat.c | 5 +--
9 files changed, 32 insertions(+), 115 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4d6c008..1a3b869 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -559,8 +559,7 @@ static inline bool pm_suspended_storage(void)

#if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || defined(CONFIG_CMA)
/* The below functions must be run on a range from a single zone. */
-extern int alloc_contig_range(unsigned long start, unsigned long end,
- unsigned migratetype);
+extern int alloc_contig_range(unsigned long start, unsigned long end);
extern void free_contig_range(unsigned long pfn, unsigned nr_pages);
#endif

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5c97ba9..75b41c5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -41,22 +41,6 @@ enum {
MIGRATE_RECLAIMABLE,
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
-#ifdef CONFIG_CMA
- /*
- * MIGRATE_CMA migration type is designed to mimic the way
- * ZONE_MOVABLE works. Only movable pages can be allocated
- * from MIGRATE_CMA pageblocks and page allocator never
- * implicitly change migration type of MIGRATE_CMA pageblock.
- *
- * The way to use it is to change migratetype of a range of
- * pageblocks to MIGRATE_CMA which can be done by
- * __free_pageblock_cma() function. What is important though
- * is that a range of pageblocks must be aligned to
- * MAX_ORDER_NR_PAGES should biggest page be bigger then
- * a single pageblock.
- */
- MIGRATE_CMA,
-#endif
#ifdef CONFIG_MEMORY_ISOLATION
MIGRATE_ISOLATE, /* can't allocate from here */
#endif
@@ -66,12 +50,6 @@ enum {
/* In mm/page_alloc.c; keep in sync also with show_migration_types() there */
extern char * const migratetype_names[MIGRATE_TYPES];

-#ifdef CONFIG_CMA
-# define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
-#else
-# define is_migrate_cma(migratetype) false
-#endif
-
#define for_each_migratetype_order(order, type) \
for (order = 0; order < MAX_ORDER; order++) \
for (type = 0; type < MIGRATE_TYPES; type++)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 02fce41..6ddf080 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -260,14 +260,6 @@ static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
#endif /* CONFIG_SMP */

-static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
- int migratetype)
-{
- __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
- if (is_migrate_cma(migratetype))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
-}
-
extern const char * const vmstat_text[];

#endif /* _LINUX_VMSTAT_H */
diff --git a/mm/cma.c b/mm/cma.c
index 8684f50..bd436e4 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -444,7 +444,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align)

pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
mutex_lock(&cma_mutex);
- ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA);
+ ret = alloc_contig_range(pfn, pfn + count);
mutex_unlock(&cma_mutex);
if (ret == 0) {
page = pfn_to_page(pfn);
diff --git a/mm/compaction.c b/mm/compaction.c
index 315e5d5..91e0969 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -76,7 +76,7 @@ static void map_pages(struct list_head *list)

static inline bool migrate_async_suitable(int migratetype)
{
- return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
+ return migratetype == MIGRATE_MOVABLE;
}

#ifdef CONFIG_COMPACTION
@@ -965,7 +965,7 @@ static bool suitable_migration_target(struct page *page)
return false;
}

- /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
+ /* If the block is MIGRATE_MOVABLE, allow migration */
if (migrate_async_suitable(get_pageblock_migratetype(page)))
return true;

@@ -1329,12 +1329,6 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
if (!list_empty(&area->free_list[migratetype]))
return COMPACT_PARTIAL;

-#ifdef CONFIG_CMA
- /* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
- if (migratetype == MIGRATE_MOVABLE &&
- !list_empty(&area->free_list[MIGRATE_CMA]))
- return COMPACT_PARTIAL;
-#endif
/*
* Job done if allocation would steal freepages from
* other migratetype buddy lists.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 949d806..b7ca28f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1028,7 +1028,7 @@ static int __alloc_gigantic_page(unsigned long start_pfn,
unsigned long nr_pages)
{
unsigned long end_pfn = start_pfn + nr_pages;
- return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+ return alloc_contig_range(start_pfn, end_pfn);
}

static bool pfn_range_valid_gigantic(struct zone *z,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69546b7..51b2b0c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -124,8 +124,8 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
* put on a pcplist. Used to avoid the pageblock migratetype lookup when
* freeing from pcplists in most cases, at the cost of possibly becoming stale.
* Also the migratetype set in the page does not necessarily match the pcplist
- * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any
- * other index - this ensures that it will be put on the correct CMA freelist.
+ * index, e.g. page might have MIGRATE_MOVABLE set but be on a pcplist with any
+ * other index - this ensures that it will be put on the correct freelist.
*/
static inline int get_pcppage_migratetype(struct page *page)
{
@@ -234,9 +234,6 @@ char * const migratetype_names[MIGRATE_TYPES] = {
"Movable",
"Reclaimable",
"HighAtomic",
-#ifdef CONFIG_CMA
- "CMA",
-#endif
#ifdef CONFIG_MEMORY_ISOLATION
"Isolate",
#endif
@@ -580,7 +577,7 @@ static inline void set_page_guard(struct zone *zone, struct page *page,
INIT_LIST_HEAD(&page->lru);
set_page_private(page, order);
/* Guard pages are not available for any usage */
- __mod_zone_freepage_state(zone, -(1 << order), migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
}

static inline void clear_page_guard(struct zone *zone, struct page *page,
@@ -596,7 +593,7 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,

set_page_private(page, 0);
if (!is_migrate_isolate(migratetype))
- __mod_zone_freepage_state(zone, (1 << order), migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, (1 << order));
}
#else
struct page_ext_operations debug_guardpage_ops = { NULL, };
@@ -707,7 +704,7 @@ static inline void __free_one_page(struct page *page,

VM_BUG_ON(migratetype == -1);
if (likely(!is_migrate_isolate(migratetype)))
- __mod_zone_freepage_state(zone, 1 << order, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);

page_idx = pfn & ((1 << MAX_ORDER) - 1);

@@ -1416,7 +1413,7 @@ static void __init adjust_present_page_count(struct page *page, long count)
zone->present_pages += count;
}

-/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
+/* Free whole pageblock and set its migration type to MIGRATE_MOVABLE. */
void __init init_cma_reserved_pageblock(struct page *page)
{
unsigned i = pageblock_nr_pages;
@@ -1441,7 +1438,7 @@ void __init init_cma_reserved_pageblock(struct page *page)

adjust_present_page_count(page, pageblock_nr_pages);

- set_pageblock_migratetype(page, MIGRATE_CMA);
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);

if (pageblock_order >= MAX_ORDER) {
i = pageblock_nr_pages;
@@ -1627,25 +1624,11 @@ static int fallbacks[MIGRATE_TYPES][4] = {
[MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
[MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
-#ifdef CONFIG_CMA
- [MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */
-#endif
#ifdef CONFIG_MEMORY_ISOLATION
[MIGRATE_ISOLATE] = { MIGRATE_TYPES }, /* Never used */
#endif
};

-#ifdef CONFIG_CMA
-static struct page *__rmqueue_cma_fallback(struct zone *zone,
- unsigned int order)
-{
- return __rmqueue_smallest(zone, order, MIGRATE_CMA);
-}
-#else
-static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
- unsigned int order) { return NULL; }
-#endif
-
/*
* Move the free pages in a range to the free lists of the requested type.
* Note that start_page and end_pages are not aligned on a pageblock
@@ -1850,7 +1833,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
/* Yoink! */
mt = get_pageblock_migratetype(page);
if (mt != MIGRATE_HIGHATOMIC &&
- !is_migrate_isolate(mt) && !is_migrate_cma(mt)) {
+ !is_migrate_isolate(mt)) {
zone->nr_reserved_highatomic += pageblock_nr_pages;
set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
@@ -1953,9 +1936,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
/*
* The pcppage_migratetype may differ from pageblock's
* migratetype depending on the decisions in
- * find_suitable_fallback(). This is OK as long as it does not
- * differ for MIGRATE_CMA pageblocks. Those can be used as
- * fallback only via special __rmqueue_cma_fallback() function
+ * find_suitable_fallback(). This is OK.
*/
set_pcppage_migratetype(page, start_migratetype);

@@ -1978,13 +1959,8 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
struct page *page;

page = __rmqueue_smallest(zone, order, migratetype);
- if (unlikely(!page)) {
- if (migratetype == MIGRATE_MOVABLE)
- page = __rmqueue_cma_fallback(zone, order);
-
- if (!page)
- page = __rmqueue_fallback(zone, order, migratetype);
- }
+ if (unlikely(!page))
+ page = __rmqueue_fallback(zone, order, migratetype);

trace_mm_page_alloc_zone_locked(page, order, migratetype);
return page;
@@ -2021,9 +1997,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
else
list_add_tail(&page->lru, list);
list = &page->lru;
- if (is_migrate_cma(get_pcppage_migratetype(page)))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
- -(1 << order));
}
__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
spin_unlock(&zone->lock);
@@ -2321,7 +2294,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
return 0;

- __mod_zone_freepage_state(zone, -(1UL << order), mt);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
}

/* Remove page from free list */
@@ -2336,7 +2309,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
struct page *endpage = page + (1 << order) - 1;
for (; page < endpage; page += pageblock_nr_pages) {
int mt = get_pageblock_migratetype(page);
- if (!is_migrate_isolate(mt) && !is_migrate_cma(mt))
+ if (!is_migrate_isolate(mt))
set_pageblock_migratetype(page,
MIGRATE_MOVABLE);
}
@@ -2391,8 +2364,7 @@ alloc_pages_zone(struct zone *zone, unsigned int order, int migratetype)
if (!page)
return NULL;

- __mod_zone_freepage_state(zone, -(1 << order),
- get_pcppage_migratetype(page));
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));

set_page_owner(page, order, __GFP_MOVABLE);
set_page_refcounted(page);
@@ -2453,8 +2425,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
spin_unlock(&zone->lock);
if (!page)
goto failed;
- __mod_zone_freepage_state(zone, -(1 << order),
- get_pcppage_migratetype(page));
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
}

__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
@@ -2609,11 +2580,6 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
if (!list_empty(&area->free_list[mt]))
return true;
}
-
-#ifdef CONFIG_CMA
- if (!list_empty(&area->free_list[MIGRATE_CMA]))
- return true;
-#endif
}
return false;
}
@@ -4073,9 +4039,6 @@ static void show_migration_types(unsigned char type)
[MIGRATE_MOVABLE] = 'M',
[MIGRATE_RECLAIMABLE] = 'E',
[MIGRATE_HIGHATOMIC] = 'H',
-#ifdef CONFIG_CMA
- [MIGRATE_CMA] = 'C',
-#endif
#ifdef CONFIG_MEMORY_ISOLATION
[MIGRATE_ISOLATE] = 'I',
#endif
@@ -7102,7 +7065,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
return false;

mt = get_pageblock_migratetype(page);
- if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
+ if (mt == MIGRATE_MOVABLE)
return false;

pfn = page_to_pfn(page);
@@ -7250,15 +7213,11 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
* alloc_contig_range() -- tries to allocate given range of pages
* @start: start PFN to allocate
* @end: one-past-the-last PFN to allocate
- * @migratetype: migratetype of the underlaying pageblocks (either
- * #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks
- * in range must have the same migratetype and it must
- * be either of the two.
*
* The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
* aligned, however it's the caller's responsibility to guarantee that
* we are the only thread that changes migrate type of pageblocks the
- * pages fall in.
+ * pages fall in and it should be MIGRATE_MOVABLE.
*
* The PFN range must belong to a single zone.
*
@@ -7266,8 +7225,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
* pages which PFN is in [start, end) are allocated for the caller and
* need to be freed with free_contig_range().
*/
-int alloc_contig_range(unsigned long start, unsigned long end,
- unsigned migratetype)
+int alloc_contig_range(unsigned long start, unsigned long end)
{
unsigned long outer_start, outer_end;
unsigned int order;
@@ -7300,14 +7258,14 @@ int alloc_contig_range(unsigned long start, unsigned long end,
* allocator removing them from the buddy system. This way
* page allocator will never consider using them.
*
- * This lets us mark the pageblocks back as
- * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
- * aligned range but not in the unaligned, original range are
- * put back to page allocator so that buddy can use them.
+ * This lets us mark the pageblocks back as MIGRATE_MOVABLE
+ * so that free pages in the aligned range but not in the
+ * unaligned, original range are put back to page allocator
+ * so that buddy can use them.
*/

ret = start_isolate_page_range(pfn_max_align_down(start),
- pfn_max_align_up(end), migratetype,
+ pfn_max_align_up(end), MIGRATE_MOVABLE,
false);
if (ret)
return ret;
@@ -7386,7 +7344,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,

done:
undo_isolate_page_range(pfn_max_align_down(start),
- pfn_max_align_up(end), migratetype);
+ pfn_max_align_up(end), MIGRATE_MOVABLE);
return ret;
}

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 612122b..5708649 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -61,13 +61,12 @@ static int set_migratetype_isolate(struct page *page,
out:
if (!ret) {
unsigned long nr_pages;
- int migratetype = get_pageblock_migratetype(page);

set_pageblock_migratetype(page, MIGRATE_ISOLATE);
zone->nr_isolate_pageblock++;
nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);

- __mod_zone_freepage_state(zone, -nr_pages, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -nr_pages);
}

spin_unlock_irqrestore(&zone->lock, flags);
@@ -122,7 +121,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
*/
if (!isolated_page) {
nr_pages = move_freepages_block(zone, page, migratetype);
- __mod_zone_freepage_state(zone, nr_pages, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
}
set_pageblock_migratetype(page, migratetype);
zone->nr_isolate_pageblock--;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e8c46ad..39a0c3c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1104,10 +1104,7 @@ static void pagetypeinfo_showmixedcount_print(struct seq_file *m,

page_mt = gfpflags_to_migratetype(page_ext->gfp_mask);
if (pageblock_mt != page_mt) {
- if (is_migrate_cma(pageblock_mt))
- count[MIGRATE_MOVABLE]++;
- else
- count[pageblock_mt]++;
+ count[pageblock_mt]++;

pfn = block_end_pfn;
break;
--
1.9.1

2016-04-25 05:33:24

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] Introduce ZONE_CMA

On Mon, Apr 25, 2016 at 02:21:04PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Hello,
>
> Changes from v1
> o Separate some patches which deserve to submit independently
> o Modify description to reflect current kernel state
> (e.g. high-order watermark problem disappeared by Mel's work)
> o Don't increase SECTION_SIZE_BITS to make a room in page flags
> (detailed reason is on the patch that adds ZONE_CMA)
> o Adjust ZONE_CMA population code
>
> This series try to solve problems of current CMA implementation.
>
> CMA is introduced to provide physically contiguous pages at runtime
> without exclusive reserved memory area. But, current implementation
> works like as previous reserved memory approach, because freepages
> on CMA region are used only if there is no movable freepage. In other
> words, freepages on CMA region are only used as fallback. In that
> situation where freepages on CMA region are used as fallback, kswapd
> would be woken up easily since there is no unmovable and reclaimable
> freepage, too. If kswapd starts to reclaim memory, fallback allocation
> to MIGRATE_CMA doesn't occur any more since movable freepages are
> already refilled by kswapd and then most of freepage on CMA are left
> to be in free. This situation looks like exclusive reserved memory case.
>
> In my experiment, I found that if system memory has 1024 MB memory and
> 512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
> free memory is left. Detailed reason is that for keeping enough free
> memory for unmovable and reclaimable allocation, kswapd uses below
> equation when calculating free memory and it easily go under the watermark.
>
> Free memory for unmovable and reclaimable = Free total - Free CMA pages
>
> This is derivated from the property of CMA freepage that CMA freepage
> can't be used for unmovable and reclaimable allocation.
>
> Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
> is lower than low watermark and tries to make free memory until
> (FreeTotal - FreeCMA) is higher than high watermark. That results
> in that FreeTotal is moving around 512MB boundary consistently. It
> then means that we can't utilize full memory capacity.
>
> To fix this problem, I submitted some patches [1] about 10 months ago,
> but, found some more problems to be fixed before solving this problem.
> It requires many hooks in allocator hotpath so some developers doesn't
> like it. Instead, some of them suggest different approach [2] to fix
> all the problems related to CMA, that is, introducing a new zone to deal
> with free CMA pages. I agree that it is the best way to go so implement
> here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar, I
> decide to add a new zone rather than piggyback on ZONE_MOVABLE since
> they have some differences. First, reserved CMA pages should not be
> offlined. If freepage for CMA is managed by ZONE_MOVABLE, we need to keep
> MIGRATE_CMA migratetype and insert many hooks on memory hotplug code
> to distiguish hotpluggable memory and reserved memory for CMA in the same
> zone. It would make memory hotplug code which is already complicated
> more complicated. Second, cma_alloc() can be called more frequently
> than memory hotplug operation and possibly we need to control
> allocation rate of ZONE_CMA to optimize latency in the future.
> In this case, separate zone approach is easy to modify. Third, I'd
> like to see statistics for CMA, separately. Sometimes, we need to debug
> why cma_alloc() is failed and separate statistics would be more helpful
> in this situtaion.
>
> Anyway, this patchset solves four problems related to CMA implementation.
>
> 1) Utilization problem
> As mentioned above, we can't utilize full memory capacity due to the
> limitation of CMA freepage and fallback policy. This patchset implements
> a new zone for CMA and uses it for GFP_HIGHUSER_MOVABLE request. This
> typed allocation is used for page cache and anonymous pages which
> occupies most of memory usage in normal case so we can utilize full
> memory capacity. Below is the experiment result about this problem.
>
> 8 CPUs, 1024 MB, VIRTUAL MACHINE
> make -j16
>
> <Before this series>
> CMA reserve: 0 MB 512 MB
> Elapsed-time: 92.4 186.5
> pswpin: 82 18647
> pswpout: 160 69839
>
> <After this series>
> CMA reserve: 0 MB 512 MB
> Elapsed-time: 93.1 93.4
> pswpin: 84 46
> pswpout: 183 92
>
> FYI, there is another attempt [3] trying to solve this problem in lkml.
> And, as far as I know, Qualcomm also has out-of-tree solution for this
> problem.
>
> 2) Reclaim problem
> Currently, there is no logic to distinguish CMA pages in reclaim path.
> If reclaim is initiated for unmovable and reclaimable allocation,
> reclaiming CMA pages doesn't help to satisfy the request and reclaiming
> CMA page is just waste. By managing CMA pages in the new zone, we can
> skip to reclaim ZONE_CMA completely if it is unnecessary.
>
> 3) Atomic allocation failure problem
> Kswapd isn't started to reclaim pages when allocation request is movable
> type and there is enough free page in the CMA region. After bunch of
> consecutive movable allocation requests, free pages in ordinary region
> (not CMA region) would be exhausted without waking up kswapd. At that time,
> if atomic unmovable allocation comes, it can't be successful since there
> is not enough page in ordinary region. This problem is reported
> by Aneesh [4] and can be solved by this patchset.
>
> 4) Inefficiently work of compaction
> Usual high-order allocation request is unmovable type and it cannot
> be serviced from CMA area. In compaction, migration scanner doesn't
> distinguish migratable pages on the CMA area and do migration.
> In this case, even if we make high-order page on that region, it
> cannot be used due to type mismatch. This patch will solve this problem
> by separating CMA pages from ordinary zones.
>
> I passed boot test on x86_64, x86_32, arm and arm64. I did some stress
> tests on x86_64 and x86_32 and there is no problem. Feel free to enjoy
> and please give me a feedback. :)
>
> This patchset is based on linux-next-20160413.
>
> Thanks.
>
> [1] https://lkml.org/lkml/2014/5/28/64
> [2] https://lkml.org/lkml/2014/11/4/55
> [3] https://lkml.org/lkml/2014/10/15/623
> [4] http://www.spinics.net/lists/linux-mm/msg100562.html
>
> Joonsoo Kim (6):
> mm/page_alloc: recalculate some of zone threshold when on/offline
> memory
> mm/cma: introduce new zone, ZONE_CMA
> mm/cma: populate ZONE_CMA
> mm/cma: remove ALLOC_CMA
> mm/cma: remove MIGRATE_CMA
> mm/cma: remove per zone CMA stat
>
> arch/x86/mm/highmem_32.c | 8 ++
> fs/proc/meminfo.c | 2 +-
> include/linux/cma.h | 6 +
> include/linux/gfp.h | 32 +++---
> include/linux/memory_hotplug.h | 3 -
> include/linux/mempolicy.h | 2 +-
> include/linux/mmzone.h | 54 +++++----
> include/linux/vm_event_item.h | 10 +-
> include/linux/vmstat.h | 8 --
> include/trace/events/compaction.h | 10 +-
> kernel/power/snapshot.c | 8 ++
> mm/cma.c | 58 +++++++++-
> mm/compaction.c | 10 +-
> mm/hugetlb.c | 2 +-
> mm/internal.h | 6 +-
> mm/memory_hotplug.c | 3 +
> mm/page_alloc.c | 236 ++++++++++++++++++++++----------------
> mm/page_isolation.c | 5 +-
> mm/vmstat.c | 15 ++-
> 19 files changed, 303 insertions(+), 175 deletions(-)

Hello, Mel and Aneesh.

I read the summary of the LSF/MM in LWN.net and Rik's summary e-mail
and it looks like there is a disagreement on ZONE_CMA approach and
I'd like to talk about it in this mail.

I'd like to object Aneesh's statement that using ZONE_CMA just replaces
one set of problems with another (mentioned in LWN.net). The fact
that pages under I/O cannot be moved is also the problem of all CMA
approaches. It is just separate issue and should not affect the decision
on ZONE_CMA. It would be solved by migration before I/O and pinning.
And, mlocked pages in CMA area can be moved. THP pages aren't moved
in current implementation but it can be solved by linear/lumpy reclaim
mentioned in that discussion. It is also the problem of all the approaches
so should not affect the decision on ZONE_CMA.

What we should consider is what is the best approach to solve other issues
that comes from the fact that pages with different characteristic are
in the same zone. One of the problem is a watermark check. Many MM logic
based on watermark check to decide something. If there are free pages with
different characteristic that is not compatible with other migratetypes,
output of watermark check would cause the problem. We distinguished
allocation type and adjusted watermark check through ALLOC_CMA flag but
using it is not so simple and is fragile. Consider about the compaction
code. There is a checks that there are enough order-0 freepage in the zone
to check that compaction could work, before entering the compaction.
In this case, we could add up CMA freepages even if alloc_flags doesn't
have ALLOC_CMA because we can utilize CMA freepages as a freepage.
But, in reality, we missed it. We might fix those cases one by one but
it's seems to be really error-prone to me. Until recent date, high order
freepage counting problem was there, too. It partially disappeared
by Mel's MIGRATE_HIGHATOMIC work, but still remain for ALLOC_HARDER
request. (We cannot know how many high order freepages are on normal area
and CMA area.) That problem also shows that it's very fragile design
that non-compatible types of pages are in the same zone.

ZONE_CMA separates those pages to a new zone so there is no problem
mentioned in the above. There are other issues about freepage utilization
and reclaim efficiency in current kernel code and those also could be solved
by ZONE_CMA approach. Without a new zone, it need more hooks on core MM code
and it is uncomfortable to the developers who doesn't use the CMA.
New zone would be a necessary evil in this situation.

I don't think another migratetype, sticky MIGRATE_MOVABLE,
is the right solution here. We already uses MIGRATE_CMA for that purpose
and it is proved that it doesn't work well.

If someone still disagree with ZONE_CMA approach, please let me know
what is the problem of this approach and how to solve these real problems
in more detail.

Thanks.

2016-04-26 09:38:48

by Rui Teng

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] mm/cma: introduce new zone, ZONE_CMA

On 4/25/16 1:21 PM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Attached cover-letter:
>
> This series try to solve problems of current CMA implementation.
>
> CMA is introduced to provide physically contiguous pages at runtime
> without exclusive reserved memory area. But, current implementation
> works like as previous reserved memory approach, because freepages
> on CMA region are used only if there is no movable freepage. In other
> words, freepages on CMA region are only used as fallback. In that
> situation where freepages on CMA region are used as fallback, kswapd
> would be woken up easily since there is no unmovable and reclaimable
> freepage, too. If kswapd starts to reclaim memory, fallback allocation
> to MIGRATE_CMA doesn't occur any more since movable freepages are
> already refilled by kswapd and then most of freepage on CMA are left
> to be in free. This situation looks like exclusive reserved memory case.
>
> In my experiment, I found that if system memory has 1024 MB memory and
> 512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
> free memory is left. Detailed reason is that for keeping enough free
> memory for unmovable and reclaimable allocation, kswapd uses below
> equation when calculating free memory and it easily go under the watermark.
>
> Free memory for unmovable and reclaimable = Free total - Free CMA pages
>
> This is derivated from the property of CMA freepage that CMA freepage
> can't be used for unmovable and reclaimable allocation.
>
> Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
> is lower than low watermark and tries to make free memory until
> (FreeTotal - FreeCMA) is higher than high watermark. That results
> in that FreeTotal is moving around 512MB boundary consistently. It
> then means that we can't utilize full memory capacity.
>
> To fix this problem, I submitted some patches [1] about 10 months ago,
> but, found some more problems to be fixed before solving this problem.
> It requires many hooks in allocator hotpath so some developers doesn't
> like it. Instead, some of them suggest different approach [2] to fix
> all the problems related to CMA, that is, introducing a new zone to deal
> with free CMA pages. I agree that it is the best way to go so implement
> here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar, I
> decide to add a new zone rather than piggyback on ZONE_MOVABLE since
> they have some differences. First, reserved CMA pages should not be
> offlined. If freepage for CMA is managed by ZONE_MOVABLE, we need to keep
> MIGRATE_CMA migratetype and insert many hooks on memory hotplug code
> to distiguish hotpluggable memory and reserved memory for CMA in the same
> zone. It would make memory hotplug code which is already complicated
> more complicated. Second, cma_alloc() can be called more frequently
> than memory hotplug operation and possibly we need to control
> allocation rate of ZONE_CMA to optimize latency in the future.
> In this case, separate zone approach is easy to modify. Third, I'd
> like to see statistics for CMA, separately. Sometimes, we need to debug
> why cma_alloc() is failed and separate statistics would be more helpful
> in this situtaion.
>
> Anyway, this patchset solves four problems related to CMA implementation.
>
> 1) Utilization problem
> As mentioned above, we can't utilize full memory capacity due to the
> limitation of CMA freepage and fallback policy. This patchset implements
> a new zone for CMA and uses it for GFP_HIGHUSER_MOVABLE request. This
> typed allocation is used for page cache and anonymous pages which
> occupies most of memory usage in normal case so we can utilize full
> memory capacity. Below is the experiment result about this problem.
>
> 8 CPUs, 1024 MB, VIRTUAL MACHINE
> make -j16
>
> <Before this series>
> CMA reserve: 0 MB 512 MB
> Elapsed-time: 92.4 186.5
> pswpin: 82 18647
> pswpout: 160 69839
>
> <After this series>
> CMA reserve: 0 MB 512 MB
> Elapsed-time: 93.1 93.4
> pswpin: 84 46
> pswpout: 183 92
>
> FYI, there is another attempt [3] trying to solve this problem in lkml.
> And, as far as I know, Qualcomm also has out-of-tree solution for this
> problem.
>
> 2) Reclaim problem
> Currently, there is no logic to distinguish CMA pages in reclaim path.
> If reclaim is initiated for unmovable and reclaimable allocation,
> reclaiming CMA pages doesn't help to satisfy the request and reclaiming
> CMA page is just waste. By managing CMA pages in the new zone, we can
> skip to reclaim ZONE_CMA completely if it is unnecessary.
>
> 3) Atomic allocation failure problem
> Kswapd isn't started to reclaim pages when allocation request is movable
> type and there is enough free page in the CMA region. After bunch of
> consecutive movable allocation requests, free pages in ordinary region
> (not CMA region) would be exhausted without waking up kswapd. At that time,
> if atomic unmovable allocation comes, it can't be successful since there
> is not enough page in ordinary region. This problem is reported
> by Aneesh [4] and can be solved by this patchset.
>
> 4) Inefficiently work of compaction
> Usual high-order allocation request is unmovable type and it cannot
> be serviced from CMA area. In compaction, migration scanner doesn't
> distinguish migratable pages on the CMA area and do migration.
> In this case, even if we make high-order page on that region, it
> cannot be used due to type mismatch. This patch will solve this problem
> by separating CMA pages from ordinary zones.
>
> [1] https://lkml.org/lkml/2014/5/28/64
> [2] https://lkml.org/lkml/2014/11/4/55
> [3] https://lkml.org/lkml/2014/10/15/623
> [4] http://www.spinics.net/lists/linux-mm/msg100562.html
> [5] https://lkml.org/lkml/2014/5/30/320
>
> For this patch:
>
> Currently, reserved pages for CMA are managed together with normal pages.
> To distinguish them, we used migratetype, MIGRATE_CMA, and
> do special handlings for this migratetype. But, it turns out that
> there are too many problems with this approach and to fix all of them
> needs many more hooks to page allocation and reclaim path so
> some developers express their discomfort and problems on CMA aren't fixed
> for a long time.
>
> To terminate this situation and fix CMA problems, this patch implements
> ZONE_CMA. Reserved pages for CMA will be managed in this new zone. This
> approach will remove all exisiting hooks for MIGRATE_CMA and many
> problems related to CMA implementation will be solved.
>
> This patch only add basic infrastructure of ZONE_CMA. In the following
> patch, ZONE_CMA is actually populated and used.
>
> Adding a new zone could cause two possible problems. One is the overflow
> of page flags and the other is GFP_ZONES_TABLE issue.
>
> Following is page-flags layout described in page-flags-layout.h.
>
> 1. No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
> 2. " plus space for last_cpupid: | NODE | ZONE | LAST_CPUPID ... | FLAGS |
> 3. classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
> 4. " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
> 5. classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
>
> There is no problem in #1, #2 configurations for 64-bit system. There are
> enough room even for extremiely large x86_64 system. 32-bit system would
> not have many nodes so it would have no problem, too.
> System with #3, #4, #5 configurations could be affected by this zone
> addition, but, thanks to recent THP rework which reduce one page flag,
> problem surface would be small. In some configurations, problem is
> still possible, but, it highly depends on individual configuration
> so impact cannot be easily estimated. I guess that usual system
> with CONFIG_CMA would not be affected. If there is a problem,
> we can adjust section width or node width for that architecture.
>
> Currently, GFP_ZONES_TABLE is 32-bit value for 32-bit bit operation
> in the 32-bit system. If we add one more zone, it will be 48-bit and
> 32-bit bit operation cannot be possible. Although it will cause slight
> overhead, there is no other way so this patch relax GFP_ZONES_TABLE's
> 32-bit limitation. 32-bit System with CONFIG_CMA will be affected by
> this change but it would be marginal.
>
> Note that there are many checkpatch warnings but I think that current
> code is better for readability than fixing them up.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> arch/x86/mm/highmem_32.c | 8 +++++
> include/linux/gfp.h | 29 +++++++++++-------
> include/linux/mempolicy.h | 2 +-
> include/linux/mmzone.h | 31 ++++++++++++++++++-
> include/linux/vm_event_item.h | 10 ++++++-
> include/trace/events/compaction.h | 10 ++++++-
> kernel/power/snapshot.c | 8 +++++
> mm/memory_hotplug.c | 3 ++
> mm/page_alloc.c | 63 +++++++++++++++++++++++++++++++++------
> mm/vmstat.c | 9 +++++-
> 10 files changed, 148 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
> index a6d7392..a7fcb12 100644
> --- a/arch/x86/mm/highmem_32.c
> +++ b/arch/x86/mm/highmem_32.c
> @@ -120,6 +120,14 @@ void __init set_highmem_pages_init(void)
> if (!is_highmem(zone))
> continue;
>
> + /*
> + * ZONE_CMA is a special zone that should not be
> + * participated in initialization because it's pages
> + * would be initialized by initialization of other zones.
> + */
> + if (is_zone_cma(zone))
> + continue;
> +
> zone_start_pfn = zone->zone_start_pfn;
> zone_end_pfn = zone_start_pfn + zone->spanned_pages;
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 570383a..4d6c008 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -301,6 +301,12 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> #define OPT_ZONE_DMA32 ZONE_NORMAL
> #endif
>
> +#ifdef CONFIG_CMA
> +#define OPT_ZONE_CMA ZONE_CMA
> +#else
> +#define OPT_ZONE_CMA ZONE_MOVABLE
> +#endif
> +
> /*
> * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
> * zone to use given the lowest 4 bits of gfp_t. Entries are ZONE_SHIFT long
> @@ -331,7 +337,6 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> * 0xe => BAD (MOVABLE+DMA32+HIGHMEM)
> * 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
> *
> - * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
> */
>
> #if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
> @@ -341,19 +346,21 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> #define GFP_ZONES_SHIFT ZONES_SHIFT
> #endif
>
> -#if 16 * GFP_ZONES_SHIFT > BITS_PER_LONG
> -#error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
> +#if !defined(CONFIG_64BITS) && GFP_ZONES_SHIFT > 2
> +#define GFP_ZONE_TABLE_CAST unsigned long long
> +#else
> +#define GFP_ZONE_TABLE_CAST unsigned long
> #endif
>
> #define GFP_ZONE_TABLE ( \
> - (ZONE_NORMAL << 0 * GFP_ZONES_SHIFT) \
> - | (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT) \
> - | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT) \
> - | (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT) \
> - | (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT) \
> - | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT) \
> - | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)\
> - | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)\
> + ((GFP_ZONE_TABLE_CAST) ZONE_NORMAL << 0 * GFP_ZONES_SHIFT) \
> + | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT) \
> + | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT) \
> + | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT) \
> + | ((GFP_ZONE_TABLE_CAST) ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT) \
> + | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT) \
> + | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_CMA << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT) \
> + | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT) \
> )
>
> /*
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 4429d25..c4cc86e 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -157,7 +157,7 @@ extern enum zone_type policy_zone;
>
> static inline void check_highest_zone(enum zone_type k)
> {
> - if (k > policy_zone && k != ZONE_MOVABLE)
> + if (k > policy_zone && k != ZONE_MOVABLE && !is_zone_cma_idx(k))
> policy_zone = k;
> }
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f4ae0abb..5c97ba9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -322,6 +322,9 @@ enum zone_type {
> ZONE_HIGHMEM,
> #endif
> ZONE_MOVABLE,
> +#ifdef CONFIG_CMA
> + ZONE_CMA,
> +#endif
> #ifdef CONFIG_ZONE_DEVICE
> ZONE_DEVICE,
> #endif
> @@ -812,11 +815,37 @@ static inline int zone_movable_is_highmem(void)
> }
> #endif
>
> +static inline int is_zone_cma_idx(enum zone_type idx)
> +{
> +#ifdef CONFIG_CMA
> + return idx == ZONE_CMA;
> +#else
> + return 0;
> +#endif
> +}
> +
> +static inline int is_zone_cma(struct zone *zone)
> +{
> + int zone_idx = zone_idx(zone);
> +
> + return is_zone_cma_idx(zone_idx);
> +}
> +
> +static inline int zone_cma_is_highmem(void)
> +{
> +#ifdef CONFIG_HIGHMEM

Whether it needs to check the CONFIG_CMA here also?

> + return 1;
> +#else
> + return 0;
> +#endif
> +}
> +
> static inline int is_highmem_idx(enum zone_type idx)
> {
> #ifdef CONFIG_HIGHMEM
> return (idx == ZONE_HIGHMEM ||
> - (idx == ZONE_MOVABLE && zone_movable_is_highmem()));
> + (idx == ZONE_MOVABLE && zone_movable_is_highmem()) ||
> + (is_zone_cma_idx(idx) && zone_cma_is_highmem()));

When CONFIG_HIGHMEM defined, zone_cma_is_highmem() will always return 1.
I think it is not necessary to call the function here, and even define
it.

> #else
> return 0;
> #endif
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 9ec2940..8e25ba5 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -19,7 +19,15 @@
> #define HIGHMEM_ZONE(xx)
> #endif
>
> -#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, HIGHMEM_ZONE(xx) xx##_MOVABLE
> +#ifdef CONFIG_CMA
> +#define MOVABLE_ZONE(xx) xx##_MOVABLE,
> +#define CMA_ZONE(xx) xx##_CMA
> +#else
> +#define MOVABLE_ZONE(xx) xx##_MOVABLE
> +#define CMA_ZONE(xx)
> +#endif
> +
> +#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, HIGHMEM_ZONE(xx) MOVABLE_ZONE(xx) CMA_ZONE(xx)
>
> enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> FOR_ALL_ZONES(PGALLOC),
> diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
> index 36e2d6f..9d3b254 100644
> --- a/include/trace/events/compaction.h
> +++ b/include/trace/events/compaction.h
> @@ -38,12 +38,20 @@
> #define IFDEF_ZONE_HIGHMEM(X)
> #endif
>
> +#ifdef CONFIG_CMA
> +#define IFDEF_ZONE_CMA(X, Y, Z) X Z
> +#else
> +#define IFDEF_ZONE_CMA(X, Y, Z) Y
> +#endif
> +
> #define ZONE_TYPE \
> IFDEF_ZONE_DMA( EM (ZONE_DMA, "DMA")) \
> IFDEF_ZONE_DMA32( EM (ZONE_DMA32, "DMA32")) \
> EM (ZONE_NORMAL, "Normal") \
> IFDEF_ZONE_HIGHMEM( EM (ZONE_HIGHMEM,"HighMem")) \
> - EMe(ZONE_MOVABLE,"Movable")
> + IFDEF_ZONE_CMA( EM (ZONE_MOVABLE,"Movable"), \
> + EMe(ZONE_MOVABLE,"Movable"), \
> + EMe(ZONE_CMA, "CMA"))
>
> /*
> * First define the enums in the above macros to be exported to userspace
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 3a97060..e8a7d8f 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1042,6 +1042,14 @@ unsigned int snapshot_additional_pages(struct zone *zone)
> {
> unsigned int rtree, nodes;
>
> + /*
> + * Estimation of needed pages for ZONE_CMA is already considered
> + * when calculating other zones since span of ZONE_CMA is subset
> + * of other zones.
> + */
> + if (is_zone_cma(zone))
> + return 0;
> +
> rtree = nodes = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
> rtree += DIV_ROUND_UP(rtree * sizeof(struct rtree_node),
> LINKED_PAGE_DATA_SIZE);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index caf2a14..354fa9c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1808,6 +1808,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
> if (zone_idx(zone) <= ZONE_NORMAL && !can_offline_normal(zone, nr_pages))
> return -EINVAL;
>
> + if (is_zone_cma(zone))
> + return -EINVAL;
> +
> /* set above range as isolated */
> ret = start_isolate_page_range(start_pfn, end_pfn,
> MIGRATE_MOVABLE, true);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ffa93e0..987a87c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -202,6 +202,9 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
> 32,
> #endif
> 32,
> +#ifdef CONFIG_CMA
> + 32,
> +#endif
> };
>
> EXPORT_SYMBOL(totalram_pages);
> @@ -218,6 +221,9 @@ static char * const zone_names[MAX_NR_ZONES] = {
> "HighMem",
> #endif
> "Movable",
> +#ifdef CONFIG_CMA
> + "CMA",
> +#endif
> #ifdef CONFIG_ZONE_DEVICE
> "Device",
> #endif
> @@ -4896,6 +4902,15 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> struct memblock_region *r = NULL, *tmp;
> #endif
>
> + /*
> + * Physical pages for ZONE_CMA are belong to other zones now. They
> + * are initialized when corresponding zone is initialized and they
> + * will be moved to ZONE_CMA later. Zone information will also be
> + * adjusted later.
> + */
> + if (is_zone_cma_idx(zone))
> + return;
> +
> if (highest_memmap_pfn < end_pfn - 1)
> highest_memmap_pfn = end_pfn - 1;
>
> @@ -5332,7 +5347,7 @@ static void __init find_usable_zone_for_movable(void)
> {
> int zone_index;
> for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) {
> - if (zone_index == ZONE_MOVABLE)
> + if (zone_index == ZONE_MOVABLE || is_zone_cma_idx(zone_index))
> continue;
>
> if (arch_zone_highest_possible_pfn[zone_index] >
> @@ -5541,6 +5556,8 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
> unsigned long *zholes_size)
> {
> unsigned long realtotalpages = 0, totalpages = 0;
> + unsigned long zone_cma_start_pfn = UINT_MAX;
> + unsigned long zone_cma_end_pfn = 0;
> enum zone_type i;
>
> for (i = 0; i < MAX_NR_ZONES; i++) {
> @@ -5548,6 +5565,13 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
> unsigned long zone_start_pfn, zone_end_pfn;
> unsigned long size, real_size;
>
> + if (is_zone_cma_idx(i)) {
> + zone->zone_start_pfn = zone_cma_start_pfn;
> + size = zone_cma_end_pfn - zone_cma_start_pfn;
> + real_size = 0;
> + goto init_zone;
> + }
> +
> size = zone_spanned_pages_in_node(pgdat->node_id, i,
> node_start_pfn,
> node_end_pfn,
> @@ -5557,13 +5581,23 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
> real_size = size - zone_absent_pages_in_node(pgdat->node_id, i,
> node_start_pfn, node_end_pfn,
> zholes_size);
> - if (size)
> + if (size) {
> zone->zone_start_pfn = zone_start_pfn;
> - else
> + if (zone_cma_start_pfn > zone_start_pfn)
> + zone_cma_start_pfn = zone_start_pfn;
> + if (zone_cma_end_pfn < zone_start_pfn + size)
> + zone_cma_end_pfn = zone_start_pfn + size;
> + } else
> zone->zone_start_pfn = 0;
> +
> +init_zone:
> zone->spanned_pages = size;
> zone->present_pages = real_size;
>
> + /* Prevent to over-count node span */
> + if (is_zone_cma_idx(i))
> + size = 0;
> +
> totalpages += size;
> realtotalpages += real_size;
> }
> @@ -5705,6 +5739,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> struct zone *zone = pgdat->node_zones + j;
> unsigned long size, realsize, freesize, memmap_pages;
> unsigned long zone_start_pfn = zone->zone_start_pfn;
> + bool zone_kernel = !is_highmem_idx(j) && !is_zone_cma_idx(j);
>
> size = zone->spanned_pages;
> realsize = freesize = zone->present_pages;
> @@ -5715,7 +5750,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> * and per-cpu initialisations
> */
> memmap_pages = calc_memmap_size(size, realsize);
> - if (!is_highmem_idx(j)) {
> + if (zone_kernel) {
> if (freesize >= memmap_pages) {
> freesize -= memmap_pages;
> if (memmap_pages)
> @@ -5734,7 +5769,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> zone_names[0], dma_reserve);
> }
>
> - if (!is_highmem_idx(j))
> + if (zone_kernel)
> nr_kernel_pages += freesize;
> /* Charge for highmem memmap if there are enough kernel pages */
> else if (nr_kernel_pages > memmap_pages * 2)
> @@ -5746,7 +5781,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> * when the bootmem allocator frees pages into the buddy system.
> * And all highmem pages will be managed by the buddy system.
> */
> - zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
> + zone->managed_pages = zone_kernel ? freesize : realsize;
> #ifdef CONFIG_NUMA
> zone->node = nid;
> setup_min_unmapped_ratio(zone);
> @@ -5763,7 +5798,12 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
>
> lruvec_init(&zone->lruvec);
> - if (!size)
> +
> + /*
> + * ZONE_CMA should be initialized even if it has no present
> + * page now since pages will be moved to the zone later.
> + */
> + if (!size && !is_zone_cma_idx(j))
> continue;
>
> set_pageblock_order();
> @@ -6217,7 +6257,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
> arch_zone_lowest_possible_pfn[0] = find_min_pfn_with_active_regions();
> arch_zone_highest_possible_pfn[0] = max_zone_pfn[0];
> for (i = 1; i < MAX_NR_ZONES; i++) {
> - if (i == ZONE_MOVABLE)
> + if (i == ZONE_MOVABLE || is_zone_cma_idx(i))
> continue;
> arch_zone_lowest_possible_pfn[i] =
> arch_zone_highest_possible_pfn[i-1];
> @@ -6234,7 +6274,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
> /* Print out the zone ranges */
> pr_info("Zone ranges:\n");
> for (i = 0; i < MAX_NR_ZONES; i++) {
> - if (i == ZONE_MOVABLE)
> + if (i == ZONE_MOVABLE || is_zone_cma_idx(i))
> continue;
> pr_info(" %-8s ", zone_names[i]);
> if (arch_zone_lowest_possible_pfn[i] ==
> @@ -7048,6 +7088,11 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
> */
> if (zone_idx(zone) == ZONE_MOVABLE)
> return false;
> +
> + /* ZONE_CMA never contains unmovable pages */
> + if (is_zone_cma(zone))
> + return false;
> +
> mt = get_pageblock_migratetype(page);
> if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
> return false;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 070fd90..e8c46ad 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -710,8 +710,15 @@ int fragmentation_index(struct zone *zone, unsigned int order)
> #define TEXT_FOR_HIGHMEM(xx)
> #endif
>
> +#ifdef CONFIG_CMA
> +#define TEXT_FOR_CMA(xx) xx "_cma",
> +#else
> +#define TEXT_FOR_CMA(xx)
> +#endif
> +
> #define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \
> - TEXT_FOR_HIGHMEM(xx) xx "_movable",
> + TEXT_FOR_HIGHMEM(xx) xx "_movable", \
> + TEXT_FOR_CMA(xx)
>
> const char * const vmstat_text[] = {
> /* enum zone_stat_item countes */
>

2016-04-28 07:46:48

by Rui Teng

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] mm/page_alloc: recalculate some of zone threshold when on/offline memory

On 4/25/16 1:21 PM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Some of zone threshold depends on number of managed pages in the zone.
> When memory is going on/offline, it can be changed and we need to
> adjust them.
>
> This patch add recalculation to appropriate places and clean-up
> related function for better maintanance.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-------
> 1 file changed, 29 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 71fa015..ffa93e0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4633,6 +4633,8 @@ int local_memory_node(int node)
> }
> #endif
>
> +static void setup_min_unmapped_ratio(struct zone *zone);
> +static void setup_min_slab_ratio(struct zone *zone);
> #else /* CONFIG_NUMA */
>
> static void set_zonelist_order(void)
> @@ -5747,9 +5749,8 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
> #ifdef CONFIG_NUMA
> zone->node = nid;
> - zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
> - / 100;
> - zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
> + setup_min_unmapped_ratio(zone);
> + setup_min_slab_ratio(zone);

The original logic use freesize to calculate the
zone->min_unmapped_pages and zone->min_slab_pages here.
But the new function will use zone->managed_pages.
Do you mean the original logic is wrong, or the managed_pages will
always be freesize when CONFIG_NUMA defined?

> #endif
> zone->name = zone_names[j];
> spin_lock_init(&zone->lock);
> @@ -6655,6 +6656,7 @@ int __meminit init_per_zone_wmark_min(void)
> {
> unsigned long lowmem_kbytes;
> int new_min_free_kbytes;
> + struct zone *zone;
>
> lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);
> new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16);
> @@ -6672,6 +6674,14 @@ int __meminit init_per_zone_wmark_min(void)
> setup_per_zone_wmarks();
> refresh_zone_stat_thresholds();
> setup_per_zone_lowmem_reserve();
> +
> + for_each_zone(zone) {
> +#ifdef CONFIG_NUMA
> + setup_min_unmapped_ratio(zone);
> + setup_min_slab_ratio(zone);
> +#endif
> + }
> +
> return 0;
> }
> module_init(init_per_zone_wmark_min)
> @@ -6713,6 +6723,12 @@ int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
> }
>
> #ifdef CONFIG_NUMA
> +static void setup_min_unmapped_ratio(struct zone *zone)
> +{
> + zone->min_unmapped_pages = (zone->managed_pages *
> + sysctl_min_unmapped_ratio) / 100;
> +}
> +
> int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
> void __user *buffer, size_t *length, loff_t *ppos)
> {
> @@ -6724,11 +6740,17 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
> return rc;
>
> for_each_zone(zone)
> - zone->min_unmapped_pages = (zone->managed_pages *
> - sysctl_min_unmapped_ratio) / 100;
> + setup_min_unmapped_ratio(zone);
> +
> return 0;
> }
>
> +static void setup_min_slab_ratio(struct zone *zone)
> +{
> + zone->min_slab_pages = (zone->managed_pages *
> + sysctl_min_slab_ratio) / 100;
> +}
> +
> int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
> void __user *buffer, size_t *length, loff_t *ppos)
> {
> @@ -6740,8 +6762,8 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
> return rc;
>
> for_each_zone(zone)
> - zone->min_slab_pages = (zone->managed_pages *
> - sysctl_min_slab_ratio) / 100;
> + setup_min_slab_ratio(zone);
> +
> return 0;
> }
> #endif
>

2016-04-28 10:41:51

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] Introduce ZONE_CMA

On Mon, Apr 25, 2016 at 02:36:54PM +0900, Joonsoo Kim wrote:
> > Hello,
> >
> > Changes from v1
> > o Separate some patches which deserve to submit independently
> > o Modify description to reflect current kernel state
> > (e.g. high-order watermark problem disappeared by Mel's work)
> > o Don't increase SECTION_SIZE_BITS to make a room in page flags
> > (detailed reason is on the patch that adds ZONE_CMA)
> > o Adjust ZONE_CMA population code
> >
> > This series try to solve problems of current CMA implementation.
> >
> > CMA is introduced to provide physically contiguous pages at runtime
> > without exclusive reserved memory area. But, current implementation
> > works like as previous reserved memory approach, because freepages
> > on CMA region are used only if there is no movable freepage. In other
> > words, freepages on CMA region are only used as fallback. In that
> > situation where freepages on CMA region are used as fallback, kswapd
> > would be woken up easily since there is no unmovable and reclaimable
> > freepage, too. If kswapd starts to reclaim memory, fallback allocation
> > to MIGRATE_CMA doesn't occur any more since movable freepages are
> > already refilled by kswapd and then most of freepage on CMA are left
> > to be in free. This situation looks like exclusive reserved memory case.
> >

My understanding is that this was intentional. One of the original design
requirements was that CMA have a high likelihood of allocation success for
devices if it was necessary as an allocation failure was very visible to
the user. It does not *have* to be treated as a reserve because Movable
allocations could try CMA first but it increases allocation latency for
devices that require it and it gets worse if those pages are pinned.

> > In my experiment, I found that if system memory has 1024 MB memory and
> > 512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
> > free memory is left. Detailed reason is that for keeping enough free
> > memory for unmovable and reclaimable allocation, kswapd uses below
> > equation when calculating free memory and it easily go under the watermark.
> >
> > Free memory for unmovable and reclaimable = Free total - Free CMA pages
> >
> > This is derivated from the property of CMA freepage that CMA freepage
> > can't be used for unmovable and reclaimable allocation.
> >

Yes and also keeping it lightly utilised to reduce CMA allocation
latency and probability of failure.

> > Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
> > is lower than low watermark and tries to make free memory until
> > (FreeTotal - FreeCMA) is higher than high watermark. That results
> > in that FreeTotal is moving around 512MB boundary consistently. It
> > then means that we can't utilize full memory capacity.
> >
> > To fix this problem, I submitted some patches [1] about 10 months ago,
> > but, found some more problems to be fixed before solving this problem.
> > It requires many hooks in allocator hotpath so some developers doesn't
> > like it. Instead, some of them suggest different approach [2] to fix
> > all the problems related to CMA, that is, introducing a new zone to deal
> > with free CMA pages. I agree that it is the best way to go so implement
> > here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar,

One of the issues I mentioned at LSF/MM is that I consider ZONE_MOVABLE
to be a mistake. Zones are meant to be about addressing limitations and
both ZONE_MOVABLE and ZONE_CMA violate that. When ZONE_MOVABLE was
introduced, it was intended for use with dynamically resizing the
hugetlbfs pool. It was competing with fragmentation avoidance at the
time and the community could not decide which approach was better so
both ended up being merged as they had different advantages and
disadvantages.

Now, ZONE_MOVABLE is being abused -- memory hotplug was a particular mistake
and I don't want to see CMA fall down the same hole. Both CMA and memory
hotplug would benefit from the notion of having "sticky" MIGRATE_MOVABLE
pageblocks that are never used for UNMOVABLE and RECLAIMABLE fallbacks.
It costs to detect that in the slow path but zones cause their own problems.

> > I
> > decide to add a new zone rather than piggyback on ZONE_MOVABLE since
> > they have some differences. First, reserved CMA pages should not be
> > offlined. If freepage for CMA is managed by ZONE_MOVABLE, we need to keep
> > MIGRATE_CMA migratetype and insert many hooks on memory hotplug code
> > to distiguish hotpluggable memory and reserved memory for CMA in the same
> > zone.

Or treat both as "sticky" MIGRATE_MOVABLE.

> > It would make memory hotplug code which is already complicated
> > more complicated. Second, cma_alloc() can be called more frequently
> > than memory hotplug operation and possibly we need to control
> > allocation rate of ZONE_CMA to optimize latency in the future.
> > In this case, separate zone approach is easy to modify. Third, I'd
> > like to see statistics for CMA, separately. Sometimes, we need to debug
> > why cma_alloc() is failed and separate statistics would be more helpful
> > in this situtaion.
> >
> > Anyway, this patchset solves four problems related to CMA implementation.
> >
> > 1) Utilization problem
> > As mentioned above, we can't utilize full memory capacity due to the
> > limitation of CMA freepage and fallback policy. This patchset implements
> > a new zone for CMA and uses it for GFP_HIGHUSER_MOVABLE request. This
> > typed allocation is used for page cache and anonymous pages which
> > occupies most of memory usage in normal case so we can utilize full
> > memory capacity. Below is the experiment result about this problem.
> >

A zone is not necessary for that. Currently a zone would have a side-benefit
from the fair zone allocation policy because it would interleave between
ZONE_CMA and ZONE_MOVABLE. However, the intention is to remove that policy
by moving LRUs to the node. Once that happens, the interleaving benefit
is lost and you're back to square one.

There may be some justification for interleaving *only* between MOVABLE and
MOVABLE_STICKY for CMA allocations and hiding that behind both a CONFIG_CMA
guard *and* a check if there a CMA region exists. It'd still need
something like the BATCH vmstat but it would only be updated when CMA is
active and hide it from the fast paths.

> > 2) Reclaim problem
> > Currently, there is no logic to distinguish CMA pages in reclaim path.
> > If reclaim is initiated for unmovable and reclaimable allocation,
> > reclaiming CMA pages doesn't help to satisfy the request and reclaiming
> > CMA page is just waste. By managing CMA pages in the new zone, we can
> > skip to reclaim ZONE_CMA completely if it is unnecessary.
> >

This problem will recur with node-lru. However, that said CMA reclaim is
currently depending on randomly reclaiming followed by compaction. This is
both slow and inefficient. CMA and alloc_contig_range() should strongly
consider isolation pages with a PFN walk of the CMA regions and directly
reclaiming those pages. Those pages may need to be refaulted in but the
priority is for the allocation to succeed. That would side-step the issues
with kswapd scanning the wrong zones.

> > 3) Atomic allocation failure problem
> > Kswapd isn't started to reclaim pages when allocation request is movable
> > type and there is enough free page in the CMA region. After bunch of
> > consecutive movable allocation requests, free pages in ordinary region
> > (not CMA region) would be exhausted without waking up kswapd. At that time,
> > if atomic unmovable allocation comes, it can't be successful since there
> > is not enough page in ordinary region. This problem is reported
> > by Aneesh [4] and can be solved by this patchset.
> >

Not necessarily as kswapd curently would still reclaim from the lower
zones unnecessarily. Again, targetting the pages required for the CMA
allocation would side-step the issue. The actual core code of lumpy reclaim
was quite small.

> > 4) Inefficiently work of compaction
> > Usual high-order allocation request is unmovable type and it cannot
> > be serviced from CMA area. In compaction, migration scanner doesn't
> > distinguish migratable pages on the CMA area and do migration.
> > In this case, even if we make high-order page on that region, it
> > cannot be used due to type mismatch. This patch will solve this problem
> > by separating CMA pages from ordinary zones.
> >

Compaction problems are actually compounded by introducing ZONE_CMA as
it only compacts within the zone. Compaction would need to know how to
compact within a node to address the introduction of ZONE_CMA.

> I read the summary of the LSF/MM in LWN.net and Rik's summary e-mail
> and it looks like there is a disagreement on ZONE_CMA approach and
> I'd like to talk about it in this mail.
>
> I'd like to object Aneesh's statement that using ZONE_CMA just replaces
> one set of problems with another (mentioned in LWN.net).

These are the problems I have as well. Even with current code, reclaim
is balanced between zones and introducing a new one compounds the
problem. With node-lru, it is similarly complicated.

> The fact
> that pages under I/O cannot be moved is also the problem of all CMA
> approaches. It is just separate issue and should not affect the decision
> on ZONE_CMA.

While it's a separate issue, it's also an important one. A linear reclaim
of a CMA region would at least be able to clearly identify pages that are
pinned in that region. Introducing the zone does not help the problem.

> What we should consider is what is the best approach to solve other issues
> that comes from the fact that pages with different characteristic are
> in the same zone. One of the problem is a watermark check. Many MM logic
> based on watermark check to decide something. If there are free pages with
> different characteristic that is not compatible with other migratetypes,
> output of watermark check would cause the problem. We distinguished
> allocation type and adjusted watermark check through ALLOC_CMA flag but
> using it is not so simple and is fragile. Consider about the compaction
> code. There is a checks that there are enough order-0 freepage in the zone
> to check that compaction could work, before entering the compaction.

Which would be addressed by linear reclaiming instead as the contiguous
region is what CMA requires. Compaction was intended to be best effort for
THP. Potentially right now, CMA has to loop constantly reclaiming pages
and hoping compaction works and that can over-reclaim if there are pinned
pages and *still* fail.

> In this case, we could add up CMA freepages even if alloc_flags doesn't
> have ALLOC_CMA because we can utilize CMA freepages as a freepage.
> But, in reality, we missed it. We might fix those cases one by one but
> it's seems to be really error-prone to me. Until recent date, high order
> freepage counting problem was there, too. It partially disappeared
> by Mel's MIGRATE_HIGHATOMIC work, but still remain for ALLOC_HARDER
> request. (We cannot know how many high order freepages are on normal area
> and CMA area.) That problem also shows that it's very fragile design
> that non-compatible types of pages are in the same zone.
>
> ZONE_CMA separates those pages to a new zone so there is no problem
> mentioned in the above. There are other issues about freepage utilization
> and reclaim efficiency in current kernel code and those also could be solved
> by ZONE_CMA approach. Without a new zone, it need more hooks on core MM code
> and it is uncomfortable to the developers who doesn't use the CMA.
> New zone would be a necessary evil in this situation.
>
> I don't think another migratetype, sticky MIGRATE_MOVABLE,
> is the right solution here. We already uses MIGRATE_CMA for that purpose
> and it is proved that it doesn't work well.
>

Partially because the watermark checks do not always do the best thing,
kswapd does not always reclaim the correct pages and compaction does not
always work. A zone alleviates the watermark check but not the reclaim or
compaction problems while introducing issues with balancing zone reclaim,
having additional free lists and getting impacted later when the fair zone
allocation policy is removed. It would be very difficult to convince me
that ZONE_CMA is the way forward when I already think that ZONE_MOVABLE
was a mistake.

What I was proposing at LSF/MM was the following;

1. Create the notion of a sticky MIGRATE_MOVABLE type.
UNMOVABLE and RECLAIMABLE cannot fallback to these regions. If a sticky
region exists then the fallback code will need additional checks
in the slow path. This is slow but it's the cost of protection

2. Express MIGRATE_CMA in terms of sticky MIGRATE_MOVABLE

3. Use linear reclaim instead of reclaim/compaction in alloc_contig_range
Reclaim/compaction was intended for THP which does not care about zones,
only node locality. It potentially over-reclaims in the CMA-allocation
case. If reclaim/aggression is increased then it could potentially
reclaim the entire system before failing. By using linear reclaim, a
failure pass will scan just CMA, reclaim everything there and fail. On
success, it may still over-reclaim if it finds pinned pages but in the
ideal case, it reclaims exactly what is required for the allocation to
succeed. Some pages may need to be refaulted but that is likely cheaper
than multiple reclaim/compaction cycles that eventually fail anyway

The core of how linear reclaim used to work is still visible in commit
c53919adc045bf803252e912f23028a68525753d in the isolate_lru_pages
function although I would not suggest reintroducing it there and instead
do something similar in alloc_contig_range.

Potentially, over-reclaim could be avoided by isolating the full range
first and if one isolation fails then putback all the pages and restart
the scan after the pinned page.

4. Interleave MOVABLE and sticky MOVABLE if desired
This would be in the fallback paths only and be specific to CMA. This
would alleviate the utilisation problems while not impacting the fast
paths for everyone else. Functionally it would be similar to the fair
zone allocation policy which is currently in the fast path and scheduled
for removal.

5. For kernelcore=, create stick MIGRATE_MOVABLE blocks instead of
ZONE_MOVABLE

6. For memory hot-add, create sticky MIGRATE_MOVABLE blocks instead of
adding pages to ZONE_MOVABLE

7. Delete ZONE_MOVABLE

8. Optionally migrate pages about to be pinned from sticky MIGRATE_MOVABLE

This would benefit both CMA and memory hot-remove. It would be a policy
choice on whether a failed migration allows the allocation to succeed
or not.


--
Mel Gorman
SUSE Labs

2016-04-29 06:51:39

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] Introduce ZONE_CMA

Hello, Mel.

IIUC, you may miss that alloc_contig_range() currently does linear
reclaim/migration. Your comment is largely based on this
misunderstanding so please keep it in your mind when reading the
reply.

On Thu, Apr 28, 2016 at 11:39:27AM +0100, Mel Gorman wrote:
> On Mon, Apr 25, 2016 at 02:36:54PM +0900, Joonsoo Kim wrote:
> > > Hello,
> > >
> > > Changes from v1
> > > o Separate some patches which deserve to submit independently
> > > o Modify description to reflect current kernel state
> > > (e.g. high-order watermark problem disappeared by Mel's work)
> > > o Don't increase SECTION_SIZE_BITS to make a room in page flags
> > > (detailed reason is on the patch that adds ZONE_CMA)
> > > o Adjust ZONE_CMA population code
> > >
> > > This series try to solve problems of current CMA implementation.
> > >
> > > CMA is introduced to provide physically contiguous pages at runtime
> > > without exclusive reserved memory area. But, current implementation
> > > works like as previous reserved memory approach, because freepages
> > > on CMA region are used only if there is no movable freepage. In other
> > > words, freepages on CMA region are only used as fallback. In that
> > > situation where freepages on CMA region are used as fallback, kswapd
> > > would be woken up easily since there is no unmovable and reclaimable
> > > freepage, too. If kswapd starts to reclaim memory, fallback allocation
> > > to MIGRATE_CMA doesn't occur any more since movable freepages are
> > > already refilled by kswapd and then most of freepage on CMA are left
> > > to be in free. This situation looks like exclusive reserved memory case.
> > >
>
> My understanding is that this was intentional. One of the original design
> requirements was that CMA have a high likelihood of allocation success for
> devices if it was necessary as an allocation failure was very visible to
> the user. It does not *have* to be treated as a reserve because Movable
> allocations could try CMA first but it increases allocation latency for
> devices that require it and it gets worse if those pages are pinned.

I know that it was design decision at that time when CMA isn't
actively used. It is due to lack of experience and now situation is
quite different. Most of embedded systems uses CMA with their own
adaptation because utilization is too low. It makes system much
slower and this is more likely than the case that device memory is
required. Given the fact that they adapt their logic to utilize CMA
much more and sacrifice latency, I think that previous design
decision is wrong and we should go another way.

>
> > > In my experiment, I found that if system memory has 1024 MB memory and
> > > 512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
> > > free memory is left. Detailed reason is that for keeping enough free
> > > memory for unmovable and reclaimable allocation, kswapd uses below
> > > equation when calculating free memory and it easily go under the watermark.
> > >
> > > Free memory for unmovable and reclaimable = Free total - Free CMA pages
> > >
> > > This is derivated from the property of CMA freepage that CMA freepage
> > > can't be used for unmovable and reclaimable allocation.
> > >
>
> Yes and also keeping it lightly utilised to reduce CMA allocation
> latency and probability of failure.

As my experience about CMA, most of unacceptable failure (takes more
than 3 sec) comes from blockdev pagecache. Even, it's not simple to
check what is going on there when failure happen. ZONE_CMA uses
different approach that it only takes the request with
GFP_HIGHUSER_MOVABLE so blockdev pagecache cannot get in and
probability of failure is much reduced.

> > > Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
> > > is lower than low watermark and tries to make free memory until
> > > (FreeTotal - FreeCMA) is higher than high watermark. That results
> > > in that FreeTotal is moving around 512MB boundary consistently. It
> > > then means that we can't utilize full memory capacity.
> > >
> > > To fix this problem, I submitted some patches [1] about 10 months ago,
> > > but, found some more problems to be fixed before solving this problem.
> > > It requires many hooks in allocator hotpath so some developers doesn't
> > > like it. Instead, some of them suggest different approach [2] to fix
> > > all the problems related to CMA, that is, introducing a new zone to deal
> > > with free CMA pages. I agree that it is the best way to go so implement
> > > here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar,
>
> One of the issues I mentioned at LSF/MM is that I consider ZONE_MOVABLE
> to be a mistake. Zones are meant to be about addressing limitations and
> both ZONE_MOVABLE and ZONE_CMA violate that. When ZONE_MOVABLE was
> introduced, it was intended for use with dynamically resizing the
> hugetlbfs pool. It was competing with fragmentation avoidance at the
> time and the community could not decide which approach was better so
> both ended up being merged as they had different advantages and
> disadvantages.
>
> Now, ZONE_MOVABLE is being abused -- memory hotplug was a particular mistake
> and I don't want to see CMA fall down the same hole. Both CMA and memory
> hotplug would benefit from the notion of having "sticky" MIGRATE_MOVABLE
> pageblocks that are never used for UNMOVABLE and RECLAIMABLE fallbacks.
> It costs to detect that in the slow path but zones cause their own problems.

Please elaborate more concrete reasons that you think why ZONE_MOVABLE
is a mistake. Simply saying that zones are meant to be about address
limitations doesn't make sense. Moreover, I think that this original
purpose of zone could be changed if needed. It was introduced for that
purpose but time goes by a lot. We have different requirement now and
zone is suitable to handle this new requirement. And, if we think
address limitation more generally, it can be considered as different
characteristic memory problem. Zone is introduced to handle this
situation and that's what new CMA implementation needs.

> > > I
> > > decide to add a new zone rather than piggyback on ZONE_MOVABLE since
> > > they have some differences. First, reserved CMA pages should not be
> > > offlined. If freepage for CMA is managed by ZONE_MOVABLE, we need to keep
> > > MIGRATE_CMA migratetype and insert many hooks on memory hotplug code
> > > to distiguish hotpluggable memory and reserved memory for CMA in the same
> > > zone.
>
> Or treat both as "sticky" MIGRATE_MOVABLE.
>
> > > It would make memory hotplug code which is already complicated
> > > more complicated. Second, cma_alloc() can be called more frequently
> > > than memory hotplug operation and possibly we need to control
> > > allocation rate of ZONE_CMA to optimize latency in the future.
> > > In this case, separate zone approach is easy to modify. Third, I'd
> > > like to see statistics for CMA, separately. Sometimes, we need to debug
> > > why cma_alloc() is failed and separate statistics would be more helpful
> > > in this situtaion.
> > >
> > > Anyway, this patchset solves four problems related to CMA implementation.
> > >
> > > 1) Utilization problem
> > > As mentioned above, we can't utilize full memory capacity due to the
> > > limitation of CMA freepage and fallback policy. This patchset implements
> > > a new zone for CMA and uses it for GFP_HIGHUSER_MOVABLE request. This
> > > typed allocation is used for page cache and anonymous pages which
> > > occupies most of memory usage in normal case so we can utilize full
> > > memory capacity. Below is the experiment result about this problem.
> > >
>
> A zone is not necessary for that. Currently a zone would have a side-benefit

Agreed that a zone isn't necessary for it. I also tried the approach
interleaving within a normal zone about two years ago, and, because
there are more remaining problems and it needs more hook in many
placees, people doesn't like it. You can see implementation in below link.

https://lkml.org/lkml/2014/5/28/64


> from the fair zone allocation policy because it would interleave between
> ZONE_CMA and ZONE_MOVABLE. However, the intention is to remove that policy
> by moving LRUs to the node. Once that happens, the interleaving benefit
> is lost and you're back to square one.
>
> There may be some justification for interleaving *only* between MOVABLE and
> MOVABLE_STICKY for CMA allocations and hiding that behind both a CONFIG_CMA
> guard *and* a check if there a CMA region exists. It'd still need
> something like the BATCH vmstat but it would only be updated when CMA is
> active and hide it from the fast paths.

And, please don't focus on interleaving allocation problem. Main
problem I'd like to solve is the utilization problem and interleaving
is optional benefit from fair zone policy. Some CMA adaptation I know
doesn't use interleaving at all. So, even if your node LRU work remove
fair zone policy, it would not be a critical problem of ZONE_CMA.
We can implement it if desired.

>
> > > 2) Reclaim problem
> > > Currently, there is no logic to distinguish CMA pages in reclaim path.
> > > If reclaim is initiated for unmovable and reclaimable allocation,
> > > reclaiming CMA pages doesn't help to satisfy the request and reclaiming
> > > CMA page is just waste. By managing CMA pages in the new zone, we can
> > > skip to reclaim ZONE_CMA completely if it is unnecessary.
> > >
>
> This problem will recur with node-lru. However, that said CMA reclaim is

Yes, it will. But, it will also happen on ZONE_HIGHMEM if we use
node-lru. What's your plan to handle this issue? Because you cannot
remove ZONE_HIGHMEM, you need to handle it properly and that way would
naturally work well for ZONE_CMA as well.

> currently depending on randomly reclaiming followed by compaction. This is
> both slow and inefficient. CMA and alloc_contig_range() should strongly
> consider isolation pages with a PFN walk of the CMA regions and directly
> reclaiming those pages. Those pages may need to be refaulted in but the
> priority is for the allocation to succeed. That would side-step the issues
> with kswapd scanning the wrong zones.

I think that you are confused now. alloc_contig_range() already uses
PFN walk on the CMA regions and directly reclaim/migration those
pages. There is no randomly reclaim/migration here.

> > > 3) Atomic allocation failure problem
> > > Kswapd isn't started to reclaim pages when allocation request is movable
> > > type and there is enough free page in the CMA region. After bunch of
> > > consecutive movable allocation requests, free pages in ordinary region
> > > (not CMA region) would be exhausted without waking up kswapd. At that time,
> > > if atomic unmovable allocation comes, it can't be successful since there
> > > is not enough page in ordinary region. This problem is reported
> > > by Aneesh [4] and can be solved by this patchset.
> > >
>
> Not necessarily as kswapd curently would still reclaim from the lower
> zones unnecessarily. Again, targetting the pages required for the CMA
> allocation would side-step the issue. The actual core code of lumpy reclaim
> was quite small.

You may be confused here, too. Maybe, I misunderstand what you say
here so please correct me if I'm missing something. First, let me
clarify the problem.

It's not the issue when calling alloc_contig_range(). It is the issue
when MM manages (allocate/reclaim) pages on CMA area when they are not
used by device now.

The problem is that kswapd isn't woken up properly due to non-accurate
watermark check. When we do watermark check, number of freepage is
varied depending on allocation type. Non-movable allocation subtracts
number of CMA freepages from total freepages. In other words, we adds
number of CMA freepages to number of normal freepages when movable
allocation is requested. Problem comes from here. While we handle
bunch of movable allocation, we can't notice that there is not enough
freepage in normal area because we adds up number of CMA freepages and
watermark looks safe. In this case, kswapd would not be woken up. If
atomic allocation suddenly comes in this situation, freepage in normal
area could be low and atomic allocation can fail.

This is a really good example that comes from the fact that different
types of pages are in a single zone. As I mentioned in other places,
it's really error-prone design and handling it case by case is very fragile.
Current long lasting problems about CMA is caused by this design
decision and we need to change it.

> > > 4) Inefficiently work of compaction
> > > Usual high-order allocation request is unmovable type and it cannot
> > > be serviced from CMA area. In compaction, migration scanner doesn't
> > > distinguish migratable pages on the CMA area and do migration.
> > > In this case, even if we make high-order page on that region, it
> > > cannot be used due to type mismatch. This patch will solve this problem
> > > by separating CMA pages from ordinary zones.
> > >
>
> Compaction problems are actually compounded by introducing ZONE_CMA as
> it only compacts within the zone. Compaction would need to know how to
> compact within a node to address the introduction of ZONE_CMA.
>

I'm not sure what you'd like to say here. What I meant is following
situation. Capital letter means pageblock type. (C:MIGRATE_CMA,
M:MIGRATE_MOVABLE). F means freed pageblock due to compaction.

CCCCCMMMM

If compaction is invoked to make unmovable high order freepage,
compaction would start to work. It empties front part of the zone and
migrate them to rear part of the zone. High order freepages are made
on front part of the zone and it may look like as following.

FFFCCMMMM

But, they are on CMA pageblock so it cannot be used to satisfy unmovable
high order allocation. Freepages on CMA pageblock isn't allocated for
unmovable allocation request. That is what I'd like to say here.
We can fix it by adding corner case handling at some place but
this kind of corner case handling is not what I want. It's not
maintainable. It comes from current design decision (MIGRATE_CMA) and
we need to change the situation.

And, there is no compaction problem on ZONE_CMA because we can
compact it as like as the others. Any new problem isn't added due to
ZONE_CMA. If you'd like to say about problem of migration destination
page when calling alloc_contig_range(), there is no problem, too.
alloc_contig_range() uses alloc_nugrate_target() that is used by
memory-hotplug and it doesn't limit allocation's target zone. No
problem is introduced by ZONE_CMA.

> > I read the summary of the LSF/MM in LWN.net and Rik's summary e-mail
> > and it looks like there is a disagreement on ZONE_CMA approach and
> > I'd like to talk about it in this mail.
> >
> > I'd like to object Aneesh's statement that using ZONE_CMA just replaces
> > one set of problems with another (mentioned in LWN.net).
>
> These are the problems I have as well. Even with current code, reclaim
> is balanced between zones and introducing a new one compounds the
> problem. With node-lru, it is similarly complicated.

I believe your node-lru work will solve balancing problem for
ZONE_HIGHMEM. ZONE_CMA can be piggyback on this design decision since
it has similar limitation with ZONE_HIGHMEM so I think that there is
no problem at all.

And, I guess that using another design like as your sticky MOVABLE
types needs similar exception handling eventually and it would be more
complex than adding a new zone because zone is originally for handling
different characteristic memory but migratetype isn't. At least
currently, different migratetype doesn't mean that it isn't compatible
with others except MIGRATE_CMA. It is the root of the problem.

> > The fact
> > that pages under I/O cannot be moved is also the problem of all CMA
> > approaches. It is just separate issue and should not affect the decision
> > on ZONE_CMA.
>
> While it's a separate issue, it's also an important one. A linear reclaim
> of a CMA region would at least be able to clearly identify pages that are
> pinned in that region. Introducing the zone does not help the problem.

I know that it's important. ZONE_CMA may not help the problem but
it is also true for other approaches. They also doesn't help the
problem. It's why I said it is a separate issue. I'm not sure what you
mean as linear reclaim here but separate zone will make easy to adapt
different reclaim algorithm if needed.

Although it's separate issue, I should mentioned one thing. Related to
I/O pinning issue, ZONE_CMA don't get blockdev allocation request so
I/O pinning problem is much reduced.

> > What we should consider is what is the best approach to solve other issues
> > that comes from the fact that pages with different characteristic are
> > in the same zone. One of the problem is a watermark check. Many MM logic
> > based on watermark check to decide something. If there are free pages with
> > different characteristic that is not compatible with other migratetypes,
> > output of watermark check would cause the problem. We distinguished
> > allocation type and adjusted watermark check through ALLOC_CMA flag but
> > using it is not so simple and is fragile. Consider about the compaction
> > code. There is a checks that there are enough order-0 freepage in the zone
> > to check that compaction could work, before entering the compaction.
>
> Which would be addressed by linear reclaiming instead as the contiguous
> region is what CMA requires. Compaction was intended to be best effort for
> THP. Potentially right now, CMA has to loop constantly reclaiming pages
> and hoping compaction works and that can over-reclaim if there are pinned
> pages and *still* fail.

I don't get it here. What I'd like to say here is the difficulty of
corner case handling with current MIGRATE_CMA approach. These corner
case comes from the design decision of MIGRATE_CMA. It doesn't work
well and causes many problems until now.

> > In this case, we could add up CMA freepages even if alloc_flags doesn't
> > have ALLOC_CMA because we can utilize CMA freepages as a freepage.
> > But, in reality, we missed it. We might fix those cases one by one but
> > it's seems to be really error-prone to me. Until recent date, high order
> > freepage counting problem was there, too. It partially disappeared
> > by Mel's MIGRATE_HIGHATOMIC work, but still remain for ALLOC_HARDER
> > request. (We cannot know how many high order freepages are on normal area
> > and CMA area.) That problem also shows that it's very fragile design
> > that non-compatible types of pages are in the same zone.
> >
> > ZONE_CMA separates those pages to a new zone so there is no problem
> > mentioned in the above. There are other issues about freepage utilization
> > and reclaim efficiency in current kernel code and those also could be solved
> > by ZONE_CMA approach. Without a new zone, it need more hooks on core MM code
> > and it is uncomfortable to the developers who doesn't use the CMA.
> > New zone would be a necessary evil in this situation.
> >
> > I don't think another migratetype, sticky MIGRATE_MOVABLE,
> > is the right solution here. We already uses MIGRATE_CMA for that purpose
> > and it is proved that it doesn't work well.
> >
>
> Partially because the watermark checks do not always do the best thing,
> kswapd does not always reclaim the correct pages and compaction does not
> always work. A zone alleviates the watermark check but not the reclaim or
> compaction problems while introducing issues with balancing zone reclaim,
> having additional free lists and getting impacted later when the fair zone
> allocation policy is removed. It would be very difficult to convince me
> that ZONE_CMA is the way forward when I already think that ZONE_MOVABLE
> was a mistake.

I don't get what you mean reclaim and compaction problem here.
ZONE_CMA will solve many problems above mentioned and provides robust
implementation to us. I admit that reclaim balancing problem is added
if your node-lru work is merged but it is not a new issue.
ZONE_HIGHMEM also has same problem and you need to consider
and solve it. ZONE_CMA can be benefit from that solution so it
would not be show stopper for ZONE_CMA design.

Moreover, in the worst case, you just manages pages on ZONE_CMA
with global node lru. It is the same case that CMA pages are
on the other zone such as ZONE_NORMAL. It's not worse than current
implementation in terms of reclaim efficiency.

> What I was proposing at LSF/MM was the following;
>
> 1. Create the notion of a sticky MIGRATE_MOVABLE type.
> UNMOVABLE and RECLAIMABLE cannot fallback to these regions. If a sticky
> region exists then the fallback code will need additional checks
> in the slow path. This is slow but it's the cost of protection

First of all, I can't understand what is different between ZONE_CMA
and sticky MIGRATE_MOVABLE. We already did it for MIGRATE_CMA and it
is proved as error-prone design. #1 seems to be core concept of sticky
MIGRATE_MOVABLE and I cannot understand difference between them.
Please elaborate more on difference between them.

> 2. Express MIGRATE_CMA in terms of sticky MIGRATE_MOVABLE
>
> 3. Use linear reclaim instead of reclaim/compaction in alloc_contig_range
> Reclaim/compaction was intended for THP which does not care about zones,
> only node locality. It potentially over-reclaims in the CMA-allocation
> case. If reclaim/aggression is increased then it could potentially
> reclaim the entire system before failing. By using linear reclaim, a
> failure pass will scan just CMA, reclaim everything there and fail. On
> success, it may still over-reclaim if it finds pinned pages but in the
> ideal case, it reclaims exactly what is required for the allocation to
> succeed. Some pages may need to be refaulted but that is likely cheaper
> than multiple reclaim/compaction cycles that eventually fail anyway
>
> The core of how linear reclaim used to work is still visible in commit
> c53919adc045bf803252e912f23028a68525753d in the isolate_lru_pages
> function although I would not suggest reintroducing it there and instead
> do something similar in alloc_contig_range.
>
> Potentially, over-reclaim could be avoided by isolating the full range
> first and if one isolation fails then putback all the pages and restart
> the scan after the pinned page.

alloc_contig_range() already uses linear reclaim/migration.

> 4. Interleave MOVABLE and sticky MOVABLE if desired
> This would be in the fallback paths only and be specific to CMA. This
> would alleviate the utilisation problems while not impacting the fast
> paths for everyone else. Functionally it would be similar to the fair
> zone allocation policy which is currently in the fast path and scheduled
> for removal.

Interleaving can be implement in any case if desired. At least now, thanks
to zone fair policy, ZONE_CMA would get benefit of interleaving. If your node
lru work removes zone fair policy, it would become a problem but it also the
problem on other approach. We can solve it with some hook on
allocation path so it's not bad point of ZONE_CMA.

> 5. For kernelcore=, create stick MIGRATE_MOVABLE blocks instead of
> ZONE_MOVABLE
>
> 6. For memory hot-add, create sticky MIGRATE_MOVABLE blocks instead of
> adding pages to ZONE_MOVABLE
>
> 7. Delete ZONE_MOVABLE
>
> 8. Optionally migrate pages about to be pinned from sticky MIGRATE_MOVABLE
>
> This would benefit both CMA and memory hot-remove. It would be a policy
> choice on whether a failed migration allows the allocation to succeed
> or not.

Separate issue. It can be implemented regardless of CMA design decision.

Overall, I do not see any advantage of sticky MIGRATE_MOVABLE design
at least now. Main reason is that ZONE_CMA is introduced to replace
MIGRATE_CMA which is conceptually same with your sticky
MIGRATE_MOVABLE proposal. It doesn't solve any issues mentioned here
and we should not repeat same mistake again.

If I'm missing something, please let me know.

Thanks.

2016-04-29 06:57:08

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] mm/page_alloc: recalculate some of zone threshold when on/offline memory

On Thu, Apr 28, 2016 at 03:46:33PM +0800, Rui Teng wrote:
> On 4/25/16 1:21 PM, [email protected] wrote:
> >From: Joonsoo Kim <[email protected]>
> >
> >Some of zone threshold depends on number of managed pages in the zone.
> >When memory is going on/offline, it can be changed and we need to
> >adjust them.
> >
> >This patch add recalculation to appropriate places and clean-up
> >related function for better maintanance.
> >
> >Signed-off-by: Joonsoo Kim <[email protected]>
> >---
> > mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-------
> > 1 file changed, 29 insertions(+), 7 deletions(-)
> >
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 71fa015..ffa93e0 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -4633,6 +4633,8 @@ int local_memory_node(int node)
> > }
> > #endif
> >
> >+static void setup_min_unmapped_ratio(struct zone *zone);
> >+static void setup_min_slab_ratio(struct zone *zone);
> > #else /* CONFIG_NUMA */
> >
> > static void set_zonelist_order(void)
> >@@ -5747,9 +5749,8 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> > zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
> > #ifdef CONFIG_NUMA
> > zone->node = nid;
> >- zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
> >- / 100;
> >- zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
> >+ setup_min_unmapped_ratio(zone);
> >+ setup_min_slab_ratio(zone);
>
> The original logic use freesize to calculate the
> zone->min_unmapped_pages and zone->min_slab_pages here.
> But the new function will use zone->managed_pages.
> Do you mean the original logic is wrong, or the managed_pages will
> always be freesize when CONFIG_NUMA defined?

managed_pages will always be freesize so no problem.

Thanks.

2016-04-29 07:04:19

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] mm/cma: introduce new zone, ZONE_CMA

On Tue, Apr 26, 2016 at 05:38:18PM +0800, Rui Teng wrote:
> On 4/25/16 1:21 PM, [email protected] wrote:
> >From: Joonsoo Kim <[email protected]>
> >
> >Attached cover-letter:
> >
> >This series try to solve problems of current CMA implementation.
> >
> >CMA is introduced to provide physically contiguous pages at runtime
> >without exclusive reserved memory area. But, current implementation
> >works like as previous reserved memory approach, because freepages
> >on CMA region are used only if there is no movable freepage. In other
> >words, freepages on CMA region are only used as fallback. In that
> >situation where freepages on CMA region are used as fallback, kswapd
> >would be woken up easily since there is no unmovable and reclaimable
> >freepage, too. If kswapd starts to reclaim memory, fallback allocation
> >to MIGRATE_CMA doesn't occur any more since movable freepages are
> >already refilled by kswapd and then most of freepage on CMA are left
> >to be in free. This situation looks like exclusive reserved memory case.
> >
> >In my experiment, I found that if system memory has 1024 MB memory and
> >512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
> >free memory is left. Detailed reason is that for keeping enough free
> >memory for unmovable and reclaimable allocation, kswapd uses below
> >equation when calculating free memory and it easily go under the watermark.
> >
> >Free memory for unmovable and reclaimable = Free total - Free CMA pages
> >
> >This is derivated from the property of CMA freepage that CMA freepage
> >can't be used for unmovable and reclaimable allocation.
> >
> >Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
> >is lower than low watermark and tries to make free memory until
> >(FreeTotal - FreeCMA) is higher than high watermark. That results
> >in that FreeTotal is moving around 512MB boundary consistently. It
> >then means that we can't utilize full memory capacity.
> >
> >To fix this problem, I submitted some patches [1] about 10 months ago,
> >but, found some more problems to be fixed before solving this problem.
> >It requires many hooks in allocator hotpath so some developers doesn't
> >like it. Instead, some of them suggest different approach [2] to fix
> >all the problems related to CMA, that is, introducing a new zone to deal
> >with free CMA pages. I agree that it is the best way to go so implement
> >here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar, I
> >decide to add a new zone rather than piggyback on ZONE_MOVABLE since
> >they have some differences. First, reserved CMA pages should not be
> >offlined. If freepage for CMA is managed by ZONE_MOVABLE, we need to keep
> >MIGRATE_CMA migratetype and insert many hooks on memory hotplug code
> >to distiguish hotpluggable memory and reserved memory for CMA in the same
> >zone. It would make memory hotplug code which is already complicated
> >more complicated. Second, cma_alloc() can be called more frequently
> >than memory hotplug operation and possibly we need to control
> >allocation rate of ZONE_CMA to optimize latency in the future.
> >In this case, separate zone approach is easy to modify. Third, I'd
> >like to see statistics for CMA, separately. Sometimes, we need to debug
> >why cma_alloc() is failed and separate statistics would be more helpful
> >in this situtaion.
> >
> >Anyway, this patchset solves four problems related to CMA implementation.
> >
> >1) Utilization problem
> >As mentioned above, we can't utilize full memory capacity due to the
> >limitation of CMA freepage and fallback policy. This patchset implements
> >a new zone for CMA and uses it for GFP_HIGHUSER_MOVABLE request. This
> >typed allocation is used for page cache and anonymous pages which
> >occupies most of memory usage in normal case so we can utilize full
> >memory capacity. Below is the experiment result about this problem.
> >
> >8 CPUs, 1024 MB, VIRTUAL MACHINE
> >make -j16
> >
> ><Before this series>
> >CMA reserve: 0 MB 512 MB
> >Elapsed-time: 92.4 186.5
> >pswpin: 82 18647
> >pswpout: 160 69839
> >
> ><After this series>
> >CMA reserve: 0 MB 512 MB
> >Elapsed-time: 93.1 93.4
> >pswpin: 84 46
> >pswpout: 183 92
> >
> >FYI, there is another attempt [3] trying to solve this problem in lkml.
> >And, as far as I know, Qualcomm also has out-of-tree solution for this
> >problem.
> >
> >2) Reclaim problem
> >Currently, there is no logic to distinguish CMA pages in reclaim path.
> >If reclaim is initiated for unmovable and reclaimable allocation,
> >reclaiming CMA pages doesn't help to satisfy the request and reclaiming
> >CMA page is just waste. By managing CMA pages in the new zone, we can
> >skip to reclaim ZONE_CMA completely if it is unnecessary.
> >
> >3) Atomic allocation failure problem
> >Kswapd isn't started to reclaim pages when allocation request is movable
> >type and there is enough free page in the CMA region. After bunch of
> >consecutive movable allocation requests, free pages in ordinary region
> >(not CMA region) would be exhausted without waking up kswapd. At that time,
> >if atomic unmovable allocation comes, it can't be successful since there
> >is not enough page in ordinary region. This problem is reported
> >by Aneesh [4] and can be solved by this patchset.
> >
> >4) Inefficiently work of compaction
> >Usual high-order allocation request is unmovable type and it cannot
> >be serviced from CMA area. In compaction, migration scanner doesn't
> >distinguish migratable pages on the CMA area and do migration.
> >In this case, even if we make high-order page on that region, it
> >cannot be used due to type mismatch. This patch will solve this problem
> >by separating CMA pages from ordinary zones.
> >
> >[1] https://lkml.org/lkml/2014/5/28/64
> >[2] https://lkml.org/lkml/2014/11/4/55
> >[3] https://lkml.org/lkml/2014/10/15/623
> >[4] http://www.spinics.net/lists/linux-mm/msg100562.html
> >[5] https://lkml.org/lkml/2014/5/30/320
> >
> >For this patch:
> >
> >Currently, reserved pages for CMA are managed together with normal pages.
> >To distinguish them, we used migratetype, MIGRATE_CMA, and
> >do special handlings for this migratetype. But, it turns out that
> >there are too many problems with this approach and to fix all of them
> >needs many more hooks to page allocation and reclaim path so
> >some developers express their discomfort and problems on CMA aren't fixed
> >for a long time.
> >
> >To terminate this situation and fix CMA problems, this patch implements
> >ZONE_CMA. Reserved pages for CMA will be managed in this new zone. This
> >approach will remove all exisiting hooks for MIGRATE_CMA and many
> >problems related to CMA implementation will be solved.
> >
> >This patch only add basic infrastructure of ZONE_CMA. In the following
> >patch, ZONE_CMA is actually populated and used.
> >
> >Adding a new zone could cause two possible problems. One is the overflow
> >of page flags and the other is GFP_ZONES_TABLE issue.
> >
> >Following is page-flags layout described in page-flags-layout.h.
> >
> >1. No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
> >2. " plus space for last_cpupid: | NODE | ZONE | LAST_CPUPID ... | FLAGS |
> >3. classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
> >4. " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
> >5. classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
> >
> >There is no problem in #1, #2 configurations for 64-bit system. There are
> >enough room even for extremiely large x86_64 system. 32-bit system would
> >not have many nodes so it would have no problem, too.
> >System with #3, #4, #5 configurations could be affected by this zone
> >addition, but, thanks to recent THP rework which reduce one page flag,
> >problem surface would be small. In some configurations, problem is
> >still possible, but, it highly depends on individual configuration
> >so impact cannot be easily estimated. I guess that usual system
> >with CONFIG_CMA would not be affected. If there is a problem,
> >we can adjust section width or node width for that architecture.
> >
> >Currently, GFP_ZONES_TABLE is 32-bit value for 32-bit bit operation
> >in the 32-bit system. If we add one more zone, it will be 48-bit and
> >32-bit bit operation cannot be possible. Although it will cause slight
> >overhead, there is no other way so this patch relax GFP_ZONES_TABLE's
> >32-bit limitation. 32-bit System with CONFIG_CMA will be affected by
> >this change but it would be marginal.
> >
> >Note that there are many checkpatch warnings but I think that current
> >code is better for readability than fixing them up.
> >
> >Signed-off-by: Joonsoo Kim <[email protected]>
> >---
> > arch/x86/mm/highmem_32.c | 8 +++++
> > include/linux/gfp.h | 29 +++++++++++-------
> > include/linux/mempolicy.h | 2 +-
> > include/linux/mmzone.h | 31 ++++++++++++++++++-
> > include/linux/vm_event_item.h | 10 ++++++-
> > include/trace/events/compaction.h | 10 ++++++-
> > kernel/power/snapshot.c | 8 +++++
> > mm/memory_hotplug.c | 3 ++
> > mm/page_alloc.c | 63 +++++++++++++++++++++++++++++++++------
> > mm/vmstat.c | 9 +++++-
> > 10 files changed, 148 insertions(+), 25 deletions(-)
> >
> >diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
> >index a6d7392..a7fcb12 100644
> >--- a/arch/x86/mm/highmem_32.c
> >+++ b/arch/x86/mm/highmem_32.c
> >@@ -120,6 +120,14 @@ void __init set_highmem_pages_init(void)
> > if (!is_highmem(zone))
> > continue;
> >
> >+ /*
> >+ * ZONE_CMA is a special zone that should not be
> >+ * participated in initialization because it's pages
> >+ * would be initialized by initialization of other zones.
> >+ */
> >+ if (is_zone_cma(zone))
> >+ continue;
> >+
> > zone_start_pfn = zone->zone_start_pfn;
> > zone_end_pfn = zone_start_pfn + zone->spanned_pages;
> >
> >diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >index 570383a..4d6c008 100644
> >--- a/include/linux/gfp.h
> >+++ b/include/linux/gfp.h
> >@@ -301,6 +301,12 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> > #define OPT_ZONE_DMA32 ZONE_NORMAL
> > #endif
> >
> >+#ifdef CONFIG_CMA
> >+#define OPT_ZONE_CMA ZONE_CMA
> >+#else
> >+#define OPT_ZONE_CMA ZONE_MOVABLE
> >+#endif
> >+
> > /*
> > * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
> > * zone to use given the lowest 4 bits of gfp_t. Entries are ZONE_SHIFT long
> >@@ -331,7 +337,6 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> > * 0xe => BAD (MOVABLE+DMA32+HIGHMEM)
> > * 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
> > *
> >- * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
> > */
> >
> > #if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
> >@@ -341,19 +346,21 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> > #define GFP_ZONES_SHIFT ZONES_SHIFT
> > #endif
> >
> >-#if 16 * GFP_ZONES_SHIFT > BITS_PER_LONG
> >-#error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
> >+#if !defined(CONFIG_64BITS) && GFP_ZONES_SHIFT > 2
> >+#define GFP_ZONE_TABLE_CAST unsigned long long
> >+#else
> >+#define GFP_ZONE_TABLE_CAST unsigned long
> > #endif
> >
> > #define GFP_ZONE_TABLE ( \
> >- (ZONE_NORMAL << 0 * GFP_ZONES_SHIFT) \
> >- | (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT) \
> >- | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT) \
> >- | (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT) \
> >- | (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT) \
> >- | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT) \
> >- | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)\
> >- | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)\
> >+ ((GFP_ZONE_TABLE_CAST) ZONE_NORMAL << 0 * GFP_ZONES_SHIFT) \
> >+ | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT) \
> >+ | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT) \
> >+ | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT) \
> >+ | ((GFP_ZONE_TABLE_CAST) ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT) \
> >+ | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT) \
> >+ | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_CMA << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT) \
> >+ | ((GFP_ZONE_TABLE_CAST) OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT) \
> > )
> >
> > /*
> >diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> >index 4429d25..c4cc86e 100644
> >--- a/include/linux/mempolicy.h
> >+++ b/include/linux/mempolicy.h
> >@@ -157,7 +157,7 @@ extern enum zone_type policy_zone;
> >
> > static inline void check_highest_zone(enum zone_type k)
> > {
> >- if (k > policy_zone && k != ZONE_MOVABLE)
> >+ if (k > policy_zone && k != ZONE_MOVABLE && !is_zone_cma_idx(k))
> > policy_zone = k;
> > }
> >
> >diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >index f4ae0abb..5c97ba9 100644
> >--- a/include/linux/mmzone.h
> >+++ b/include/linux/mmzone.h
> >@@ -322,6 +322,9 @@ enum zone_type {
> > ZONE_HIGHMEM,
> > #endif
> > ZONE_MOVABLE,
> >+#ifdef CONFIG_CMA
> >+ ZONE_CMA,
> >+#endif
> > #ifdef CONFIG_ZONE_DEVICE
> > ZONE_DEVICE,
> > #endif
> >@@ -812,11 +815,37 @@ static inline int zone_movable_is_highmem(void)
> > }
> > #endif
> >
> >+static inline int is_zone_cma_idx(enum zone_type idx)
> >+{
> >+#ifdef CONFIG_CMA
> >+ return idx == ZONE_CMA;
> >+#else
> >+ return 0;
> >+#endif
> >+}
> >+
> >+static inline int is_zone_cma(struct zone *zone)
> >+{
> >+ int zone_idx = zone_idx(zone);
> >+
> >+ return is_zone_cma_idx(zone_idx);
> >+}
> >+
> >+static inline int zone_cma_is_highmem(void)
> >+{
> >+#ifdef CONFIG_HIGHMEM
>
> Whether it needs to check the CONFIG_CMA here also?

It's not necessary because zone_cma_is_highmem() will be called
after checking whether zone is CMA or not.

>
> >+ return 1;
> >+#else
> >+ return 0;
> >+#endif
> >+}
> >+
> > static inline int is_highmem_idx(enum zone_type idx)
> > {
> > #ifdef CONFIG_HIGHMEM
> > return (idx == ZONE_HIGHMEM ||
> >- (idx == ZONE_MOVABLE && zone_movable_is_highmem()));
> >+ (idx == ZONE_MOVABLE && zone_movable_is_highmem()) ||
> >+ (is_zone_cma_idx(idx) && zone_cma_is_highmem()));
>
> When CONFIG_HIGHMEM defined, zone_cma_is_highmem() will always return 1.
> I think it is not necessary to call the function here, and even define
> it.

We can remove it. But, I'd like to remain it because we can do similar
thing like as ZONE_MOVABLE which don't unconditionally set highmem bit
through checking memory map.

Thanks.

2016-04-29 09:29:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] Introduce ZONE_CMA

On Fri, Apr 29, 2016 at 03:51:45PM +0900, Joonsoo Kim wrote:
> Hello, Mel.
>
> IIUC, you may miss that alloc_contig_range() currently does linear
> reclaim/migration. Your comment is largely based on this
> misunderstanding so please keep it in your mind when reading the
> reply.
>

Ok, you're right but if anything this moves heavier *against* the zone.
If linear reclaim is not able to work at the moment due to pinned pages
then how is a zone going to help in the slightest?

If pages cannot be directly reclaimed then no amount of moving the pages
into a separate zone and teaching kswapd new tricks is going to change
whether those pages can be reclaimed or not. At best, it alters the
timing of when problems occur.

If this is partially about kswapd waking up to reclaim pages suitable for
atomic allocations then the classzone_idx handling of kswapd needs to be
improved. It was very haphazard although improved slightly recently. The
node-lru series attempts to improve it further.

> On Thu, Apr 28, 2016 at 11:39:27AM +0100, Mel Gorman wrote:
> > On Mon, Apr 25, 2016 at 02:36:54PM +0900, Joonsoo Kim wrote:
> > > > Hello,
> > > >
> > > > Changes from v1
> > > > o Separate some patches which deserve to submit independently
> > > > o Modify description to reflect current kernel state
> > > > (e.g. high-order watermark problem disappeared by Mel's work)
> > > > o Don't increase SECTION_SIZE_BITS to make a room in page flags
> > > > (detailed reason is on the patch that adds ZONE_CMA)
> > > > o Adjust ZONE_CMA population code
> > > >
> > > > This series try to solve problems of current CMA implementation.
> > > >
> > > > CMA is introduced to provide physically contiguous pages at runtime
> > > > without exclusive reserved memory area. But, current implementation
> > > > works like as previous reserved memory approach, because freepages
> > > > on CMA region are used only if there is no movable freepage. In other
> > > > words, freepages on CMA region are only used as fallback. In that
> > > > situation where freepages on CMA region are used as fallback, kswapd
> > > > would be woken up easily since there is no unmovable and reclaimable
> > > > freepage, too. If kswapd starts to reclaim memory, fallback allocation
> > > > to MIGRATE_CMA doesn't occur any more since movable freepages are
> > > > already refilled by kswapd and then most of freepage on CMA are left
> > > > to be in free. This situation looks like exclusive reserved memory case.
> > > >
> >
> > My understanding is that this was intentional. One of the original design
> > requirements was that CMA have a high likelihood of allocation success for
> > devices if it was necessary as an allocation failure was very visible to
> > the user. It does not *have* to be treated as a reserve because Movable
> > allocations could try CMA first but it increases allocation latency for
> > devices that require it and it gets worse if those pages are pinned.
>
> I know that it was design decision at that time when CMA isn't
> actively used. It is due to lack of experience and now situation is
> quite different. Most of embedded systems uses CMA with their own
> adaptation because utilization is too low. It makes system much
> slower and this is more likely than the case that device memory is
> required. Given the fact that they adapt their logic to utilize CMA
> much more and sacrifice latency, I think that previous design
> decision is wrong and we should go another way.
>

Then slow path interleave between CMA and !CMA regions for movable
allocations. Moving to a zone now will temporarily work and fail again
when the fair zone allocation policy is removed.

> >
> > > > In my experiment, I found that if system memory has 1024 MB memory and
> > > > 512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
> > > > free memory is left. Detailed reason is that for keeping enough free
> > > > memory for unmovable and reclaimable allocation, kswapd uses below
> > > > equation when calculating free memory and it easily go under the watermark.
> > > >
> > > > Free memory for unmovable and reclaimable = Free total - Free CMA pages
> > > >
> > > > This is derivated from the property of CMA freepage that CMA freepage
> > > > can't be used for unmovable and reclaimable allocation.
> > > >
> >
> > Yes and also keeping it lightly utilised to reduce CMA allocation
> > latency and probability of failure.
>
> As my experience about CMA, most of unacceptable failure (takes more
> than 3 sec) comes from blockdev pagecache. Even, it's not simple to
> check what is going on there when failure happen. ZONE_CMA uses
> different approach that it only takes the request with
> GFP_HIGHUSER_MOVABLE so blockdev pagecache cannot get in and
> probability of failure is much reduced.
>

If ZONE_CMA is protected from blockdev allocations then it's altering the
problem in a different way. The utilisation of ZONE_CMA for such allocations
will be lower and while this may side-step some pinning issues it may be
the case that ZONE_CMA is underutilised depending on what the workload is.

> > > > Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
> > > > is lower than low watermark and tries to make free memory until
> > > > (FreeTotal - FreeCMA) is higher than high watermark. That results
> > > > in that FreeTotal is moving around 512MB boundary consistently. It
> > > > then means that we can't utilize full memory capacity.
> > > >
> > > > To fix this problem, I submitted some patches [1] about 10 months ago,
> > > > but, found some more problems to be fixed before solving this problem.
> > > > It requires many hooks in allocator hotpath so some developers doesn't
> > > > like it. Instead, some of them suggest different approach [2] to fix
> > > > all the problems related to CMA, that is, introducing a new zone to deal
> > > > with free CMA pages. I agree that it is the best way to go so implement
> > > > here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar,
> >
> > One of the issues I mentioned at LSF/MM is that I consider ZONE_MOVABLE
> > to be a mistake. Zones are meant to be about addressing limitations and
> > both ZONE_MOVABLE and ZONE_CMA violate that. When ZONE_MOVABLE was
> > introduced, it was intended for use with dynamically resizing the
> > hugetlbfs pool. It was competing with fragmentation avoidance at the
> > time and the community could not decide which approach was better so
> > both ended up being merged as they had different advantages and
> > disadvantages.
> >
> > Now, ZONE_MOVABLE is being abused -- memory hotplug was a particular mistake
> > and I don't want to see CMA fall down the same hole. Both CMA and memory
> > hotplug would benefit from the notion of having "sticky" MIGRATE_MOVABLE
> > pageblocks that are never used for UNMOVABLE and RECLAIMABLE fallbacks.
> > It costs to detect that in the slow path but zones cause their own problems.
>
> Please elaborate more concrete reasons that you think why ZONE_MOVABLE
> is a mistake. Simply saying that zones are meant to be about address
> limitations doesn't make sense. Moreover, I think that this original
> purpose of zone could be changed if needed. It was introduced for that
> purpose but time goes by a lot. We have different requirement now and
> zone is suitable to handle this new requirement. And, if we think
> address limitation more generally, it can be considered as different
> characteristic memory problem. Zone is introduced to handle this
> situation and that's what new CMA implementation needs.
>

ZONE_MOVABLE is a mistake because it reintroduces a variation of
lowmem/highmem problems when aggressively used. Memory hotplug is a good
example when memory is only added to the movable zone. All kernel allocations
and page tables then use a limited amount of memory triggering premature
reclaim. In extreme cases, the allocations simply fail as no !ZONE_MOVABLE is
available or can be reclaimed even though plenty of memory is free overall.

A slightly different problem is page age inversion. Movable allocations
in ZONE_MOVABLE get artifical protection versus pages in lower zones
when zones are imbalanced. kswapd reclaims from lowest to highest zones
where allocations use higher zones to lower zones. Under memory pressure,
newer pages from lower zones are potentially reclaimed before old pages
in higher zones. This is highly workload-dependant and it's mitigated
somewhat by fair zone interleaving but it's an issue.

> > > > 1) Utilization problem
> > > > As mentioned above, we can't utilize full memory capacity due to the
> > > > limitation of CMA freepage and fallback policy. This patchset implements
> > > > a new zone for CMA and uses it for GFP_HIGHUSER_MOVABLE request. This
> > > > typed allocation is used for page cache and anonymous pages which
> > > > occupies most of memory usage in normal case so we can utilize full
> > > > memory capacity. Below is the experiment result about this problem.
> > > >
> >
> > A zone is not necessary for that. Currently a zone would have a side-benefit
>
> Agreed that a zone isn't necessary for it. I also tried the approach
> interleaving within a normal zone about two years ago, and, because
> there are more remaining problems and it needs more hook in many
> placees, people doesn't like it. You can see implementation in below link.
>
> https://lkml.org/lkml/2014/5/28/64
>
>
> > from the fair zone allocation policy because it would interleave between
> > ZONE_CMA and ZONE_MOVABLE. However, the intention is to remove that policy
> > by moving LRUs to the node. Once that happens, the interleaving benefit
> > is lost and you're back to square one.
> >
> > There may be some justification for interleaving *only* between MOVABLE and
> > MOVABLE_STICKY for CMA allocations and hiding that behind both a CONFIG_CMA
> > guard *and* a check if there a CMA region exists. It'd still need
> > something like the BATCH vmstat but it would only be updated when CMA is
> > active and hide it from the fast paths.
>
> And, please don't focus on interleaving allocation problem. Main
> problem I'd like to solve is the utilization problem and interleaving
> is optional benefit from fair zone policy.

The stated issue is that "we can't utilize full memory capacity due to
the limitation of CMA freepage and fallback policy.". Adding a zone still
has a fallback policy that forbids UNMOVABLE and RECLAIMABLE allocations
(or if it doesn't, it completely breaks the concept of CMA). If you intend
to restrict what MOVABLE allocations use ZONE_CMA then the utilisation
problem still exists.

> > > > 2) Reclaim problem
> > > > Currently, there is no logic to distinguish CMA pages in reclaim path.
> > > > If reclaim is initiated for unmovable and reclaimable allocation,
> > > > reclaiming CMA pages doesn't help to satisfy the request and reclaiming
> > > > CMA page is just waste. By managing CMA pages in the new zone, we can
> > > > skip to reclaim ZONE_CMA completely if it is unnecessary.
> > > >
> >
> > This problem will recur with node-lru. However, that said CMA reclaim is
>
> Yes, it will. But, it will also happen on ZONE_HIGHMEM if we use
> node-lru. What's your plan to handle this issue? Because you cannot
> remove ZONE_HIGHMEM, you need to handle it properly and that way would
> naturally work well for ZONE_CMA as well.
>

For highmem configurations, allocation requests that require lower zones skip
the highmem pages. This means that configurations with a large highmem/lowmem
ratio will have to scan more pages for skipping with higher CPU usage. The
thinking behind it is that configurations with large configurations are
already sub-optimal. It made some sense when 32-bit CPUs dominated in
large memory configurations but not 10+ years later.

> > currently depending on randomly reclaiming followed by compaction. This is
> > both slow and inefficient. CMA and alloc_contig_range() should strongly
> > consider isolation pages with a PFN walk of the CMA regions and directly
> > reclaiming those pages. Those pages may need to be refaulted in but the
> > priority is for the allocation to succeed. That would side-step the issues
> > with kswapd scanning the wrong zones.
>
> I think that you are confused now. alloc_contig_range() already uses
> PFN walk on the CMA regions and directly reclaim/migration those
> pages. There is no randomly reclaim/migration here.
>

Fine, then what exactly does a zone solve? Because if you linearly scan
a CMA region and that fails then how does putting those pages into a
separate zone fix the scanning?

> > > > 3) Atomic allocation failure problem
> > > > Kswapd isn't started to reclaim pages when allocation request is movable
> > > > type and there is enough free page in the CMA region. After bunch of
> > > > consecutive movable allocation requests, free pages in ordinary region
> > > > (not CMA region) would be exhausted without waking up kswapd. At that time,
> > > > if atomic unmovable allocation comes, it can't be successful since there
> > > > is not enough page in ordinary region. This problem is reported
> > > > by Aneesh [4] and can be solved by this patchset.
> > > >
> >
> > Not necessarily as kswapd curently would still reclaim from the lower
> > zones unnecessarily. Again, targetting the pages required for the CMA
> > allocation would side-step the issue. The actual core code of lumpy reclaim
> > was quite small.
>
> You may be confused here, too. Maybe, I misunderstand what you say
> here so please correct me if I'm missing something. First, let me
> clarify the problem.
>
> It's not the issue when calling alloc_contig_range(). It is the issue
> when MM manages (allocate/reclaim) pages on CMA area when they are not
> used by device now.
>
> The problem is that kswapd isn't woken up properly due to non-accurate
> watermark check.

Then fix the classzone idx handling in kswapd and in the watermark check so
that it wakes up and reclaims. The handling of classzone_idx in kswapd is
hap-hazard. It has improved a little recently, the node-lru series tries
to clean it up a bit more.

> When we do watermark check, number of freepage is
> varied depending on allocation type. Non-movable allocation subtracts
> number of CMA freepages from total freepages. In other words, we adds
> number of CMA freepages to number of normal freepages when movable
> allocation is requested. Problem comes from here. While we handle
> bunch of movable allocation, we can't notice that there is not enough
> freepage in normal area because we adds up number of CMA freepages and
> watermark looks safe. In this case, kswapd would not be woken up. If
> atomic allocation suddenly comes in this situation, freepage in normal
> area could be low and atomic allocation can fail.
>

And moving to a zone doesn't necessarily fix that. If the CMA zone is
preserved as long as possible, the requests fill the lower zone and the
atomic allocation fails. If the policy is to use the CMA region first then
it does not matter if it's in a separate zone or not as the region used
for atomic allocations is untouched as long as possible.

> This is a really good example that comes from the fact that different
> types of pages are in a single zone. As I mentioned in other places,
> it's really error-prone design and handling it case by case is very fragile.
> Current long lasting problems about CMA is caused by this design
> decision and we need to change it.
>
> > > > 4) Inefficiently work of compaction
> > > > Usual high-order allocation request is unmovable type and it cannot
> > > > be serviced from CMA area. In compaction, migration scanner doesn't
> > > > distinguish migratable pages on the CMA area and do migration.
> > > > In this case, even if we make high-order page on that region, it
> > > > cannot be used due to type mismatch. This patch will solve this problem
> > > > by separating CMA pages from ordinary zones.
> > > >
> >
> > Compaction problems are actually compounded by introducing ZONE_CMA as
> > it only compacts within the zone. Compaction would need to know how to
> > compact within a node to address the introduction of ZONE_CMA.
> >
>
> I'm not sure what you'd like to say here. What I meant is following
> situation. Capital letter means pageblock type. (C:MIGRATE_CMA,
> M:MIGRATE_MOVABLE). F means freed pageblock due to compaction.
>
> CCCCCMMMM
>
> If compaction is invoked to make unmovable high order freepage,
> compaction would start to work. It empties front part of the zone and
> migrate them to rear part of the zone. High order freepages are made
> on front part of the zone and it may look like as following.
>
> FFFCCMMMM
>
> But, they are on CMA pageblock so it cannot be used to satisfy unmovable
> high order allocation. Freepages on CMA pageblock isn't allocated for
> unmovable allocation request. That is what I'd like to say here.
> We can fix it by adding corner case handling at some place but
> this kind of corner case handling is not what I want. It's not
> maintainable. It comes from current design decision (MIGRATE_CMA) and
> we need to change the situation.
>

The corner case may be necessary to skip MIGRATE_CMA pageblocks during
compaction if the caller is !MOVABLE and !CMA. I know it's a corner case but
it would alleviate this concern without creating new zones. Similar logic
could then be used by memory hotplug so it can get away from ZONE_MOVABLE.

> > > The fact
> > > that pages under I/O cannot be moved is also the problem of all CMA
> > > approaches. It is just separate issue and should not affect the decision
> > > on ZONE_CMA.
> >
> > While it's a separate issue, it's also an important one. A linear reclaim
> > of a CMA region would at least be able to clearly identify pages that are
> > pinned in that region. Introducing the zone does not help the problem.
>
> I know that it's important. ZONE_CMA may not help the problem but
> it is also true for other approaches. They also doesn't help the
> problem. It's why I said it is a separate issue. I'm not sure what you
> mean as linear reclaim here but separate zone will make easy to adapt
> different reclaim algorithm if needed.
>
> Although it's separate issue, I should mentioned one thing. Related to
> I/O pinning issue, ZONE_CMA don't get blockdev allocation request so
> I/O pinning problem is much reduced.
>

This is not super-clear from the patch. blockdev is using GFP_USER so it
already should not be classed as MOVABLE. I could easily be looking in
the wrong place or missed which allocation path sets GFP_MOVABLE.

> > What I was proposing at LSF/MM was the following;
> >
> > 1. Create the notion of a sticky MIGRATE_MOVABLE type.
> > UNMOVABLE and RECLAIMABLE cannot fallback to these regions. If a sticky
> > region exists then the fallback code will need additional checks
> > in the slow path. This is slow but it's the cost of protection
>
> First of all, I can't understand what is different between ZONE_CMA
> and sticky MIGRATE_MOVABLE. We already did it for MIGRATE_CMA and it
> is proved as error-prone design. #1 seems to be core concept of sticky
> MIGRATE_MOVABLE and I cannot understand difference between them.
> Please elaborate more on difference between them.
>

Sticky MIGRATE_MOVABLE does not worry about potential zones overlapping
It does not further fuzzy what a zone is meant to be for -- address
limtations which MIGRATE_MOVABLE also violates
It does not require a separate zone with potential page age inversion issues
It does not require addiitonal memory footprint for per-cpu allocation,
page lock waitqueues and accounting
In the current reclaim implementation, it does not require zone
balancing tricks although that concern goes away with node-lru.

Some of this overlaps with the problems ZONE_MOVABLE has.

> > 4. Interleave MOVABLE and sticky MOVABLE if desired
> > This would be in the fallback paths only and be specific to CMA. This
> > would alleviate the utilisation problems while not impacting the fast
> > paths for everyone else. Functionally it would be similar to the fair
> > zone allocation policy which is currently in the fast path and scheduled
> > for removal.
>
> Interleaving can be implement in any case if desired. At least now, thanks
> to zone fair policy, ZONE_CMA would get benefit of interleaving.

For maybe one release if things go according to plan.

> Overall, I do not see any advantage of sticky MIGRATE_MOVABLE design
> at least now. Main reason is that ZONE_CMA is introduced to replace
> MIGRATE_CMA which is conceptually same with your sticky
> MIGRATE_MOVABLE proposal. It doesn't solve any issues mentioned here
> and we should not repeat same mistake again.
>

ZONE_MOVABLE in itself was a mistake, particularly when it was used for
memory hotplug "guaranteeing" that memory could be removed. I'm not going to
outright NAK your series but I won't ACK it either. Zones come with their own
class of problems and I suspect that CMA will still be having discussions on
utilisation and reliability problems in the future even if the zone is added.

--
Mel Gorman
SUSE Labs