From: Joonsoo Kim <[email protected]>
Changed from v6
o Rebase on next-20170405
o Add a fix for lowmem mapping on ARM (last patch)
o Re-organize the cover letter
Changes from v5
o Rebase on next-20161013
o Cosmetic change on patch 1
o Optimize span of ZONE_CMA on multiple node system
Changes from v4
o Rebase on next-20160825
o Add general fix patch for lowmem reserve
o Fix lowmem reserve ratio
o Fix zone span optimizaion per Vlastimil
o Fix pageset initialization
o Change invocation timing on cma_init_reserved_areas()
Changes from v3
o Rebase on next-20160805
o Split first patch per Vlastimil
o Remove useless function parameter per Vlastimil
o Add code comment per Vlastimil
o Add following description on cover-letter
Changes from v2
o Rebase on next-20160525
o No other changes except following description
Changes from v1
o Separate some patches which deserve to submit independently
o Modify description to reflect current kernel state
(e.g. high-order watermark problem disappeared by Mel's work)
o Don't increase SECTION_SIZE_BITS to make a room in page flags
(detailed reason is on the patch that adds ZONE_CMA)
o Adjust ZONE_CMA population code
Hello,
This is the 7th version of ZONE_CMA patchset. One patch is added
to fix potential problem on ARM. Other changes are just due to rebase.
This patchset has long history and got some reviews before. This
cover-letter has the summary and my opinion on those reviews. Content
order is so confusing so I make a simple index. If anyone want to
understand the history properly, please read them by reverse order.
PART 1. Strong points of the zone approach
PART 2. Summary in LSF/MM 2016 discussion
PART 3. Original motivation of this patchset
***** PART 1 *****
CMA has many problems and I mentioned them on the bottom of the
cover letter. These problems comes from limitation of CMA memory that
should be always migratable for device usage. I think that introducing
a new zone is the best approach to solve them. Here are the reasons.
Zone is introduced to solve some issues due to H/W addressing limitation.
MM subsystem is implemented to work efficiently with these zones.
Allocation/reclaim logic in MM consider this limitation very much.
What I did in this patchset is introducing a new zone and extending zone's
concept slightly. New concept is that zone can have not only H/W addressing
limitation but also S/W limitation to guarantee page migration.
This concept is originated from ZONE_MOVABLE and it works well
for a long time. So, ZONE_CMA should not be special at this moment.
There is a major concern from Mel that ZONE_MOVABLE which has
S/W limitation causes highmem/lowmem problem. Highmem/lowmem problem is
that some of memory cannot be usable for kernel memory due to limitation
of the zone. It causes to break LRU ordering and makes hard to find kernel
usable memory when memory pressure.
However, important point is that this problem doesn't come from
implementation detail (ZONE_MOVABLE/MIGRATETYPE). Even if we implement it
by MIGRATETYPE instead of by ZONE_MOVABLE, we cannot use that type of
memory for kernel allocation because it isn't migratable. So, it will cause
to break LRU ordering, too. We cannot avoid the problem in any case.
Therefore, we should focus on which solution is better for maintenance
and not intrusive for MM subsystem.
In this viewpoint, I think that zone approach is better. As mentioned
earlier, MM subsystem already have many infrastructures to deal with
zone's H/W addressing limitation. Adding S/W limitation on zone concept
and adding a new zone doesn't change anything. It will work by itself.
My patchset can remove many hooks related to CMA area management in MM
while solving the problems. More hooks are required to solve the problems
if we choose MIGRATETYPE approach.
Although Mel withdrew the review, Vlastimil expressed an agreement on this
new zone approach [6].
"I realize I differ here from much more experienced mm guys, and will
probably deservingly regret it later on, but I think that the ZONE_CMA
approach could work indeed better than current MIGRATE_CMA pageblocks."
If anyone has a different opinion, please let me know.
Thanks.
***** PART 2 *****
There was a discussion with Mel [5] after LSF/MM 2016. I could summarise
it to help merge decision but it's better to read by yourself since
if I summarise it, it would be biased for me. But, if anyone hope
the summary, I will do it. :)
Anyway, Mel's position on this patchset seems to be neutral. He saids:
"I'm not going to outright NAK your series but I won't ACK it either"
We can fix the problems with any approach but I hope to go a new zone
approach because it is less error-prone. It reduces some corner case
handling for now and remove need for potential corner case handling to fix
problems.
Note that our company is already using ZONE_CMA and there is no problem.
If anyone has a different opinion, please let me know and let's discuss
together.
Andrew, if there is something to do for merge, please let me know.
***** PART 3 *****
This series try to solve problems of current CMA implementation.
CMA is introduced to provide physically contiguous pages at runtime
without exclusive reserved memory area. But, current implementation
works like as previous reserved memory approach, because freepages
on CMA region are used only if there is no movable freepage. In other
words, freepages on CMA region are only used as fallback. In that
situation where freepages on CMA region are used as fallback, kswapd
would be woken up easily since there is no unmovable and reclaimable
freepage, too. If kswapd starts to reclaim memory, fallback allocation
to MIGRATE_CMA doesn't occur any more since movable freepages are
already refilled by kswapd and then most of freepage on CMA are left
to be in free. This situation looks like exclusive reserved memory case.
In my experiment, I found that if system memory has 1024 MB memory and
512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
free memory is left. Detailed reason is that for keeping enough free
memory for unmovable and reclaimable allocation, kswapd uses below
equation when calculating free memory and it easily go under the watermark.
Free memory for unmovable and reclaimable = Free total - Free CMA pages
This is derivated from the property of CMA freepage that CMA freepage
can't be used for unmovable and reclaimable allocation.
Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
is lower than low watermark and tries to make free memory until
(FreeTotal - FreeCMA) is higher than high watermark. That results
in that FreeTotal is moving around 512MB boundary consistently. It
then means that we can't utilize full memory capacity.
To fix this problem, I submitted some patches [1] about 10 months ago,
but, found some more problems to be fixed before solving this problem.
It requires many hooks in allocator hotpath so some developers doesn't
like it. Instead, some of them suggest different approach [2] to fix
all the problems related to CMA, that is, introducing a new zone to deal
with free CMA pages. I agree that it is the best way to go so implement
here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar, I
decide to add a new zone rather than piggyback on ZONE_MOVABLE since
they have some differences. First, reserved CMA pages should not be
offlined. If freepage for CMA is managed by ZONE_MOVABLE, we need to keep
MIGRATE_CMA migratetype and insert many hooks on memory hotplug code
to distiguish hotpluggable memory and reserved memory for CMA in the same
zone. It would make memory hotplug code which is already complicated
more complicated. Second, cma_alloc() can be called more frequently
than memory hotplug operation and possibly we need to control
allocation rate of ZONE_CMA to optimize latency in the future.
In this case, separate zone approach is easy to modify. Third, I'd
like to see statistics for CMA, separately. Sometimes, we need to debug
why cma_alloc() is failed and separate statistics would be more helpful
in this situtaion.
Anyway, this patchset solves four problems related to CMA implementation.
1) Utilization problem
As mentioned above, we can't utilize full memory capacity due to the
limitation of CMA freepage and fallback policy. This patchset implements
a new zone for CMA and uses it for GFP_HIGHUSER_MOVABLE request. This
typed allocation is used for page cache and anonymous pages which
occupies most of memory usage in normal case so we can utilize full
memory capacity. Below is the experiment result about this problem.
8 CPUs, 1024 MB, VIRTUAL MACHINE
make -j16
<Before this series>
CMA reserve: 0 MB 512 MB
Elapsed-time: 92.4 186.5
pswpin: 82 18647
pswpout: 160 69839
<After this series>
CMA reserve: 0 MB 512 MB
Elapsed-time: 93.1 93.4
pswpin: 84 46
pswpout: 183 92
FYI, there is another attempt [3] trying to solve this problem in lkml.
And, as far as I know, Qualcomm also has out-of-tree solution for this
problem.
2) Reclaim problem
Currently, there is no logic to distinguish CMA pages in reclaim path.
If reclaim is initiated for unmovable and reclaimable allocation,
reclaiming CMA pages doesn't help to satisfy the request and reclaiming
CMA page is just waste. By managing CMA pages in the new zone, we can
skip to reclaim ZONE_CMA completely if it is unnecessary.
3) Atomic allocation failure problem
Kswapd isn't started to reclaim pages when allocation request is movable
type and there is enough free page in the CMA region. After bunch of
consecutive movable allocation requests, free pages in ordinary region
(not CMA region) would be exhausted without waking up kswapd. At that time,
if atomic unmovable allocation comes, it can't be successful since there
is not enough page in ordinary region. This problem is reported
by Aneesh [4] and can be solved by this patchset.
4) Inefficiently work of compaction
Usual high-order allocation request is unmovable type and it cannot
be serviced from CMA area. In compaction, migration scanner doesn't
distinguish migratable pages on the CMA area and do migration.
In this case, even if we make high-order page on that region, it
cannot be used due to type mismatch. This patch will solve this problem
by separating CMA pages from ordinary zones.
I passed boot test on x86_64, x86_32, arm and arm64. I did some stress
tests on x86_64 and x86_32 and there is no problem. Feel free to enjoy
and please give me a feedback. :)
Thanks.
[1] https://lkml.org/lkml/2014/5/28/64
[2] https://lkml.org/lkml/2014/11/4/55
[3] https://lkml.org/lkml/2014/10/15/623
[4] http://www.spinics.net/lists/linux-mm/msg100562.html
[5] https://lkml.kernel.org/r/20160425053653.GA25662@js1304-P5Q-DELUXE
[6] https://lkml.kernel.org/r/[email protected]
Joonsoo Kim (7):
mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request
mm/cma: introduce new zone, ZONE_CMA
mm/cma: populate ZONE_CMA
mm/cma: remove ALLOC_CMA
mm/cma: remove MIGRATE_CMA
mm/cma: remove per zone CMA stat
ARM: CMA: avoid re-mapping CMA region if CONFIG_HIGHMEM
arch/arm/mm/dma-mapping.c | 7 +-
arch/powerpc/mm/mmu_context_iommu.c | 2 +-
arch/x86/mm/highmem_32.c | 8 ++
fs/proc/meminfo.c | 2 +-
include/linux/cma.h | 7 ++
include/linux/gfp.h | 32 +++---
include/linux/memory_hotplug.h | 3 -
include/linux/mempolicy.h | 2 +-
include/linux/mm.h | 1 +
include/linux/mmzone.h | 60 +++++-----
include/linux/page-isolation.h | 5 +-
include/linux/vm_event_item.h | 10 +-
include/linux/vmstat.h | 8 --
include/trace/events/mmflags.h | 10 +-
kernel/power/snapshot.c | 8 ++
mm/cma.c | 78 +++++++++++--
mm/compaction.c | 12 +-
mm/hugetlb.c | 3 +-
mm/internal.h | 4 +-
mm/memory_hotplug.c | 7 +-
mm/page_alloc.c | 220 ++++++++++++++++++------------------
mm/page_isolation.c | 15 +--
mm/page_owner.c | 6 +-
mm/usercopy.c | 4 +-
mm/vmstat.c | 10 +-
25 files changed, 310 insertions(+), 214 deletions(-)
--
2.7.4
From: Joonsoo Kim <[email protected]>
Attached cover-letter:
This series try to solve problems of current CMA implementation.
CMA is introduced to provide physically contiguous pages at runtime
without exclusive reserved memory area. But, current implementation
works like as previous reserved memory approach, because freepages
on CMA region are used only if there is no movable freepage. In other
words, freepages on CMA region are only used as fallback. In that
situation where freepages on CMA region are used as fallback, kswapd
would be woken up easily since there is no unmovable and reclaimable
freepage, too. If kswapd starts to reclaim memory, fallback allocation
to MIGRATE_CMA doesn't occur any more since movable freepages are
already refilled by kswapd and then most of freepage on CMA are left
to be in free. This situation looks like exclusive reserved memory case.
In my experiment, I found that if system memory has 1024 MB memory and
512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
free memory is left. Detailed reason is that for keeping enough free
memory for unmovable and reclaimable allocation, kswapd uses below
equation when calculating free memory and it easily go under the watermark.
Free memory for unmovable and reclaimable = Free total - Free CMA pages
This is derivated from the property of CMA freepage that CMA freepage
can't be used for unmovable and reclaimable allocation.
Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
is lower than low watermark and tries to make free memory until
(FreeTotal - FreeCMA) is higher than high watermark. That results
in that FreeTotal is moving around 512MB boundary consistently. It
then means that we can't utilize full memory capacity.
To fix this problem, I submitted some patches [1] about 10 months ago,
but, found some more problems to be fixed before solving this problem.
It requires many hooks in allocator hotpath so some developers doesn't
like it. Instead, some of them suggest different approach [2] to fix
all the problems related to CMA, that is, introducing a new zone to deal
with free CMA pages. I agree that it is the best way to go so implement
here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar, I
decide to add a new zone rather than piggyback on ZONE_MOVABLE since
they have some differences. First, reserved CMA pages should not be
offlined. If freepage for CMA is managed by ZONE_MOVABLE, we need to keep
MIGRATE_CMA migratetype and insert many hooks on memory hotplug code
to distiguish hotpluggable memory and reserved memory for CMA in the same
zone. It would make memory hotplug code which is already complicated
more complicated. Second, cma_alloc() can be called more frequently
than memory hotplug operation and possibly we need to control
allocation rate of ZONE_CMA to optimize latency in the future.
In this case, separate zone approach is easy to modify. Third, I'd
like to see statistics for CMA, separately. Sometimes, we need to debug
why cma_alloc() is failed and separate statistics would be more helpful
in this situtaion.
Anyway, this patchset solves four problems related to CMA implementation.
1) Utilization problem
As mentioned above, we can't utilize full memory capacity due to the
limitation of CMA freepage and fallback policy. This patchset implements
a new zone for CMA and uses it for GFP_HIGHUSER_MOVABLE request. This
typed allocation is used for page cache and anonymous pages which
occupies most of memory usage in normal case so we can utilize full
memory capacity. Below is the experiment result about this problem.
8 CPUs, 1024 MB, VIRTUAL MACHINE
make -j16
<Before this series>
CMA reserve: 0 MB 512 MB
Elapsed-time: 92.4 186.5
pswpin: 82 18647
pswpout: 160 69839
<After this series>
CMA reserve: 0 MB 512 MB
Elapsed-time: 93.1 93.4
pswpin: 84 46
pswpout: 183 92
FYI, there is another attempt [3] trying to solve this problem in lkml.
And, as far as I know, Qualcomm also has out-of-tree solution for this
problem.
2) Reclaim problem
Currently, there is no logic to distinguish CMA pages in reclaim path.
If reclaim is initiated for unmovable and reclaimable allocation,
reclaiming CMA pages doesn't help to satisfy the request and reclaiming
CMA page is just waste. By managing CMA pages in the new zone, we can
skip to reclaim ZONE_CMA completely if it is unnecessary.
3) Atomic allocation failure problem
Kswapd isn't started to reclaim pages when allocation request is movable
type and there is enough free page in the CMA region. After bunch of
consecutive movable allocation requests, free pages in ordinary region
(not CMA region) would be exhausted without waking up kswapd. At that time,
if atomic unmovable allocation comes, it can't be successful since there
is not enough page in ordinary region. This problem is reported
by Aneesh [4] and can be solved by this patchset.
4) Inefficiently work of compaction
Usual high-order allocation request is unmovable type and it cannot
be serviced from CMA area. In compaction, migration scanner doesn't
distinguish migratable pages on the CMA area and do migration.
In this case, even if we make high-order page on that region, it
cannot be used due to type mismatch. This patch will solve this problem
by separating CMA pages from ordinary zones.
[1] https://lkml.org/lkml/2014/5/28/64
[2] https://lkml.org/lkml/2014/11/4/55
[3] https://lkml.org/lkml/2014/10/15/623
[4] http://www.spinics.net/lists/linux-mm/msg100562.html
[5] https://lkml.org/lkml/2014/5/30/320
For this patch:
Currently, reserved pages for CMA are managed together with normal pages.
To distinguish them, we used migratetype, MIGRATE_CMA, and
do special handlings for this migratetype. But, it turns out that
there are too many problems with this approach and to fix all of them
needs many more hooks to page allocation and reclaim path so
some developers express their discomfort and problems on CMA aren't fixed
for a long time.
To terminate this situation and fix CMA problems, this patch implements
ZONE_CMA. Reserved pages for CMA will be managed in this new zone. This
approach will remove all exisiting hooks for MIGRATE_CMA and many
problems related to CMA implementation will be solved.
This patch only add basic infrastructure of ZONE_CMA. In the following
patch, ZONE_CMA is actually populated and used.
Adding a new zone could cause two possible problems. One is the overflow
of page flags and the other is GFP_ZONES_TABLE issue.
Following is page-flags layout described in page-flags-layout.h.
1. No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
2. " plus space for last_cpupid: | NODE | ZONE | LAST_CPUPID ... | FLAGS |
3. classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
4. " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
5. classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
There is no problem in #1, #2 configurations for 64-bit system. There are
enough room even for extremiely large x86_64 system. 32-bit system would
not have many nodes so it would have no problem, too.
System with #3, #4, #5 configurations could be affected by this zone
addition, but, thanks to recent THP rework which reduce one page flag,
problem surface would be small. In some configurations, problem is
still possible, but, it highly depends on individual configuration
so impact cannot be easily estimated. I guess that usual system
with CONFIG_CMA would not be affected. If there is a problem,
we can adjust section width or node width for that architecture.
Currently, GFP_ZONES_TABLE is 32-bit value for 32-bit bit operation
in the 32-bit system. If we add one more zone, it will be 48-bit and
32-bit bit operation cannot be possible. Although it will cause slight
overhead, there is no other way so this patch relax GFP_ZONES_TABLE's
32-bit limitation. 32-bit System with CONFIG_CMA will be affected by
this change but it would be marginal.
Note that there are many checkpatch warnings but I think that current
code is better for readability than fixing them up.
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
arch/x86/mm/highmem_32.c | 8 ++++++
include/linux/gfp.h | 30 ++++++++++++---------
include/linux/mempolicy.h | 2 +-
include/linux/mmzone.h | 31 ++++++++++++++++++++-
include/linux/vm_event_item.h | 10 ++++++-
include/trace/events/mmflags.h | 10 ++++++-
kernel/power/snapshot.c | 8 ++++++
mm/page_alloc.c | 61 ++++++++++++++++++++++++++++++++++--------
mm/vmstat.c | 9 ++++++-
9 files changed, 141 insertions(+), 28 deletions(-)
diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index 6d18b70..52a14da 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -120,6 +120,14 @@ void __init set_highmem_pages_init(void)
if (!is_highmem(zone))
continue;
+ /*
+ * ZONE_CMA is a special zone that should not be
+ * participated in initialization because it's pages
+ * would be initialized by initialization of other zones.
+ */
+ if (is_zone_cma(zone))
+ continue;
+
zone_start_pfn = zone->zone_start_pfn;
zone_end_pfn = zone_start_pfn + zone->spanned_pages;
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 2b1a44f5..c2ed2eb 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -311,6 +311,12 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
#define OPT_ZONE_DMA32 ZONE_NORMAL
#endif
+#ifdef CONFIG_CMA
+#define OPT_ZONE_CMA ZONE_CMA
+#else
+#define OPT_ZONE_CMA ZONE_MOVABLE
+#endif
+
/*
* GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
* zone to use given the lowest 4 bits of gfp_t. Entries are GFP_ZONES_SHIFT
@@ -340,8 +346,6 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
* 0xd => BAD (MOVABLE+DMA32+DMA)
* 0xe => BAD (MOVABLE+DMA32+HIGHMEM)
* 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
- *
- * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
*/
#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
@@ -351,19 +355,21 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
#define GFP_ZONES_SHIFT ZONES_SHIFT
#endif
-#if 16 * GFP_ZONES_SHIFT > BITS_PER_LONG
-#error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
+#if !defined(CONFIG_64BITS) && GFP_ZONES_SHIFT > 2
+typedef unsigned long long GFP_ZONE_TABLE_TYPE;
+#else
+typedef unsigned long GFP_ZONE_TABLE_TYPE;
#endif
#define GFP_ZONE_TABLE ( \
- (ZONE_NORMAL << 0 * GFP_ZONES_SHIFT) \
- | (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT) \
- | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT) \
- | (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT) \
- | (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT) \
- | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT) \
- | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)\
- | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)\
+ ((GFP_ZONE_TABLE_TYPE) ZONE_NORMAL << 0 * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_TYPE) OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_TYPE) OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_TYPE) OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_TYPE) ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_TYPE) OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_TYPE) OPT_ZONE_CMA << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT) \
+ | ((GFP_ZONE_TABLE_TYPE) OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT) \
)
/*
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5f4d828..a43adb5 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -158,7 +158,7 @@ extern enum zone_type policy_zone;
static inline void check_highest_zone(enum zone_type k)
{
- if (k > policy_zone && k != ZONE_MOVABLE)
+ if (k > policy_zone && k != ZONE_MOVABLE && !is_zone_cma_idx(k))
policy_zone = k;
}
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 96194bf..74eda07 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -338,6 +338,9 @@ enum zone_type {
ZONE_HIGHMEM,
#endif
ZONE_MOVABLE,
+#ifdef CONFIG_CMA
+ ZONE_CMA,
+#endif
#ifdef CONFIG_ZONE_DEVICE
ZONE_DEVICE,
#endif
@@ -838,11 +841,37 @@ static inline int zone_movable_is_highmem(void)
}
#endif
+static inline int is_zone_cma_idx(enum zone_type idx)
+{
+#ifdef CONFIG_CMA
+ return idx == ZONE_CMA;
+#else
+ return 0;
+#endif
+}
+
+static inline int is_zone_cma(struct zone *zone)
+{
+ int zone_idx = zone_idx(zone);
+
+ return is_zone_cma_idx(zone_idx);
+}
+
+static inline int zone_cma_is_highmem(void)
+{
+#ifdef CONFIG_HIGHMEM
+ return 1;
+#else
+ return 0;
+#endif
+}
+
static inline int is_highmem_idx(enum zone_type idx)
{
#ifdef CONFIG_HIGHMEM
return (idx == ZONE_HIGHMEM ||
- (idx == ZONE_MOVABLE && zone_movable_is_highmem()));
+ (idx == ZONE_MOVABLE && zone_movable_is_highmem()) ||
+ (is_zone_cma_idx(idx) && zone_cma_is_highmem()));
#else
return 0;
#endif
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d84ae90..228f3df 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -19,7 +19,15 @@
#define HIGHMEM_ZONE(xx)
#endif
-#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, HIGHMEM_ZONE(xx) xx##_MOVABLE
+#ifdef CONFIG_CMA
+#define MOVABLE_ZONE(xx) xx##_MOVABLE,
+#define CMA_ZONE(xx) xx##_CMA
+#else
+#define MOVABLE_ZONE(xx) xx##_MOVABLE
+#define CMA_ZONE(xx)
+#endif
+
+#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, HIGHMEM_ZONE(xx) MOVABLE_ZONE(xx) CMA_ZONE(xx)
enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 304ff94..7c32ba6 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -232,12 +232,20 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \
#define IFDEF_ZONE_HIGHMEM(X)
#endif
+#ifdef CONFIG_CMA
+#define IFDEF_ZONE_CMA(X, Y, Z) X Z
+#else
+#define IFDEF_ZONE_CMA(X, Y, Z) Y
+#endif
+
#define ZONE_TYPE \
IFDEF_ZONE_DMA( EM (ZONE_DMA, "DMA")) \
IFDEF_ZONE_DMA32( EM (ZONE_DMA32, "DMA32")) \
EM (ZONE_NORMAL, "Normal") \
IFDEF_ZONE_HIGHMEM( EM (ZONE_HIGHMEM,"HighMem")) \
- EMe(ZONE_MOVABLE,"Movable")
+ IFDEF_ZONE_CMA( EM (ZONE_MOVABLE,"Movable"), \
+ EMe(ZONE_MOVABLE,"Movable"), \
+ EMe(ZONE_CMA, "CMA"))
#define LRU_NAMES \
EM (LRU_INACTIVE_ANON, "inactive_anon") \
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 3b1e0f3..6971c23 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1170,6 +1170,14 @@ unsigned int snapshot_additional_pages(struct zone *zone)
{
unsigned int rtree, nodes;
+ /*
+ * Estimation of needed pages for ZONE_CMA is already considered
+ * when calculating other zones since span of ZONE_CMA is subset
+ * of other zones.
+ */
+ if (is_zone_cma(zone))
+ return 0;
+
rtree = nodes = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
rtree += DIV_ROUND_UP(rtree * sizeof(struct rtree_node),
LINKED_PAGE_DATA_SIZE);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 60ffa4e..26d86c3b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -215,6 +215,9 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
[ZONE_HIGHMEM] = INT_MAX,
#endif
[ZONE_MOVABLE] = INT_MAX,
+#ifdef CONFIG_CMA
+ [ZONE_CMA] = INT_MAX,
+#endif
};
EXPORT_SYMBOL(totalram_pages);
@@ -231,6 +234,9 @@ static char * const zone_names[MAX_NR_ZONES] = {
"HighMem",
#endif
"Movable",
+#ifdef CONFIG_CMA
+ "CMA",
+#endif
#ifdef CONFIG_ZONE_DEVICE
"Device",
#endif
@@ -5253,6 +5259,15 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
struct memblock_region *r = NULL, *tmp;
#endif
+ /*
+ * Physical pages for ZONE_CMA are belong to other zones now. They
+ * are initialized when corresponding zone is initialized and they
+ * will be moved to ZONE_CMA later. Zone information will also be
+ * adjusted later.
+ */
+ if (is_zone_cma_idx(zone))
+ return;
+
if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;
@@ -5649,7 +5664,7 @@ static void __init find_usable_zone_for_movable(void)
{
int zone_index;
for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) {
- if (zone_index == ZONE_MOVABLE)
+ if (zone_index == ZONE_MOVABLE || is_zone_cma_idx(zone_index))
continue;
if (arch_zone_highest_possible_pfn[zone_index] >
@@ -5864,6 +5879,8 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
unsigned long *zholes_size)
{
unsigned long realtotalpages = 0, totalpages = 0;
+ unsigned long zone_cma_start_pfn = UINT_MAX;
+ unsigned long zone_cma_end_pfn = 0;
enum zone_type i;
for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -5871,6 +5888,13 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
unsigned long zone_start_pfn, zone_end_pfn;
unsigned long size, real_size;
+ if (is_zone_cma_idx(i)) {
+ zone->zone_start_pfn = zone_cma_start_pfn;
+ size = zone_cma_end_pfn - zone_cma_start_pfn;
+ real_size = 0;
+ goto init_zone;
+ }
+
size = zone_spanned_pages_in_node(pgdat->node_id, i,
node_start_pfn,
node_end_pfn,
@@ -5880,13 +5904,23 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
real_size = size - zone_absent_pages_in_node(pgdat->node_id, i,
node_start_pfn, node_end_pfn,
zholes_size);
- if (size)
+ if (size) {
zone->zone_start_pfn = zone_start_pfn;
- else
+ if (zone_cma_start_pfn > zone_start_pfn)
+ zone_cma_start_pfn = zone_start_pfn;
+ if (zone_cma_end_pfn < zone_start_pfn + size)
+ zone_cma_end_pfn = zone_start_pfn + size;
+ } else
zone->zone_start_pfn = 0;
+
+init_zone:
zone->spanned_pages = size;
zone->present_pages = real_size;
+ /* Prevent to over-count node span */
+ if (is_zone_cma_idx(i))
+ size = 0;
+
totalpages += size;
realtotalpages += real_size;
}
@@ -6030,6 +6064,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
struct zone *zone = pgdat->node_zones + j;
unsigned long size, realsize, freesize, memmap_pages;
unsigned long zone_start_pfn = zone->zone_start_pfn;
+ bool zone_kernel = !is_highmem_idx(j) && !is_zone_cma_idx(j);
size = zone->spanned_pages;
realsize = freesize = zone->present_pages;
@@ -6040,7 +6075,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
* and per-cpu initialisations
*/
memmap_pages = calc_memmap_size(size, realsize);
- if (!is_highmem_idx(j)) {
+ if (zone_kernel) {
if (freesize >= memmap_pages) {
freesize -= memmap_pages;
if (memmap_pages)
@@ -6059,7 +6094,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
zone_names[0], dma_reserve);
}
- if (!is_highmem_idx(j))
+ if (zone_kernel)
nr_kernel_pages += freesize;
/* Charge for highmem memmap if there are enough kernel pages */
else if (nr_kernel_pages > memmap_pages * 2)
@@ -6071,7 +6106,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
* when the bootmem allocator frees pages into the buddy system.
* And all highmem pages will be managed by the buddy system.
*/
- zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
+ zone->managed_pages = zone_kernel ? freesize : realsize;
#ifdef CONFIG_NUMA
zone->node = nid;
#endif
@@ -6081,7 +6116,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
zone_seqlock_init(zone);
zone_pcp_init(zone);
- if (!size)
+ /*
+ * ZONE_CMA should be initialized even if it has no present
+ * page now since pages will be moved to the zone later.
+ */
+ if (!size && !is_zone_cma_idx(j))
continue;
set_pageblock_order();
@@ -6537,7 +6576,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
start_pfn = find_min_pfn_with_active_regions();
for (i = 0; i < MAX_NR_ZONES; i++) {
- if (i == ZONE_MOVABLE)
+ if (i == ZONE_MOVABLE || is_zone_cma_idx(i))
continue;
end_pfn = max(max_zone_pfn[i], start_pfn);
@@ -6554,7 +6593,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
/* Print out the zone ranges */
pr_info("Zone ranges:\n");
for (i = 0; i < MAX_NR_ZONES; i++) {
- if (i == ZONE_MOVABLE)
+ if (i == ZONE_MOVABLE || is_zone_cma_idx(i))
continue;
pr_info(" %-8s ", zone_names[i]);
if (arch_zone_lowest_possible_pfn[i] ==
@@ -7318,9 +7357,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
/*
* For avoiding noise data, lru_add_drain_all() should be called
- * If ZONE_MOVABLE, the zone never contains unmovable pages
+ * If ZONE_(MOVABLE|CMA), the zone never contains unmovable pages
*/
- if (zone_idx(zone) == ZONE_MOVABLE)
+ if (zone_idx(zone) == ZONE_MOVABLE || is_zone_cma(zone))
return false;
mt = get_pageblock_migratetype(page);
if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 757be83..3c3aac2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -915,8 +915,15 @@ int fragmentation_index(struct zone *zone, unsigned int order)
#define TEXT_FOR_HIGHMEM(xx)
#endif
+#ifdef CONFIG_CMA
+#define TEXT_FOR_CMA(xx) xx "_cma",
+#else
+#define TEXT_FOR_CMA(xx)
+#endif
+
#define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \
- TEXT_FOR_HIGHMEM(xx) xx "_movable",
+ TEXT_FOR_HIGHMEM(xx) xx "_movable", \
+ TEXT_FOR_CMA(xx)
const char * const vmstat_text[] = {
/* enum zone_stat_item countes */
--
2.7.4
From: Joonsoo Kim <[email protected]>
Until now, reserved pages for CMA are managed in the ordinary zones
where page's pfn are belong to. This approach has numorous problems
and fixing them isn't easy. (It is mentioned on previous patch.)
To fix this situation, ZONE_CMA is introduced in previous patch, but,
not yet populated. This patch implement population of ZONE_CMA
by stealing reserved pages from the ordinary zones.
Unlike previous implementation that kernel allocation request with
__GFP_MOVABLE could be serviced from CMA region, allocation request only
with GFP_HIGHUSER_MOVABLE can be serviced from CMA region in the new
approach. This is an inevitable design decision to use the zone
implementation because ZONE_CMA could contain highmem. Due to this
decision, ZONE_CMA will work like as ZONE_HIGHMEM or ZONE_MOVABLE.
I don't think it would be a problem because most of file cache pages
and anonymous pages are requested with GFP_HIGHUSER_MOVABLE. It could
be proved by the fact that there are many systems with ZONE_HIGHMEM and
they work fine. Notable disadvantage is that we cannot use these pages
for blockdev file cache page, because it usually has __GFP_MOVABLE but
not __GFP_HIGHMEM and __GFP_USER. But, in this case, there is pros and
cons. In my experience, blockdev file cache pages are one of the top
reason that causes cma_alloc() to fail temporarily. So, we can get more
guarantee of cma_alloc() success by discarding that case.
Implementation itself is very easy to understand. Steal when cma area is
initialized and recalculate various per zone stat/threshold.
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/memory_hotplug.h | 3 ---
include/linux/mm.h | 1 +
mm/cma.c | 60 ++++++++++++++++++++++++++++++++++++++----
mm/internal.h | 3 +++
mm/page_alloc.c | 29 +++++++++++++++++---
5 files changed, 84 insertions(+), 12 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index e60f203..d730ce9 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -229,9 +229,6 @@ void put_online_mems(void);
void mem_hotplug_begin(void);
void mem_hotplug_done(void);
-extern void set_zone_contiguous(struct zone *zone);
-extern void clear_zone_contiguous(struct zone *zone);
-
#else /* ! CONFIG_MEMORY_HOTPLUG */
/*
* Stub functions for when hotplug is off
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 022423c..1390abe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2031,6 +2031,7 @@ extern void setup_per_cpu_pageset(void);
extern void zone_pcp_update(struct zone *zone);
extern void zone_pcp_reset(struct zone *zone);
+extern void setup_zone_pageset(struct zone *zone);
/* page_alloc.c */
extern int min_free_kbytes;
diff --git a/mm/cma.c b/mm/cma.c
index a6033e3..6d8bd300 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -38,6 +38,7 @@
#include <trace/events/cma.h>
#include "cma.h"
+#include "internal.h"
struct cma cma_areas[MAX_CMA_AREAS];
unsigned cma_area_count;
@@ -116,10 +117,9 @@ static int __init cma_activate_area(struct cma *cma)
for (j = pageblock_nr_pages; j; --j, pfn++) {
WARN_ON_ONCE(!pfn_valid(pfn));
/*
- * alloc_contig_range requires the pfn range
- * specified to be in the same zone. Make this
- * simple by forcing the entire CMA resv range
- * to be in the same zone.
+ * In init_cma_reserved_pageblock(), present_pages is
+ * adjusted with assumption that all pages come from
+ * a single zone. It could be fixed but not yet done.
*/
if (page_zone(pfn_to_page(pfn)) != zone)
goto err;
@@ -145,6 +145,34 @@ static int __init cma_activate_area(struct cma *cma)
static int __init cma_init_reserved_areas(void)
{
int i;
+ struct zone *zone;
+ pg_data_t *pgdat;
+
+ if (!cma_area_count)
+ return 0;
+
+ for_each_online_pgdat(pgdat) {
+ unsigned long start_pfn = UINT_MAX, end_pfn = 0;
+
+ for (i = 0; i < cma_area_count; i++) {
+ if (pfn_to_nid(cma_areas[i].base_pfn) !=
+ pgdat->node_id)
+ continue;
+
+ start_pfn = min(start_pfn, cma_areas[i].base_pfn);
+ end_pfn = max(end_pfn, cma_areas[i].base_pfn +
+ cma_areas[i].count);
+ }
+
+ if (!end_pfn)
+ continue;
+
+ zone = &pgdat->node_zones[ZONE_CMA];
+
+ /* ZONE_CMA doesn't need to exceed CMA region */
+ zone->zone_start_pfn = start_pfn;
+ zone->spanned_pages = end_pfn - start_pfn;
+ }
for (i = 0; i < cma_area_count; i++) {
int ret = cma_activate_area(&cma_areas[i]);
@@ -153,9 +181,31 @@ static int __init cma_init_reserved_areas(void)
return ret;
}
+ /*
+ * Reserved pages for ZONE_CMA are now activated and this would change
+ * ZONE_CMA's managed page counter and other zone's present counter.
+ * We need to re-calculate various zone information that depends on
+ * this initialization.
+ */
+ build_all_zonelists(NULL, NULL);
+ for_each_populated_zone(zone) {
+ if (is_zone_cma(zone))
+ setup_zone_pageset(zone);
+ else
+ zone_pcp_update(zone);
+
+ set_zone_contiguous(zone);
+ }
+
+ /*
+ * We need to re-init per zone wmark by calling
+ * init_per_zone_wmark_min() but doesn't call here because it is
+ * registered on core_initcall and it will be called later than us.
+ */
+
return 0;
}
-core_initcall(cma_init_reserved_areas);
+pure_initcall(cma_init_reserved_areas);
/**
* cma_init_reserved_mem() - create custom contiguous area from reserved memory
diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558..ecc69a4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -168,6 +168,9 @@ extern void post_alloc_hook(struct page *page, unsigned int order,
gfp_t gfp_flags);
extern int user_min_free_kbytes;
+extern void set_zone_contiguous(struct zone *zone);
+extern void clear_zone_contiguous(struct zone *zone);
+
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 26d86c3b..760f518 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1583,16 +1583,38 @@ void __init page_alloc_init_late(void)
}
#ifdef CONFIG_CMA
+static void __init adjust_present_page_count(struct page *page, long count)
+{
+ struct zone *zone = page_zone(page);
+
+ /* We don't need to hold a lock since it is boot-up process */
+ zone->present_pages += count;
+}
+
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
{
unsigned i = pageblock_nr_pages;
+ unsigned long pfn = page_to_pfn(page);
struct page *p = page;
+ int nid = page_to_nid(page);
+
+ /*
+ * ZONE_CMA will steal present pages from other zones by changing
+ * page links so page_zone() is changed. Before that,
+ * we need to adjust previous zone's page count first.
+ */
+ adjust_present_page_count(page, -pageblock_nr_pages);
do {
__ClearPageReserved(p);
set_page_count(p, 0);
- } while (++p, --i);
+
+ /* Steal pages from other zones */
+ set_page_links(p, ZONE_CMA, nid, pfn);
+ } while (++p, ++pfn, --i);
+
+ adjust_present_page_count(page, pageblock_nr_pages);
set_pageblock_migratetype(page, MIGRATE_CMA);
@@ -5124,7 +5146,6 @@ static void build_zonelists(pg_data_t *pgdat)
*/
static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
-static void setup_zone_pageset(struct zone *zone);
/*
* Global mutex to protect against size modification of zonelists
@@ -5497,7 +5518,7 @@ static void __meminit zone_pageset_init(struct zone *zone, int cpu)
pageset_set_high_and_batch(zone, pcp);
}
-static void __meminit setup_zone_pageset(struct zone *zone)
+void __meminit setup_zone_pageset(struct zone *zone)
{
int cpu;
zone->pageset = alloc_percpu(struct per_cpu_pageset);
@@ -7669,7 +7690,7 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
}
#endif
-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined CONFIG_MEMORY_HOTPLUG || defined CONFIG_CMA
/*
* The zone indicated has a new number of managed_pages; batch sizes and percpu
* page high values need to be recalulated.
--
2.7.4
From: Joonsoo Kim <[email protected]>
Now, all reserved pages for CMA region are belong to the ZONE_CMA
and it only serves for GFP_HIGHUSER_MOVABLE. Therefore, we don't need to
consider ALLOC_CMA at all.
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/compaction.c | 4 +---
mm/internal.h | 1 -
mm/page_alloc.c | 28 +++-------------------------
3 files changed, 4 insertions(+), 29 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 613c59e..80b1424 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1427,14 +1427,12 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
* if compaction succeeds.
* For costly orders, we require low watermark instead of min for
* compaction to proceed to increase its chances.
- * ALLOC_CMA is used, as pages in CMA pageblocks are considered
- * suitable migration targets
*/
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);
if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
- ALLOC_CMA, wmark_target))
+ 0, wmark_target))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
diff --git a/mm/internal.h b/mm/internal.h
index ecc69a4..08b19b7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -486,7 +486,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
-#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */
enum ttu_flags;
struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 760f518..18f16bf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2664,7 +2664,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
* exists.
*/
watermark = min_wmark_pages(zone) + (1UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+ if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
return 0;
__mod_zone_freepage_state(zone, -(1UL << order), mt);
@@ -2931,12 +2931,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
else
min -= min / 4;
-#ifdef CONFIG_CMA
- /* If allocation can't use CMA areas don't use free CMA pages */
- if (!(alloc_flags & ALLOC_CMA))
- free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
-
/*
* Check watermarks for an order-0 allocation request. If these
* are not met, then a high-order request also cannot go ahead
@@ -2966,10 +2960,8 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
}
#ifdef CONFIG_CMA
- if ((alloc_flags & ALLOC_CMA) &&
- !list_empty(&area->free_list[MIGRATE_CMA])) {
+ if (!list_empty(&area->free_list[MIGRATE_CMA]))
return true;
- }
#endif
}
return false;
@@ -2986,13 +2978,6 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
unsigned long mark, int classzone_idx, unsigned int alloc_flags)
{
long free_pages = zone_page_state(z, NR_FREE_PAGES);
- long cma_pages = 0;
-
-#ifdef CONFIG_CMA
- /* If allocation can't use CMA areas don't use free CMA pages */
- if (!(alloc_flags & ALLOC_CMA))
- cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
/*
* Fast check for order-0 only. If this fails then the reserves
@@ -3001,7 +2986,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
* the caller is !atomic then it'll uselessly search the free
* list. That corner case is then slower but it is harmless.
*/
- if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzone_idx])
+ if (!order && free_pages > mark + z->lowmem_reserve[classzone_idx])
return true;
return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
@@ -3572,10 +3557,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
-#ifdef CONFIG_CMA
- if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
- alloc_flags |= ALLOC_CMA;
-#endif
return alloc_flags;
}
@@ -3997,9 +3978,6 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
if (should_fail_alloc_page(gfp_mask, order))
return false;
- if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
- *alloc_flags |= ALLOC_CMA;
-
return true;
}
--
2.7.4
From: Joonsoo Kim <[email protected]>
CMA region is now managed by the separate zone, ZONE_CMA, to
fix many MM related problems. In this implementation, it is
possible that ZONE_CMA contains two CMA regions that are
on the both, lowmem and highmem, respectively. To handle this case
properly, ZONE_CMA is considered as highmem.
In dma_contiguous_remap(), mapping for CMA region on lowmem is cleared
and remapped for DMA, but, in the new CMA implementation, remap isn't
needed since the region is considered as highmem. And, remap should not be
allowed since it would cause cache problems. So, this patch disables it.
Signed-off-by: Joonsoo Kim <[email protected]>
---
arch/arm/mm/dma-mapping.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 475811f..377053a 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -506,7 +506,12 @@ void __init dma_contiguous_remap(void)
flush_tlb_kernel_range(__phys_to_virt(start),
__phys_to_virt(end));
- iotable_init(&map, 1);
+ /*
+ * For highmem system, all the memory in CMA region will be
+ * considered as highmem, therefore, re-mapping isn't required.
+ */
+ if (!IS_ENABLED(CONFIG_HIGHMEM))
+ iotable_init(&map, 1);
}
}
--
2.7.4
From: Joonsoo Kim <[email protected]>
Now, all reserved pages for CMA region are belong to the ZONE_CMA
and there is no other type of pages. Therefore, we don't need to
use MIGRATE_CMA to distinguish and handle differently for CMA pages
and ordinary pages. Remove MIGRATE_CMA.
Unfortunately, this patch make free CMA counter incorrect because
we count it when pages are on the MIGRATE_CMA. It will be fixed
by next patch. I can squash next patch here but it makes changes
complicated and hard to review so I separate that.
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
arch/powerpc/mm/mmu_context_iommu.c | 2 +-
include/linux/gfp.h | 2 +-
include/linux/mmzone.h | 26 +----------
include/linux/page-isolation.h | 5 +--
include/linux/vmstat.h | 8 ----
mm/cma.c | 3 +-
mm/compaction.c | 8 +---
mm/hugetlb.c | 3 +-
mm/memory_hotplug.c | 7 ++-
mm/page_alloc.c | 86 ++++++++++---------------------------
mm/page_isolation.c | 15 +++----
mm/page_owner.c | 6 +--
mm/usercopy.c | 4 +-
13 files changed, 43 insertions(+), 132 deletions(-)
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index fc67bd7..330c495 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -184,7 +184,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
* of the CMA zone if possible. NOTE: faulting in + migration
* can be expensive. Batching can be considered later
*/
- if (is_migrate_cma_page(page)) {
+ if (is_zone_cma(page_zone(page))) {
if (mm_iommu_move_page_from_cma(page))
goto populate;
if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c2ed2eb..15987cc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -563,7 +563,7 @@ static inline bool pm_suspended_storage(void)
#if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || defined(CONFIG_CMA)
/* The below functions must be run on a range from a single zone. */
extern int alloc_contig_range(unsigned long start, unsigned long end,
- unsigned migratetype, gfp_t gfp_mask);
+ gfp_t gfp_mask);
extern void free_contig_range(unsigned long pfn, unsigned nr_pages);
#endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 74eda07..efb69b1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -41,22 +41,6 @@ enum migratetype {
MIGRATE_RECLAIMABLE,
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
-#ifdef CONFIG_CMA
- /*
- * MIGRATE_CMA migration type is designed to mimic the way
- * ZONE_MOVABLE works. Only movable pages can be allocated
- * from MIGRATE_CMA pageblocks and page allocator never
- * implicitly change migration type of MIGRATE_CMA pageblock.
- *
- * The way to use it is to change migratetype of a range of
- * pageblocks to MIGRATE_CMA which can be done by
- * __free_pageblock_cma() function. What is important though
- * is that a range of pageblocks must be aligned to
- * MAX_ORDER_NR_PAGES should biggest page be bigger then
- * a single pageblock.
- */
- MIGRATE_CMA,
-#endif
#ifdef CONFIG_MEMORY_ISOLATION
MIGRATE_ISOLATE, /* can't allocate from here */
#endif
@@ -66,17 +50,9 @@ enum migratetype {
/* In mm/page_alloc.c; keep in sync also with show_migration_types() there */
extern char * const migratetype_names[MIGRATE_TYPES];
-#ifdef CONFIG_CMA
-# define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
-# define is_migrate_cma_page(_page) (get_pageblock_migratetype(_page) == MIGRATE_CMA)
-#else
-# define is_migrate_cma(migratetype) false
-# define is_migrate_cma_page(_page) false
-#endif
-
static inline bool is_migrate_movable(int mt)
{
- return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
+ return mt == MIGRATE_MOVABLE;
}
#define for_each_migratetype_order(order, type) \
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index d4cd201..67735f2 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -46,15 +46,14 @@ int move_freepages_block(struct zone *zone, struct page *page,
*/
int
start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
- unsigned migratetype, bool skip_hwpoisoned_pages);
+ bool skip_hwpoisoned_pages);
/*
* Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
* target range is [start_pfn, end_pfn)
*/
int
-undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
- unsigned migratetype);
+undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn);
/*
* Test all pages in [start_pfn, end_pfn) are isolated or not.
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 6137719..ac6db88 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -341,14 +341,6 @@ static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
#endif /* CONFIG_SMP */
-static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
- int migratetype)
-{
- __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
- if (is_migrate_cma(migratetype))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
-}
-
extern const char * const vmstat_text[];
#endif /* _LINUX_VMSTAT_H */
diff --git a/mm/cma.c b/mm/cma.c
index 6d8bd300..91dd85a 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -479,8 +479,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
mutex_lock(&cma_mutex);
- ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
- gfp_mask);
+ ret = alloc_contig_range(pfn, pfn + count, gfp_mask);
mutex_unlock(&cma_mutex);
if (ret == 0) {
page = pfn_to_page(pfn);
diff --git a/mm/compaction.c b/mm/compaction.c
index 80b1424..f6ae10f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1017,7 +1017,7 @@ static bool suitable_migration_target(struct compact_control *cc,
if (cc->ignore_block_suitable)
return true;
- /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
+ /* If the block is MIGRATE_MOVABLE, allow migration */
if (is_migrate_movable(get_pageblock_migratetype(page)))
return true;
@@ -1338,12 +1338,6 @@ static enum compact_result __compact_finished(struct zone *zone,
if (!list_empty(&area->free_list[migratetype]))
return COMPACT_SUCCESS;
-#ifdef CONFIG_CMA
- /* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
- if (migratetype == MIGRATE_MOVABLE &&
- !list_empty(&area->free_list[MIGRATE_CMA]))
- return COMPACT_SUCCESS;
-#endif
/*
* Job done if allocation would steal freepages from
* other migratetype buddy lists.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e582887..d26c837 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1053,8 +1053,7 @@ static int __alloc_gigantic_page(unsigned long start_pfn,
unsigned long nr_pages)
{
unsigned long end_pfn = start_pfn + nr_pages;
- return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
- GFP_KERNEL);
+ return alloc_contig_range(start_pfn, end_pfn, GFP_KERNEL);
}
static bool pfn_range_valid_gigantic(struct zone *z,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 76d4745..c48c36f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1897,8 +1897,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
return -EINVAL;
/* set above range as isolated */
- ret = start_isolate_page_range(start_pfn, end_pfn,
- MIGRATE_MOVABLE, true);
+ ret = start_isolate_page_range(start_pfn, end_pfn, true);
if (ret)
return ret;
@@ -1968,7 +1967,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
We cannot do rollback at this point. */
offline_isolated_pages(start_pfn, end_pfn);
/* reset pagetype flags and makes migrate type to be MOVABLE */
- undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+ undo_isolate_page_range(start_pfn, end_pfn);
/* removal success */
adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
zone->present_pages -= offlined_pages;
@@ -2005,7 +2004,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
((unsigned long long) end_pfn << PAGE_SHIFT) - 1);
memory_notify(MEM_CANCEL_OFFLINE, &arg);
/* pushback to free area */
- undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+ undo_isolate_page_range(start_pfn, end_pfn);
return ret;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 18f16bf..33a1b69 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -136,8 +136,8 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
* put on a pcplist. Used to avoid the pageblock migratetype lookup when
* freeing from pcplists in most cases, at the cost of possibly becoming stale.
* Also the migratetype set in the page does not necessarily match the pcplist
- * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any
- * other index - this ensures that it will be put on the correct CMA freelist.
+ * index, e.g. page might have MIGRATE_MOVABLE set but be on a pcplist with any
+ * other index - this ensures that it will be put on the correct freelist.
*/
static inline int get_pcppage_migratetype(struct page *page)
{
@@ -247,9 +247,6 @@ char * const migratetype_names[MIGRATE_TYPES] = {
"Movable",
"Reclaimable",
"HighAtomic",
-#ifdef CONFIG_CMA
- "CMA",
-#endif
#ifdef CONFIG_MEMORY_ISOLATION
"Isolate",
#endif
@@ -681,7 +678,7 @@ static inline bool set_page_guard(struct zone *zone, struct page *page,
INIT_LIST_HEAD(&page->lru);
set_page_private(page, order);
/* Guard pages are not available for any usage */
- __mod_zone_freepage_state(zone, -(1 << order), migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
return true;
}
@@ -702,7 +699,7 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
set_page_private(page, 0);
if (!is_migrate_isolate(migratetype))
- __mod_zone_freepage_state(zone, (1 << order), migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, (1 << order));
}
#else
struct page_ext_operations debug_guardpage_ops;
@@ -809,7 +806,7 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON(migratetype == -1);
if (likely(!is_migrate_isolate(migratetype)))
- __mod_zone_freepage_state(zone, 1 << order, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
@@ -1591,7 +1588,7 @@ static void __init adjust_present_page_count(struct page *page, long count)
zone->present_pages += count;
}
-/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
+/* Free whole pageblock and set its migration type to MIGRATE_MOVABLE. */
void __init init_cma_reserved_pageblock(struct page *page)
{
unsigned i = pageblock_nr_pages;
@@ -1616,7 +1613,7 @@ void __init init_cma_reserved_pageblock(struct page *page)
adjust_present_page_count(page, pageblock_nr_pages);
- set_pageblock_migratetype(page, MIGRATE_CMA);
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
if (pageblock_order >= MAX_ORDER) {
i = pageblock_nr_pages;
@@ -1836,25 +1833,11 @@ static int fallbacks[MIGRATE_TYPES][4] = {
[MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
[MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
-#ifdef CONFIG_CMA
- [MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */
-#endif
#ifdef CONFIG_MEMORY_ISOLATION
[MIGRATE_ISOLATE] = { MIGRATE_TYPES }, /* Never used */
#endif
};
-#ifdef CONFIG_CMA
-static struct page *__rmqueue_cma_fallback(struct zone *zone,
- unsigned int order)
-{
- return __rmqueue_smallest(zone, order, MIGRATE_CMA);
-}
-#else
-static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
- unsigned int order) { return NULL; }
-#endif
-
/*
* Move the free pages in a range to the free lists of the requested type.
* Note that start_page and end_pages are not aligned on a pageblock
@@ -2122,8 +2105,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
/* Yoink! */
mt = get_pageblock_migratetype(page);
- if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt)
- && !is_migrate_cma(mt)) {
+ if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt)) {
zone->nr_reserved_highatomic += pageblock_nr_pages;
set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
@@ -2267,13 +2249,8 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
retry:
page = __rmqueue_smallest(zone, order, migratetype);
- if (unlikely(!page)) {
- if (migratetype == MIGRATE_MOVABLE)
- page = __rmqueue_cma_fallback(zone, order);
-
- if (!page && __rmqueue_fallback(zone, order, migratetype))
- goto retry;
- }
+ if (unlikely(!page) && __rmqueue_fallback(zone, order, migratetype))
+ goto retry;
trace_mm_page_alloc_zone_locked(page, order, migratetype);
return page;
@@ -2315,9 +2292,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
list_add_tail(&page->lru, list);
list = &page->lru;
alloced++;
- if (is_migrate_cma(get_pcppage_migratetype(page)))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
- -(1 << order));
}
/*
@@ -2667,7 +2641,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
return 0;
- __mod_zone_freepage_state(zone, -(1UL << order), mt);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
}
/* Remove page from free list */
@@ -2683,8 +2657,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
struct page *endpage = page + (1 << order) - 1;
for (; page < endpage; page += pageblock_nr_pages) {
int mt = get_pageblock_migratetype(page);
- if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)
- && !is_migrate_highatomic(mt))
+ if (!is_migrate_isolate(mt) &&
+ !is_migrate_highatomic(mt))
set_pageblock_migratetype(page,
MIGRATE_MOVABLE);
}
@@ -2807,8 +2781,7 @@ struct page *rmqueue(struct zone *preferred_zone,
spin_unlock(&zone->lock);
if (!page)
goto failed;
- __mod_zone_freepage_state(zone, -(1 << order),
- get_pcppage_migratetype(page));
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
zone_statistics(preferred_zone, zone);
@@ -2958,11 +2931,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
if (!list_empty(&area->free_list[mt]))
return true;
}
-
-#ifdef CONFIG_CMA
- if (!list_empty(&area->free_list[MIGRATE_CMA]))
- return true;
-#endif
}
return false;
}
@@ -4469,9 +4437,6 @@ static void show_migration_types(unsigned char type)
[MIGRATE_MOVABLE] = 'M',
[MIGRATE_RECLAIMABLE] = 'E',
[MIGRATE_HIGHATOMIC] = 'H',
-#ifdef CONFIG_CMA
- [MIGRATE_CMA] = 'C',
-#endif
#ifdef CONFIG_MEMORY_ISOLATION
[MIGRATE_ISOLATE] = 'I',
#endif
@@ -7361,7 +7326,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
if (zone_idx(zone) == ZONE_MOVABLE || is_zone_cma(zone))
return false;
mt = get_pageblock_migratetype(page);
- if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
+ if (mt == MIGRATE_MOVABLE)
return false;
pfn = page_to_pfn(page);
@@ -7512,16 +7477,12 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
* alloc_contig_range() -- tries to allocate given range of pages
* @start: start PFN to allocate
* @end: one-past-the-last PFN to allocate
- * @migratetype: migratetype of the underlaying pageblocks (either
- * #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks
- * in range must have the same migratetype and it must
- * be either of the two.
* @gfp_mask: GFP mask to use during compaction
*
* The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
* aligned, however it's the caller's responsibility to guarantee that
* we are the only thread that changes migrate type of pageblocks the
- * pages fall in.
+ * pages fall in and it should be MIGRATE_MOVABLE.
*
* The PFN range must belong to a single zone.
*
@@ -7530,7 +7491,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
* need to be freed with free_contig_range().
*/
int alloc_contig_range(unsigned long start, unsigned long end,
- unsigned migratetype, gfp_t gfp_mask)
+ gfp_t gfp_mask)
{
unsigned long outer_start, outer_end;
unsigned int order;
@@ -7564,15 +7525,14 @@ int alloc_contig_range(unsigned long start, unsigned long end,
* allocator removing them from the buddy system. This way
* page allocator will never consider using them.
*
- * This lets us mark the pageblocks back as
- * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
- * aligned range but not in the unaligned, original range are
- * put back to page allocator so that buddy can use them.
+ * This lets us mark the pageblocks back as MIGRATE_MOVABLE
+ * so that free pages in the aligned range but not in the
+ * unaligned, original range are put back to page allocator
+ * so that buddy can use them.
*/
ret = start_isolate_page_range(pfn_max_align_down(start),
- pfn_max_align_up(end), migratetype,
- false);
+ pfn_max_align_up(end), false);
if (ret)
return ret;
@@ -7650,7 +7610,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
done:
undo_isolate_page_range(pfn_max_align_down(start),
- pfn_max_align_up(end), migratetype);
+ pfn_max_align_up(end));
return ret;
}
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 5092e4e..312f2f6 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -62,14 +62,13 @@ static int set_migratetype_isolate(struct page *page,
out:
if (!ret) {
unsigned long nr_pages;
- int migratetype = get_pageblock_migratetype(page);
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
zone->nr_isolate_pageblock++;
nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE,
NULL);
- __mod_zone_freepage_state(zone, -nr_pages, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -nr_pages);
}
spin_unlock_irqrestore(&zone->lock, flags);
@@ -122,7 +121,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
*/
if (!isolated_page) {
nr_pages = move_freepages_block(zone, page, migratetype, NULL);
- __mod_zone_freepage_state(zone, nr_pages, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
}
set_pageblock_migratetype(page, migratetype);
zone->nr_isolate_pageblock--;
@@ -151,7 +150,6 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* to be MIGRATE_ISOLATE.
* @start_pfn: The lower PFN of the range to be isolated.
* @end_pfn: The upper PFN of the range to be isolated.
- * @migratetype: migrate type to set in error recovery.
*
* Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
* the range will never be allocated. Any free pages and pages freed in the
@@ -161,7 +159,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* Returns 0 on success and -EBUSY if any part of range cannot be isolated.
*/
int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
- unsigned migratetype, bool skip_hwpoisoned_pages)
+ bool skip_hwpoisoned_pages)
{
unsigned long pfn;
unsigned long undo_pfn;
@@ -185,7 +183,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
for (pfn = start_pfn;
pfn < undo_pfn;
pfn += pageblock_nr_pages)
- unset_migratetype_isolate(pfn_to_page(pfn), migratetype);
+ unset_migratetype_isolate(pfn_to_page(pfn), MIGRATE_MOVABLE);
return -EBUSY;
}
@@ -193,8 +191,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
/*
* Make isolated pages available again.
*/
-int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
- unsigned migratetype)
+int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
{
unsigned long pfn;
struct page *page;
@@ -208,7 +205,7 @@ int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
page = __first_valid_page(pfn, pageblock_nr_pages);
if (!page || !is_migrate_isolate_page(page))
continue;
- unset_migratetype_isolate(page, migratetype);
+ unset_migratetype_isolate(page, MIGRATE_MOVABLE);
}
return 0;
}
diff --git a/mm/page_owner.c b/mm/page_owner.c
index c3cee24..4016815 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -299,11 +299,7 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
page_mt = gfpflags_to_migratetype(
page_owner->gfp_mask);
if (pageblock_mt != page_mt) {
- if (is_migrate_cma(pageblock_mt))
- count[MIGRATE_MOVABLE]++;
- else
- count[pageblock_mt]++;
-
+ count[pageblock_mt]++;
pfn = block_end_pfn;
break;
}
diff --git a/mm/usercopy.c b/mm/usercopy.c
index a9852b2..f0a2c8f 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -179,7 +179,7 @@ static inline const char *check_page_span(const void *ptr, unsigned long n,
* several independently allocated pages.
*/
is_reserved = PageReserved(page);
- is_cma = is_migrate_cma_page(page);
+ is_cma = is_zone_cma(page_zone(page));
if (!is_reserved && !is_cma)
return "<spans multiple pages>";
@@ -187,7 +187,7 @@ static inline const char *check_page_span(const void *ptr, unsigned long n,
page = virt_to_head_page(ptr);
if (is_reserved && !PageReserved(page))
return "<spans Reserved and non-Reserved pages>";
- if (is_cma && !is_migrate_cma_page(page))
+ if (is_cma && !is_zone_cma(page_zone(page)))
return "<spans CMA and non-CMA pages>";
}
#endif
--
2.7.4
From: Joonsoo Kim <[email protected]>
Now, all reserved pages for CMA region are belong to the ZONE_CMA
so we don't need to maintain CMA stat in other zones. Remove it.
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
fs/proc/meminfo.c | 2 +-
include/linux/cma.h | 7 +++++++
include/linux/mmzone.h | 1 -
mm/cma.c | 15 +++++++++++++++
mm/page_alloc.c | 7 +++----
mm/vmstat.c | 1 -
6 files changed, 26 insertions(+), 7 deletions(-)
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8a42849..0ca6f38 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -151,7 +151,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#ifdef CONFIG_CMA
show_val_kb(m, "CmaTotal: ", totalcma_pages);
show_val_kb(m, "CmaFree: ",
- global_page_state(NR_FREE_CMA_PAGES));
+ cma_get_free());
#endif
hugetlb_report_meminfo(m);
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 03f32d0..2433d5e 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -32,4 +32,11 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
gfp_t gfp_mask);
extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
+
+#ifdef CONFIG_CMA
+extern unsigned long cma_get_free(void);
+#else
+static inline unsigned long cma_get_free(void) { return 0; }
+#endif
+
#endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index efb69b1..5e1d8ba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -118,7 +118,6 @@ enum zone_stat_item {
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
#endif
- NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
enum node_stat_item {
diff --git a/mm/cma.c b/mm/cma.c
index 91dd85a..adfda1c 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -54,6 +54,21 @@ unsigned long cma_get_size(const struct cma *cma)
return cma->count << PAGE_SHIFT;
}
+unsigned long cma_get_free(void)
+{
+ struct zone *zone;
+ unsigned long freecma = 0;
+
+ for_each_populated_zone(zone) {
+ if (!is_zone_cma(zone))
+ continue;
+
+ freecma += zone_page_state(zone, NR_FREE_PAGES);
+ }
+
+ return freecma;
+}
+
static unsigned long cma_bitmap_aligned_mask(const struct cma *cma,
int align_order)
{
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 33a1b69..7d61469 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@
#include <linux/kthread.h>
#include <linux/memcontrol.h>
#include <linux/ftrace.h>
+#include <linux/cma.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -4502,7 +4503,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
global_page_state(NR_BOUNCE),
global_page_state(NR_FREE_PAGES),
free_pcp,
- global_page_state(NR_FREE_CMA_PAGES));
+ cma_get_free());
for_each_online_pgdat(pgdat) {
if (show_mem_node_skip(filter, pgdat->node_id, nodemask))
@@ -4586,7 +4587,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
" bounce:%lukB"
" free_pcp:%lukB"
" local_pcp:%ukB"
- " free_cma:%lukB"
"\n",
zone->name,
K(zone_page_state(zone, NR_FREE_PAGES)),
@@ -4608,8 +4608,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
K(zone_page_state(zone, NR_PAGETABLE)),
K(zone_page_state(zone, NR_BOUNCE)),
K(free_pcp),
- K(this_cpu_read(zone->pageset->pcp.count)),
- K(zone_page_state(zone, NR_FREE_CMA_PAGES)));
+ K(this_cpu_read(zone->pageset->pcp.count)));
printk("lowmem_reserve[]:");
for (i = 0; i < MAX_NR_ZONES; i++)
printk(KERN_CONT " %ld", zone->lowmem_reserve[i]);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3c3aac2..9467b56 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -951,7 +951,6 @@ const char * const vmstat_text[] = {
"numa_local",
"numa_other",
#endif
- "nr_free_cma",
/* Node-based counters */
"nr_inactive_anon",
--
2.7.4
From: Joonsoo Kim <[email protected]>
Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
important to reserve. When ZONE_MOVABLE is used, this problem would
theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
allocation request which is mainly used for page cache and anon page
allocation. So, fix it.
And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
makes code complex. For example, if there is highmem system, following
reserve ratio is activated for *NORMAL ZONE* which would be easyily
misleading people.
#ifdef CONFIG_HIGHMEM
32
#endif
This patch also fix this situation by defining sysctl_lowmem_reserve_ratio
array by MAX_NR_ZONES and place "#ifdef" to right place.
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 2 +-
mm/page_alloc.c | 11 ++++++-----
2 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ebaccd4..96194bf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -869,7 +869,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
-extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
+extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 32b31d6..60ffa4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -203,17 +203,18 @@ static void __free_pages_ok(struct page *page, unsigned int order);
* TBD: should special case ZONE_DMA32 machines here - in those we normally
* don't need any ZONE_NORMAL reservation
*/
-int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
+int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
#ifdef CONFIG_ZONE_DMA
- 256,
+ [ZONE_DMA] = 256,
#endif
#ifdef CONFIG_ZONE_DMA32
- 256,
+ [ZONE_DMA32] = 256,
#endif
+ [ZONE_NORMAL] = 32,
#ifdef CONFIG_HIGHMEM
- 32,
+ [ZONE_HIGHMEM] = INT_MAX,
#endif
- 32,
+ [ZONE_MOVABLE] = INT_MAX,
};
EXPORT_SYMBOL(totalram_pages);
--
2.7.4
Hi,
I didn't get to read though patches yet but the cover letter didn't
really help me to understand the basic concepts to have a good starting
point before diving into implementation details. It contains a lot of
history remarks which is not bad but IMHO too excessive here. I would
appreciate the following information (some of that is already provided
in the cover but could benefit from some rewording/text reorganization).
- what is ZONE_CMA and how it is configured (from admin POV)
- how does ZONE_CMA compare to other zones
- who is allowed to allocate from this zone and what are the
guarantees/requirements for successful allocation
- how does the zone compare to a preallocate allocation pool
- how is ZONE_CMA balanced/reclaimed due to internal memory pressure
(from CMA users)
- is this zone reclaimable for the global memory reclaim
- why this was/is controversial
--
Michal Hocko
SUSE Labs
On Tue, Apr 11, 2017 at 08:15:20PM +0200, Michal Hocko wrote:
> Hi,
> I didn't get to read though patches yet but the cover letter didn't
> really help me to understand the basic concepts to have a good starting
> point before diving into implementation details. It contains a lot of
> history remarks which is not bad but IMHO too excessive here. I would
> appreciate the following information (some of that is already provided
> in the cover but could benefit from some rewording/text reorganization).
>
> - what is ZONE_CMA and how it is configured (from admin POV)
> - how does ZONE_CMA compare to other zones
> - who is allowed to allocate from this zone and what are the
> guarantees/requirements for successful allocation
> - how does the zone compare to a preallocate allocation pool
> - how is ZONE_CMA balanced/reclaimed due to internal memory pressure
> (from CMA users)
> - is this zone reclaimable for the global memory reclaim
> - why this was/is controversial
Hello,
I hope that following summary helps you to understand this patchset.
I skip some basic things about CMA. I will attach this description to
the cover-letter if re-spin is needed.
1. What is ZONE_CMA
ZONE_CMA is a newly introduced zone that manages freepages in CMA areas.
Previously, freepages in CMA areas are in the ordinary zone and
managed/distinguished by the special migratetype, MIGRATE_CMA.
However, it causes too many subtle problems and fixing all the problems
due to it seems to be impossible and too intrusive to MM subsystem.
Therefore, different solution is requested and this is the outcome of
this request. Problem details are described in PART 3.
There is no change in admin POV. It is just implementation detail.
If the kernel is congifured to use CMA, it is managed by MM like as before
except pages are now belong to the separate zone, ZONE_CMA.
2. How does ZONE_CMA compare to other zones
ZONE_CMA is conceptually the same with ZONE_MOVABLE. There is a software
constraint to guarantee the success of future allocation request from
the device. If the device requests the specific range of the memory in CMA
area at the runtime, page that allocated by MM will be migrated to
the other page and it will be returned to the device. To guarantee it,
ZONE_CMA only takes the allocation request with GFP_MOVABLE.
The other important point about ZONE_CMA is that span of ZONE_CMA would be
overlapped with the other zone. This is not new to MM subsystem and
MM subsystem has enough logic to handle such situation
so there would be no problem.
Other things are completely the same with other zones. For MM POV, there is
no difference in allocation process except that it only takes
GFP_MOVABLE request. In reclaim, pages that are allocated by MM will
be reclaimed by the same policy of the MM. So, no difference.
This 'no difference' is a strong point of this approach. ZONE_CMA is
naturally handled by MM subsystem unlike as before (special handling is
required for MIGRATE_CMA).
3. Controversial Point
Major concern from Mel is that zone concept is abused. ZONE is originally
introduced to solve some issues due to H/W addressing limitation.
However, from the age of ZONE_MOVABLE, ZONE is used to solve the issues
due to S/W limitation. This S/W limitation causes highmem/lowmem problem
that is some of memory cannot be usable for kernel memory and LRU ordering
would be broken easily. My major objection to this point is that
this problem isn't related to implementation detail like as ZONE.
Problems just comes from S/W limitation that we cannot use this memory
for kernel memory to guarantee offlining the memory (ZONE_MOVABLE) or
allocation from the device (ZONE_CMA) in the future. See PART 1 for
more information.
Thanks.
On Tue, Apr 11, 2017 at 12:17:13PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Changed from v6
> o Rebase on next-20170405
> o Add a fix for lowmem mapping on ARM (last patch)
Hello, Russell and Will.
In this 7th patchset, I newly added a patch for ARM.
Could you review it?
Thanks.
On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> On Tue, Apr 11, 2017 at 08:15:20PM +0200, Michal Hocko wrote:
> > Hi,
> > I didn't get to read though patches yet but the cover letter didn't
> > really help me to understand the basic concepts to have a good starting
> > point before diving into implementation details. It contains a lot of
> > history remarks which is not bad but IMHO too excessive here. I would
> > appreciate the following information (some of that is already provided
> > in the cover but could benefit from some rewording/text reorganization).
> >
> > - what is ZONE_CMA and how it is configured (from admin POV)
> > - how does ZONE_CMA compare to other zones
> > - who is allowed to allocate from this zone and what are the
> > guarantees/requirements for successful allocation
> > - how does the zone compare to a preallocate allocation pool
> > - how is ZONE_CMA balanced/reclaimed due to internal memory pressure
> > (from CMA users)
> > - is this zone reclaimable for the global memory reclaim
> > - why this was/is controversial
>
> Hello,
>
> I hope that following summary helps you to understand this patchset.
> I skip some basic things about CMA. I will attach this description to
> the cover-letter if re-spin is needed.
I believe that sorting out these questions is more important than what
you have in the current cover letter. Andrew tends to fold the cover
into the first patch so I think you should update.
> 2. How does ZONE_CMA compare to other zones
>
> ZONE_CMA is conceptually the same with ZONE_MOVABLE. There is a software
> constraint to guarantee the success of future allocation request from
> the device. If the device requests the specific range of the memory in CMA
> area at the runtime, page that allocated by MM will be migrated to
> the other page and it will be returned to the device. To guarantee it,
> ZONE_CMA only takes the allocation request with GFP_MOVABLE.
The immediate follow up question is. Why cannot we reuse ZONE_MOVABLE
for that purpose?
> The other important point about ZONE_CMA is that span of ZONE_CMA would be
> overlapped with the other zone. This is not new to MM subsystem and
> MM subsystem has enough logic to handle such situation
> so there would be no problem.
I am not really sure this is actually true. Zones are disjoint from the
early beginning. I remember that we had something like numa nodes
interleaving but that is such a rare configuration that I wouldn't be
surprised if it wasn't very well tested and actually broken in some
subtle ways.
There are many page_zone(page) != zone checks sprinkled in the code but
I do not see anything consistent there. Similarly pageblock_pfn_to_page
is only used by compaction but there are other pfn walkers which do
ad-hoc checking. I was staring into that code these days due to my
hotplug patches.
That being said, I think that interleaving zones are an interesting
concept but I would be rather nervous to consider this as working
currently without a deeper review.
> Other things are completely the same with other zones. For MM POV, there is
> no difference in allocation process except that it only takes
> GFP_MOVABLE request. In reclaim, pages that are allocated by MM will
> be reclaimed by the same policy of the MM. So, no difference.
OK, so essentially this is yet another "highmem" zone. We already know
that only GFP_MOVABLE are allowed to fallback to ZONE_CMA but do CMA
allocations fallback to other zones and punch new holes? In which zone
order?
> This 'no difference' is a strong point of this approach. ZONE_CMA is
> naturally handled by MM subsystem unlike as before (special handling is
> required for MIGRATE_CMA).
>
> 3. Controversial Point
>
> Major concern from Mel is that zone concept is abused. ZONE is originally
> introduced to solve some issues due to H/W addressing limitation.
Yes, very much agreed on that. You basically want to punch holes into
other zones to guarantee an allocation progress. Marking those wholes
with special migrate type sounds quite natural but I will have to study
the current code some more to see whether issues you mention are
inherently unfixable. This might very well turn out to be the case.
> However, from the age of ZONE_MOVABLE, ZONE is used to solve the issues
> due to S/W limitation.
copying ZONE_MOVABLE pattern doesn't sound all that great to me to be
honest.
> This S/W limitation causes highmem/lowmem problem
> that is some of memory cannot be usable for kernel memory and LRU ordering
> would be broken easily. My major objection to this point is that
> this problem isn't related to implementation detail like as ZONE.
yes, agreement on that.
> Problems just comes from S/W limitation that we cannot use this memory
> for kernel memory to guarantee offlining the memory (ZONE_MOVABLE) or
> allocation from the device (ZONE_CMA) in the future. See PART 1 for
> more information.
--
Michal Hocko
SUSE Labs
On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> > On Tue, Apr 11, 2017 at 08:15:20PM +0200, Michal Hocko wrote:
> > > Hi,
> > > I didn't get to read though patches yet but the cover letter didn't
> > > really help me to understand the basic concepts to have a good starting
> > > point before diving into implementation details. It contains a lot of
> > > history remarks which is not bad but IMHO too excessive here. I would
> > > appreciate the following information (some of that is already provided
> > > in the cover but could benefit from some rewording/text reorganization).
> > >
> > > - what is ZONE_CMA and how it is configured (from admin POV)
> > > - how does ZONE_CMA compare to other zones
> > > - who is allowed to allocate from this zone and what are the
> > > guarantees/requirements for successful allocation
> > > - how does the zone compare to a preallocate allocation pool
> > > - how is ZONE_CMA balanced/reclaimed due to internal memory pressure
> > > (from CMA users)
> > > - is this zone reclaimable for the global memory reclaim
> > > - why this was/is controversial
> >
> > Hello,
> >
> > I hope that following summary helps you to understand this patchset.
> > I skip some basic things about CMA. I will attach this description to
> > the cover-letter if re-spin is needed.
>
> I believe that sorting out these questions is more important than what
> you have in the current cover letter. Andrew tends to fold the cover
> into the first patch so I think you should update.
Okay.
> > 2. How does ZONE_CMA compare to other zones
> >
> > ZONE_CMA is conceptually the same with ZONE_MOVABLE. There is a software
> > constraint to guarantee the success of future allocation request from
> > the device. If the device requests the specific range of the memory in CMA
> > area at the runtime, page that allocated by MM will be migrated to
> > the other page and it will be returned to the device. To guarantee it,
> > ZONE_CMA only takes the allocation request with GFP_MOVABLE.
>
> The immediate follow up question is. Why cannot we reuse ZONE_MOVABLE
> for that purpose?
I can make CMA reuses the ZONE_MOVABLE but I don't want it. Reasons
are that
1. If ZONE_MOVABLE has two different types of memory, hotpluggable and
CMA, it may need special handling for each type. This would lead to a new
migratetype again (to distinguish them) and easy to be error-prone. I
don't want that case.
2. CMA users want to see usage stat separately since CMA often causes
the problems and separate stat would helps to debug it.
> > The other important point about ZONE_CMA is that span of ZONE_CMA would be
> > overlapped with the other zone. This is not new to MM subsystem and
> > MM subsystem has enough logic to handle such situation
> > so there would be no problem.
>
> I am not really sure this is actually true. Zones are disjoint from the
> early beginning. I remember that we had something like numa nodes
> interleaving but that is such a rare configuration that I wouldn't be
> surprised if it wasn't very well tested and actually broken in some
> subtle ways.
I agree with your concern however if something is broken for them, it
just shows that we need to fix it. MM should handle this situation
since we already know that such architecture exists.
>
> There are many page_zone(page) != zone checks sprinkled in the code but
> I do not see anything consistent there. Similarly pageblock_pfn_to_page
> is only used by compaction but there are other pfn walkers which do
> ad-hoc checking. I was staring into that code these days due to my
> hotplug patches.
>
> That being said, I think that interleaving zones are an interesting
> concept but I would be rather nervous to consider this as working
> currently without a deeper review.
I have tried to audit all the pfn walkers before and have added above
mentioned check. Perhaps, I missed something however I believe not
that much. Our production already have used ZONE_CMA and I haven't get
the report about such problem.
>
> > Other things are completely the same with other zones. For MM POV, there is
> > no difference in allocation process except that it only takes
> > GFP_MOVABLE request. In reclaim, pages that are allocated by MM will
> > be reclaimed by the same policy of the MM. So, no difference.
>
> OK, so essentially this is yet another "highmem" zone. We already know
> that only GFP_MOVABLE are allowed to fallback to ZONE_CMA but do CMA
> allocations fallback to other zones and punch new holes? In which zone
> order?
Hmm... I don't understand your question. Could you elaborate it more?
> > This 'no difference' is a strong point of this approach. ZONE_CMA is
> > naturally handled by MM subsystem unlike as before (special handling is
> > required for MIGRATE_CMA).
> >
> > 3. Controversial Point
> >
> > Major concern from Mel is that zone concept is abused. ZONE is originally
> > introduced to solve some issues due to H/W addressing limitation.
>
> Yes, very much agreed on that. You basically want to punch holes into
> other zones to guarantee an allocation progress. Marking those wholes
> with special migrate type sounds quite natural but I will have to study
> the current code some more to see whether issues you mention are
> inherently unfixable. This might very well turn out to be the case.
At a glance, special migratetype sound natural. I also did. However,
it's not natural in implementation POV. Zone consists of the same type
of memory (by definition ?) and MM subsystem is implemented with that
assumption. If difference type of memory shares the same zone, it easily
causes the problem and CMA problems are the such case.
Thanks.
Hi Joonsoo,
On Tue, Apr 11, 2017 at 12:17:14PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
> important to reserve. When ZONE_MOVABLE is used, this problem would
> theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
> allocation request which is mainly used for page cache and anon page
> allocation. So, fix it.
>
> And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
> makes code complex. For example, if there is highmem system, following
> reserve ratio is activated for *NORMAL ZONE* which would be easyily
> misleading people.
>
> #ifdef CONFIG_HIGHMEM
> 32
> #endif
>
> This patch also fix this situation by defining sysctl_lowmem_reserve_ratio
> array by MAX_NR_ZONES and place "#ifdef" to right place.
>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> include/linux/mmzone.h | 2 +-
> mm/page_alloc.c | 11 ++++++-----
> 2 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ebaccd4..96194bf 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -869,7 +869,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
> void __user *, size_t *, loff_t *);
> int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
> void __user *, size_t *, loff_t *);
> -extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
> +extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
> int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
> void __user *, size_t *, loff_t *);
> int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 32b31d6..60ffa4e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -203,17 +203,18 @@ static void __free_pages_ok(struct page *page, unsigned int order);
> * TBD: should special case ZONE_DMA32 machines here - in those we normally
> * don't need any ZONE_NORMAL reservation
> */
> -int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
> +int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
> #ifdef CONFIG_ZONE_DMA
> - 256,
> + [ZONE_DMA] = 256,
> #endif
> #ifdef CONFIG_ZONE_DMA32
> - 256,
> + [ZONE_DMA32] = 256,
> #endif
> + [ZONE_NORMAL] = 32,
> #ifdef CONFIG_HIGHMEM
> - 32,
> + [ZONE_HIGHMEM] = INT_MAX,
> #endif
> - 32,
> + [ZONE_MOVABLE] = INT_MAX,
> };
We need to update lowmem_reserve_ratio in Documentation/sysctl/vm.txt.
And to me, INT_MAX is rather awkward.
# cat /proc/sys/vm/lowmem_reserve_ratio
256 256 32 2147483647 2147483647
What do you think about to use 0 or -1 as special meaning
instead 2147483647?
Anyway, it could be separate patch regardless of zone_cma
so I hope Andrew to merge this patch regardless of other patches
in this patchset.
Thanks.
On Mon, Apr 17, 2017 at 04:38:08PM +0900, Minchan Kim wrote:
> Hi Joonsoo,
>
> On Tue, Apr 11, 2017 at 12:17:14PM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
> > important to reserve. When ZONE_MOVABLE is used, this problem would
> > theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
> > allocation request which is mainly used for page cache and anon page
> > allocation. So, fix it.
> >
> > And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
> > makes code complex. For example, if there is highmem system, following
> > reserve ratio is activated for *NORMAL ZONE* which would be easyily
> > misleading people.
> >
> > #ifdef CONFIG_HIGHMEM
> > 32
> > #endif
> >
> > This patch also fix this situation by defining sysctl_lowmem_reserve_ratio
> > array by MAX_NR_ZONES and place "#ifdef" to right place.
> >
> > Reviewed-by: Aneesh Kumar K.V <[email protected]>
> > Acked-by: Vlastimil Babka <[email protected]>
> > Signed-off-by: Joonsoo Kim <[email protected]>
> > ---
> > include/linux/mmzone.h | 2 +-
> > mm/page_alloc.c | 11 ++++++-----
> > 2 files changed, 7 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index ebaccd4..96194bf 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -869,7 +869,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
> > void __user *, size_t *, loff_t *);
> > int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
> > void __user *, size_t *, loff_t *);
> > -extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
> > +extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
> > int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
> > void __user *, size_t *, loff_t *);
> > int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 32b31d6..60ffa4e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -203,17 +203,18 @@ static void __free_pages_ok(struct page *page, unsigned int order);
> > * TBD: should special case ZONE_DMA32 machines here - in those we normally
> > * don't need any ZONE_NORMAL reservation
> > */
> > -int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
> > +int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
> > #ifdef CONFIG_ZONE_DMA
> > - 256,
> > + [ZONE_DMA] = 256,
> > #endif
> > #ifdef CONFIG_ZONE_DMA32
> > - 256,
> > + [ZONE_DMA32] = 256,
> > #endif
> > + [ZONE_NORMAL] = 32,
> > #ifdef CONFIG_HIGHMEM
> > - 32,
> > + [ZONE_HIGHMEM] = INT_MAX,
> > #endif
> > - 32,
> > + [ZONE_MOVABLE] = INT_MAX,
> > };
>
> We need to update lowmem_reserve_ratio in Documentation/sysctl/vm.txt.
Okay!
> And to me, INT_MAX is rather awkward.
I also think so.
> # cat /proc/sys/vm/lowmem_reserve_ratio
> 256 256 32 2147483647 2147483647
>
> What do you think about to use 0 or -1 as special meaning
> instead 2147483647?
I have thought it but drop it. In setup_per_zone_lowmem_reserve(),
there is a code to adjust the value to 1 if the value is less than 1.
There might be someone who (ab)use this adjustment so it's safe to use
INT_MAX.
> Anyway, it could be separate patch regardless of zone_cma
> so I hope Andrew to merge this patch regardless of other patches
> in this patchset.
Okay. I will send updated version soon.
Thanks.
On Mon, Apr 17, 2017 at 11:02:12AM +0900, Joonsoo Kim wrote:
> On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> > On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> > > On Tue, Apr 11, 2017 at 08:15:20PM +0200, Michal Hocko wrote:
> > > > Hi,
> > > > I didn't get to read though patches yet but the cover letter didn't
> > > > really help me to understand the basic concepts to have a good starting
> > > > point before diving into implementation details. It contains a lot of
> > > > history remarks which is not bad but IMHO too excessive here. I would
> > > > appreciate the following information (some of that is already provided
> > > > in the cover but could benefit from some rewording/text reorganization).
> > > >
> > > > - what is ZONE_CMA and how it is configured (from admin POV)
> > > > - how does ZONE_CMA compare to other zones
> > > > - who is allowed to allocate from this zone and what are the
> > > > guarantees/requirements for successful allocation
> > > > - how does the zone compare to a preallocate allocation pool
> > > > - how is ZONE_CMA balanced/reclaimed due to internal memory pressure
> > > > (from CMA users)
> > > > - is this zone reclaimable for the global memory reclaim
> > > > - why this was/is controversial
> > >
> > > Hello,
> > >
> > > I hope that following summary helps you to understand this patchset.
> > > I skip some basic things about CMA. I will attach this description to
> > > the cover-letter if re-spin is needed.
> >
> > I believe that sorting out these questions is more important than what
> > you have in the current cover letter. Andrew tends to fold the cover
> > into the first patch so I think you should update.
>
> Okay.
>
> > > 2. How does ZONE_CMA compare to other zones
> > >
> > > ZONE_CMA is conceptually the same with ZONE_MOVABLE. There is a software
> > > constraint to guarantee the success of future allocation request from
> > > the device. If the device requests the specific range of the memory in CMA
> > > area at the runtime, page that allocated by MM will be migrated to
> > > the other page and it will be returned to the device. To guarantee it,
> > > ZONE_CMA only takes the allocation request with GFP_MOVABLE.
> >
> > The immediate follow up question is. Why cannot we reuse ZONE_MOVABLE
> > for that purpose?
>
> I can make CMA reuses the ZONE_MOVABLE but I don't want it. Reasons
> are that
>
> 1. If ZONE_MOVABLE has two different types of memory, hotpluggable and
> CMA, it may need special handling for each type. This would lead to a new
> migratetype again (to distinguish them) and easy to be error-prone. I
> don't want that case.
>
> 2. CMA users want to see usage stat separately since CMA often causes
> the problems and separate stat would helps to debug it.
>
> > > The other important point about ZONE_CMA is that span of ZONE_CMA would be
> > > overlapped with the other zone. This is not new to MM subsystem and
> > > MM subsystem has enough logic to handle such situation
> > > so there would be no problem.
> >
> > I am not really sure this is actually true. Zones are disjoint from the
> > early beginning. I remember that we had something like numa nodes
> > interleaving but that is such a rare configuration that I wouldn't be
> > surprised if it wasn't very well tested and actually broken in some
> > subtle ways.
>
> I agree with your concern however if something is broken for them, it
> just shows that we need to fix it. MM should handle this situation
> since we already know that such architecture exists.
>
> >
> > There are many page_zone(page) != zone checks sprinkled in the code but
> > I do not see anything consistent there. Similarly pageblock_pfn_to_page
> > is only used by compaction but there are other pfn walkers which do
> > ad-hoc checking. I was staring into that code these days due to my
> > hotplug patches.
> >
> > That being said, I think that interleaving zones are an interesting
> > concept but I would be rather nervous to consider this as working
> > currently without a deeper review.
>
> I have tried to audit all the pfn walkers before and have added above
> mentioned check. Perhaps, I missed something however I believe not
> that much. Our production already have used ZONE_CMA and I haven't get
> the report about such problem.
>
> >
> > > Other things are completely the same with other zones. For MM POV, there is
> > > no difference in allocation process except that it only takes
> > > GFP_MOVABLE request. In reclaim, pages that are allocated by MM will
> > > be reclaimed by the same policy of the MM. So, no difference.
> >
> > OK, so essentially this is yet another "highmem" zone. We already know
> > that only GFP_MOVABLE are allowed to fallback to ZONE_CMA but do CMA
> > allocations fallback to other zones and punch new holes? In which zone
> > order?
>
> Hmm... I don't understand your question. Could you elaborate it more?
>
> > > This 'no difference' is a strong point of this approach. ZONE_CMA is
> > > naturally handled by MM subsystem unlike as before (special handling is
> > > required for MIGRATE_CMA).
> > >
> > > 3. Controversial Point
> > >
> > > Major concern from Mel is that zone concept is abused. ZONE is originally
> > > introduced to solve some issues due to H/W addressing limitation.
> >
> > Yes, very much agreed on that. You basically want to punch holes into
> > other zones to guarantee an allocation progress. Marking those wholes
> > with special migrate type sounds quite natural but I will have to study
> > the current code some more to see whether issues you mention are
> > inherently unfixable. This might very well turn out to be the case.
>
> At a glance, special migratetype sound natural. I also did. However,
> it's not natural in implementation POV. Zone consists of the same type
> of memory (by definition ?) and MM subsystem is implemented with that
> assumption. If difference type of memory shares the same zone, it easily
> causes the problem and CMA problems are the such case.
Hello, Michal.
If you don't have any more question, I will send next version with
updated cover-letter.
Thanks.
On Fri 21-04-17 10:35:03, Joonsoo Kim wrote:
[...]
> Hello, Michal.
>
> If you don't have any more question, I will send next version with
> updated cover-letter.
I am sorry but I am bussy as hell this week and didn't get to your email
yet. I will try as soon as possible.
--
Michal Hocko
SUSE Labs
On 2017/4/11 11:17, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Changed from v6
> o Rebase on next-20170405
> o Add a fix for lowmem mapping on ARM (last patch)
> o Re-organize the cover letter
>
> Changes from v5
> o Rebase on next-20161013
> o Cosmetic change on patch 1
> o Optimize span of ZONE_CMA on multiple node system
>
> Changes from v4
> o Rebase on next-20160825
> o Add general fix patch for lowmem reserve
> o Fix lowmem reserve ratio
> o Fix zone span optimizaion per Vlastimil
> o Fix pageset initialization
> o Change invocation timing on cma_init_reserved_areas()
>
> Changes from v3
> o Rebase on next-20160805
> o Split first patch per Vlastimil
> o Remove useless function parameter per Vlastimil
> o Add code comment per Vlastimil
> o Add following description on cover-letter
>
> Changes from v2
> o Rebase on next-20160525
> o No other changes except following description
>
> Changes from v1
> o Separate some patches which deserve to submit independently
> o Modify description to reflect current kernel state
> (e.g. high-order watermark problem disappeared by Mel's work)
> o Don't increase SECTION_SIZE_BITS to make a room in page flags
> (detailed reason is on the patch that adds ZONE_CMA)
> o Adjust ZONE_CMA population code
>
>
> Hello,
>
> This is the 7th version of ZONE_CMA patchset. One patch is added
> to fix potential problem on ARM. Other changes are just due to rebase.
>
> This patchset has long history and got some reviews before. This
> cover-letter has the summary and my opinion on those reviews. Content
> order is so confusing so I make a simple index. If anyone want to
> understand the history properly, please read them by reverse order.
>
> PART 1. Strong points of the zone approach
> PART 2. Summary in LSF/MM 2016 discussion
> PART 3. Original motivation of this patchset
>
> ***** PART 1 *****
>
> CMA has many problems and I mentioned them on the bottom of the
> cover letter. These problems comes from limitation of CMA memory that
> should be always migratable for device usage. I think that introducing
> a new zone is the best approach to solve them. Here are the reasons.
>
> Zone is introduced to solve some issues due to H/W addressing limitation.
> MM subsystem is implemented to work efficiently with these zones.
> Allocation/reclaim logic in MM consider this limitation very much.
> What I did in this patchset is introducing a new zone and extending zone's
> concept slightly. New concept is that zone can have not only H/W addressing
> limitation but also S/W limitation to guarantee page migration.
> This concept is originated from ZONE_MOVABLE and it works well
> for a long time. So, ZONE_CMA should not be special at this moment.
>
> There is a major concern from Mel that ZONE_MOVABLE which has
> S/W limitation causes highmem/lowmem problem. Highmem/lowmem problem is
> that some of memory cannot be usable for kernel memory due to limitation
> of the zone. It causes to break LRU ordering and makes hard to find kernel
> usable memory when memory pressure.
>
> However, important point is that this problem doesn't come from
> implementation detail (ZONE_MOVABLE/MIGRATETYPE). Even if we implement it
> by MIGRATETYPE instead of by ZONE_MOVABLE, we cannot use that type of
> memory for kernel allocation because it isn't migratable. So, it will cause
> to break LRU ordering, too. We cannot avoid the problem in any case.
> Therefore, we should focus on which solution is better for maintenance
> and not intrusive for MM subsystem.
>
> In this viewpoint, I think that zone approach is better. As mentioned
> earlier, MM subsystem already have many infrastructures to deal with
> zone's H/W addressing limitation. Adding S/W limitation on zone concept
> and adding a new zone doesn't change anything. It will work by itself.
> My patchset can remove many hooks related to CMA area management in MM
> while solving the problems. More hooks are required to solve the problems
> if we choose MIGRATETYPE approach.
>
Agree, there are already too many hooks and pain to maintain/bugfix.
It looks better if choose this ZONE_CMA approach.
--
Regards,
Bob Liu
On Mon 17-04-17 11:02:12, Joonsoo Kim wrote:
> On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> > On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
[...]
> > > ZONE_CMA is conceptually the same with ZONE_MOVABLE. There is a software
> > > constraint to guarantee the success of future allocation request from
> > > the device. If the device requests the specific range of the memory in CMA
> > > area at the runtime, page that allocated by MM will be migrated to
> > > the other page and it will be returned to the device. To guarantee it,
> > > ZONE_CMA only takes the allocation request with GFP_MOVABLE.
> >
> > The immediate follow up question is. Why cannot we reuse ZONE_MOVABLE
> > for that purpose?
>
> I can make CMA reuses the ZONE_MOVABLE but I don't want it. Reasons
> are that
>
> 1. If ZONE_MOVABLE has two different types of memory, hotpluggable and
> CMA, it may need special handling for each type. This would lead to a new
> migratetype again (to distinguish them) and easy to be error-prone. I
> don't want that case.
Hmm, I see your motivation. I believe that we could find a way
around this. Anyway, movable zones are quite special and configuring
overlapping CMA and hotplug movable regions could be refused. So I am
not even sure this is a real problem in practice.
> 2. CMA users want to see usage stat separately since CMA often causes
> the problems and separate stat would helps to debug it.
That could be solved by a per-zone/node counter.
Anyway, these reasons should be mentioned as well. Adding a new zone is
not for free. For most common configurations where we have ZONE_DMA,
ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE all the 3 bits are already
consumed so a new zone will need a new one AFAICS.
[...]
> > > Other things are completely the same with other zones. For MM POV, there is
> > > no difference in allocation process except that it only takes
> > > GFP_MOVABLE request. In reclaim, pages that are allocated by MM will
> > > be reclaimed by the same policy of the MM. So, no difference.
> >
> > OK, so essentially this is yet another "highmem" zone. We already know
> > that only GFP_MOVABLE are allowed to fallback to ZONE_CMA but do CMA
> > allocations fallback to other zones and punch new holes? In which zone
> > order?
>
> Hmm... I don't understand your question. Could you elaborate it more?
Well, my question was about the zone fallback chain. MOVABLE allocation
can fallback to lower zones and also to the ZONE_CMA with your patch. If
there is a CMA allocation it doesn't fall back to any other zone - in
other words no new holes are punched to other zones. Is this correct?
> > > This 'no difference' is a strong point of this approach. ZONE_CMA is
> > > naturally handled by MM subsystem unlike as before (special handling is
> > > required for MIGRATE_CMA).
> > >
> > > 3. Controversial Point
> > >
> > > Major concern from Mel is that zone concept is abused. ZONE is originally
> > > introduced to solve some issues due to H/W addressing limitation.
> >
> > Yes, very much agreed on that. You basically want to punch holes into
> > other zones to guarantee an allocation progress. Marking those wholes
> > with special migrate type sounds quite natural but I will have to study
> > the current code some more to see whether issues you mention are
> > inherently unfixable. This might very well turn out to be the case.
>
> At a glance, special migratetype sound natural. I also did. However,
> it's not natural in implementation POV. Zone consists of the same type
> of memory (by definition ?) and MM subsystem is implemented with that
> assumption. If difference type of memory shares the same zone, it easily
> causes the problem and CMA problems are the such case.
But this is not any different from the highmem vs. lowmem problems we
already have, no? I have looked at your example in the cover where you
mention utilization and the reclaim problems. With the node reclaim we
will have pages from all zones on the same LRU(s). isolate_lru_pages
will skip those from ZONE_CMA because their zone_idx is higher than
gfp_idx(GFP_KERNEL). The same could be achieved by an explicit check for
the pageblock migrate type. So the zone doesn't really help much. Or is
there some aspect that I am missing?
Another worry I would have with the zone approach is that there is a
risk to reintroduce issues we used to have with small zones in the
past. Just consider that the CMA will get depleted by CMA users almost
completely. Now that zone will not get balanced with only few pages.
wakeup_kswapd/pgdat_balanced already has measures to prevent from wake
ups but I cannot say I would be sure everything will work smoothly.
I have glanced through the cumulative diff and to be honest I am not
really sure the result is a great simplification in the end. There is
still quite a lot of special casing. It is true that the page allocator
path is cleaned up and some CMA specific checks are moved away. This is
definitely good to see but I am not convinced that the new zone is
really justified. Only very little from the zone infrastructure is used
in the end AFAICS. Is there any specific usecase which cannot be solved
with the pageblock while it could be with the zone approach? That would
be a strong argument to chose one over another.
Please do _not_ take this as a NAK from me. At least not at this time. I
am still trying to understand all the consequences but my intuition
tells me that building on top of highmem like approach will turn out to
be problematic in future (as we have already seen with the highmem and
movable zones) so this needs a very prudent consideration.
--
Michal Hocko
SUSE Labs
On Mon, Apr 24, 2017 at 03:09:36PM +0200, Michal Hocko wrote:
> On Mon 17-04-17 11:02:12, Joonsoo Kim wrote:
> > On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> > > On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> [...]
> > > > ZONE_CMA is conceptually the same with ZONE_MOVABLE. There is a software
> > > > constraint to guarantee the success of future allocation request from
> > > > the device. If the device requests the specific range of the memory in CMA
> > > > area at the runtime, page that allocated by MM will be migrated to
> > > > the other page and it will be returned to the device. To guarantee it,
> > > > ZONE_CMA only takes the allocation request with GFP_MOVABLE.
> > >
> > > The immediate follow up question is. Why cannot we reuse ZONE_MOVABLE
> > > for that purpose?
> >
> > I can make CMA reuses the ZONE_MOVABLE but I don't want it. Reasons
> > are that
> >
> > 1. If ZONE_MOVABLE has two different types of memory, hotpluggable and
> > CMA, it may need special handling for each type. This would lead to a new
> > migratetype again (to distinguish them) and easy to be error-prone. I
> > don't want that case.
>
> Hmm, I see your motivation. I believe that we could find a way
> around this. Anyway, movable zones are quite special and configuring
> overlapping CMA and hotplug movable regions could be refused. So I am
> not even sure this is a real problem in practice.
>
> > 2. CMA users want to see usage stat separately since CMA often causes
> > the problems and separate stat would helps to debug it.
>
> That could be solved by a per-zone/node counter.
>
> Anyway, these reasons should be mentioned as well. Adding a new zone is
Okay.
> not for free. For most common configurations where we have ZONE_DMA,
> ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE all the 3 bits are already
> consumed so a new zone will need a new one AFAICS.
Yes, it requires one more bit for a new zone and it's handled by the patch.
>
> [...]
> > > > Other things are completely the same with other zones. For MM POV, there is
> > > > no difference in allocation process except that it only takes
> > > > GFP_MOVABLE request. In reclaim, pages that are allocated by MM will
> > > > be reclaimed by the same policy of the MM. So, no difference.
> > >
> > > OK, so essentially this is yet another "highmem" zone. We already know
> > > that only GFP_MOVABLE are allowed to fallback to ZONE_CMA but do CMA
> > > allocations fallback to other zones and punch new holes? In which zone
> > > order?
> >
> > Hmm... I don't understand your question. Could you elaborate it more?
>
> Well, my question was about the zone fallback chain. MOVABLE allocation
> can fallback to lower zones and also to the ZONE_CMA with your patch. If
> there is a CMA allocation it doesn't fall back to any other zone - in
> other words no new holes are punched to other zones. Is this correct?
Hmm... I still don't get the meaning of "no new holes are punched to
other zones". I try to answer with my current understanding about your
question.
MOVABLE allocation will fallback as following sequence.
ZONE_CMA -> ZONE_MOVABLE -> ZONE_HIGHMEM -> ZONE_NORMAL -> ...
I don't understand what you mean CMA allocation. In MM's context,
there is no CMA allocation. That is just MOVABLE allocation.
For device's context, there is CMA allocation. It is range specific
allocation so it should be succeed for requested range. No fallback is
allowed in this case.
> > > > This 'no difference' is a strong point of this approach. ZONE_CMA is
> > > > naturally handled by MM subsystem unlike as before (special handling is
> > > > required for MIGRATE_CMA).
> > > >
> > > > 3. Controversial Point
> > > >
> > > > Major concern from Mel is that zone concept is abused. ZONE is originally
> > > > introduced to solve some issues due to H/W addressing limitation.
> > >
> > > Yes, very much agreed on that. You basically want to punch holes into
> > > other zones to guarantee an allocation progress. Marking those wholes
> > > with special migrate type sounds quite natural but I will have to study
> > > the current code some more to see whether issues you mention are
> > > inherently unfixable. This might very well turn out to be the case.
> >
> > At a glance, special migratetype sound natural. I also did. However,
> > it's not natural in implementation POV. Zone consists of the same type
> > of memory (by definition ?) and MM subsystem is implemented with that
> > assumption. If difference type of memory shares the same zone, it easily
> > causes the problem and CMA problems are the such case.
>
> But this is not any different from the highmem vs. lowmem problems we
> already have, no? I have looked at your example in the cover where you
> mention utilization and the reclaim problems. With the node reclaim we
> will have pages from all zones on the same LRU(s). isolate_lru_pages
> will skip those from ZONE_CMA because their zone_idx is higher than
> gfp_idx(GFP_KERNEL). The same could be achieved by an explicit check for
> the pageblock migrate type. So the zone doesn't really help much. Or is
> there some aspect that I am missing?
Your understanding is correct. It can archieved by an explict check
for migratetype. And, this is the main reason that we should avoid
such approach.
With ZONE approach, all these things are done naturally. We don't need
any explicit check to anywhere. We already have a code to skip to
reclaim such pages by checking zone_idx.
However, with MIGRATETYPE approach, all these things *cannot* be done
naturally. We need extra checks to all the places (allocator fast
path, reclaim path, compaction, etc...). It is really error-prone and
it already causes many problems due to this aspect. For the
performance wise, this approach is also bad since it requires to check
migratetype for each pages.
Moreover, even if we adds extra checks, things cannot be easily
perfect. See 3) Atomic allocation failure problem. It's inherent
problem if we have different types of memory in a single zone.
We possibly can make things perfect even with MIGRATETYPE approach,
however, it requires additional checks in hotpath than current. It's
expensive and undesirable. It will make future maintenance of MM code
much difficult.
This is why I prefer the ZONE approach.
>
> Another worry I would have with the zone approach is that there is a
> risk to reintroduce issues we used to have with small zones in the
> past. Just consider that the CMA will get depleted by CMA users almost
> completely. Now that zone will not get balanced with only few pages.
> wakeup_kswapd/pgdat_balanced already has measures to prevent from wake
> ups but I cannot say I would be sure everything will work smoothly.
If there is a small zone problem, it should be fixed in any case.
There are many workloads that allocates memory almost completely
and doesn't return them back to the page allocator.
> I have glanced through the cumulative diff and to be honest I am not
> really sure the result is a great simplification in the end. There is
> still quite a lot of special casing. It is true that the page allocator
Special casing is mostly for initialization. We cannot avoid such things
since CMA isn't normal memory. We have just two choices.
1) ZONE: special casing in intialization phase
2) MIGRATETYPE: special casing in runtime phase
And, I choose 1).
> path is cleaned up and some CMA specific checks are moved away. This is
> definitely good to see but I am not convinced that the new zone is
> really justified. Only very little from the zone infrastructure is used
> in the end AFAICS. Is there any specific usecase which cannot be solved
> with the pageblock while it could be with the zone approach? That would
> be a strong argument to chose one over another.
As I mentioned above, atomic allocation failure problem is somewhat
inherent problem. Another one is described in Vlastimil's reply in
other thread.
lkml.kernel.org/r/[email protected]
>
> Please do _not_ take this as a NAK from me. At least not at this time. I
> am still trying to understand all the consequences but my intuition
> tells me that building on top of highmem like approach will turn out to
> be problematic in future (as we have already seen with the highmem and
> movable zones) so this needs a very prudent consideration.
I can understand that you are prudent to this issue. However, it takes more
than two years and many people already expressed that ZONE approach is the
way to go.
As I said before, some problems are due to S/W limitation of the CMA
memory, not due to the implementation. Even if it chooses MIGRATETYPE
approach, some problem that you have seen in highmem and
movable zones will still exist. So, it should not be a criteria for
implementation decision.
Thanks.
On Tue 25-04-17 12:42:57, Joonsoo Kim wrote:
> On Mon, Apr 24, 2017 at 03:09:36PM +0200, Michal Hocko wrote:
> > On Mon 17-04-17 11:02:12, Joonsoo Kim wrote:
> > > On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> > > > On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
[...]
> > not for free. For most common configurations where we have ZONE_DMA,
> > ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE all the 3 bits are already
> > consumed so a new zone will need a new one AFAICS.
>
> Yes, it requires one more bit for a new zone and it's handled by the patch.
I am pretty sure that you are aware that consuming new page flag bits
is usually a no-go and something we try to avoid as much as possible
because we are in a great shortage there. So there really have to be a
_strong_ reason if we go that way. My current understanding that the
whole zone concept is more about a more convenient implementation rather
than a fundamental change which will solve unsolvable problems with the
current approach. More on that below.
[...]
> MOVABLE allocation will fallback as following sequence.
>
> ZONE_CMA -> ZONE_MOVABLE -> ZONE_HIGHMEM -> ZONE_NORMAL -> ...
>
> I don't understand what you mean CMA allocation. In MM's context,
> there is no CMA allocation. That is just MOVABLE allocation.
>
> For device's context, there is CMA allocation. It is range specific
> allocation so it should be succeed for requested range. No fallback is
> allowed in this case.
OK. that answers my question. I guess... My main confusion comes from
__alloc_gigantic_page which shares alloc_contig_range with the cma
allocation. But from what you wrote above and my quick glance over the
code __alloc_gigantic_page simply changes the migrate type of the pfn
range and it doesn't move it to the zone CMA. Right?
[...]
> > > At a glance, special migratetype sound natural. I also did. However,
> > > it's not natural in implementation POV. Zone consists of the same type
> > > of memory (by definition ?) and MM subsystem is implemented with that
> > > assumption. If difference type of memory shares the same zone, it easily
> > > causes the problem and CMA problems are the such case.
> >
> > But this is not any different from the highmem vs. lowmem problems we
> > already have, no? I have looked at your example in the cover where you
> > mention utilization and the reclaim problems. With the node reclaim we
> > will have pages from all zones on the same LRU(s). isolate_lru_pages
> > will skip those from ZONE_CMA because their zone_idx is higher than
> > gfp_idx(GFP_KERNEL). The same could be achieved by an explicit check for
> > the pageblock migrate type. So the zone doesn't really help much. Or is
> > there some aspect that I am missing?
>
> Your understanding is correct. It can archieved by an explict check
> for migratetype. And, this is the main reason that we should avoid
> such approach.
>
> With ZONE approach, all these things are done naturally. We don't need
> any explicit check to anywhere. We already have a code to skip to
> reclaim such pages by checking zone_idx.
Yes, and as we have to filter pages anyway doing so for cma blocks
doesn't sound overly burdensome from the maintenance point of view.
> However, with MIGRATETYPE approach, all these things *cannot* be done
> naturally. We need extra checks to all the places (allocator fast
> path, reclaim path, compaction, etc...). It is really error-prone and
> it already causes many problems due to this aspect. For the
> performance wise, this approach is also bad since it requires to check
> migratetype for each pages.
>
> Moreover, even if we adds extra checks, things cannot be easily
> perfect.
I see this point and I agree that using a specific zone might be a
_nicer_ solution in the end but you have to consider another aspects as
well. The main one I am worried about is a long term maintainability.
We are really out of page flags and consuming one for a rather specific
usecase is not good. Look at ZONE_DMA. I am pretty sure that almost
no sane HW needs 16MB zone anymore, yet we have hard time to get rid
of it and so we have that memory laying around unused all the time
and blocking one page flag bit. CMA falls into a similar category
AFAIU. I wouldn't be all that surprised if a future HW will not need CMA
allocations in few years, yet we will have to fight to get rid of it
like we do with ZONE_DMA. And not only that. We will also have to fight
finding page flags for other more general usecases in the meantime.
> See 3) Atomic allocation failure problem. It's inherent
> problem if we have different types of memory in a single zone.
> We possibly can make things perfect even with MIGRATETYPE approach,
> however, it requires additional checks in hotpath than current. It's
> expensive and undesirable. It will make future maintenance of MM code
> much difficult.
I believe that the overhead in the hot path is not such a big deal. We
have means to make it 0 when CMA is not used by jumplabels. I assume
that the vast majority of systems will not use CMA. Those systems which
use CMA should be able to cope with some slight overhead IMHO.
I agree that the code maintenance cost is not free. And that is a valid
concern. CMA maintenance will not be for free in either case, though (if
for nothing else the page flags space mentioned above). Let's see what
what this means for mm/page_alloc.c
mm/page_alloc.c | 220 ++++++++++++++++++++++++++++----------------------------
1 file changed, 109 insertions(+), 111 deletions(-)
Not very convincing at first glance but this can be quite misleading as
you have already mentioned because you have moved a lot of code to to
init path. So let's just focus on the allocator hot paths
@@ -800,7 +805,7 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON(migratetype == -1);
if (likely(!is_migrate_isolate(migratetype)))
- __mod_zone_freepage_state(zone, 1 << order, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
@@ -1804,25 +1831,11 @@ static int fallbacks[MIGRATE_TYPES][4] = {
[MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
[MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
-#ifdef CONFIG_CMA
- [MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */
-#endif
#ifdef CONFIG_MEMORY_ISOLATION
[MIGRATE_ISOLATE] = { MIGRATE_TYPES }, /* Never used */
#endif
};
-#ifdef CONFIG_CMA
-static struct page *__rmqueue_cma_fallback(struct zone *zone,
- unsigned int order)
-{
- return __rmqueue_smallest(zone, order, MIGRATE_CMA);
-}
-#else
-static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
- unsigned int order) { return NULL; }
-#endif
-
/*
* Move the free pages in a range to the free lists of the requested type.
* Note that start_page and end_pages are not aligned on a pageblock
@@ -2090,8 +2103,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
/* Yoink! */
mt = get_pageblock_migratetype(page);
- if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt)
- && !is_migrate_cma(mt)) {
+ if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt)) {
zone->nr_reserved_highatomic += pageblock_nr_pages;
set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
@@ -2235,13 +2247,8 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
retry:
page = __rmqueue_smallest(zone, order, migratetype);
- if (unlikely(!page)) {
- if (migratetype == MIGRATE_MOVABLE)
- page = __rmqueue_cma_fallback(zone, order);
-
- if (!page && __rmqueue_fallback(zone, order, migratetype))
- goto retry;
- }
+ if (unlikely(!page) && __rmqueue_fallback(zone, order, migratetype))
+ goto retry;
trace_mm_page_alloc_zone_locked(page, order, migratetype);
return page;
@@ -2283,9 +2290,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
list_add_tail(&page->lru, list);
list = &page->lru;
alloced++;
- if (is_migrate_cma(get_pcppage_migratetype(page)))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
- -(1 << order));
}
/*
@@ -2636,10 +2640,10 @@ int __isolate_free_page(struct page *page, unsigned int order)
* exists.
*/
watermark = min_wmark_pages(zone) + (1UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+ if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
return 0;
- __mod_zone_freepage_state(zone, -(1UL << order), mt);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
}
/* Remove page from free list */
@@ -2655,8 +2659,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
struct page *endpage = page + (1 << order) - 1;
for (; page < endpage; page += pageblock_nr_pages) {
int mt = get_pageblock_migratetype(page);
- if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)
- && !is_migrate_highatomic(mt))
+ if (!is_migrate_isolate(mt) &&
+ !is_migrate_highatomic(mt))
set_pageblock_migratetype(page,
MIGRATE_MOVABLE);
}
@@ -2783,8 +2787,7 @@ struct page *rmqueue(struct zone *preferred_zone,
spin_unlock(&zone->lock);
if (!page)
goto failed;
- __mod_zone_freepage_state(zone, -(1 << order),
- get_pcppage_migratetype(page));
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
zone_statistics(preferred_zone, zone);
@@ -2907,12 +2910,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
else
min -= min / 4;
-#ifdef CONFIG_CMA
- /* If allocation can't use CMA areas don't use free CMA pages */
- if (!(alloc_flags & ALLOC_CMA))
- free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
-
/*
* Check watermarks for an order-0 allocation request. If these
* are not met, then a high-order request also cannot go ahead
@@ -2940,13 +2937,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
if (!list_empty(&area->free_list[mt]))
return true;
}
-
-#ifdef CONFIG_CMA
- if ((alloc_flags & ALLOC_CMA) &&
- !list_empty(&area->free_list[MIGRATE_CMA])) {
- return true;
- }
-#endif
}
return false;
}
@@ -2962,13 +2952,6 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
unsigned long mark, int classzone_idx, unsigned int alloc_flags)
{
long free_pages = zone_page_state(z, NR_FREE_PAGES);
- long cma_pages = 0;
-
-#ifdef CONFIG_CMA
- /* If allocation can't use CMA areas don't use free CMA pages */
- if (!(alloc_flags & ALLOC_CMA))
- cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
/*
* Fast check for order-0 only. If this fails then the reserves
@@ -2977,7 +2960,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
* the caller is !atomic then it'll uselessly search the free
* list. That corner case is then slower but it is harmless.
*/
- if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzone_idx])
+ if (!order && free_pages > mark + z->lowmem_reserve[classzone_idx])
return true;
return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
@@ -3547,10 +3530,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
-#ifdef CONFIG_CMA
- if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
- alloc_flags |= ALLOC_CMA;
-#endif
return alloc_flags;
}
@@ -3972,9 +3951,6 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
if (should_fail_alloc_page(gfp_mask, order))
return false;
- if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
- *alloc_flags |= ALLOC_CMA;
-
return true;
}
This looks like a nice clean up. Those ifdefs are ugly as hell. One
could argue that some of that could be cleaned up by simply adding some
helpers (with a jump label to reduce the overhead), though. But is this
really strong enough reason to bring the whole zone in? I am not really
convinced to be honest.
[...]
> > Please do _not_ take this as a NAK from me. At least not at this time. I
> > am still trying to understand all the consequences but my intuition
> > tells me that building on top of highmem like approach will turn out to
> > be problematic in future (as we have already seen with the highmem and
> > movable zones) so this needs a very prudent consideration.
>
> I can understand that you are prudent to this issue. However, it takes more
> than two years and many people already expressed that ZONE approach is the
> way to go.
I can see a single Acked-by and one Reviewed-by. It would be much more
convincing to see much larger support. Do not take me wrong I am not
trying to undermine the feedback so far but we should be clear about one
thing. CMA is mostly motivated by the industry which tries to overcome
HW limitations which can change in future very easily. I would rather
see good enough solution for something like that than a nicer solution
which is pushing additional burden on more general usecases.
That being said, I would like to see a much larger consensus in the MM
community before a new zone is merged. I am staying very skeptical this
is the right direction though.
--
Michal Hocko
SUSE Labs
On 27/04/17 18:06, Michal Hocko wrote:
> On Tue 25-04-17 12:42:57, Joonsoo Kim wrote:
[...]
>> Yes, it requires one more bit for a new zone and it's handled by the patch.
>
> I am pretty sure that you are aware that consuming new page flag bits
> is usually a no-go and something we try to avoid as much as possible
> because we are in a great shortage there. So there really have to be a
> _strong_ reason if we go that way. My current understanding that the
> whole zone concept is more about a more convenient implementation rather
> than a fundamental change which will solve unsolvable problems with the
> current approach. More on that below.
Since I am in a similar situation, I think it's better if I join this
conversation instead of going through the same in a separate thread.
In this regard, I have a few observations (are they correct?):
* not everyone seems to be interested in having all the current
zones active simultaneously
* some zones are even not so meaningful on certain architectures or
platforms
* some architectures/platforms that are 64 bits would have no penalty
in dealing with a larger data type.
So I wonder, would anybody be against this:
* within the 32bits constraint, define some optional zones
* decouple the specific position of a bit from the zone it represents;
iow: if the zone is enabled, ensure that it gets a bit in the mask,
but do not make promises about which one it is, provided that the
corresponding macros work properly
* ensure that if one selects more optional zones than there are bits
available (in the case of a 32bits mask), an error is produced at
compile time
* if one is happy to have a 64bits type, allow for as many zones as
it's possible to fit, or anyway more than what is possible with
the 32 bit mask.
I think I can re-factor the code so that there is no runtime performance
degradation, if there is no immediate objection to what I described. Or
maybe I failed to notice some obvious pitfall?
>From what I see, there seems to be a lot of interest in using functions
like Kmalloc / vmalloc, with the ability of specifying pseudo-custom
areas from where they should tap into.
Why not, as long as those who do not need it are not negatively impacted?
I understand that if the association between bits and zones is fixed,
then suddenly bits become very precious stuff, but if they could be used
in a more efficient way, then maybe they could be used more liberally.
The alternative is to keep getting requests about new zones and turning
them away because they do not pass the bar of being extremely critical,
even if indeed they would simplify people's life.
The change shouldn't be too ugly, if I do something along these lines of
the pseudo code below.
Note: the #ifdefs would be mainly concentrated in the declaration part.
enum gfp_zone_shift {
#if IS_ENABLED(CONFIG_ZONE_DMA)
/*I haven't checked if this is the correct name, but it gives the idea*/
ZONE_DMA_SHIFT = 0,
#endif
#if IS_ENABLED(CONFIG_ZONE_HIGHMEM)
ZONE_HIGHMEM_SHIFT,
#endif
#if IS_ENABLED(CONFIG_ZONE_DMA32)
ZONE_DMA32_SHIFT,
#endif
#if IS_ENABLED(CONFIG_ZONE_xxx)
ZONE_xxx,
#endif
NON_OPTIONAL_ZONE_SHIFT,
...
USED_ZONES_NUMBER,
ZONE_MOVABLE_SHIFT = USED_ZONES_NUMBER,
...
};
#if USED_ZONES_NUMBER < MAX_ZONES_32BITS
typedef gfp_zones_t uint32_t
#elif IS_ENABLED(CONFIG_ZONES_64BITS
typedef gfp_zones_t uint64_t
#else
#error
#endif
The type should be adjusted in other places where it is used, but I
didn't find too many occurrences.
#define __ZONE_DMA \
(((gfp_zones_t)IS_ENABLED(CONFIG_ZONE_DMA)) << \
(ZONE_DMA_SHIFT - 0))
[rinse and repeat]
Code referring to these optional zones can be sandboxed in
#if IS_ENABLED(CONFIG_ZONE_DMA)
inline function do_something_dma() {
....
}
#else
#define do_something_dma()
#endif
Or equivalent, effectively removing many #ifdefs from the main code of
functions like those called by kmalloc.
So, would this approach stand a chance?
thanks, igor
I didn't read this thoughly yet because I will be travelling shortly but
this point alone just made ask, because it seems there is some
misunderstanding
On Fri 28-04-17 11:04:27, Igor Stoppa wrote:
[...]
> * if one is happy to have a 64bits type, allow for as many zones as
> it's possible to fit, or anyway more than what is possible with
> the 32 bit mask.
zones are currently placed in struct page::flags. And that already is
64b size on 64b arches. And we do not really have any room spare there.
We encode page flags, zone id, numa_nid/sparse section_nr there. How can
you add more without enlarging the struct page itself or using external
means to store the same information (page_ext comes to mind)? Even if
the later would be possible then note thatpage_zone() is used in many
performance sensitive paths and making it perform well with special
casing would be far from trivial.
--
Michal Hocko
SUSE Labs
On 28/04/17 11:36, Michal Hocko wrote:
> I didn't read this thoughly yet because I will be travelling shortly
ok, thanks for bearing with me =)
> but
> this point alone just made ask, because it seems there is some
> misunderstanding
It is possible, so far I did some changes, but I have not completed the
whole conversion.
> On Fri 28-04-17 11:04:27, Igor Stoppa wrote:
> [...]
>> * if one is happy to have a 64bits type, allow for as many zones as
>> it's possible to fit, or anyway more than what is possible with
>> the 32 bit mask.
>
> zones are currently placed in struct page::flags. And that already is
> 64b size on 64b arches.
Ok, the issues I had so fare were related to the enum for zones being
treated as 32b.
> And we do not really have any room spare there.
> We encode page flags, zone id, numa_nid/sparse section_nr there. How can
> you add more without enlarging the struct page itself or using external
> means to store the same information (page_ext comes to mind)?
Then I'll be conservative and assume I can't, unless I can prove otherwise.
There is still the possibility I mentioned of loosely coupling DMA,
DMA32 and HIGHMEM with the bits currently reserved for them, right?
If my system doesn't use those zones as such, because it doesn't
have/need them, those bits are wasted for me. Otoh someone else is
probably not interested in what I'm after but needs one or more of those
zones.
Making the meaning of the bits configurable should still be a viable
option. It's not altering their amount, just their purpose on a specific
build.
> Even if
> the later would be possible then note thatpage_zone() is used in many
> performance sensitive paths and making it perform well with special
> casing would be far from trivial.
If the solution I propose is acceptable, I'm willing to bite the bullet
and go for implementing the conversion.
In my case I really would like to be able to use kmalloc, because it
would provide an easy path to convert also other portions of the kernel,
besides SE Linux.
I suspect I would encounter overall far less resistance if the type of
change I propose is limited to:
s/GFP_KERNEL/GFP_LOCKABLE/
And if I can guarrantee that GFP_LOCKABLE falls back to GFP_KERNEL when
the "lockable" feature is not enabled.
--
thanks, igor