This is essentially a resubmission of v3 rebased with a
rewritten cover letter to hopefully clarify the submission based
on feedback and follow-on discussion. The individual patches
have not materially changed.
The Linux Memory Management system (MM) has long supported the
concept of movable memory. It takes advantage of address
abstraction to allow the data held in physical memory to be
moved to a different physical address or other form of storage
without the user of the abstracted (i.e. virtual) address
needing to be aware. This is generally the foundation of user
space memory and the basic service the kernel provides to
applications.
On the other hand, the kernel itself is generally not tolerant
of the movement of data that it accesses so most of its usage is
unmovable memory. It may be useful to understand that this
terminology is relative to the kernel's perspective and that
what the kernel considers unmovable memory may in fact be moved
by a hypervisor that hosts the kernel, but an additional address
abstraction must exist to keep the kernel unaware of such
movement.
The MM supports the conversion of free memory between MOVABLE
and UNMOVABLE (and other) migration types to allow better
sharing of memory resources. More recently, the MM introduced
"movablecore" memory that should never be made UNMOVABLE. As an
implementation detail "movablecore" memory introduced the
ZONE_MOVABLE zone to manage this type of memory and significant
progress has been made to ensure the movability of memory in
this zone with the few exceptions now documented in
include/linux/mmzone.h.
"Movablecore" memory can support multiple use cases including
dynamic allocation of hugetlbfs pages, but an imbalance of
"movablecore" memory and kernel memory can lead to serious
consequences for kernel operation which is why the kernel
parameter includes the warning "the administrator must be
careful that the amount of memory usable for all allocations is
not too small."
Designated Movable Blocks represent a generic extension of the
"movablecore" concept to allow specific blocks of memory to be
designated part of the "movablecore" to provide support for
additional use cases. For example, it could have been/could
still be used to support hot unplugging of memory. A very
similar concept was proposed in [1] for that purpose, and
revised in [2], but ultimately a more use case specific
implementation of the movable_node parameter was accepted. That
implementation is dependent on NUMA, ACPI, and SRAT tables which
narrow its usefullness. Designated Movable Blocks allow for the
same type of discontiguous and non-monotonic configuration of
ZONE_MOVABLE for systems whether or not they support NUMA, ACPI,
or SRAT tables. Specifically this feature is desired by users of
the arm64 Android GKI common kernel on Broadcom SoCs where NUMA
is not available. These patches make minimal additions to
existing code to offer a controllable "movablecore" feature to
those systems.
Like all "movablecore" memory there are no Designated Movable
Blocks created by default. They are only created when specified
and the warning on the "movablecore" kernel parameter remains
just as relevant.
The key feature of "movablecore" memory is that any allocations
of the memory by the kernel page allocator must be movable and
this has the follow-on effect that GFP_MOVABLE allocation
requests look to "movablecore" memory first. This prioritizes
the use of "movablecore" memory by user processes though the
kernel can conceivably use the memory as long as movability can
be preserved.
One use case of interest to customers of Broadcom SoCs with
multiple memory controllers is for improved memory bandwidth
utilization for multi-threaded user space dominant workloads.
Designated Movable Blocks can be located on each memory
controller and the page_alloc.shuffle=1 kernel parameter can be
applied to provide a simplistic software-based memory channel
interleaving of accesses from user space across the multiple
memory controllers. Experiments using this approach with a dummy
workload [3] on a BCM7278 dual memory controller system with 1GB
of RAM on each controller (i.e. 2GB total RAM) and using the
kernel parameters "movablecore=300M@0x60000000,300M@0x320000000
page_alloc.shuffle=1" showed a more than 20% performance
improvement over a system without this feature using either
"movablecore=600M" or no "movablecore" kernel parameter.
Another use case of interest is to add broader support for the
"reusable" parameter for reserved-memory device tree nodes. The
Designated Movable Block extension of movablecore would allow
designation of the location as well as ownership of the block.
A device driver that owns a reusable reserved-memory would own
the underlying portion of a Designated Movable Block and could
reclaim memory from the OS for use exclusively by the device on
demand in a manner similar to memory hot unplugging. The
existing alloc/free_contig_range functions could be used to
support this or a different API could be developed. This use
case is mentioned for consideration, but an implementation is
not part of this submission.
There have also been efforts to reduce the amounts of memory
CMA holds in reserve (e.g. [4]). Adding the ability to place a
CMA pool in a Designated Movable Block could offer an option to
improve memory utilization when increased allocation latency can
be tolerated, but again such an implementation is not part of
this submission.
Changes in v4:
- rewrote the cover letter in an attempt to provide clarity
and encourage review.
- rebased to akpm-mm/master (i.e. Linux 6.3-rc1).
Changes in v3:
- removed OTHER OPPORTUNITIES and NOTES from the cover letter.
- prevent the creation of empty zones instead of adding extra
info to zoneinfo.
- size the ZONE_MOVABLE span to the minimum necessary to cover
pages within the zone to be more intuitive.
- removed "real" from variable names that were consolidated.
- rebased to akpm-mm/master (i.e. Linux 6.1-rc1).
Changes in v2:
- first three commits upstreamed separately.
- commits 04-06 submitted separately.
- Corrected errors "Reported-by: kernel test robot <[email protected]>"
- Deferred commits after 15 to simplify review of the base
functionality.
- minor reorganization of commit 13.
v3: https://lore.kernel.org/lkml/[email protected]/
v2: https://lore.kernel.org/linux-mm/[email protected]/
v1: https://lore.kernel.org/linux-mm/[email protected]/
[1] https://lwn.net/Articles/543790/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/
[4] https://lore.kernel.org/linux-mm/[email protected]/
Doug Berger (9):
lib/show_mem.c: display MovableOnly
mm/page_alloc: calculate node_spanned_pages from pfns
mm/page_alloc: prevent creation of empty zones
mm/page_alloc.c: allow oversized movablecore
mm/page_alloc: introduce init_reserved_pageblock()
memblock: introduce MEMBLOCK_MOVABLE flag
mm/dmb: Introduce Designated Movable Blocks
mm/page_alloc: make alloc_contig_pages DMB aware
mm/page_alloc: allow base for movablecore
.../admin-guide/kernel-parameters.txt | 14 +-
include/linux/dmb.h | 29 +++
include/linux/gfp.h | 5 +-
include/linux/memblock.h | 8 +
lib/show_mem.c | 2 +-
mm/Kconfig | 12 ++
mm/Makefile | 1 +
mm/cma.c | 15 +-
mm/dmb.c | 91 +++++++++
mm/memblock.c | 30 ++-
mm/page_alloc.c | 188 +++++++++++++-----
11 files changed, 338 insertions(+), 57 deletions(-)
create mode 100644 include/linux/dmb.h
create mode 100644 mm/dmb.c
--
2.34.1
The comment for commit c78e93630d15 ("mm: do not walk all of
system memory during show_mem") indicates it "also corrects the
reporting of HighMem as HighMem/MovableOnly as ZONE_MOVABLE has
similar problems to HighMem with respect to lowmem/highmem
exhaustion."
Presuming the similar problems are with regard to the general
exclusion of kernel allocations from either zone, I believe it
makes sense to include all ZONE_MOVABLE memory even on systems
without HighMem.
To the extent that this was the intent of the original commit I
have included a "Fixes" tag, but it seems unnecessary to submit
to linux-stable.
Fixes: c78e93630d15 ("mm: do not walk all of system memory during show_mem")
Signed-off-by: Doug Berger <[email protected]>
---
lib/show_mem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/show_mem.c b/lib/show_mem.c
index 0d7585cde2a6..6a632b0c35c5 100644
--- a/lib/show_mem.c
+++ b/lib/show_mem.c
@@ -27,7 +27,7 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
total += zone->present_pages;
reserved += zone->present_pages - zone_managed_pages(zone);
- if (is_highmem_idx(zoneid))
+ if (zoneid == ZONE_MOVABLE || is_highmem_idx(zoneid))
highmem += zone->present_pages;
}
}
--
2.34.1
Since the start and end pfns of the node are passed as arguments
to calculate_node_totalpages() they might as well be used to
specify the node_spanned_pages value for the node rather than
accumulating the spans of member zones.
This prevents the need for additional adjustments if zones are
allowed to overlap.
The realtotalpages name is reverted to just totalpages to reduce
the burden of supporting multiple realities.
Signed-off-by: Doug Berger <[email protected]>
---
mm/page_alloc.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ac1fc986af44..b1952f86ab6d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7586,7 +7586,7 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
unsigned long node_start_pfn,
unsigned long node_end_pfn)
{
- unsigned long realtotalpages = 0, totalpages = 0;
+ unsigned long totalpages = 0;
enum zone_type i;
for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -7617,13 +7617,12 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
zone->present_early_pages = real_size;
#endif
- totalpages += size;
- realtotalpages += real_size;
+ totalpages += real_size;
}
- pgdat->node_spanned_pages = totalpages;
- pgdat->node_present_pages = realtotalpages;
- pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
+ pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
+ pgdat->node_present_pages = totalpages;
+ pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, totalpages);
}
#ifndef CONFIG_SPARSEMEM
--
2.34.1
If none of the pages a zone spans are present then its start pfn
and span should be zeroed to prevent initialization.
This prevents the creation of an empty zone if all of its pages
are moved to a zone that would overlap it.
The real_size name is reverted to just size to reduce the burden
of supporting multiple realities.
Signed-off-by: Doug Berger <[email protected]>
---
mm/page_alloc.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b1952f86ab6d..827b4bfef625 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7592,8 +7592,7 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
for (i = 0; i < MAX_NR_ZONES; i++) {
struct zone *zone = pgdat->node_zones + i;
unsigned long zone_start_pfn, zone_end_pfn;
- unsigned long spanned, absent;
- unsigned long size, real_size;
+ unsigned long spanned, absent, size;
spanned = zone_spanned_pages_in_node(pgdat->node_id, i,
node_start_pfn,
@@ -7604,20 +7603,21 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
node_start_pfn,
node_end_pfn);
- size = spanned;
- real_size = size - absent;
+ size = spanned - absent;
- if (size)
+ if (size) {
zone->zone_start_pfn = zone_start_pfn;
- else
+ } else {
+ spanned = 0;
zone->zone_start_pfn = 0;
- zone->spanned_pages = size;
- zone->present_pages = real_size;
+ }
+ zone->spanned_pages = spanned;
+ zone->present_pages = size;
#if defined(CONFIG_MEMORY_HOTPLUG)
- zone->present_early_pages = real_size;
+ zone->present_early_pages = size;
#endif
- totalpages += real_size;
+ totalpages += size;
}
pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
--
2.34.1
Now that the error in computation of corepages has been corrected
by commit 9fd745d450e7 ("mm: fix overflow in
find_zone_movable_pfns_for_nodes()"), oversized specifications of
movablecore will result in a zero value for required_kernelcore if
it is not also specified.
It is unintuitive for such a request to lead to no ZONE_MOVABLE
memory when the kernel parameters are clearly requesting some.
The current behavior when requesting an oversized kernelcore is to
classify all of the pages in movable_zone as kernelcore. The new
behavior when requesting an oversized movablecore (when not also
specifying kernelcore) is to similarly classify all of the pages
in movable_zone as movablecore.
Signed-off-by: Doug Berger <[email protected]>
---
mm/page_alloc.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 827b4bfef625..e574c6a79e2f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8166,13 +8166,13 @@ static void __init find_zone_movable_pfns_for_nodes(void)
corepages = totalpages - required_movablecore;
required_kernelcore = max(required_kernelcore, corepages);
+ } else if (!required_kernelcore) {
+ /* If kernelcore was not specified, there is no ZONE_MOVABLE */
+ goto out;
}
- /*
- * If kernelcore was not specified or kernelcore size is larger
- * than totalpages, there is no ZONE_MOVABLE.
- */
- if (!required_kernelcore || required_kernelcore >= totalpages)
+ /* If kernelcore size exceeds totalpages, there is no ZONE_MOVABLE */
+ if (required_kernelcore >= totalpages)
goto out;
/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
--
2.34.1
Most of the implementation of init_cma_reserved_pageblock() is
common to the initialization of any reserved pageblock for use
by the page allocator.
This commit breaks that functionality out into the new common
function init_reserved_pageblock() for use by code other than
CMA. The CMA specific code is relocated from page_alloc to the
point where init_cma_reserved_pageblock() was invoked and the
new function is used there instead. The error path is also
updated to use the function to operate on pageblocks rather
than pages.
Signed-off-by: Doug Berger <[email protected]>
---
include/linux/gfp.h | 5 +----
mm/cma.c | 15 +++++++++++----
mm/page_alloc.c | 8 ++------
3 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 65a78773dcca..a7892b3c436b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -361,9 +361,6 @@ extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
#endif
void free_contig_range(unsigned long pfn, unsigned long nr_pages);
-#ifdef CONFIG_CMA
-/* CMA stuff */
-extern void init_cma_reserved_pageblock(struct page *page);
-#endif
+extern void init_reserved_pageblock(struct page *page);
#endif /* __LINUX_GFP_H */
diff --git a/mm/cma.c b/mm/cma.c
index a7263aa02c92..cc462df68781 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -31,6 +31,7 @@
#include <linux/highmem.h>
#include <linux/io.h>
#include <linux/kmemleak.h>
+#include <linux/page-isolation.h>
#include <trace/events/cma.h>
#include "cma.h"
@@ -116,8 +117,13 @@ static void __init cma_activate_area(struct cma *cma)
}
for (pfn = base_pfn; pfn < base_pfn + cma->count;
- pfn += pageblock_nr_pages)
- init_cma_reserved_pageblock(pfn_to_page(pfn));
+ pfn += pageblock_nr_pages) {
+ struct page *page = pfn_to_page(pfn);
+
+ set_pageblock_migratetype(page, MIGRATE_CMA);
+ init_reserved_pageblock(page);
+ page_zone(page)->cma_pages += pageblock_nr_pages;
+ }
spin_lock_init(&cma->lock);
@@ -133,8 +139,9 @@ static void __init cma_activate_area(struct cma *cma)
out_error:
/* Expose all pages to the buddy, they are useless for CMA. */
if (!cma->reserve_pages_on_error) {
- for (pfn = base_pfn; pfn < base_pfn + cma->count; pfn++)
- free_reserved_page(pfn_to_page(pfn));
+ for (pfn = base_pfn; pfn < base_pfn + cma->count;
+ pfn += pageblock_nr_pages)
+ init_reserved_pageblock(pfn_to_page(pfn));
}
totalcma_pages -= cma->count;
cma->count = 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e574c6a79e2f..da1af678995b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2308,9 +2308,8 @@ void __init page_alloc_init_late(void)
set_zone_contiguous(zone);
}
-#ifdef CONFIG_CMA
-/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
-void __init init_cma_reserved_pageblock(struct page *page)
+/* Free whole pageblock */
+void __init init_reserved_pageblock(struct page *page)
{
unsigned i = pageblock_nr_pages;
struct page *p = page;
@@ -2320,14 +2319,11 @@ void __init init_cma_reserved_pageblock(struct page *page)
set_page_count(p, 0);
} while (++p, --i);
- set_pageblock_migratetype(page, MIGRATE_CMA);
set_page_refcounted(page);
__free_pages(page, pageblock_order);
adjust_managed_page_count(page, pageblock_nr_pages);
- page_zone(page)->cma_pages += pageblock_nr_pages;
}
-#endif
/*
* The order of subdivision here is critical for the IO subsystem.
--
2.34.1
The MEMBLOCK_MOVABLE flag is introduced to designate a memblock
as only supporting movable allocations by the page allocator.
Signed-off-by: Doug Berger <[email protected]>
---
include/linux/memblock.h | 8 ++++++++
mm/memblock.c | 24 ++++++++++++++++++++++++
2 files changed, 32 insertions(+)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 50ad19662a32..8eb3ca32dfa7 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -47,6 +47,7 @@ enum memblock_flags {
MEMBLOCK_MIRROR = 0x2, /* mirrored region */
MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
+ MEMBLOCK_MOVABLE = 0x10, /* designated movable block */
};
/**
@@ -125,6 +126,8 @@ int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
+int memblock_mark_movable(phys_addr_t base, phys_addr_t size);
+int memblock_clear_movable(phys_addr_t base, phys_addr_t size);
void memblock_free_all(void);
void memblock_free(void *ptr, size_t size);
@@ -265,6 +268,11 @@ static inline bool memblock_is_driver_managed(struct memblock_region *m)
return m->flags & MEMBLOCK_DRIVER_MANAGED;
}
+static inline bool memblock_is_movable(struct memblock_region *m)
+{
+ return m->flags & MEMBLOCK_MOVABLE;
+}
+
int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long *end_pfn);
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index 25fd0626a9e7..794a099ec3e2 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -992,6 +992,30 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
}
+/**
+ * memblock_mark_movable - Mark designated movable block with MEMBLOCK_MOVABLE.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_mark_movable(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(base, size, 1, MEMBLOCK_MOVABLE);
+}
+
+/**
+ * memblock_clear_movable - Clear flag MEMBLOCK_MOVABLE for a specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_clear_movable(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(base, size, 0, MEMBLOCK_MOVABLE);
+}
+
static bool should_skip_region(struct memblock_type *type,
struct memblock_region *m,
int nid, int flags)
--
2.34.1
Designated Movable Blocks are blocks of memory that are composed
of one or more adjacent memblocks that have the MEMBLOCK_MOVABLE
designation. These blocks must be reserved before receiving that
designation and will be located in the ZONE_MOVABLE zone rather
than any other zone that may span them.
Signed-off-by: Doug Berger <[email protected]>
---
include/linux/dmb.h | 29 ++++++++++++++
mm/Kconfig | 12 ++++++
mm/Makefile | 1 +
mm/dmb.c | 91 +++++++++++++++++++++++++++++++++++++++++++
mm/memblock.c | 6 ++-
mm/page_alloc.c | 95 ++++++++++++++++++++++++++++++++++++++-------
6 files changed, 220 insertions(+), 14 deletions(-)
create mode 100644 include/linux/dmb.h
create mode 100644 mm/dmb.c
diff --git a/include/linux/dmb.h b/include/linux/dmb.h
new file mode 100644
index 000000000000..fa2976c0fa21
--- /dev/null
+++ b/include/linux/dmb.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __DMB_H__
+#define __DMB_H__
+
+#include <linux/memblock.h>
+
+/*
+ * the buddy -- especially pageblock merging and alloc_contig_range()
+ * -- can deal with only some pageblocks of a higher-order page being
+ * MIGRATE_MOVABLE, we can use pageblock_nr_pages.
+ */
+#define DMB_MIN_ALIGNMENT_PAGES pageblock_nr_pages
+#define DMB_MIN_ALIGNMENT_BYTES (PAGE_SIZE * DMB_MIN_ALIGNMENT_PAGES)
+
+enum {
+ DMB_DISJOINT = 0,
+ DMB_INTERSECTS,
+ DMB_MIXED,
+};
+
+struct dmb;
+
+extern int dmb_intersects(unsigned long spfn, unsigned long epfn);
+
+extern int dmb_reserve(phys_addr_t base, phys_addr_t size,
+ struct dmb **res_dmb);
+extern void dmb_init_region(struct memblock_region *region);
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 4751031f3f05..85ac5f136487 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -913,6 +913,18 @@ config CMA_AREAS
If unsure, leave the default value "7" in UMA and "19" in NUMA.
+config DMB_COUNT
+ int "Maximum count of Designated Movable Blocks"
+ default 19 if NUMA
+ default 7
+ help
+ Designated Movable Blocks are blocks of memory that can be used
+ by the page allocator exclusively for movable pages. They are
+ managed in ZONE_MOVABLE but may overlap with other zones. This
+ parameter sets the maximum number of DMBs in the system.
+
+ If unsure, leave the default value "7" in UMA and "19" in NUMA.
+
config MEM_SOFT_DIRTY
bool "Track memory changes"
depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..824be8fb11cd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -67,6 +67,7 @@ obj-y += page-alloc.o
obj-y += init-mm.o
obj-y += memblock.o
obj-y += $(memory-hotplug-y)
+obj-y += dmb.o
ifdef CONFIG_MMU
obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o
diff --git a/mm/dmb.c b/mm/dmb.c
new file mode 100644
index 000000000000..f6c4e2662e0f
--- /dev/null
+++ b/mm/dmb.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Designated Movable Block
+ */
+
+#define pr_fmt(fmt) "dmb: " fmt
+
+#include <linux/dmb.h>
+
+struct dmb {
+ unsigned long start_pfn;
+ unsigned long end_pfn;
+};
+
+static struct dmb dmb_areas[CONFIG_DMB_COUNT];
+static unsigned int dmb_area_count;
+
+int dmb_intersects(unsigned long spfn, unsigned long epfn)
+{
+ int i;
+ struct dmb *dmb;
+
+ if (spfn >= epfn)
+ return DMB_DISJOINT;
+
+ for (i = 0; i < dmb_area_count; i++) {
+ dmb = &dmb_areas[i];
+ if (spfn >= dmb->end_pfn)
+ continue;
+ if (epfn <= dmb->start_pfn)
+ return DMB_DISJOINT;
+ if (spfn >= dmb->start_pfn && epfn <= dmb->end_pfn)
+ return DMB_INTERSECTS;
+ else
+ return DMB_MIXED;
+ }
+
+ return DMB_DISJOINT;
+}
+EXPORT_SYMBOL(dmb_intersects);
+
+int __init dmb_reserve(phys_addr_t base, phys_addr_t size,
+ struct dmb **res_dmb)
+{
+ struct dmb *dmb;
+
+ /* Sanity checks */
+ if (!size || !memblock_is_region_reserved(base, size))
+ return -EINVAL;
+
+ /* ensure minimal alignment required by mm core */
+ if (!IS_ALIGNED(base | size, DMB_MIN_ALIGNMENT_BYTES))
+ return -EINVAL;
+
+ if (dmb_area_count == ARRAY_SIZE(dmb_areas)) {
+ pr_warn("Not enough slots for DMB reserved regions!\n");
+ return -ENOSPC;
+ }
+
+ /*
+ * Each reserved area must be initialised later, when more kernel
+ * subsystems (like slab allocator) are available.
+ */
+ dmb = &dmb_areas[dmb_area_count++];
+
+ dmb->start_pfn = PFN_DOWN(base);
+ dmb->end_pfn = PFN_DOWN(base + size);
+ if (res_dmb)
+ *res_dmb = dmb;
+
+ memblock_mark_movable(base, size);
+ return 0;
+}
+
+void __init dmb_init_region(struct memblock_region *region)
+{
+ unsigned long pfn;
+ int i;
+
+ for (pfn = memblock_region_memory_base_pfn(region);
+ pfn < memblock_region_memory_end_pfn(region);
+ pfn += pageblock_nr_pages) {
+ struct page *page = pfn_to_page(pfn);
+
+ for (i = 0; i < pageblock_nr_pages; i++)
+ set_page_zone(page + i, ZONE_MOVABLE);
+
+ /* free reserved pageblocks to page allocator */
+ init_reserved_pageblock(page);
+ }
+}
diff --git a/mm/memblock.c b/mm/memblock.c
index 794a099ec3e2..3db06288a5c0 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -16,6 +16,7 @@
#include <linux/kmemleak.h>
#include <linux/seq_file.h>
#include <linux/memblock.h>
+#include <linux/dmb.h>
#include <asm/sections.h>
#include <linux/io.h>
@@ -2103,13 +2104,16 @@ static void __init memmap_init_reserved_pages(void)
for_each_reserved_mem_range(i, &start, &end)
reserve_bootmem_region(start, end);
- /* and also treat struct pages for the NOMAP regions as PageReserved */
for_each_mem_region(region) {
+ /* treat struct pages for the NOMAP regions as PageReserved */
if (memblock_is_nomap(region)) {
start = region->base;
end = start + region->size;
reserve_bootmem_region(start, end);
}
+ /* move Designated Movable Block pages to ZONE_MOVABLE */
+ if (memblock_is_movable(region))
+ dmb_init_region(region);
}
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index da1af678995b..26846a9a9fc4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -76,6 +76,7 @@
#include <linux/khugepaged.h>
#include <linux/buffer_head.h>
#include <linux/delayacct.h>
+#include <linux/dmb.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -414,6 +415,8 @@ static unsigned long required_kernelcore __initdata;
static unsigned long required_kernelcore_percent __initdata;
static unsigned long required_movablecore __initdata;
static unsigned long required_movablecore_percent __initdata;
+static unsigned long min_dmb_pfn[MAX_NUMNODES] __initdata;
+static unsigned long max_dmb_pfn[MAX_NUMNODES] __initdata;
static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata;
bool mirrored_kernelcore __initdata_memblock;
@@ -2171,7 +2174,7 @@ static int __init deferred_init_memmap(void *data)
}
zone_empty:
/* Sanity check that the next zone really is unpopulated */
- WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
+ WARN_ON(++zid < ZONE_MOVABLE && populated_zone(++zone));
pr_info("node %d deferred pages initialised in %ums\n",
pgdat->node_id, jiffies_to_msecs(jiffies - start));
@@ -7022,6 +7025,10 @@ static void __init memmap_init_zone_range(struct zone *zone,
unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
int nid = zone_to_nid(zone), zone_id = zone_idx(zone);
+ /* Skip overlap of ZONE_MOVABLE */
+ if (zone_id == ZONE_MOVABLE && zone_start_pfn < *hole_pfn)
+ zone_start_pfn = *hole_pfn;
+
start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
@@ -7482,6 +7489,12 @@ static unsigned long __init zone_spanned_pages_in_node(int nid,
node_start_pfn, node_end_pfn,
zone_start_pfn, zone_end_pfn);
+ if (zone_type == ZONE_MOVABLE && max_dmb_pfn[nid]) {
+ if (*zone_start_pfn == *zone_end_pfn)
+ *zone_end_pfn = max_dmb_pfn[nid];
+ *zone_start_pfn = min(*zone_start_pfn, min_dmb_pfn[nid]);
+ }
+
/* Check that this node has pages within the zone's required range */
if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn)
return 0;
@@ -7550,12 +7563,21 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
&zone_start_pfn, &zone_end_pfn);
nr_absent = __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
+ if (zone_type == ZONE_MOVABLE && max_dmb_pfn[nid]) {
+ if (zone_start_pfn == zone_end_pfn)
+ zone_end_pfn = max_dmb_pfn[nid];
+ else
+ zone_end_pfn = zone_movable_pfn[nid];
+ zone_start_pfn = min(zone_start_pfn, min_dmb_pfn[nid]);
+ nr_absent += zone_end_pfn - zone_start_pfn;
+ }
+
/*
* ZONE_MOVABLE handling.
- * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages
+ * Treat pages to be ZONE_MOVABLE in other zones as absent pages
* and vice versa.
*/
- if (mirrored_kernelcore && zone_movable_pfn[nid]) {
+ if (zone_movable_pfn[nid]) {
unsigned long start_pfn, end_pfn;
struct memblock_region *r;
@@ -7565,6 +7587,19 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
end_pfn = clamp(memblock_region_memory_end_pfn(r),
zone_start_pfn, zone_end_pfn);
+ if (memblock_is_movable(r)) {
+ if (zone_type != ZONE_MOVABLE) {
+ nr_absent += end_pfn - start_pfn;
+ continue;
+ }
+
+ nr_absent -= end_pfn - start_pfn;
+ continue;
+ }
+
+ if (!mirrored_kernelcore)
+ continue;
+
if (zone_type == ZONE_MOVABLE &&
memblock_is_mirror(r))
nr_absent += end_pfn - start_pfn;
@@ -7584,18 +7619,27 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
{
unsigned long totalpages = 0;
enum zone_type i;
+ int nid = pgdat->node_id;
+
+ /*
+ * If Designated Movable Blocks are defined on this node, ensure that
+ * zone_movable_pfn is also defined for this node.
+ */
+ if (max_dmb_pfn[nid] && !zone_movable_pfn[nid])
+ zone_movable_pfn[nid] = min(node_end_pfn,
+ arch_zone_highest_possible_pfn[movable_zone]);
for (i = 0; i < MAX_NR_ZONES; i++) {
struct zone *zone = pgdat->node_zones + i;
unsigned long zone_start_pfn, zone_end_pfn;
unsigned long spanned, absent, size;
- spanned = zone_spanned_pages_in_node(pgdat->node_id, i,
+ spanned = zone_spanned_pages_in_node(nid, i,
node_start_pfn,
node_end_pfn,
&zone_start_pfn,
&zone_end_pfn);
- absent = zone_absent_pages_in_node(pgdat->node_id, i,
+ absent = zone_absent_pages_in_node(nid, i,
node_start_pfn,
node_end_pfn);
@@ -8047,15 +8091,27 @@ unsigned long __init node_map_pfn_alignment(void)
static unsigned long __init early_calculate_totalpages(void)
{
unsigned long totalpages = 0;
- unsigned long start_pfn, end_pfn;
- int i, nid;
+ struct memblock_region *r;
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
- unsigned long pages = end_pfn - start_pfn;
+ for_each_mem_region(r) {
+ unsigned long start_pfn, end_pfn, pages;
+ int nid;
+
+ nid = memblock_get_region_node(r);
+ start_pfn = memblock_region_memory_base_pfn(r);
+ end_pfn = memblock_region_memory_end_pfn(r);
- totalpages += pages;
- if (pages)
+ pages = end_pfn - start_pfn;
+ if (pages) {
+ totalpages += pages;
node_set_state(nid, N_MEMORY);
+ if (memblock_is_movable(r)) {
+ if (start_pfn < min_dmb_pfn[nid])
+ min_dmb_pfn[nid] = start_pfn;
+ if (end_pfn > max_dmb_pfn[nid])
+ max_dmb_pfn[nid] = end_pfn;
+ }
+ }
}
return totalpages;
}
@@ -8068,7 +8124,7 @@ static unsigned long __init early_calculate_totalpages(void)
*/
static void __init find_zone_movable_pfns_for_nodes(void)
{
- int i, nid;
+ int nid;
unsigned long usable_startpfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
@@ -8196,13 +8252,24 @@ static void __init find_zone_movable_pfns_for_nodes(void)
kernelcore_remaining = kernelcore_node;
/* Go through each range of PFNs within this node */
- for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
+ for_each_mem_region(r) {
unsigned long size_pages;
+ if (memblock_get_region_node(r) != nid)
+ continue;
+
+ start_pfn = memblock_region_memory_base_pfn(r);
+ end_pfn = memblock_region_memory_end_pfn(r);
start_pfn = max(start_pfn, zone_movable_pfn[nid]);
if (start_pfn >= end_pfn)
continue;
+ /* Skip over Designated Movable Blocks */
+ if (memblock_is_movable(r)) {
+ zone_movable_pfn[nid] = end_pfn;
+ continue;
+ }
+
/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
unsigned long kernel_pages;
@@ -8351,6 +8418,8 @@ void __init free_area_init(unsigned long *max_zone_pfn)
}
/* Find the PFNs that ZONE_MOVABLE begins at in each node */
+ memset(min_dmb_pfn, 0xff, sizeof(min_dmb_pfn));
+ memset(max_dmb_pfn, 0, sizeof(max_dmb_pfn));
memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
find_zone_movable_pfns_for_nodes();
--
2.34.1
Designated Movable Blocks are skipped when attempting to allocate
contiguous pages. Doing per page validation across all spanned
pages within a zone can be extra inefficient when Designated
Movable Blocks create large overlaps between zones. Use
dmb_intersects() within pfn_range_valid_contig as an early check
to signal the range is not valid.
The zone_movable_pfn array which represents the start of non-
overlapped ZONE_MOVABLE on the node is now preserved to be used
at runtime to skip over any DMB-only portion of the zone.
Signed-off-by: Doug Berger <[email protected]>
---
mm/page_alloc.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 26846a9a9fc4..d4358d19d5a1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -417,7 +417,7 @@ static unsigned long required_movablecore __initdata;
static unsigned long required_movablecore_percent __initdata;
static unsigned long min_dmb_pfn[MAX_NUMNODES] __initdata;
static unsigned long max_dmb_pfn[MAX_NUMNODES] __initdata;
-static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata;
+static unsigned long zone_movable_pfn[MAX_NUMNODES];
bool mirrored_kernelcore __initdata_memblock;
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
@@ -9503,6 +9503,9 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
unsigned long i, end_pfn = start_pfn + nr_pages;
struct page *page;
+ if (dmb_intersects(start_pfn, end_pfn))
+ return false;
+
for (i = start_pfn; i < end_pfn; i++) {
page = pfn_to_online_page(i);
if (!page)
@@ -9559,7 +9562,10 @@ struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
gfp_zone(gfp_mask), nodemask) {
spin_lock_irqsave(&zone->lock, flags);
- pfn = ALIGN(zone->zone_start_pfn, nr_pages);
+ if (zone_idx(zone) == ZONE_MOVABLE && zone_movable_pfn[nid])
+ pfn = ALIGN(zone_movable_pfn[nid], nr_pages);
+ else
+ pfn = ALIGN(zone->zone_start_pfn, nr_pages);
while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
if (pfn_range_valid_contig(zone, pfn, nr_pages)) {
/*
--
2.34.1
A Designated Movable Block can be created by including the base
address of the block when specifying a movablecore range on the
kernel command line.
Signed-off-by: Doug Berger <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 14 ++++++-
mm/page_alloc.c | 38 ++++++++++++++++---
2 files changed, 45 insertions(+), 7 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6221a1d057dd..5e3bf6e0a264 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3353,7 +3353,7 @@
reporting absolute coordinates, such as tablets
movablecore= [KNL,X86,IA-64,PPC]
- Format: nn[KMGTPE] | nn%
+ Format: nn[KMGTPE] | nn[KMGTPE]@ss[KMGTPE] | nn%
This parameter is the complement to kernelcore=, it
specifies the amount of memory used for migratable
allocations. If both kernelcore and movablecore is
@@ -3363,6 +3363,18 @@
that the amount of memory usable for all allocations
is not too small.
+ If @ss[KMGTPE] is included, memory within the region
+ from ss to ss+nn will be designated as a movable block
+ and included in ZONE_MOVABLE. Designated Movable Blocks
+ must be aligned to pageblock_order. Designated Movable
+ Blocks take priority over values of kernelcore= and are
+ considered part of any memory specified by more general
+ movablecore= values.
+ Multiple Designated Movable Blocks may be specified,
+ comma delimited.
+ Example:
+ movablecore=100M@2G,100M@3G,1G@1024G
+
movable_node [KNL] Boot-time switch to make hotplugable memory
NUMA nodes to be movable. This means that the memory
of such nodes will be usable only for movable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4358d19d5a1..cb3c55acf7de 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8504,9 +8504,9 @@ void __init free_area_init(unsigned long *max_zone_pfn)
}
static int __init cmdline_parse_core(char *p, unsigned long *core,
- unsigned long *percent)
+ unsigned long *percent, bool movable)
{
- unsigned long long coremem;
+ unsigned long long coremem, address;
char *endptr;
if (!p)
@@ -8521,6 +8521,17 @@ static int __init cmdline_parse_core(char *p, unsigned long *core,
*percent = coremem;
} else {
coremem = memparse(p, &p);
+ if (movable && *p == '@') {
+ address = memparse(++p, &p);
+ if (*p != '\0' ||
+ !memblock_is_region_memory(address, coremem) ||
+ memblock_is_region_reserved(address, coremem))
+ return -EINVAL;
+ memblock_reserve(address, coremem);
+ return dmb_reserve(address, coremem, NULL);
+ } else if (*p != '\0') {
+ return -EINVAL;
+ }
/* Paranoid check that UL is enough for the coremem value */
WARN_ON((coremem >> PAGE_SHIFT) > ULONG_MAX);
@@ -8543,17 +8554,32 @@ static int __init cmdline_parse_kernelcore(char *p)
}
return cmdline_parse_core(p, &required_kernelcore,
- &required_kernelcore_percent);
+ &required_kernelcore_percent, false);
}
/*
* movablecore=size sets the amount of memory for use for allocations that
- * can be reclaimed or migrated.
+ * can be reclaimed or migrated. movablecore=size@base defines a Designated
+ * Movable Block.
*/
static int __init cmdline_parse_movablecore(char *p)
{
- return cmdline_parse_core(p, &required_movablecore,
- &required_movablecore_percent);
+ int ret = -EINVAL;
+
+ while (p) {
+ char *k = strchr(p, ',');
+
+ if (k)
+ *k++ = 0;
+
+ ret = cmdline_parse_core(p, &required_movablecore,
+ &required_movablecore_percent, true);
+ if (ret)
+ break;
+ p = k;
+ }
+
+ return ret;
}
early_param("kernelcore", cmdline_parse_kernelcore);
--
2.34.1