2022-09-13 19:59:26

by Doug Berger

[permalink] [raw]
Subject: [PATCH 00/21] mm: introduce Designated Movable Blocks

MOTIVATION:
Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
controllers with each mapped in a different address range within
a Uniform Memory Architecture. Some users of these systems have
expressed the desire to locate ZONE_MOVABLE memory on each
memory controller to allow user space intensive processing to
make better use of the additional memory bandwidth.
Unfortunately, the historical monotonic layout of zones would
mean that if the lowest addressed memory controller contains
ZONE_MOVABLE memory then all of the memory available from
memory controllers at higher addresses must also be in the
ZONE_MOVABLE zone. This would force all kernel memory accesses
onto the lowest addressed memory controller and significantly
reduce the amount of memory available for non-movable
allocations.

The main objective of this patch set is therefore to allow a
block of memory to be designated as part of the ZONE_MOVABLE
zone where it will always only be used by the kernel page
allocator to satisfy requests for movable pages. The term
Designated Movable Block is introduced here to represent such a
block. The favored implementation allows modification of the
'movablecore' kernel parameter to allow specification of a base
address and support for multiple blocks. The existing
'movablecore' mechanisms are retained. Other mechanisms based on
device tree are also included in this set.

BACKGROUND:
NUMA architectures support distributing movablecore memory
across each node, but it is undesirable to introduce the
overhead and complexities of NUMA on systems that don't have a
Non-Uniform Memory Architecture.

Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror option")
also depends on zone overlap to support sytems with multiple
mirrored ranges.

Commit c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions")
embraced overlapped zones for memory hotplug.

This commit set follows their lead to allow the ZONE_MOVABLE
zone to overlap other zones while spanning the pages from the
lowest Designated Movable Block to the end of the node.
Designated Movable Blocks are made absent from overlapping zones
and present within the ZONE_MOVABLE zone.

I initially investigated an implementation using a Designated
Movable migrate type in line with comments[1] made by Mel Gorman
regarding a "sticky" MIGRATE_MOVABLE type to avoid using
ZONE_MOVABLE. However, this approach was riskier since it was
much more instrusive on the allocation paths. Ultimately, the
progress made by the memory hotplug folks to expand the
ZONE_MOVABLE functionality convinced me to follow this approach.

OPPORTUNITIES:
There have been many attempts to modify the behavior of the
kernel page allocators use of CMA regions. This implementation
of Designated Movable Blocks creates an opportunity to repurpose
the CMA allocator to operate on ZONE_MOVABLE memory that the
kernel page allocator can use more agressively, without
affecting the existing CMA implementation. It is hoped that the
"shared-dmb-pool" approach included here will be useful in cases
where memory sharing is more important than allocation latency.

CMA introduced a paradigm where multiple allocators could
operate on the same region of memory, and that paradigm can be
extended to Designated Movable Blocks as well. I was interested
in using kernel resource management as a mechanism for exposing
Designated Movable Block resources (e.g. /proc/iomem) that would
be used by the kernel page allocator like any other ZONE_MOVABLE
memory, but could be claimed by an alternative allocator (e.g.
CMA). Unfortunately, this becomes complicated because the kernel
resource implementation varies materially across different
architectures and I do not require this capability so I have
deferred that.

The MEMBLOCK_MOVABLE and MEMBLOCK_HOTPLUG have a lot in common
and could potentially be consolidated, but I chose to avoid that
here to reduce controversy.

The CMA and DMB alignment constraints are currently the same so
the logic could be simplified, but this implementation keeps
them distinct to facilitate independent evolution of the
implementations if necessary.

COMMITS:
Commits 1-3 represent bug fixes that could have been submitted
separately and should be submitted to linux-stable. They are
included here because of later commit dependencies to facilitate
review of the entire patch set.

Commits 4-6 are enhancements of hugepage migration to support
contiguous allocations (i.e. alloc_contig_range). These are
potentially of value if a non-gigantic hugepage can be
allocated through fallback from MIGRATE_CMA pageblocks or for
the allocation of gigantic pages. Their real value is to support
CMA from Designated Movable Blocks.

Commits 7-15 make up the preferred embodiment of the concept of
Designated Movable Block support.

The remaining commits (i.e. 16-21) are examples of additional
opportunites to use DMBs with other kernel services to achieve
more aggressive sharing of DMB reservations with the kernel
page allocator. It is hoped that they are of value to others,
but they can be reviewed and evaluated separately from the other
commits in this set if there is controversy and/or opportunites
for improvement.

[1] https://lore.kernel.org/lkml/[email protected]/

Doug Berger (21):
mm/page_isolation: protect cma from isolate_single_pageblock
mm/hugetlb: correct max_huge_pages accounting on demote
mm/hugetlb: correct demote page offset logic
mm/hugetlb: refactor alloc_and_dissolve_huge_page
mm/hugetlb: allow migrated hugepage to dissolve when freed
mm/hugetlb: add hugepage isolation support
lib/show_mem.c: display MovableOnly
mm/vmstat: show start_pfn when zone spans pages
mm/page_alloc: calculate node_spanned_pages from pfns
mm/page_alloc.c: allow oversized movablecore
mm/page_alloc: introduce init_reserved_pageblock()
memblock: introduce MEMBLOCK_MOVABLE flag
mm/dmb: Introduce Designated Movable Blocks
mm/page_alloc: make alloc_contig_pages DMB aware
mm/page_alloc: allow base for movablecore
dt-bindings: reserved-memory: introduce designated-movable-block
mm/dmb: introduce rmem designated-movable-block
mm/cma: support CMA in Designated Movable Blocks
dt-bindings: reserved-memory: shared-dma-pool: support DMB
mm/cma: introduce rmem shared-dmb-pool
mm/hugetlb: introduce hugetlb_dmb

.../admin-guide/kernel-parameters.txt | 17 +-
.../designated-movable-block.yaml | 51 ++++
.../reserved-memory/shared-dma-pool.yaml | 8 +
drivers/of/of_reserved_mem.c | 20 +-
include/linux/cma.h | 13 +-
include/linux/dmb.h | 28 +++
include/linux/gfp.h | 5 +-
include/linux/hugetlb.h | 5 +
include/linux/memblock.h | 8 +
kernel/dma/contiguous.c | 33 ++-
lib/show_mem.c | 2 +-
mm/Kconfig | 12 +
mm/Makefile | 1 +
mm/cma.c | 58 +++--
mm/dmb.c | 156 ++++++++++++
mm/hugetlb.c | 194 +++++++++++----
mm/memblock.c | 30 ++-
mm/migrate.c | 1 +
mm/page_alloc.c | 225 +++++++++++++-----
mm/page_isolation.c | 75 +++---
mm/vmstat.c | 5 +
21 files changed, 765 insertions(+), 182 deletions(-)
create mode 100644 Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
create mode 100644 include/linux/dmb.h
create mode 100644 mm/dmb.c

--
2.25.1


2022-09-13 19:59:42

by Doug Berger

[permalink] [raw]
Subject: [PATCH 02/21] mm/hugetlb: correct max_huge_pages accounting on demote

When demoting a hugepage to a smaller order, the number of pages
added to the target hstate will be the size of the large page
divided by the size of the smaller page.

Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
Signed-off-by: Doug Berger <[email protected]>
---
mm/hugetlb.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e070b8593b37..79949893ac12 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3472,7 +3472,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
* based on pool changes for the demoted page.
*/
h->max_huge_pages--;
- target_hstate->max_huge_pages += pages_per_huge_page(h);
+ target_hstate->max_huge_pages += pages_per_huge_page(h) /
+ pages_per_huge_page(target_hstate);

return rc;
}
--
2.25.1

2022-09-13 20:00:09

by Doug Berger

[permalink] [raw]
Subject: [PATCH 04/21] mm/hugetlb: refactor alloc_and_dissolve_huge_page

The alloc_replacement_page() and replace_hugepage() functions are
created from code in the alloc_and_dissolve_huge_page() function
to allow their reuse by the next commit.

Signed-off-by: Doug Berger <[email protected]>
---
mm/hugetlb.c | 84 +++++++++++++++++++++++++++++++---------------------
1 file changed, 51 insertions(+), 33 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a1d51a1f0404..f232a37df4b6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2709,32 +2709,22 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
}

/*
- * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
- * @h: struct hstate old page belongs to
- * @old_page: Old page to dissolve
- * @list: List to isolate the page in case we need to
- * Returns 0 on success, otherwise negated error.
+ * Before dissolving the page, we need to allocate a new one for the
+ * pool to remain stable. Here, we allocate the page and 'prep' it
+ * by doing everything but actually updating counters and adding to
+ * the pool. This simplifies and let us do most of the processing
+ * under the lock.
*/
-static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
- struct list_head *list)
+static struct page *alloc_replacement_page(struct hstate *h, int nid)
{
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
- int nid = page_to_nid(old_page);
bool alloc_retry = false;
struct page *new_page;
- int ret = 0;

- /*
- * Before dissolving the page, we need to allocate a new one for the
- * pool to remain stable. Here, we allocate the page and 'prep' it
- * by doing everything but actually updating counters and adding to
- * the pool. This simplifies and let us do most of the processing
- * under the lock.
- */
alloc_retry:
new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL);
if (!new_page)
- return -ENOMEM;
+ return ERR_PTR(-ENOMEM);
/*
* If all goes well, this page will be directly added to the free
* list in the pool. For this the ref count needs to be zero.
@@ -2748,7 +2738,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
SetHPageTemporary(new_page);
if (!put_page_testzero(new_page)) {
if (alloc_retry)
- return -EBUSY;
+ return ERR_PTR(-EBUSY);

alloc_retry = true;
goto alloc_retry;
@@ -2757,6 +2747,48 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,

__prep_new_huge_page(h, new_page);

+ return new_page;
+}
+
+static void replace_hugepage(struct hstate *h, int nid, struct page *old_page,
+ struct page *new_page)
+{
+ lockdep_assert_held(&hugetlb_lock);
+ /*
+ * Ok, old_page is still a genuine free hugepage. Remove it from
+ * the freelist and decrease the counters. These will be
+ * incremented again when calling __prep_account_new_huge_page()
+ * and enqueue_huge_page() for new_page. The counters will remain
+ * stable since this happens under the lock.
+ */
+ remove_hugetlb_page(h, old_page, false);
+
+ /*
+ * Ref count on new page is already zero as it was dropped
+ * earlier. It can be directly added to the pool free list.
+ */
+ __prep_account_new_huge_page(h, nid);
+ enqueue_huge_page(h, new_page);
+}
+
+/*
+ * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
+ * @h: struct hstate old page belongs to
+ * @old_page: Old page to dissolve
+ * @list: List to isolate the page in case we need to
+ * Returns 0 on success, otherwise negated error.
+ */
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
+ struct list_head *list)
+{
+ int nid = page_to_nid(old_page);
+ struct page *new_page;
+ int ret = 0;
+
+ new_page = alloc_replacement_page(h, nid);
+ if (IS_ERR(new_page))
+ return PTR_ERR(new_page);
+
retry:
spin_lock_irq(&hugetlb_lock);
if (!PageHuge(old_page)) {
@@ -2783,21 +2815,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
cond_resched();
goto retry;
} else {
- /*
- * Ok, old_page is still a genuine free hugepage. Remove it from
- * the freelist and decrease the counters. These will be
- * incremented again when calling __prep_account_new_huge_page()
- * and enqueue_huge_page() for new_page. The counters will remain
- * stable since this happens under the lock.
- */
- remove_hugetlb_page(h, old_page, false);
-
- /*
- * Ref count on new page is already zero as it was dropped
- * earlier. It can be directly added to the pool free list.
- */
- __prep_account_new_huge_page(h, nid);
- enqueue_huge_page(h, new_page);
+ replace_hugepage(h, nid, old_page, new_page);

/*
* Pages have been replaced, we can safely free the old one.
--
2.25.1

2022-09-13 20:00:09

by Doug Berger

[permalink] [raw]
Subject: [PATCH 07/21] lib/show_mem.c: display MovableOnly

The comment for commit c78e93630d15 ("mm: do not walk all of
system memory during show_mem") indicates it "also corrects the
reporting of HighMem as HighMem/MovableOnly as ZONE_MOVABLE has
similar problems to HighMem with respect to lowmem/highmem
exhaustion."

Presuming the similar problems are with regard to the general
exclusion of kernel allocations from either zone, I believe it
makes sense to include all ZONE_MOVABLE memory even on systems
without HighMem.

To the extent that this was the intent of the original commit I
have included a "Fixes" tag, but it seems unnecessary to submit
to linux-stable.

Fixes: c78e93630d15 ("mm: do not walk all of system memory during show_mem")
Signed-off-by: Doug Berger <[email protected]>
---
lib/show_mem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/show_mem.c b/lib/show_mem.c
index 1c26c14ffbb9..337c870a5e59 100644
--- a/lib/show_mem.c
+++ b/lib/show_mem.c
@@ -27,7 +27,7 @@ void show_mem(unsigned int filter, nodemask_t *nodemask)
total += zone->present_pages;
reserved += zone->present_pages - zone_managed_pages(zone);

- if (is_highmem_idx(zoneid))
+ if (zoneid == ZONE_MOVABLE || is_highmem_idx(zoneid))
highmem += zone->present_pages;
}
}
--
2.25.1

2022-09-13 20:00:32

by Doug Berger

[permalink] [raw]
Subject: [PATCH 09/21] mm/page_alloc: calculate node_spanned_pages from pfns

Since the start and end pfns of the node are passed as arguments
to calculate_node_totalpages() they might as well be used to
specify the node_spanned_pages value for the node rather than
accumulating the spans of member zones.

This prevents the need for additional adjustments if zones are
allowed to overlap.

Signed-off-by: Doug Berger <[email protected]>
---
mm/page_alloc.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6bf76bbc0308..b6074961fb59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7452,7 +7452,7 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
unsigned long node_start_pfn,
unsigned long node_end_pfn)
{
- unsigned long realtotalpages = 0, totalpages = 0;
+ unsigned long realtotalpages = 0;
enum zone_type i;

for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -7483,11 +7483,10 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
zone->present_early_pages = real_size;
#endif

- totalpages += size;
realtotalpages += real_size;
}

- pgdat->node_spanned_pages = totalpages;
+ pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
pgdat->node_present_pages = realtotalpages;
pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
}
--
2.25.1

2022-09-13 20:00:33

by Doug Berger

[permalink] [raw]
Subject: [PATCH 10/21] mm/page_alloc.c: allow oversized movablecore

Now that the error in computation of corepages has been corrected
by commit 9fd745d450e7 ("mm: fix overflow in
find_zone_movable_pfns_for_nodes()"), oversized specifications of
movablecore will result in a zero value for required_kernelcore if
it is not also specified.

It is unintuitive for such a request to lead to no ZONE_MOVABLE
memory when the kernel parameters are clearly requesting some.

The current behavior when requesting an oversized kernelcore is to
classify all of the pages in movable_zone as kernelcore. The new
behavior when requesting an oversized movablecore (when not also
specifying kernelcore) is to similarly classify all of the pages
in movable_zone as movablecore.

Signed-off-by: Doug Berger <[email protected]>
---
mm/page_alloc.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b6074961fb59..ad38a81203e5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8041,13 +8041,13 @@ static void __init find_zone_movable_pfns_for_nodes(void)
corepages = totalpages - required_movablecore;

required_kernelcore = max(required_kernelcore, corepages);
+ } else if (!required_kernelcore) {
+ /* If kernelcore was not specified, there is no ZONE_MOVABLE */
+ goto out;
}

- /*
- * If kernelcore was not specified or kernelcore size is larger
- * than totalpages, there is no ZONE_MOVABLE.
- */
- if (!required_kernelcore || required_kernelcore >= totalpages)
+ /* If kernelcore size exceeds totalpages, there is no ZONE_MOVABLE */
+ if (required_kernelcore >= totalpages)
goto out;

/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
--
2.25.1

2022-09-13 20:00:51

by Doug Berger

[permalink] [raw]
Subject: [PATCH 06/21] mm/hugetlb: add hugepage isolation support

When a range of pageblocks is isolated there is at most one
hugepage that has only tail pages overlapping that range (i.e.
a hugepage that overlaps the beginning of the range).

However, that hugepage is the first migration target for an
alloc_contig_range() attempt so it already receives special
attention.

Checking whether the pageblock containing the head of a hugepage
is isolated is an inexpensive way to avoid hugepage allocations
from isolated pageblocks which makes alloc_contig_range() more
efficient.

Signed-off-by: Doug Berger <[email protected]>
---
mm/hugetlb.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index da80889e1436..2f354423f50f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -33,6 +33,7 @@
#include <linux/migrate.h>
#include <linux/nospec.h>
#include <linux/delayacct.h>
+#include <linux/page-isolation.h>

#include <asm/page.h>
#include <asm/pgalloc.h>
@@ -1135,6 +1136,10 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
if (PageHWPoison(page))
continue;

+ /* Check head pageblock isolation */
+ if (is_migrate_isolate_page(page))
+ continue;
+
list_move(&page->lru, &h->hugepage_activelist);
set_page_refcounted(page);
ClearHPageFreed(page);
--
2.25.1

2022-09-13 20:01:20

by Doug Berger

[permalink] [raw]
Subject: [PATCH 16/21] dt-bindings: reserved-memory: introduce designated-movable-block

Introduce designated-movable-block.yaml to document the
devicetree binding for Designated Movable Block children of the
reserved-memory node.

Signed-off-by: Doug Berger <[email protected]>
---
.../designated-movable-block.yaml | 51 +++++++++++++++++++
1 file changed, 51 insertions(+)
create mode 100644 Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml

diff --git a/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml b/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
new file mode 100644
index 000000000000..42f846069a2e
--- /dev/null
+++ b/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
@@ -0,0 +1,51 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/reserved-memory/designated-movable-block.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: /reserved-memory Designated Movable Block node binding
+
+maintainers:
+ - [email protected]
+
+allOf:
+ - $ref: "reserved-memory.yaml"
+
+properties:
+ compatible:
+ const: designated-movable-block
+ description:
+ This indicates a region of memory meant to be placed into
+ ZONE_MOVABLE.
+
+unevaluatedProperties: false
+
+required:
+ - compatible
+ - reusable
+
+examples:
+ - |
+ reserved-memory {
+ #address-cells = <0x2>;
+ #size-cells = <0x2>;
+
+ DMB0@10800000 {
+ compatible = "designated-movable-block";
+ reusable;
+ reg = <0x0 0x10800000 0x0 0x2d800000>;
+ };
+
+ DMB1@40000000 {
+ compatible = "designated-movable-block";
+ reusable;
+ reg = <0x0 0x40000000 0x0 0x30000000>;
+ };
+
+ DMB2@80000000 {
+ compatible = "designated-movable-block";
+ reusable;
+ reg = <0x0 0x80000000 0x0 0x2fc00000>;
+ };
+ };
--
2.25.1

2022-09-13 20:01:28

by Doug Berger

[permalink] [raw]
Subject: [PATCH 15/21] mm/page_alloc: allow base for movablecore

A Designated Movable Block can be created by including the base
address of the block when specifying a movablecore range on the
kernel command line.

Signed-off-by: Doug Berger <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 14 ++++++-
mm/page_alloc.c | 38 ++++++++++++++++---
2 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 426fa892d311..8141fac7c7cb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3312,7 +3312,7 @@
reporting absolute coordinates, such as tablets

movablecore= [KNL,X86,IA-64,PPC]
- Format: nn[KMGTPE] | nn%
+ Format: nn[KMGTPE] | nn[KMGTPE]@ss[KMGTPE] | nn%
This parameter is the complement to kernelcore=, it
specifies the amount of memory used for migratable
allocations. If both kernelcore and movablecore is
@@ -3322,6 +3322,18 @@
that the amount of memory usable for all allocations
is not too small.

+ If @ss[KMGTPE] is included, memory within the region
+ from ss to ss+nn will be designated as a movable block
+ and included in ZONE_MOVABLE. Designated Movable Blocks
+ must be aligned to pageblock_order. Designated Movable
+ Blocks take priority over values of kernelcore= and are
+ considered part of any memory specified by more general
+ movablecore= values.
+ Multiple Designated Movable Blocks may be specified,
+ comma delimited.
+ Example:
+ movablecore=100M@2G,100M@3G,1G@1024G
+
movable_node [KNL] Boot-time switch to make hotplugable memory
NUMA nodes to be movable. This means that the memory
of such nodes will be usable only for movable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69753cc51e19..e38dd1b32771 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8370,9 +8370,9 @@ void __init free_area_init(unsigned long *max_zone_pfn)
}

static int __init cmdline_parse_core(char *p, unsigned long *core,
- unsigned long *percent)
+ unsigned long *percent, bool movable)
{
- unsigned long long coremem;
+ unsigned long long coremem, address;
char *endptr;

if (!p)
@@ -8387,6 +8387,17 @@ static int __init cmdline_parse_core(char *p, unsigned long *core,
*percent = coremem;
} else {
coremem = memparse(p, &p);
+ if (movable && *p == '@') {
+ address = memparse(++p, &p);
+ if (*p != '\0' ||
+ !memblock_is_region_memory(address, coremem) ||
+ memblock_is_region_reserved(address, coremem))
+ return -EINVAL;
+ memblock_reserve(address, coremem);
+ return dmb_reserve(address, coremem, NULL);
+ } else if (*p != '\0') {
+ return -EINVAL;
+ }
/* Paranoid check that UL is enough for the coremem value */
WARN_ON((coremem >> PAGE_SHIFT) > ULONG_MAX);

@@ -8409,17 +8420,32 @@ static int __init cmdline_parse_kernelcore(char *p)
}

return cmdline_parse_core(p, &required_kernelcore,
- &required_kernelcore_percent);
+ &required_kernelcore_percent, false);
}

/*
* movablecore=size sets the amount of memory for use for allocations that
- * can be reclaimed or migrated.
+ * can be reclaimed or migrated. movablecore=size@base defines a Designated
+ * Movable Block.
*/
static int __init cmdline_parse_movablecore(char *p)
{
- return cmdline_parse_core(p, &required_movablecore,
- &required_movablecore_percent);
+ int ret = -EINVAL;
+
+ while (p) {
+ char *k = strchr(p, ',');
+
+ if (k)
+ *k++ = 0;
+
+ ret = cmdline_parse_core(p, &required_movablecore,
+ &required_movablecore_percent, true);
+ if (ret)
+ break;
+ p = k;
+ }
+
+ return ret;
}

early_param("kernelcore", cmdline_parse_kernelcore);
--
2.25.1

2022-09-13 20:08:51

by Doug Berger

[permalink] [raw]
Subject: [PATCH 21/21] mm/hugetlb: introduce hugetlb_dmb

If specified on the kernel command line the hugetlb_dmb parameter
modifies the behavior of the hugetlb_cma parameter to use the
Contiguous Memory Allocator within Designated Movable Blocks for
gigantic page allocation.

This allows the kernel page allocator to use the memory more
agressively than traditional CMA memory pools at the cost of
potentially increased allocation latency.

Signed-off-by: Doug Berger <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 3 +++
mm/hugetlb.c | 16 +++++++++++++---
2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 8141fac7c7cb..b29d1fa253d6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1732,6 +1732,9 @@
hugepages using the CMA allocator. If enabled, the
boot-time allocation of gigantic hugepages is skipped.

+ hugetlb_dmb [HW,CMA] Causes hugetlb_cma to use Designated Movable
+ Blocks for any CMA areas it reserves.
+
hugetlb_free_vmemmap=
[KNL] Reguires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
enabled.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2f354423f50f..d3fb8b1f443f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -54,6 +54,7 @@ struct hstate hstates[HUGE_MAX_HSTATE];
#ifdef CONFIG_CMA
static struct cma *hugetlb_cma[MAX_NUMNODES];
static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
+static bool hugetlb_dmb __initdata;
static bool hugetlb_cma_page(struct page *page, unsigned int order)
{
return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page,
@@ -7321,6 +7322,14 @@ static int __init cmdline_parse_hugetlb_cma(char *p)

early_param("hugetlb_cma", cmdline_parse_hugetlb_cma);

+static int __init cmdline_parse_hugetlb_dmb(char *p)
+{
+ hugetlb_dmb = true;
+ return 0;
+}
+
+early_param("hugetlb_dmb", cmdline_parse_hugetlb_dmb);
+
void __init hugetlb_cma_reserve(int order)
{
unsigned long size, reserved, per_node;
@@ -7396,10 +7405,11 @@ void __init hugetlb_cma_reserve(int order)
* may be returned to CMA allocator in the case of
* huge page demotion.
*/
- res = cma_declare_contiguous_nid(0, size, 0,
+ res = __cma_declare_contiguous_nid(0, size, 0,
PAGE_SIZE << HUGETLB_PAGE_ORDER,
- 0, false, name,
- &hugetlb_cma[nid], nid);
+ 0, false, name,
+ &hugetlb_cma[nid], nid,
+ hugetlb_dmb);
if (res) {
pr_warn("hugetlb_cma: reservation failed: err %d, node %d",
res, nid);
--
2.25.1

2022-09-13 20:13:17

by Doug Berger

[permalink] [raw]
Subject: [PATCH 01/21] mm/page_isolation: protect cma from isolate_single_pageblock

The function set_migratetype_isolate() has special handling for
pageblocks of MIGRATE_CMA type that protects them from being
isolated for MIGRATE_MOVABLE requests.

Since isolate_single_pageblock() doesn't receive the migratetype
argument of start_isolate_page_range() it used the migratetype
of the pageblock instead of the requested migratetype which
defeats this MIGRATE_CMA check.

This allows an attempt to create a gigantic page within a CMA
region to change the migratetype of the first and last pageblocks
from MIGRATE_CMA to MIGRATE_MOVABLE when they are restored after
failure, which corrupts the CMA region.

The calls to (un)set_migratetype_isolate() for the first and last
pageblocks of the start_isolate_page_range() are moved back into
that function to allow access to its migratetype argument and make
it easier to see how all of the pageblocks in the range are
isolated.

Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
Signed-off-by: Doug Berger <[email protected]>
---
mm/page_isolation.c | 75 +++++++++++++++++++++------------------------
1 file changed, 35 insertions(+), 40 deletions(-)

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 9d73dc38e3d7..8e16aa22cb61 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -286,8 +286,6 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* @flags: isolation flags
* @gfp_flags: GFP flags used for migrating pages
* @isolate_before: isolate the pageblock before the boundary_pfn
- * @skip_isolation: the flag to skip the pageblock isolation in second
- * isolate_single_pageblock()
*
* Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
* pageblock. When not all pageblocks within a page are isolated at the same
@@ -302,9 +300,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* the in-use page then splitting the free page.
*/
static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
- gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
+ gfp_t gfp_flags, bool isolate_before)
{
- unsigned char saved_mt;
unsigned long start_pfn;
unsigned long isolate_pageblock;
unsigned long pfn;
@@ -328,18 +325,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
zone->zone_start_pfn);

- saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
-
- if (skip_isolation)
- VM_BUG_ON(!is_migrate_isolate(saved_mt));
- else {
- ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
- isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
-
- if (ret)
- return ret;
- }
-
/*
* Bail out early when the to-be-isolated pageblock does not form
* a free or in-use page across boundary_pfn:
@@ -428,7 +413,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
ret = set_migratetype_isolate(page, page_mt,
flags, head_pfn, head_pfn + nr_pages);
if (ret)
- goto failed;
+ return ret;
}

ret = __alloc_contig_migrate_range(&cc, head_pfn,
@@ -443,7 +428,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
unset_migratetype_isolate(page, page_mt);

if (ret)
- goto failed;
+ return -EBUSY;
/*
* reset pfn to the head of the free page, so
* that the free page handling code above can split
@@ -459,24 +444,19 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
while (!PageBuddy(pfn_to_page(outer_pfn))) {
/* stop if we cannot find the free page */
if (++order >= MAX_ORDER)
- goto failed;
+ return -EBUSY;
outer_pfn &= ~0UL << order;
}
pfn = outer_pfn;
continue;
} else
#endif
- goto failed;
+ return -EBUSY;
}

pfn++;
}
return 0;
-failed:
- /* restore the original migratetype */
- if (!skip_isolation)
- unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
- return -EBUSY;
}

/**
@@ -534,21 +514,30 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
int ret;
- bool skip_isolation = false;

/* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
- ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
+ ret = set_migratetype_isolate(pfn_to_page(isolate_start), migratetype,
+ flags, isolate_start, isolate_start + pageblock_nr_pages);
if (ret)
return ret;
-
- if (isolate_start == isolate_end - pageblock_nr_pages)
- skip_isolation = true;
+ ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
+ if (ret)
+ goto unset_start_block;

/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
- ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
+ pfn = isolate_end - pageblock_nr_pages;
+ if (isolate_start != pfn) {
+ ret = set_migratetype_isolate(pfn_to_page(pfn), migratetype,
+ flags, pfn, pfn + pageblock_nr_pages);
+ if (ret)
+ goto unset_start_block;
+ }
+ ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
if (ret) {
- unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
- return ret;
+ if (isolate_start != pfn)
+ goto unset_end_block;
+ else
+ goto unset_start_block;
}

/* skip isolated pageblocks at the beginning and end */
@@ -557,15 +546,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
pfn += pageblock_nr_pages) {
page = __first_valid_page(pfn, pageblock_nr_pages);
if (page && set_migratetype_isolate(page, migratetype, flags,
- start_pfn, end_pfn)) {
- undo_isolate_page_range(isolate_start, pfn, migratetype);
- unset_migratetype_isolate(
- pfn_to_page(isolate_end - pageblock_nr_pages),
- migratetype);
- return -EBUSY;
- }
+ start_pfn, end_pfn))
+ goto unset_isolated_blocks;
}
return 0;
+
+unset_isolated_blocks:
+ ret = -EBUSY;
+ undo_isolate_page_range(isolate_start + pageblock_nr_pages, pfn,
+ migratetype);
+unset_end_block:
+ unset_migratetype_isolate(pfn_to_page(isolate_end - pageblock_nr_pages),
+ migratetype);
+unset_start_block:
+ unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
+ return ret;
}

/*
--
2.25.1

2022-09-13 20:14:36

by Doug Berger

[permalink] [raw]
Subject: [PATCH 11/21] mm/page_alloc: introduce init_reserved_pageblock()

Most of the implementation of init_cma_reserved_pageblock() is
common to the initialization of any reserved pageblock for use
by the page allocator.

This commit breaks that functionality out into the new common
function init_reserved_pageblock() for use by code other than
CMA. The CMA specific code is relocated from page_alloc to the
point where init_cma_reserved_pageblock() was invoked and the
new function is used there instead. The error path is also
updated to use the function to operate on pageblocks rather
than pages.

Signed-off-by: Doug Berger <[email protected]>
---
include/linux/gfp.h | 5 +----
mm/cma.c | 15 +++++++++++----
mm/page_alloc.c | 8 ++------
3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f314be58fa77..71ed687be406 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -367,9 +367,6 @@ extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
#endif
void free_contig_range(unsigned long pfn, unsigned long nr_pages);

-#ifdef CONFIG_CMA
-/* CMA stuff */
-extern void init_cma_reserved_pageblock(struct page *page);
-#endif
+extern void init_reserved_pageblock(struct page *page);

#endif /* __LINUX_GFP_H */
diff --git a/mm/cma.c b/mm/cma.c
index 4a978e09547a..6208a3e1cd9d 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -31,6 +31,7 @@
#include <linux/highmem.h>
#include <linux/io.h>
#include <linux/kmemleak.h>
+#include <linux/page-isolation.h>
#include <trace/events/cma.h>

#include "cma.h"
@@ -116,8 +117,13 @@ static void __init cma_activate_area(struct cma *cma)
}

for (pfn = base_pfn; pfn < base_pfn + cma->count;
- pfn += pageblock_nr_pages)
- init_cma_reserved_pageblock(pfn_to_page(pfn));
+ pfn += pageblock_nr_pages) {
+ struct page *page = pfn_to_page(pfn);
+
+ set_pageblock_migratetype(page, MIGRATE_CMA);
+ init_reserved_pageblock(page);
+ page_zone(page)->cma_pages += pageblock_nr_pages;
+ }

spin_lock_init(&cma->lock);

@@ -133,8 +139,9 @@ static void __init cma_activate_area(struct cma *cma)
out_error:
/* Expose all pages to the buddy, they are useless for CMA. */
if (!cma->reserve_pages_on_error) {
- for (pfn = base_pfn; pfn < base_pfn + cma->count; pfn++)
- free_reserved_page(pfn_to_page(pfn));
+ for (pfn = base_pfn; pfn < base_pfn + cma->count;
+ pfn += pageblock_nr_pages)
+ init_reserved_pageblock(pfn_to_page(pfn));
}
totalcma_pages -= cma->count;
cma->count = 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ad38a81203e5..1682d8815efa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2302,9 +2302,8 @@ void __init page_alloc_init_late(void)
set_zone_contiguous(zone);
}

-#ifdef CONFIG_CMA
-/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
-void __init init_cma_reserved_pageblock(struct page *page)
+/* Free whole pageblock */
+void __init init_reserved_pageblock(struct page *page)
{
unsigned i = pageblock_nr_pages;
struct page *p = page;
@@ -2314,14 +2313,11 @@ void __init init_cma_reserved_pageblock(struct page *page)
set_page_count(p, 0);
} while (++p, --i);

- set_pageblock_migratetype(page, MIGRATE_CMA);
set_page_refcounted(page);
__free_pages(page, pageblock_order);

adjust_managed_page_count(page, pageblock_nr_pages);
- page_zone(page)->cma_pages += pageblock_nr_pages;
}
-#endif

/*
* The order of subdivision here is critical for the IO subsystem.
--
2.25.1

2022-09-13 20:14:39

by Doug Berger

[permalink] [raw]
Subject: [PATCH 08/21] mm/vmstat: show start_pfn when zone spans pages

A zone that overlaps with another zone may span a range of pages
that are not present. In this case, displaying the start_pfn of
the zone allows the zone page range to be identified.

Signed-off-by: Doug Berger <[email protected]>
---
mm/vmstat.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 90af9a8572f5..e2f19f2b7615 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1717,6 +1717,11 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,

/* If unpopulated, no other information is useful */
if (!populated_zone(zone)) {
+ /* Show start_pfn for empty overlapped zones */
+ if (zone->spanned_pages)
+ seq_printf(m,
+ "\n start_pfn: %lu",
+ zone->zone_start_pfn);
seq_putc(m, '\n');
return;
}
--
2.25.1

2022-09-13 20:14:46

by Doug Berger

[permalink] [raw]
Subject: [PATCH 17/21] mm/dmb: introduce rmem designated-movable-block

This commit allows Designated Movable Blocks to be created by
including reserved-memory child nodes in the device tree with
the "designated-movable-block" compatible string.

Signed-off-by: Doug Berger <[email protected]>
---
drivers/of/of_reserved_mem.c | 15 ++++++---
mm/dmb.c | 64 ++++++++++++++++++++++++++++++++++++
2 files changed, 74 insertions(+), 5 deletions(-)

diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c
index 65f3b02a0e4e..0eb9e8898d7b 100644
--- a/drivers/of/of_reserved_mem.c
+++ b/drivers/of/of_reserved_mem.c
@@ -23,6 +23,7 @@
#include <linux/memblock.h>
#include <linux/kmemleak.h>
#include <linux/cma.h>
+#include <linux/dmb.h>

#include "of_private.h"

@@ -113,12 +114,16 @@ static int __init __reserved_mem_alloc_size(unsigned long node,

nomap = of_get_flat_dt_prop(node, "no-map", NULL) != NULL;

- /* Need adjust the alignment to satisfy the CMA requirement */
- if (IS_ENABLED(CONFIG_CMA)
- && of_flat_dt_is_compatible(node, "shared-dma-pool")
- && of_get_flat_dt_prop(node, "reusable", NULL)
- && !nomap)
+ if (of_flat_dt_is_compatible(node, "designated-movable-block")) {
+ /* Need adjust the alignment to satisfy the DMB requirement */
+ align = max_t(phys_addr_t, align, DMB_MIN_ALIGNMENT_BYTES);
+ } else if (IS_ENABLED(CONFIG_CMA)
+ && of_flat_dt_is_compatible(node, "shared-dma-pool")
+ && of_get_flat_dt_prop(node, "reusable", NULL)
+ && !nomap) {
+ /* Need adjust the alignment to satisfy the CMA requirement */
align = max_t(phys_addr_t, align, CMA_MIN_ALIGNMENT_BYTES);
+ }

prop = of_get_flat_dt_prop(node, "alloc-ranges", &len);
if (prop) {
diff --git a/mm/dmb.c b/mm/dmb.c
index 9d9fd31089d2..8132d18542a0 100644
--- a/mm/dmb.c
+++ b/mm/dmb.c
@@ -90,3 +90,67 @@ void __init dmb_init_region(struct memblock_region *region)
init_reserved_pageblock(page);
}
}
+
+/*
+ * Support for reserved memory regions defined in device tree
+ */
+#ifdef CONFIG_OF_RESERVED_MEM
+#include <linux/of.h>
+#include <linux/of_fdt.h>
+#include <linux/of_reserved_mem.h>
+
+#undef pr_fmt
+#define pr_fmt(fmt) fmt
+
+static int rmem_dmb_device_init(struct reserved_mem *rmem, struct device *dev)
+{
+ struct dmb *dmb;
+
+ dmb = (struct dmb *)rmem->priv;
+ if (dmb->owner)
+ return -EBUSY;
+
+ dmb->owner = dev;
+ return 0;
+}
+
+static void rmem_dmb_device_release(struct reserved_mem *rmem,
+ struct device *dev)
+{
+ struct dmb *dmb;
+
+ dmb = (struct dmb *)rmem->priv;
+ if (dmb->owner == (void *)dev)
+ dmb->owner = NULL;
+}
+
+static const struct reserved_mem_ops rmem_dmb_ops = {
+ .device_init = rmem_dmb_device_init,
+ .device_release = rmem_dmb_device_release,
+};
+
+static int __init rmem_dmb_setup(struct reserved_mem *rmem)
+{
+ unsigned long node = rmem->fdt_node;
+ struct dmb *dmb;
+ int err;
+
+ if (!of_get_flat_dt_prop(node, "reusable", NULL) ||
+ of_get_flat_dt_prop(node, "no-map", NULL))
+ return -EINVAL;
+
+ err = dmb_reserve(rmem->base, rmem->size, &dmb);
+ if (err) {
+ pr_err("Reserved memory: unable to setup DMB region\n");
+ return err;
+ }
+
+ rmem->priv = dmb;
+ rmem->ops = &rmem_dmb_ops;
+ pr_info("Reserved memory: created DMB at %pa, size %ld MiB\n",
+ &rmem->base, (unsigned long)rmem->size / SZ_1M);
+
+ return 0;
+}
+RESERVEDMEM_OF_DECLARE(dmb, "designated-movable-block", rmem_dmb_setup);
+#endif
--
2.25.1

2022-09-13 20:15:31

by Doug Berger

[permalink] [raw]
Subject: [PATCH 13/21] mm/dmb: Introduce Designated Movable Blocks

Designated Movable Blocks are blocks of memory that are composed
of one or more adjacent memblocks that have the MEMBLOCK_MOVABLE
designation. These blocks must be reserved before receiving that
designation and will be located in the ZONE_MOVABLE zone rather
than any other zone that may span them.

Signed-off-by: Doug Berger <[email protected]>
---
include/linux/dmb.h | 28 ++++++++++++++
mm/Kconfig | 12 ++++++
mm/Makefile | 1 +
mm/dmb.c | 92 +++++++++++++++++++++++++++++++++++++++++++++
mm/memblock.c | 6 ++-
mm/page_alloc.c | 84 ++++++++++++++++++++++++++++++++++-------
6 files changed, 209 insertions(+), 14 deletions(-)
create mode 100644 include/linux/dmb.h
create mode 100644 mm/dmb.c

diff --git a/include/linux/dmb.h b/include/linux/dmb.h
new file mode 100644
index 000000000000..eecc90e7f884
--- /dev/null
+++ b/include/linux/dmb.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __DMB_H__
+#define __DMB_H__
+
+#include <linux/memblock.h>
+
+/*
+ * the buddy -- especially pageblock merging and alloc_contig_range()
+ * -- can deal with only some pageblocks of a higher-order page being
+ * MIGRATE_MOVABLE, we can use pageblock_nr_pages.
+ */
+#define DMB_MIN_ALIGNMENT_PAGES pageblock_nr_pages
+#define DMB_MIN_ALIGNMENT_BYTES (PAGE_SIZE * DMB_MIN_ALIGNMENT_PAGES)
+
+enum {
+ DMB_DISJOINT = 0,
+ DMB_INTERSECTS,
+ DMB_MIXED,
+};
+
+struct dmb;
+
+extern int dmb_reserve(phys_addr_t base, phys_addr_t size,
+ struct dmb **res_dmb);
+extern int dmb_intersects(unsigned long spfn, unsigned long epfn);
+extern void dmb_init_region(struct memblock_region *region);
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 0331f1461f81..7739edde5d4d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -868,6 +868,18 @@ config CMA_AREAS

If unsure, leave the default value "7" in UMA and "19" in NUMA.

+config DMB_COUNT
+ int "Maximum count of Designated Movable Blocks"
+ default 19 if NUMA
+ default 7
+ help
+ Designated Movable Blocks are blocks of memory that can be used
+ by the page allocator exclusively for movable pages. They are
+ managed in ZONE_MOVABLE but may overlap with other zones. This
+ parameter sets the maximum number of DMBs in the system.
+
+ If unsure, leave the default value "7" in UMA and "19" in NUMA.
+
config MEM_SOFT_DIRTY
bool "Track memory changes"
depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
diff --git a/mm/Makefile b/mm/Makefile
index 9a564f836403..d0b469a494f2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -67,6 +67,7 @@ obj-y += page-alloc.o
obj-y += init-mm.o
obj-y += memblock.o
obj-y += $(memory-hotplug-y)
+obj-y += dmb.o

ifdef CONFIG_MMU
obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o
diff --git a/mm/dmb.c b/mm/dmb.c
new file mode 100644
index 000000000000..9d9fd31089d2
--- /dev/null
+++ b/mm/dmb.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Designated Movable Block
+ */
+
+#define pr_fmt(fmt) "dmb: " fmt
+
+#include <linux/dmb.h>
+
+struct dmb {
+ unsigned long start_pfn;
+ unsigned long end_pfn;
+ void *owner;
+};
+
+static struct dmb dmb_areas[CONFIG_DMB_COUNT];
+static unsigned int dmb_area_count;
+
+int __init dmb_reserve(phys_addr_t base, phys_addr_t size,
+ struct dmb **res_dmb)
+{
+ struct dmb *dmb;
+
+ /* Sanity checks */
+ if (dmb_area_count == ARRAY_SIZE(dmb_areas)) {
+ pr_warn("Not enough slots for DMB reserved regions!\n");
+ return -ENOSPC;
+ }
+
+ if (!size || !memblock_is_region_reserved(base, size))
+ return -EINVAL;
+
+ /* ensure minimal alignment required by mm core */
+ if (!IS_ALIGNED(base | size, DMB_MIN_ALIGNMENT_BYTES))
+ return -EINVAL;
+
+ /*
+ * Each reserved area must be initialised later, when more kernel
+ * subsystems (like slab allocator) are available.
+ */
+ dmb = &dmb_areas[dmb_area_count++];
+
+ dmb->start_pfn = PFN_DOWN(base);
+ dmb->end_pfn = PFN_DOWN(base + size);
+ if (res_dmb)
+ *res_dmb = dmb;
+
+ memblock_mark_movable(base, size);
+ return 0;
+}
+
+int dmb_intersects(unsigned long spfn, unsigned long epfn)
+{
+ int i;
+ struct dmb *dmb;
+
+ if (spfn >= epfn)
+ return DMB_DISJOINT;
+
+ for (i = 0; i < dmb_area_count; i++) {
+ dmb = &dmb_areas[i];
+ if (spfn >= dmb->end_pfn)
+ continue;
+ if (epfn <= dmb->start_pfn)
+ return DMB_DISJOINT;
+ if (spfn >= dmb->start_pfn && epfn <= dmb->end_pfn)
+ return DMB_INTERSECTS;
+ else
+ return DMB_MIXED;
+ }
+
+ return DMB_DISJOINT;
+}
+EXPORT_SYMBOL(dmb_intersects);
+
+void __init dmb_init_region(struct memblock_region *region)
+{
+ unsigned long pfn;
+ int i;
+
+ for (pfn = memblock_region_memory_base_pfn(region);
+ pfn < memblock_region_memory_end_pfn(region);
+ pfn += pageblock_nr_pages) {
+ struct page *page = pfn_to_page(pfn);
+
+ for (i = 0; i < pageblock_nr_pages; i++)
+ set_page_zone(page + i, ZONE_MOVABLE);
+
+ /* free reserved pageblocks to page allocator */
+ init_reserved_pageblock(page);
+ }
+}
diff --git a/mm/memblock.c b/mm/memblock.c
index 5d6a210d98ec..9eb91acdeb75 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -16,6 +16,7 @@
#include <linux/kmemleak.h>
#include <linux/seq_file.h>
#include <linux/memblock.h>
+#include <linux/dmb.h>

#include <asm/sections.h>
#include <linux/io.h>
@@ -2090,13 +2091,16 @@ static void __init memmap_init_reserved_pages(void)
for_each_reserved_mem_range(i, &start, &end)
reserve_bootmem_region(start, end);

- /* and also treat struct pages for the NOMAP regions as PageReserved */
for_each_mem_region(region) {
+ /* treat struct pages for the NOMAP regions as PageReserved */
if (memblock_is_nomap(region)) {
start = region->base;
end = start + region->size;
reserve_bootmem_region(start, end);
}
+ /* move Designated Movable Block pages to ZONE_MOVABLE */
+ if (memblock_is_movable(region))
+ dmb_init_region(region);
}
}

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1682d8815efa..e723094d1e1e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -75,6 +75,7 @@
#include <linux/khugepaged.h>
#include <linux/buffer_head.h>
#include <linux/delayacct.h>
+#include <linux/dmb.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -433,6 +434,7 @@ static unsigned long required_kernelcore __initdata;
static unsigned long required_kernelcore_percent __initdata;
static unsigned long required_movablecore __initdata;
static unsigned long required_movablecore_percent __initdata;
+static unsigned long min_dmb_pfn[MAX_NUMNODES] __initdata;
static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata;
bool mirrored_kernelcore __initdata_memblock;

@@ -2165,7 +2167,7 @@ static int __init deferred_init_memmap(void *data)
}
zone_empty:
/* Sanity check that the next zone really is unpopulated */
- WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
+ WARN_ON(++zid < ZONE_MOVABLE && populated_zone(++zone));

pr_info("node %d deferred pages initialised in %ums\n",
pgdat->node_id, jiffies_to_msecs(jiffies - start));
@@ -6899,6 +6901,10 @@ static void __init memmap_init_zone_range(struct zone *zone,
unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
int nid = zone_to_nid(zone), zone_id = zone_idx(zone);

+ /* Skip overlap of ZONE_MOVABLE */
+ if (zone_id == ZONE_MOVABLE && zone_start_pfn < *hole_pfn)
+ zone_start_pfn = *hole_pfn;
+
start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);

@@ -7348,6 +7354,9 @@ static unsigned long __init zone_spanned_pages_in_node(int nid,
node_start_pfn, node_end_pfn,
zone_start_pfn, zone_end_pfn);

+ if (zone_type == ZONE_MOVABLE && min_dmb_pfn[nid])
+ *zone_start_pfn = min(*zone_start_pfn, min_dmb_pfn[nid]);
+
/* Check that this node has pages within the zone's required range */
if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn)
return 0;
@@ -7416,12 +7425,17 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
&zone_start_pfn, &zone_end_pfn);
nr_absent = __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);

+ if (zone_type == ZONE_MOVABLE && min_dmb_pfn[nid]) {
+ zone_start_pfn = min(zone_start_pfn, min_dmb_pfn[nid]);
+ nr_absent += zone_movable_pfn[nid] - zone_start_pfn;
+ }
+
/*
* ZONE_MOVABLE handling.
- * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages
+ * Treat pages to be ZONE_MOVABLE in other zones as absent pages
* and vice versa.
*/
- if (mirrored_kernelcore && zone_movable_pfn[nid]) {
+ if (zone_movable_pfn[nid]) {
unsigned long start_pfn, end_pfn;
struct memblock_region *r;

@@ -7431,6 +7445,21 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
end_pfn = clamp(memblock_region_memory_end_pfn(r),
zone_start_pfn, zone_end_pfn);

+ if (memblock_is_movable(r)) {
+ if (zone_type != ZONE_MOVABLE) {
+ nr_absent += end_pfn - start_pfn;
+ continue;
+ }
+
+ end_pfn = min(end_pfn, zone_movable_pfn[nid]);
+ if (start_pfn < zone_movable_pfn[nid])
+ nr_absent -= end_pfn - start_pfn;
+ continue;
+ }
+
+ if (!mirrored_kernelcore)
+ continue;
+
if (zone_type == ZONE_MOVABLE &&
memblock_is_mirror(r))
nr_absent += end_pfn - start_pfn;
@@ -7450,6 +7479,15 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
{
unsigned long realtotalpages = 0;
enum zone_type i;
+ int nid = pgdat->node_id;
+
+ /*
+ * If Designated Movable Blocks are defined on this node, ensure that
+ * zone_movable_pfn is also defined for this node.
+ */
+ if (min_dmb_pfn[nid] && !zone_movable_pfn[nid])
+ zone_movable_pfn[nid] = min(node_end_pfn,
+ arch_zone_highest_possible_pfn[movable_zone]);

for (i = 0; i < MAX_NR_ZONES; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -7457,12 +7495,12 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
unsigned long spanned, absent;
unsigned long size, real_size;

- spanned = zone_spanned_pages_in_node(pgdat->node_id, i,
+ spanned = zone_spanned_pages_in_node(nid, i,
node_start_pfn,
node_end_pfn,
&zone_start_pfn,
&zone_end_pfn);
- absent = zone_absent_pages_in_node(pgdat->node_id, i,
+ absent = zone_absent_pages_in_node(nid, i,
node_start_pfn,
node_end_pfn);

@@ -7922,15 +7960,23 @@ unsigned long __init find_min_pfn_with_active_regions(void)
static unsigned long __init early_calculate_totalpages(void)
{
unsigned long totalpages = 0;
- unsigned long start_pfn, end_pfn;
- int i, nid;
+ struct memblock_region *r;

- for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
- unsigned long pages = end_pfn - start_pfn;
+ for_each_mem_region(r) {
+ unsigned long start_pfn, end_pfn, pages;
+ int nid;

- totalpages += pages;
- if (pages)
+ nid = memblock_get_region_node(r);
+ start_pfn = memblock_region_memory_base_pfn(r);
+ end_pfn = memblock_region_memory_end_pfn(r);
+
+ pages = end_pfn - start_pfn;
+ if (pages) {
+ totalpages += pages;
node_set_state(nid, N_MEMORY);
+ if (memblock_is_movable(r) && !min_dmb_pfn[nid])
+ min_dmb_pfn[nid] = start_pfn;
+ }
}
return totalpages;
}
@@ -7943,7 +7989,7 @@ static unsigned long __init early_calculate_totalpages(void)
*/
static void __init find_zone_movable_pfns_for_nodes(void)
{
- int i, nid;
+ int nid;
unsigned long usable_startpfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
@@ -8071,13 +8117,24 @@ static void __init find_zone_movable_pfns_for_nodes(void)
kernelcore_remaining = kernelcore_node;

/* Go through each range of PFNs within this node */
- for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
+ for_each_mem_region(r) {
unsigned long size_pages;

+ if (memblock_get_region_node(r) != nid)
+ continue;
+
+ start_pfn = memblock_region_memory_base_pfn(r);
+ end_pfn = memblock_region_memory_end_pfn(r);
start_pfn = max(start_pfn, zone_movable_pfn[nid]);
if (start_pfn >= end_pfn)
continue;

+ /* Skip over Designated Movable Blocks */
+ if (memblock_is_movable(r)) {
+ zone_movable_pfn[nid] = end_pfn;
+ continue;
+ }
+
/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
unsigned long kernel_pages;
@@ -8226,6 +8283,7 @@ void __init free_area_init(unsigned long *max_zone_pfn)
}

/* Find the PFNs that ZONE_MOVABLE begins at in each node */
+ memset(min_dmb_pfn, 0, sizeof(min_dmb_pfn));
memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
find_zone_movable_pfns_for_nodes();

--
2.25.1

2022-09-13 20:16:35

by Doug Berger

[permalink] [raw]
Subject: [PATCH 12/21] memblock: introduce MEMBLOCK_MOVABLE flag

The MEMBLOCK_MOVABLE flag is introduced to designate a memblock
as only supporting movable allocations by the page allocator.

Signed-off-by: Doug Berger <[email protected]>
---
include/linux/memblock.h | 8 ++++++++
mm/memblock.c | 24 ++++++++++++++++++++++++
2 files changed, 32 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 50ad19662a32..8eb3ca32dfa7 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -47,6 +47,7 @@ enum memblock_flags {
MEMBLOCK_MIRROR = 0x2, /* mirrored region */
MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
+ MEMBLOCK_MOVABLE = 0x10, /* designated movable block */
};

/**
@@ -125,6 +126,8 @@ int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
+int memblock_mark_movable(phys_addr_t base, phys_addr_t size);
+int memblock_clear_movable(phys_addr_t base, phys_addr_t size);

void memblock_free_all(void);
void memblock_free(void *ptr, size_t size);
@@ -265,6 +268,11 @@ static inline bool memblock_is_driver_managed(struct memblock_region *m)
return m->flags & MEMBLOCK_DRIVER_MANAGED;
}

+static inline bool memblock_is_movable(struct memblock_region *m)
+{
+ return m->flags & MEMBLOCK_MOVABLE;
+}
+
int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long *end_pfn);
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index b5d3026979fc..5d6a210d98ec 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -979,6 +979,30 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
}

+/**
+ * memblock_mark_movable - Mark designated movable block with MEMBLOCK_MOVABLE.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_mark_movable(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(base, size, 1, MEMBLOCK_MOVABLE);
+}
+
+/**
+ * memblock_clear_movable - Clear flag MEMBLOCK_MOVABLE for a specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_clear_movable(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(base, size, 0, MEMBLOCK_MOVABLE);
+}
+
static bool should_skip_region(struct memblock_type *type,
struct memblock_region *m,
int nid, int flags)
--
2.25.1

2022-09-13 20:17:46

by Doug Berger

[permalink] [raw]
Subject: [PATCH 03/21] mm/hugetlb: correct demote page offset logic

With gigantic pages it may not be true that struct page structures
are contiguous across the entire gigantic page. The mem_map_offset
function is used here in place of direct pointer arithmetic to
correct for this.

Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
Signed-off-by: Doug Berger <[email protected]>
---
mm/hugetlb.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 79949893ac12..a1d51a1f0404 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3420,6 +3420,7 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
{
int i, nid = page_to_nid(page);
struct hstate *target_hstate;
+ struct page *subpage;
int rc = 0;

target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
@@ -3453,15 +3454,16 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
mutex_lock(&target_hstate->resize_lock);
for (i = 0; i < pages_per_huge_page(h);
i += pages_per_huge_page(target_hstate)) {
+ subpage = mem_map_offset(page, i);
if (hstate_is_gigantic(target_hstate))
- prep_compound_gigantic_page_for_demote(page + i,
+ prep_compound_gigantic_page_for_demote(subpage,
target_hstate->order);
else
- prep_compound_page(page + i, target_hstate->order);
- set_page_private(page + i, 0);
- set_page_refcounted(page + i);
- prep_new_huge_page(target_hstate, page + i, nid);
- put_page(page + i);
+ prep_compound_page(subpage, target_hstate->order);
+ set_page_private(subpage, 0);
+ set_page_refcounted(subpage);
+ prep_new_huge_page(target_hstate, subpage, nid);
+ put_page(subpage);
}
mutex_unlock(&target_hstate->resize_lock);

--
2.25.1

2022-09-13 20:20:23

by Doug Berger

[permalink] [raw]
Subject: [PATCH 18/21] mm/cma: support CMA in Designated Movable Blocks

This commit allows for different page allocator handling for
CMA areas that are within Designated Movable Blocks.

Specifically, the pageblocks are allowed to remain migratetype
MIGRATE_MOVABLE to allow more aggressive utilization by the
page allocator. This also means that the page allocator should
not consider these pages as part of the nr_free_cma metric it
uses for managing MIGRATE_CMA type pageblocks.

This leads to the decision to remove these areas from the
CmaTotal metrics after initialization to avoid confusion.

Signed-off-by: Doug Berger <[email protected]>
---
include/linux/cma.h | 13 ++++++---
mm/cma.c | 55 +++++++++++++++++++++++++-----------
mm/page_alloc.c | 69 +++++++++++++++++++++++++++++----------------
3 files changed, 92 insertions(+), 45 deletions(-)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 63873b93deaa..ffbb8ea2c5f8 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -31,11 +31,13 @@ extern phys_addr_t cma_get_base(const struct cma *cma);
extern unsigned long cma_get_size(const struct cma *cma);
extern const char *cma_get_name(const struct cma *cma);

-extern int __init cma_declare_contiguous_nid(phys_addr_t base,
+extern int __init __cma_declare_contiguous_nid(phys_addr_t base,
phys_addr_t size, phys_addr_t limit,
phys_addr_t alignment, unsigned int order_per_bit,
bool fixed, const char *name, struct cma **res_cma,
- int nid);
+ int nid, bool in_dmb);
+#define cma_declare_contiguous_nid(b, s, l, a, o, f, n, r_c, nid) \
+ __cma_declare_contiguous_nid(b, s, l, a, o, f, n, r_c, nid, false)
static inline int __init cma_declare_contiguous(phys_addr_t base,
phys_addr_t size, phys_addr_t limit,
phys_addr_t alignment, unsigned int order_per_bit,
@@ -44,10 +46,13 @@ static inline int __init cma_declare_contiguous(phys_addr_t base,
return cma_declare_contiguous_nid(base, size, limit, alignment,
order_per_bit, fixed, name, res_cma, NUMA_NO_NODE);
}
-extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
+extern int __cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
unsigned int order_per_bit,
const char *name,
- struct cma **res_cma);
+ struct cma **res_cma,
+ bool in_dmb);
+#define cma_init_reserved_mem(base, size, order, name, res_cma) \
+ __cma_init_reserved_mem(base, size, order, name, res_cma, 0)
extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
bool no_warn);
extern bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned long count);
diff --git a/mm/cma.c b/mm/cma.c
index 6208a3e1cd9d..4f33cd54db9e 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -33,6 +33,7 @@
#include <linux/kmemleak.h>
#include <linux/page-isolation.h>
#include <trace/events/cma.h>
+#include <linux/dmb.h>

#include "cma.h"

@@ -98,6 +99,10 @@ static void __init cma_activate_area(struct cma *cma)
{
unsigned long base_pfn = cma->base_pfn, pfn;
struct zone *zone;
+ int is_dmb = dmb_intersects(base_pfn, base_pfn + cma->count);
+
+ if (is_dmb == DMB_MIXED)
+ goto out_error;

cma->bitmap = bitmap_zalloc(cma_bitmap_maxno(cma), GFP_KERNEL);
if (!cma->bitmap)
@@ -116,13 +121,17 @@ static void __init cma_activate_area(struct cma *cma)
goto not_in_zone;
}

- for (pfn = base_pfn; pfn < base_pfn + cma->count;
- pfn += pageblock_nr_pages) {
- struct page *page = pfn_to_page(pfn);
+ if (is_dmb == DMB_INTERSECTS) {
+ totalcma_pages -= cma->count;
+ } else {
+ for (pfn = base_pfn; pfn < base_pfn + cma->count;
+ pfn += pageblock_nr_pages) {
+ struct page *page = pfn_to_page(pfn);

- set_pageblock_migratetype(page, MIGRATE_CMA);
- init_reserved_pageblock(page);
- page_zone(page)->cma_pages += pageblock_nr_pages;
+ set_pageblock_migratetype(page, MIGRATE_CMA);
+ init_reserved_pageblock(page);
+ page_zone(page)->cma_pages += pageblock_nr_pages;
+ }
}

spin_lock_init(&cma->lock);
@@ -141,7 +150,8 @@ static void __init cma_activate_area(struct cma *cma)
if (!cma->reserve_pages_on_error) {
for (pfn = base_pfn; pfn < base_pfn + cma->count;
pfn += pageblock_nr_pages)
- init_reserved_pageblock(pfn_to_page(pfn));
+ if (!dmb_intersects(pfn, pfn + pageblock_nr_pages))
+ init_reserved_pageblock(pfn_to_page(pfn));
}
totalcma_pages -= cma->count;
cma->count = 0;
@@ -166,7 +176,7 @@ void __init cma_reserve_pages_on_error(struct cma *cma)
}

/**
- * cma_init_reserved_mem() - create custom contiguous area from reserved memory
+ * __cma_init_reserved_mem() - create custom contiguous area in reserved memory
* @base: Base address of the reserved area
* @size: Size of the reserved area (in bytes),
* @order_per_bit: Order of pages represented by one bit on bitmap.
@@ -174,15 +184,18 @@ void __init cma_reserve_pages_on_error(struct cma *cma)
* the area will be set to "cmaN", where N is a running counter of
* used areas.
* @res_cma: Pointer to store the created cma region.
+ * @in_dmb: Designate the reserved memory as a Designated Movable Block.
*
* This function creates custom contiguous area from already reserved memory.
*/
-int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
- unsigned int order_per_bit,
- const char *name,
- struct cma **res_cma)
+int __init __cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
+ unsigned int order_per_bit,
+ const char *name,
+ struct cma **res_cma,
+ bool in_dmb)
{
struct cma *cma;
+ int err;

/* Sanity checks */
if (cma_area_count == ARRAY_SIZE(cma_areas)) {
@@ -201,6 +214,14 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
if (!IS_ALIGNED(base | size, CMA_MIN_ALIGNMENT_BYTES))
return -EINVAL;

+ if (in_dmb) {
+ err = dmb_reserve(base, size, NULL);
+ if (err) {
+ pr_err("Cannot reserve DMB for CMA!\n");
+ return err;
+ }
+ }
+
/*
* Each reserved area must be initialised later, when more kernel
* subsystems (like slab allocator) are available.
@@ -223,7 +244,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
}

/**
- * cma_declare_contiguous_nid() - reserve custom contiguous area
+ * __cma_declare_contiguous_nid() - reserve custom contiguous area
* @base: Base address of the reserved area optional, use 0 for any
* @size: Size of the reserved area (in bytes),
* @limit: End address of the reserved memory (optional, 0 for any).
@@ -233,6 +254,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
* @name: The name of the area. See function cma_init_reserved_mem()
* @res_cma: Pointer to store the created cma region.
* @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ * @in_dmb: Designate the reserved memory as a Designated Movable Block.
*
* This function reserves memory from early allocator. It should be
* called by arch specific code once the early allocator (memblock or bootmem)
@@ -242,11 +264,11 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
* If @fixed is true, reserve contiguous area at exactly @base. If false,
* reserve in range from @base to @limit.
*/
-int __init cma_declare_contiguous_nid(phys_addr_t base,
+int __init __cma_declare_contiguous_nid(phys_addr_t base,
phys_addr_t size, phys_addr_t limit,
phys_addr_t alignment, unsigned int order_per_bit,
bool fixed, const char *name, struct cma **res_cma,
- int nid)
+ int nid, bool in_dmb)
{
phys_addr_t memblock_end = memblock_end_of_DRAM();
phys_addr_t highmem_start;
@@ -374,7 +396,8 @@ int __init cma_declare_contiguous_nid(phys_addr_t base,
base = addr;
}

- ret = cma_init_reserved_mem(base, size, order_per_bit, name, res_cma);
+ ret = __cma_init_reserved_mem(base, size, order_per_bit, name, res_cma,
+ in_dmb);
if (ret)
goto free_mem;

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e38dd1b32771..09d00c178bc8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -9233,29 +9233,8 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
return 0;
}

-/**
- * alloc_contig_range() -- tries to allocate given range of pages
- * @start: start PFN to allocate
- * @end: one-past-the-last PFN to allocate
- * @migratetype: migratetype of the underlying pageblocks (either
- * #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks
- * in range must have the same migratetype and it must
- * be either of the two.
- * @gfp_mask: GFP mask to use during compaction
- *
- * The PFN range does not have to be pageblock aligned. The PFN range must
- * belong to a single zone.
- *
- * The first thing this routine does is attempt to MIGRATE_ISOLATE all
- * pageblocks in the range. Once isolated, the pageblocks should not
- * be modified by others.
- *
- * Return: zero on success or negative error code. On success all
- * pages which PFN is in [start, end) are allocated for the caller and
- * need to be freed with free_contig_range().
- */
-int alloc_contig_range(unsigned long start, unsigned long end,
- unsigned migratetype, gfp_t gfp_mask)
+int _alloc_contig_range(unsigned long start, unsigned long end,
+ unsigned int migratetype, gfp_t gfp_mask)
{
unsigned long outer_start, outer_end;
int order;
@@ -9379,6 +9358,46 @@ int alloc_contig_range(unsigned long start, unsigned long end,
undo_isolate_page_range(start, end, migratetype);
return ret;
}
+
+/**
+ * alloc_contig_range() -- tries to allocate given range of pages
+ * @start: start PFN to allocate
+ * @end: one-past-the-last PFN to allocate
+ * @migratetype: migratetype of the underlying pageblocks (either
+ * #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks
+ * in range must have the same migratetype and it must
+ * be either of the two.
+ * @gfp_mask: GFP mask to use during compaction
+ *
+ * The PFN range does not have to be pageblock aligned. The PFN range must
+ * belong to a single zone.
+ *
+ * The first thing this routine does is attempt to MIGRATE_ISOLATE all
+ * pageblocks in the range. Once isolated, the pageblocks should not
+ * be modified by others.
+ *
+ * Return: zero on success or negative error code. On success all
+ * pages which PFN is in [start, end) are allocated for the caller and
+ * need to be freed with free_contig_range().
+ */
+int alloc_contig_range(unsigned long start, unsigned long end,
+ unsigned int migratetype, gfp_t gfp_mask)
+{
+ switch (dmb_intersects(start, end)) {
+ case DMB_DISJOINT:
+ break;
+ case DMB_INTERSECTS:
+ if (migratetype == MIGRATE_CMA)
+ migratetype = MIGRATE_MOVABLE;
+ else
+ return -EBUSY;
+ break;
+ default:
+ return -EBUSY;
+ }
+
+ return _alloc_contig_range(start, end, migratetype, gfp_mask);
+}
EXPORT_SYMBOL(alloc_contig_range);

static int __alloc_contig_pages(unsigned long start_pfn,
@@ -9386,8 +9405,8 @@ static int __alloc_contig_pages(unsigned long start_pfn,
{
unsigned long end_pfn = start_pfn + nr_pages;

- return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
- gfp_mask);
+ return _alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
+ gfp_mask);
}

static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
--
2.25.1

2022-09-13 20:40:38

by Doug Berger

[permalink] [raw]
Subject: [PATCH 20/21] mm/cma: introduce rmem shared-dmb-pool

A 'shared-dmb-pool' reserved-memory device tree node defines a
Designated Movable Block for use by an associated Contiguous
Memory Allocator.

Devices access the CMA region in the same manner as a 'shared-
dma-pool', but the kernel page allocator is free to use the
memory like any other ZONE_MOVABLE memory.

Signed-off-by: Doug Berger <[email protected]>
---
drivers/of/of_reserved_mem.c | 5 +++++
kernel/dma/contiguous.c | 33 ++++++++++++++++++++++++++++-----
2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c
index 0eb9e8898d7b..700c0dc0d3b6 100644
--- a/drivers/of/of_reserved_mem.c
+++ b/drivers/of/of_reserved_mem.c
@@ -123,6 +123,11 @@ static int __init __reserved_mem_alloc_size(unsigned long node,
&& !nomap) {
/* Need adjust the alignment to satisfy the CMA requirement */
align = max_t(phys_addr_t, align, CMA_MIN_ALIGNMENT_BYTES);
+ } else if (IS_ENABLED(CONFIG_CMA)
+ && of_flat_dt_is_compatible(node, "shared-dmb-pool")) {
+ /* Need adjust the alignment to satisfy CMA/DMB requirements */
+ align = max_t(phys_addr_t, align, CMA_MIN_ALIGNMENT_BYTES);
+ align = max_t(phys_addr_t, align, DMB_MIN_ALIGNMENT_BYTES);
}

prop = of_get_flat_dt_prop(node, "alloc-ranges", &len);
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index 6ea80ae42622..65dda12752a7 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -50,6 +50,7 @@
#include <linux/sizes.h>
#include <linux/dma-map-ops.h>
#include <linux/cma.h>
+#include <linux/dmb.h>

#ifdef CONFIG_CMA_SIZE_MBYTES
#define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
@@ -397,10 +398,11 @@ static const struct reserved_mem_ops rmem_cma_ops = {
.device_release = rmem_cma_device_release,
};

-static int __init rmem_cma_setup(struct reserved_mem *rmem)
+static int __init _rmem_cma_setup(struct reserved_mem *rmem, bool in_dmb)
{
unsigned long node = rmem->fdt_node;
bool default_cma = of_get_flat_dt_prop(node, "linux,cma-default", NULL);
+ phys_addr_t align = CMA_MIN_ALIGNMENT_BYTES;
struct cma *cma;
int err;

@@ -414,16 +416,25 @@ static int __init rmem_cma_setup(struct reserved_mem *rmem)
of_get_flat_dt_prop(node, "no-map", NULL))
return -EINVAL;

- if (!IS_ALIGNED(rmem->base | rmem->size, CMA_MIN_ALIGNMENT_BYTES)) {
+ if (in_dmb) {
+ if (default_cma) {
+ pr_err("Reserved memory: cma-default cannot be DMB\n");
+ return -EINVAL;
+ }
+ align = max_t(phys_addr_t, align, DMB_MIN_ALIGNMENT_BYTES);
+ }
+ if (!IS_ALIGNED(rmem->base | rmem->size, align)) {
pr_err("Reserved memory: incorrect alignment of CMA region\n");
return -EINVAL;
}

- err = cma_init_reserved_mem(rmem->base, rmem->size, 0, rmem->name, &cma);
+ err = __cma_init_reserved_mem(rmem->base, rmem->size, 0, rmem->name,
+ &cma, in_dmb);
if (err) {
pr_err("Reserved memory: unable to setup CMA region\n");
return err;
}
+
/* Architecture specific contiguous memory fixup. */
dma_contiguous_early_fixup(rmem->base, rmem->size);

@@ -433,10 +444,22 @@ static int __init rmem_cma_setup(struct reserved_mem *rmem)
rmem->ops = &rmem_cma_ops;
rmem->priv = cma;

- pr_info("Reserved memory: created CMA memory pool at %pa, size %ld MiB\n",
- &rmem->base, (unsigned long)rmem->size / SZ_1M);
+ pr_info("Reserved memory: created %s memory pool at %pa, size %ld MiB\n",
+ in_dmb ? "DMB" : "CMA", &rmem->base,
+ (unsigned long)rmem->size / SZ_1M);

return 0;
}
+
+static int __init rmem_cma_setup(struct reserved_mem *rmem)
+{
+ return _rmem_cma_setup(rmem, false);
+}
RESERVEDMEM_OF_DECLARE(cma, "shared-dma-pool", rmem_cma_setup);
+
+static int __init rmem_cma_in_dmb_setup(struct reserved_mem *rmem)
+{
+ return _rmem_cma_setup(rmem, true);
+}
+RESERVEDMEM_OF_DECLARE(cma_in_dmb, "shared-dmb-pool", rmem_cma_in_dmb_setup);
#endif
--
2.25.1

2022-09-13 20:48:58

by Doug Berger

[permalink] [raw]
Subject: [PATCH 14/21] mm/page_alloc: make alloc_contig_pages DMB aware

Designated Movable Blocks are skipped when attempting to allocate
contiguous pages. Doing per page validation across all spanned
pages within a zone can be extra inefficient when Designated
Movable Blocks create large overlaps between zones. Use
dmb_intersects() within pfn_range_valid_contig as an early check
to signal the range is not valid.

The zone_movable_pfn array which represents the start of non-
overlapped ZONE_MOVABLE on the node is now preserved to be used
at runtime to skip over any DMB-only portion of the zone.

Signed-off-by: Doug Berger <[email protected]>
---
mm/page_alloc.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e723094d1e1e..69753cc51e19 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -435,7 +435,7 @@ static unsigned long required_kernelcore_percent __initdata;
static unsigned long required_movablecore __initdata;
static unsigned long required_movablecore_percent __initdata;
static unsigned long min_dmb_pfn[MAX_NUMNODES] __initdata;
-static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata;
+static unsigned long zone_movable_pfn[MAX_NUMNODES];
bool mirrored_kernelcore __initdata_memblock;

/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
@@ -9370,6 +9370,9 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
unsigned long i, end_pfn = start_pfn + nr_pages;
struct page *page;

+ if (dmb_intersects(start_pfn, end_pfn))
+ return false;
+
for (i = start_pfn; i < end_pfn; i++) {
page = pfn_to_online_page(i);
if (!page)
@@ -9426,7 +9429,10 @@ struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
gfp_zone(gfp_mask), nodemask) {
spin_lock_irqsave(&zone->lock, flags);

- pfn = ALIGN(zone->zone_start_pfn, nr_pages);
+ if (zone_idx(zone) == ZONE_MOVABLE && zone_movable_pfn[nid])
+ pfn = ALIGN(zone_movable_pfn[nid], nr_pages);
+ else
+ pfn = ALIGN(zone->zone_start_pfn, nr_pages);
while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
if (pfn_range_valid_contig(zone, pfn, nr_pages)) {
/*
--
2.25.1

2022-09-13 20:50:37

by Doug Berger

[permalink] [raw]
Subject: [PATCH 19/21] dt-bindings: reserved-memory: shared-dma-pool: support DMB

The shared-dmb-pool compatible string creates a Designated Movable
Block to contain a shared pool of DMA buffers.

Signed-off-by: Doug Berger <[email protected]>
---
.../bindings/reserved-memory/shared-dma-pool.yaml | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/Documentation/devicetree/bindings/reserved-memory/shared-dma-pool.yaml b/Documentation/devicetree/bindings/reserved-memory/shared-dma-pool.yaml
index 618105f079be..85824fe05ac9 100644
--- a/Documentation/devicetree/bindings/reserved-memory/shared-dma-pool.yaml
+++ b/Documentation/devicetree/bindings/reserved-memory/shared-dma-pool.yaml
@@ -22,6 +22,14 @@ properties:
operating system to instantiate the necessary pool management
subsystem if necessary.

+ - const: shared-dmb-pool
+ description: >
+ This indicates a shared-dma-pool region that is located within
+ a Designated Movable Block. The operating system is free to
+ use unallocated memory for movable allocations in this region.
+ Devices need to be tolerant of allocation latency to use this
+ pool.
+
- const: restricted-dma-pool
description: >
This indicates a region of memory meant to be used as a pool
--
2.25.1

2022-09-13 20:54:26

by Doug Berger

[permalink] [raw]
Subject: [PATCH 05/21] mm/hugetlb: allow migrated hugepage to dissolve when freed

There is no isolation mechanism for hugepages so a hugepage that
is migrated is returned to its hugepage freelist. This creates
problems for alloc_contig_range() because migrated hugepages can
be allocated as migrate targets for subsequent hugepage migration
attempts.

Even if the migration succeeds the alloc_contig_range() attempt
will fail because test_pages_isolated() will find the now free
hugepages haven't been dissolved.

A subsequent attempt by alloc_contig_range() is necessary for the
isolate_migratepages_range() function to find the freed hugepage
and dissolve it (assuming it has not been reallocated).

A workqueue is introduced to perform the equivalent functionality
of alloc_and_dissolve_huge_page() for a migrated hugepage when it
is freed so that the pages can be released to the isolated page
lists of the buddy allocator allowing the alloc_contig_range()
attempt to succeed.

The HPG_dissolve hugepage flag is introduced to allow tagging
migratable hugepages that should be dissolved when freed.

Signed-off-by: Doug Berger <[email protected]>
---
include/linux/hugetlb.h | 5 +++
mm/hugetlb.c | 72 ++++++++++++++++++++++++++++++++++++++---
mm/migrate.c | 1 +
mm/page_alloc.c | 1 +
4 files changed, 75 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3ec981a0d8b3..0e6e21805e51 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -222,6 +222,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,

bool is_hugetlb_entry_migration(pte_t pte);
void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
+void sync_hugetlb_dissolve(void);

#else /* !CONFIG_HUGETLB_PAGE */

@@ -430,6 +431,8 @@ static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,

static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { }

+static inline void sync_hugetlb_dissolve(void) { }
+
#endif /* !CONFIG_HUGETLB_PAGE */
/*
* hugepages at page global directory. If arch support
@@ -574,6 +577,7 @@ enum hugetlb_page_flags {
HPG_freed,
HPG_vmemmap_optimized,
HPG_raw_hwp_unreliable,
+ HPG_dissolve,
__NR_HPAGEFLAGS,
};

@@ -621,6 +625,7 @@ HPAGEFLAG(Temporary, temporary)
HPAGEFLAG(Freed, freed)
HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable)
+HPAGEFLAG(Dissolve, dissolve)

#ifdef CONFIG_HUGETLB_PAGE

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f232a37df4b6..da80889e1436 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1582,6 +1582,10 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
}
}

+static LLIST_HEAD(hpage_dissolvelist);
+static void dissolve_hpage_workfn(struct work_struct *work);
+static DECLARE_WORK(dissolve_hpage_work, dissolve_hpage_workfn);
+
/*
* As update_and_free_page() can be called under any context, so we cannot
* use GFP_KERNEL to allocate vmemmap pages. However, we can defer the
@@ -1628,6 +1632,8 @@ static inline void flush_free_hpage_work(struct hstate *h)
{
if (hugetlb_vmemmap_optimizable(h))
flush_work(&free_hpage_work);
+ if (!hstate_is_gigantic(h))
+ flush_work(&dissolve_hpage_work);
}

static void update_and_free_page(struct hstate *h, struct page *page,
@@ -1679,7 +1685,7 @@ void free_huge_page(struct page *page)
struct hstate *h = page_hstate(page);
int nid = page_to_nid(page);
struct hugepage_subpool *spool = hugetlb_page_subpool(page);
- bool restore_reserve;
+ bool restore_reserve, dissolve;
unsigned long flags;

VM_BUG_ON_PAGE(page_count(page), page);
@@ -1691,6 +1697,8 @@ void free_huge_page(struct page *page)
page->mapping = NULL;
restore_reserve = HPageRestoreReserve(page);
ClearHPageRestoreReserve(page);
+ dissolve = HPageDissolve(page);
+ ClearHPageDissolve(page);

/*
* If HPageRestoreReserve was set on page, page allocation consumed a
@@ -1729,6 +1737,11 @@ void free_huge_page(struct page *page)
remove_hugetlb_page(h, page, true);
spin_unlock_irqrestore(&hugetlb_lock, flags);
update_and_free_page(h, page, true);
+ } else if (dissolve) {
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
+ if (llist_add((struct llist_node *)&page->mapping,
+ &hpage_dissolvelist))
+ schedule_work(&dissolve_hpage_work);
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
@@ -2771,6 +2784,49 @@ static void replace_hugepage(struct hstate *h, int nid, struct page *old_page,
enqueue_huge_page(h, new_page);
}

+static void dissolve_hpage_workfn(struct work_struct *work)
+{
+ struct llist_node *node;
+
+ node = llist_del_all(&hpage_dissolvelist);
+
+ while (node) {
+ struct page *oldpage, *newpage;
+ struct hstate *h;
+ int nid;
+
+ oldpage = container_of((struct address_space **)node,
+ struct page, mapping);
+ node = node->next;
+ oldpage->mapping = NULL;
+
+ h = page_hstate(oldpage);
+ nid = page_to_nid(oldpage);
+
+ newpage = alloc_replacement_page(h, nid);
+
+ spin_lock_irq(&hugetlb_lock);
+ /* finish freeing oldpage */
+ arch_clear_hugepage_flags(oldpage);
+ enqueue_huge_page(h, oldpage);
+ if (IS_ERR(newpage)) {
+ /* cannot dissolve so just leave free */
+ spin_unlock_irq(&hugetlb_lock);
+ goto next;
+ }
+
+ replace_hugepage(h, nid, oldpage, newpage);
+
+ /*
+ * Pages have been replaced, we can safely free the old one.
+ */
+ spin_unlock_irq(&hugetlb_lock);
+ __update_and_free_page(h, oldpage);
+next:
+ cond_resched();
+ }
+}
+
/*
* alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
* @h: struct hstate old page belongs to
@@ -2803,6 +2859,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
*/
spin_unlock_irq(&hugetlb_lock);
ret = isolate_hugetlb(old_page, list);
+ SetHPageDissolve(old_page);
spin_lock_irq(&hugetlb_lock);
goto free_new;
} else if (!HPageFreed(old_page)) {
@@ -2864,14 +2921,21 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
if (hstate_is_gigantic(h))
return -ENOMEM;

- if (page_count(head) && !isolate_hugetlb(head, list))
+ if (page_count(head) && !isolate_hugetlb(head, list)) {
+ SetHPageDissolve(head);
ret = 0;
- else if (!page_count(head))
+ } else if (!page_count(head)) {
ret = alloc_and_dissolve_huge_page(h, head, list);
-
+ }
return ret;
}

+void sync_hugetlb_dissolve(void)
+{
+ flush_work(&free_hpage_work);
+ flush_work(&dissolve_hpage_work);
+}
+
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve)
{
diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..b6c6123e614c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -141,6 +141,7 @@ void putback_movable_pages(struct list_head *l)

list_for_each_entry_safe(page, page2, l, lru) {
if (unlikely(PageHuge(page))) {
+ ClearHPageDissolve(page);
putback_active_hugepage(page);
continue;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..6bf76bbc0308 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -9235,6 +9235,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
if (ret && ret != -EBUSY)
goto done;
ret = 0;
+ sync_hugetlb_dissolve();

/*
* Pages from [start, end) are within a pageblock_nr_pages
--
2.25.1

2022-09-13 23:45:34

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 03/21] mm/hugetlb: correct demote page offset logic

On Tue, Sep 13, 2022 at 12:54:50PM -0700, Doug Berger wrote:
> With gigantic pages it may not be true that struct page structures
> are contiguous across the entire gigantic page. The mem_map_offset
> function is used here in place of direct pointer arithmetic to
> correct for this.

We're just eliminating mem_map_offset(). Please use nth_page()
instead.

> for (i = 0; i < pages_per_huge_page(h);
> i += pages_per_huge_page(target_hstate)) {
> + subpage = mem_map_offset(page, i);
> if (hstate_is_gigantic(target_hstate))

2022-09-14 00:10:24

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH 01/21] mm/page_isolation: protect cma from isolate_single_pageblock

On 13 Sep 2022, at 15:54, Doug Berger wrote:

> The function set_migratetype_isolate() has special handling for
> pageblocks of MIGRATE_CMA type that protects them from being
> isolated for MIGRATE_MOVABLE requests.
>
> Since isolate_single_pageblock() doesn't receive the migratetype
> argument of start_isolate_page_range() it used the migratetype
> of the pageblock instead of the requested migratetype which
> defeats this MIGRATE_CMA check.
>
> This allows an attempt to create a gigantic page within a CMA
> region to change the migratetype of the first and last pageblocks
> from MIGRATE_CMA to MIGRATE_MOVABLE when they are restored after
> failure, which corrupts the CMA region.
>
> The calls to (un)set_migratetype_isolate() for the first and last
> pageblocks of the start_isolate_page_range() are moved back into
> that function to allow access to its migratetype argument and make
> it easier to see how all of the pageblocks in the range are
> isolated.
>
> Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
> Signed-off-by: Doug Berger <[email protected]>
> ---
> mm/page_isolation.c | 75 +++++++++++++++++++++------------------------
> 1 file changed, 35 insertions(+), 40 deletions(-)

Thanks for the fix.

Why not just pass migratetype into isolate_single_pageblock() and use
it when set_migratetype_isolate() is used? That would have much
fewer changes. What is the reason of pulling skip isolation logic out?

Ultimately, I would like to make MIGRATE_ISOLATE a separate bit,
so that migratetype will not be overwritten during page isolation.
Then, set_migratetype_isolate() and start_isolate_page_range()
will not have migratetype to set in error recovery any more.
That is on my TODO.

>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 9d73dc38e3d7..8e16aa22cb61 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -286,8 +286,6 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> * @flags: isolation flags
> * @gfp_flags: GFP flags used for migrating pages
> * @isolate_before: isolate the pageblock before the boundary_pfn
> - * @skip_isolation: the flag to skip the pageblock isolation in second
> - * isolate_single_pageblock()
> *
> * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
> * pageblock. When not all pageblocks within a page are isolated at the same
> @@ -302,9 +300,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> * the in-use page then splitting the free page.
> */
> static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
> - gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
> + gfp_t gfp_flags, bool isolate_before)
> {
> - unsigned char saved_mt;
> unsigned long start_pfn;
> unsigned long isolate_pageblock;
> unsigned long pfn;
> @@ -328,18 +325,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
> start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
> zone->zone_start_pfn);
>
> - saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
> -
> - if (skip_isolation)
> - VM_BUG_ON(!is_migrate_isolate(saved_mt));
> - else {
> - ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
> - isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
> -
> - if (ret)
> - return ret;
> - }
> -
> /*
> * Bail out early when the to-be-isolated pageblock does not form
> * a free or in-use page across boundary_pfn:
> @@ -428,7 +413,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
> ret = set_migratetype_isolate(page, page_mt,
> flags, head_pfn, head_pfn + nr_pages);
> if (ret)
> - goto failed;
> + return ret;
> }
>
> ret = __alloc_contig_migrate_range(&cc, head_pfn,
> @@ -443,7 +428,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
> unset_migratetype_isolate(page, page_mt);
>
> if (ret)
> - goto failed;
> + return -EBUSY;
> /*
> * reset pfn to the head of the free page, so
> * that the free page handling code above can split
> @@ -459,24 +444,19 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
> while (!PageBuddy(pfn_to_page(outer_pfn))) {
> /* stop if we cannot find the free page */
> if (++order >= MAX_ORDER)
> - goto failed;
> + return -EBUSY;
> outer_pfn &= ~0UL << order;
> }
> pfn = outer_pfn;
> continue;
> } else
> #endif
> - goto failed;
> + return -EBUSY;
> }
>
> pfn++;
> }
> return 0;
> -failed:
> - /* restore the original migratetype */
> - if (!skip_isolation)
> - unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
> - return -EBUSY;
> }
>
> /**
> @@ -534,21 +514,30 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
> unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
> int ret;
> - bool skip_isolation = false;
>
> /* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
> - ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
> + ret = set_migratetype_isolate(pfn_to_page(isolate_start), migratetype,
> + flags, isolate_start, isolate_start + pageblock_nr_pages);
> if (ret)
> return ret;
> -
> - if (isolate_start == isolate_end - pageblock_nr_pages)
> - skip_isolation = true;
> + ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
> + if (ret)
> + goto unset_start_block;
>
> /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
> - ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
> + pfn = isolate_end - pageblock_nr_pages;
> + if (isolate_start != pfn) {
> + ret = set_migratetype_isolate(pfn_to_page(pfn), migratetype,
> + flags, pfn, pfn + pageblock_nr_pages);
> + if (ret)
> + goto unset_start_block;
> + }
> + ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
> if (ret) {
> - unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
> - return ret;
> + if (isolate_start != pfn)
> + goto unset_end_block;
> + else
> + goto unset_start_block;
> }
>
> /* skip isolated pageblocks at the beginning and end */
> @@ -557,15 +546,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> pfn += pageblock_nr_pages) {
> page = __first_valid_page(pfn, pageblock_nr_pages);
> if (page && set_migratetype_isolate(page, migratetype, flags,
> - start_pfn, end_pfn)) {
> - undo_isolate_page_range(isolate_start, pfn, migratetype);
> - unset_migratetype_isolate(
> - pfn_to_page(isolate_end - pageblock_nr_pages),
> - migratetype);
> - return -EBUSY;
> - }
> + start_pfn, end_pfn))
> + goto unset_isolated_blocks;
> }
> return 0;
> +
> +unset_isolated_blocks:
> + ret = -EBUSY;
> + undo_isolate_page_range(isolate_start + pageblock_nr_pages, pfn,
> + migratetype);
> +unset_end_block:
> + unset_migratetype_isolate(pfn_to_page(isolate_end - pageblock_nr_pages),
> + migratetype);
> +unset_start_block:
> + unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
> + return ret;
> }
>
> /*
> --
> 2.25.1


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2022-09-14 01:13:07

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 01/21] mm/page_isolation: protect cma from isolate_single_pageblock

On 9/13/2022 5:02 PM, Zi Yan wrote:
> On 13 Sep 2022, at 15:54, Doug Berger wrote:
>
>> The function set_migratetype_isolate() has special handling for
>> pageblocks of MIGRATE_CMA type that protects them from being
>> isolated for MIGRATE_MOVABLE requests.
>>
>> Since isolate_single_pageblock() doesn't receive the migratetype
>> argument of start_isolate_page_range() it used the migratetype
>> of the pageblock instead of the requested migratetype which
>> defeats this MIGRATE_CMA check.
>>
>> This allows an attempt to create a gigantic page within a CMA
>> region to change the migratetype of the first and last pageblocks
>> from MIGRATE_CMA to MIGRATE_MOVABLE when they are restored after
>> failure, which corrupts the CMA region.
>>
>> The calls to (un)set_migratetype_isolate() for the first and last
>> pageblocks of the start_isolate_page_range() are moved back into
>> that function to allow access to its migratetype argument and make
>> it easier to see how all of the pageblocks in the range are
>> isolated.
>>
>> Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
>> Signed-off-by: Doug Berger <[email protected]>
>> ---
>> mm/page_isolation.c | 75 +++++++++++++++++++++------------------------
>> 1 file changed, 35 insertions(+), 40 deletions(-)
>
> Thanks for the fix.
Thanks for the review.

>
> Why not just pass migratetype into isolate_single_pageblock() and use
> it when set_migratetype_isolate() is used? That would have much
> fewer changes. What is the reason of pulling skip isolation logic out?
I found the skip_isolation logic confusing and thought that setting and
restoring the migratetype within the same function and consolidating the
error recovery paths also within that function was easier to understand
and less prone to accidental breakage.

In particular, setting MIGRATE_ISOLATE in isolate_single_pageblock() and
having to remember to unset it in start_isolate_page_range() differently
on different error paths was troublesome for me.

It could certainly be done differently, but this was my preference.

>
> Ultimately, I would like to make MIGRATE_ISOLATE a separate bit,
> so that migratetype will not be overwritten during page isolation.
> Then, set_migratetype_isolate() and start_isolate_page_range()
> will not have migratetype to set in error recovery any more.
> That is on my TODO.
>
>>
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index 9d73dc38e3d7..8e16aa22cb61 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -286,8 +286,6 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>> * @flags: isolation flags
>> * @gfp_flags: GFP flags used for migrating pages
>> * @isolate_before: isolate the pageblock before the boundary_pfn
>> - * @skip_isolation: the flag to skip the pageblock isolation in second
>> - * isolate_single_pageblock()
>> *
>> * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
>> * pageblock. When not all pageblocks within a page are isolated at the same
>> @@ -302,9 +300,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>> * the in-use page then splitting the free page.
>> */
>> static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>> - gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
>> + gfp_t gfp_flags, bool isolate_before)
>> {
>> - unsigned char saved_mt;
>> unsigned long start_pfn;
>> unsigned long isolate_pageblock;
>> unsigned long pfn;
>> @@ -328,18 +325,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>> start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
>> zone->zone_start_pfn);
>>
>> - saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
>> -
>> - if (skip_isolation)
>> - VM_BUG_ON(!is_migrate_isolate(saved_mt));
>> - else {
>> - ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
>> - isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
>> -
>> - if (ret)
>> - return ret;
>> - }
>> -
>> /*
>> * Bail out early when the to-be-isolated pageblock does not form
>> * a free or in-use page across boundary_pfn:
>> @@ -428,7 +413,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>> ret = set_migratetype_isolate(page, page_mt,
>> flags, head_pfn, head_pfn + nr_pages);
>> if (ret)
>> - goto failed;
>> + return ret;
>> }
>>
>> ret = __alloc_contig_migrate_range(&cc, head_pfn,
>> @@ -443,7 +428,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>> unset_migratetype_isolate(page, page_mt);
>>
>> if (ret)
>> - goto failed;
>> + return -EBUSY;
>> /*
>> * reset pfn to the head of the free page, so
>> * that the free page handling code above can split
>> @@ -459,24 +444,19 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>> while (!PageBuddy(pfn_to_page(outer_pfn))) {
>> /* stop if we cannot find the free page */
>> if (++order >= MAX_ORDER)
>> - goto failed;
>> + return -EBUSY;
>> outer_pfn &= ~0UL << order;
>> }
>> pfn = outer_pfn;
>> continue;
>> } else
>> #endif
>> - goto failed;
>> + return -EBUSY;
>> }
>>
>> pfn++;
>> }
>> return 0;
>> -failed:
>> - /* restore the original migratetype */
>> - if (!skip_isolation)
>> - unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
>> - return -EBUSY;
>> }
>>
>> /**
>> @@ -534,21 +514,30 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
>> unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
>> int ret;
>> - bool skip_isolation = false;
>>
>> /* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
>> - ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
>> + ret = set_migratetype_isolate(pfn_to_page(isolate_start), migratetype,
>> + flags, isolate_start, isolate_start + pageblock_nr_pages);
>> if (ret)
>> return ret;
>> -
>> - if (isolate_start == isolate_end - pageblock_nr_pages)
>> - skip_isolation = true;
>> + ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
>> + if (ret)
>> + goto unset_start_block;
>>
>> /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
>> - ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
>> + pfn = isolate_end - pageblock_nr_pages;
>> + if (isolate_start != pfn) {
>> + ret = set_migratetype_isolate(pfn_to_page(pfn), migratetype,
>> + flags, pfn, pfn + pageblock_nr_pages);
>> + if (ret)
>> + goto unset_start_block;
>> + }
>> + ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
>> if (ret) {
>> - unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>> - return ret;
>> + if (isolate_start != pfn)
>> + goto unset_end_block;
>> + else
>> + goto unset_start_block;
>> }
>>
>> /* skip isolated pageblocks at the beginning and end */
>> @@ -557,15 +546,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> pfn += pageblock_nr_pages) {
>> page = __first_valid_page(pfn, pageblock_nr_pages);
>> if (page && set_migratetype_isolate(page, migratetype, flags,
>> - start_pfn, end_pfn)) {
>> - undo_isolate_page_range(isolate_start, pfn, migratetype);
>> - unset_migratetype_isolate(
>> - pfn_to_page(isolate_end - pageblock_nr_pages),
>> - migratetype);
>> - return -EBUSY;
>> - }
>> + start_pfn, end_pfn))
>> + goto unset_isolated_blocks;
>> }
>> return 0;
>> +
>> +unset_isolated_blocks:
>> + ret = -EBUSY;
>> + undo_isolate_page_range(isolate_start + pageblock_nr_pages, pfn,
>> + migratetype);
>> +unset_end_block:
>> + unset_migratetype_isolate(pfn_to_page(isolate_end - pageblock_nr_pages),
>> + migratetype);
>> +unset_start_block:
>> + unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>> + return ret;
>> }
>>
>> /*
>> --
>> 2.25.1
>
>
> --
> Best Regards,
> Yan, Zi

2022-09-14 01:39:29

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH 01/21] mm/page_isolation: protect cma from isolate_single_pageblock

On 13 Sep 2022, at 20:59, Doug Berger wrote:

> On 9/13/2022 5:02 PM, Zi Yan wrote:
>> On 13 Sep 2022, at 15:54, Doug Berger wrote:
>>
>>> The function set_migratetype_isolate() has special handling for
>>> pageblocks of MIGRATE_CMA type that protects them from being
>>> isolated for MIGRATE_MOVABLE requests.
>>>
>>> Since isolate_single_pageblock() doesn't receive the migratetype
>>> argument of start_isolate_page_range() it used the migratetype
>>> of the pageblock instead of the requested migratetype which
>>> defeats this MIGRATE_CMA check.
>>>
>>> This allows an attempt to create a gigantic page within a CMA
>>> region to change the migratetype of the first and last pageblocks
>>> from MIGRATE_CMA to MIGRATE_MOVABLE when they are restored after
>>> failure, which corrupts the CMA region.
>>>
>>> The calls to (un)set_migratetype_isolate() for the first and last
>>> pageblocks of the start_isolate_page_range() are moved back into
>>> that function to allow access to its migratetype argument and make
>>> it easier to see how all of the pageblocks in the range are
>>> isolated.
>>>
>>> Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
>>> Signed-off-by: Doug Berger <[email protected]>
>>> ---
>>> mm/page_isolation.c | 75 +++++++++++++++++++++------------------------
>>> 1 file changed, 35 insertions(+), 40 deletions(-)
>>
>> Thanks for the fix.
> Thanks for the review.
>
>>
>> Why not just pass migratetype into isolate_single_pageblock() and use
>> it when set_migratetype_isolate() is used? That would have much
>> fewer changes. What is the reason of pulling skip isolation logic out?
> I found the skip_isolation logic confusing and thought that setting and restoring the migratetype within the same function and consolidating the error recovery paths also within that function was easier to understand and less prone to accidental breakage.
>
> In particular, setting MIGRATE_ISOLATE in isolate_single_pageblock() and having to remember to unset it in start_isolate_page_range() differently on different error paths was troublesome for me.

Wouldn't this work as well?

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c1307d1bea81..a312cabd0d95 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -288,6 +288,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* @isolate_before: isolate the pageblock before the boundary_pfn
* @skip_isolation: the flag to skip the pageblock isolation in second
* isolate_single_pageblock()
+ * @migratetype: Migrate type to set in error recovery.
*
* Free and in-use pages can be as big as MAX_ORDER and contain more than one
* pageblock. When not all pageblocks within a page are isolated at the same
@@ -302,9 +303,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* the in-use page then splitting the free page.
*/
static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
- gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
+ gfp_t gfp_flags, bool isolate_before, bool skip_isolation,
+ int migratetype)
{
- unsigned char saved_mt;
unsigned long start_pfn;
unsigned long isolate_pageblock;
unsigned long pfn;
@@ -328,12 +329,10 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
zone->zone_start_pfn);

- saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
-
if (skip_isolation)
- VM_BUG_ON(!is_migrate_isolate(saved_mt));
+ VM_BUG_ON(!is_migrate_isolate(get_pageblock_migratetype(pfn_to_page(isolate_pageblock))));
else {
- ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
+ ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype, flags,
isolate_pageblock, isolate_pageblock + pageblock_nr_pages);

if (ret)
@@ -475,7 +474,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
failed:
/* restore the original migratetype */
if (!skip_isolation)
- unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
+ unset_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype);
return -EBUSY;
}

@@ -537,7 +536,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
bool skip_isolation = false;

/* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
- ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
+ ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false,
+ skip_isolation, migratetype);
if (ret)
return ret;

@@ -545,7 +545,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
skip_isolation = true;

/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
- ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
+ ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
+ skip_isolation, migratetype);
if (ret) {
unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
return ret;

>
> It could certainly be done differently, but this was my preference.

A smaller patch can make review easier, right?

>>
>> Ultimately, I would like to make MIGRATE_ISOLATE a separate bit,
>> so that migratetype will not be overwritten during page isolation.
>> Then, set_migratetype_isolate() and start_isolate_page_range()
>> will not have migratetype to set in error recovery any more.
>> That is on my TODO.
>>
>>>
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index 9d73dc38e3d7..8e16aa22cb61 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -286,8 +286,6 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>> * @flags: isolation flags
>>> * @gfp_flags: GFP flags used for migrating pages
>>> * @isolate_before: isolate the pageblock before the boundary_pfn
>>> - * @skip_isolation: the flag to skip the pageblock isolation in second
>>> - * isolate_single_pageblock()
>>> *
>>> * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
>>> * pageblock. When not all pageblocks within a page are isolated at the same
>>> @@ -302,9 +300,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>> * the in-use page then splitting the free page.
>>> */
>>> static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> - gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
>>> + gfp_t gfp_flags, bool isolate_before)
>>> {
>>> - unsigned char saved_mt;
>>> unsigned long start_pfn;
>>> unsigned long isolate_pageblock;
>>> unsigned long pfn;
>>> @@ -328,18 +325,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
>>> zone->zone_start_pfn);
>>>
>>> - saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
>>> -
>>> - if (skip_isolation)
>>> - VM_BUG_ON(!is_migrate_isolate(saved_mt));
>>> - else {
>>> - ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
>>> - isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
>>> -
>>> - if (ret)
>>> - return ret;
>>> - }
>>> -
>>> /*
>>> * Bail out early when the to-be-isolated pageblock does not form
>>> * a free or in-use page across boundary_pfn:
>>> @@ -428,7 +413,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> ret = set_migratetype_isolate(page, page_mt,
>>> flags, head_pfn, head_pfn + nr_pages);
>>> if (ret)
>>> - goto failed;
>>> + return ret;
>>> }
>>>
>>> ret = __alloc_contig_migrate_range(&cc, head_pfn,
>>> @@ -443,7 +428,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> unset_migratetype_isolate(page, page_mt);
>>>
>>> if (ret)
>>> - goto failed;
>>> + return -EBUSY;
>>> /*
>>> * reset pfn to the head of the free page, so
>>> * that the free page handling code above can split
>>> @@ -459,24 +444,19 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> while (!PageBuddy(pfn_to_page(outer_pfn))) {
>>> /* stop if we cannot find the free page */
>>> if (++order >= MAX_ORDER)
>>> - goto failed;
>>> + return -EBUSY;
>>> outer_pfn &= ~0UL << order;
>>> }
>>> pfn = outer_pfn;
>>> continue;
>>> } else
>>> #endif
>>> - goto failed;
>>> + return -EBUSY;
>>> }
>>>
>>> pfn++;
>>> }
>>> return 0;
>>> -failed:
>>> - /* restore the original migratetype */
>>> - if (!skip_isolation)
>>> - unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
>>> - return -EBUSY;
>>> }
>>>
>>> /**
>>> @@ -534,21 +514,30 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>> unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
>>> unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
>>> int ret;
>>> - bool skip_isolation = false;
>>>
>>> /* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
>>> - ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
>>> + ret = set_migratetype_isolate(pfn_to_page(isolate_start), migratetype,
>>> + flags, isolate_start, isolate_start + pageblock_nr_pages);
>>> if (ret)
>>> return ret;
>>> -
>>> - if (isolate_start == isolate_end - pageblock_nr_pages)
>>> - skip_isolation = true;
>>> + ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
>>> + if (ret)
>>> + goto unset_start_block;
>>>
>>> /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
>>> - ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
>>> + pfn = isolate_end - pageblock_nr_pages;
>>> + if (isolate_start != pfn) {
>>> + ret = set_migratetype_isolate(pfn_to_page(pfn), migratetype,
>>> + flags, pfn, pfn + pageblock_nr_pages);
>>> + if (ret)
>>> + goto unset_start_block;
>>> + }
>>> + ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
>>> if (ret) {
>>> - unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>>> - return ret;
>>> + if (isolate_start != pfn)
>>> + goto unset_end_block;
>>> + else
>>> + goto unset_start_block;
>>> }
>>>
>>> /* skip isolated pageblocks at the beginning and end */
>>> @@ -557,15 +546,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>> pfn += pageblock_nr_pages) {
>>> page = __first_valid_page(pfn, pageblock_nr_pages);
>>> if (page && set_migratetype_isolate(page, migratetype, flags,
>>> - start_pfn, end_pfn)) {
>>> - undo_isolate_page_range(isolate_start, pfn, migratetype);
>>> - unset_migratetype_isolate(
>>> - pfn_to_page(isolate_end - pageblock_nr_pages),
>>> - migratetype);
>>> - return -EBUSY;
>>> - }
>>> + start_pfn, end_pfn))
>>> + goto unset_isolated_blocks;
>>> }
>>> return 0;
>>> +
>>> +unset_isolated_blocks:
>>> + ret = -EBUSY;
>>> + undo_isolate_page_range(isolate_start + pageblock_nr_pages, pfn,
>>> + migratetype);
>>> +unset_end_block:
>>> + unset_migratetype_isolate(pfn_to_page(isolate_end - pageblock_nr_pages),
>>> + migratetype);
>>> +unset_start_block:
>>> + unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>>> + return ret;
>>> }
>>>
>>> /*
>>> --
>>> 2.25.1
>>
>>
>> --
>> Best Regards,
>> Yan, Zi


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2022-09-14 01:48:28

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 03/21] mm/hugetlb: correct demote page offset logic

On 9/13/2022 4:34 PM, Matthew Wilcox wrote:
> On Tue, Sep 13, 2022 at 12:54:50PM -0700, Doug Berger wrote:
>> With gigantic pages it may not be true that struct page structures
>> are contiguous across the entire gigantic page. The mem_map_offset
>> function is used here in place of direct pointer arithmetic to
>> correct for this.
>
> We're just eliminating mem_map_offset(). Please use nth_page()
> instead.That's good to know. I will include that in v2.

>
>> for (i = 0; i < pages_per_huge_page(h);
>> i += pages_per_huge_page(target_hstate)) {
>> + subpage = mem_map_offset(page, i);
>> if (hstate_is_gigantic(target_hstate))

2022-09-14 02:10:37

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH 01/21] mm/page_isolation: protect cma from isolate_single_pageblock

On 13 Sep 2022, at 21:47, Doug Berger wrote:

> On 9/13/2022 6:09 PM, Zi Yan wrote:
>> On 13 Sep 2022, at 20:59, Doug Berger wrote:
>>
>>> On 9/13/2022 5:02 PM, Zi Yan wrote:
>>>> On 13 Sep 2022, at 15:54, Doug Berger wrote:
>>>>
>>>>> The function set_migratetype_isolate() has special handling for
>>>>> pageblocks of MIGRATE_CMA type that protects them from being
>>>>> isolated for MIGRATE_MOVABLE requests.
>>>>>
>>>>> Since isolate_single_pageblock() doesn't receive the migratetype
>>>>> argument of start_isolate_page_range() it used the migratetype
>>>>> of the pageblock instead of the requested migratetype which
>>>>> defeats this MIGRATE_CMA check.
>>>>>
>>>>> This allows an attempt to create a gigantic page within a CMA
>>>>> region to change the migratetype of the first and last pageblocks
>>>>> from MIGRATE_CMA to MIGRATE_MOVABLE when they are restored after
>>>>> failure, which corrupts the CMA region.
>>>>>
>>>>> The calls to (un)set_migratetype_isolate() for the first and last
>>>>> pageblocks of the start_isolate_page_range() are moved back into
>>>>> that function to allow access to its migratetype argument and make
>>>>> it easier to see how all of the pageblocks in the range are
>>>>> isolated.
>>>>>
>>>>> Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
>>>>> Signed-off-by: Doug Berger <[email protected]>
>>>>> ---
>>>>> mm/page_isolation.c | 75 +++++++++++++++++++++------------------------
>>>>> 1 file changed, 35 insertions(+), 40 deletions(-)
>>>>
>>>> Thanks for the fix.
>>> Thanks for the review.
>>>
>>>>
>>>> Why not just pass migratetype into isolate_single_pageblock() and use
>>>> it when set_migratetype_isolate() is used? That would have much
>>>> fewer changes. What is the reason of pulling skip isolation logic out?
>>> I found the skip_isolation logic confusing and thought that setting and restoring the migratetype within the same function and consolidating the error recovery paths also within that function was easier to understand and less prone to accidental breakage.
>>>
>>> In particular, setting MIGRATE_ISOLATE in isolate_single_pageblock() and having to remember to unset it in start_isolate_page_range() differently on different error paths was troublesome for me.
>>
>> Wouldn't this work as well?
>>
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index c1307d1bea81..a312cabd0d95 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -288,6 +288,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>> * @isolate_before: isolate the pageblock before the boundary_pfn
>> * @skip_isolation: the flag to skip the pageblock isolation in second
>> * isolate_single_pageblock()
>> + * @migratetype: Migrate type to set in error recovery.
>> *
>> * Free and in-use pages can be as big as MAX_ORDER and contain more than one
>> * pageblock. When not all pageblocks within a page are isolated at the same
>> @@ -302,9 +303,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>> * the in-use page then splitting the free page.
>> */
>> static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>> - gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
>> + gfp_t gfp_flags, bool isolate_before, bool skip_isolation,
>> + int migratetype)
>> {
>> - unsigned char saved_mt;
>> unsigned long start_pfn;
>> unsigned long isolate_pageblock;
>> unsigned long pfn;
>> @@ -328,12 +329,10 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>> start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
>> zone->zone_start_pfn);
>>
>> - saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
>> -
>> if (skip_isolation)
>> - VM_BUG_ON(!is_migrate_isolate(saved_mt));
>> + VM_BUG_ON(!is_migrate_isolate(get_pageblock_migratetype(pfn_to_page(isolate_pageblock))));
>> else {
>> - ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
>> + ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype, flags,
>> isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
>>
>> if (ret)
>> @@ -475,7 +474,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>> failed:
>> /* restore the original migratetype */
>> if (!skip_isolation)
>> - unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
>> + unset_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype);
>> return -EBUSY;
>> }
>>
>> @@ -537,7 +536,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> bool skip_isolation = false;
>>
>> /* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
>> - ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
>> + ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false,
>> + skip_isolation, migratetype);
>> if (ret)
>> return ret;
>>
>> @@ -545,7 +545,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> skip_isolation = true;
>>
>> /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
>> - ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
>> + ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
>> + skip_isolation, migratetype);
>> if (ret) {
>> unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>> return ret;
>>
> I would expect this to work as well, but it is not my preference.
>
>>>
>>> It could certainly be done differently, but this was my preference.
>>
>> A smaller patch can make review easier, right?
> It certainly can. Especially when it is for code that you are familiar with ;).
>
> I am happy to have you submit a patch to fix this issue and submit it to stable for backporting. Fixing the issue is what's important to me.
>

I can submit the above as a patch. Is there a visible userspace issue, so that we need to
backport it? Thanks.

>>
>>>>
>>>> Ultimately, I would like to make MIGRATE_ISOLATE a separate bit,
>>>> so that migratetype will not be overwritten during page isolation.
>>>> Then, set_migratetype_isolate() and start_isolate_page_range()
>>>> will not have migratetype to set in error recovery any more.
>>>> That is on my TODO.
>>>>
>>>>>
>>>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>>>> index 9d73dc38e3d7..8e16aa22cb61 100644
>>>>> --- a/mm/page_isolation.c
>>>>> +++ b/mm/page_isolation.c
>>>>> @@ -286,8 +286,6 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>>>> * @flags: isolation flags
>>>>> * @gfp_flags: GFP flags used for migrating pages
>>>>> * @isolate_before: isolate the pageblock before the boundary_pfn
>>>>> - * @skip_isolation: the flag to skip the pageblock isolation in second
>>>>> - * isolate_single_pageblock()
>>>>> *
>>>>> * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
>>>>> * pageblock. When not all pageblocks within a page are isolated at the same
>>>>> @@ -302,9 +300,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>>>> * the in-use page then splitting the free page.
>>>>> */
>>>>> static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>>> - gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
>>>>> + gfp_t gfp_flags, bool isolate_before)
>>>>> {
>>>>> - unsigned char saved_mt;
>>>>> unsigned long start_pfn;
>>>>> unsigned long isolate_pageblock;
>>>>> unsigned long pfn;
>>>>> @@ -328,18 +325,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>>> start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
>>>>> zone->zone_start_pfn);
>>>>>
>>>>> - saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
>>>>> -
>>>>> - if (skip_isolation)
>>>>> - VM_BUG_ON(!is_migrate_isolate(saved_mt));
>>>>> - else {
>>>>> - ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
>>>>> - isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
>>>>> -
>>>>> - if (ret)
>>>>> - return ret;
>>>>> - }
>>>>> -
>>>>> /*
>>>>> * Bail out early when the to-be-isolated pageblock does not form
>>>>> * a free or in-use page across boundary_pfn:
>>>>> @@ -428,7 +413,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>>> ret = set_migratetype_isolate(page, page_mt,
>>>>> flags, head_pfn, head_pfn + nr_pages);
>>>>> if (ret)
>>>>> - goto failed;
>>>>> + return ret;
>>>>> }
>>>>>
>>>>> ret = __alloc_contig_migrate_range(&cc, head_pfn,
>>>>> @@ -443,7 +428,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>>> unset_migratetype_isolate(page, page_mt);
>>>>>
>>>>> if (ret)
>>>>> - goto failed;
>>>>> + return -EBUSY;
>>>>> /*
>>>>> * reset pfn to the head of the free page, so
>>>>> * that the free page handling code above can split
>>>>> @@ -459,24 +444,19 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>>> while (!PageBuddy(pfn_to_page(outer_pfn))) {
>>>>> /* stop if we cannot find the free page */
>>>>> if (++order >= MAX_ORDER)
>>>>> - goto failed;
>>>>> + return -EBUSY;
>>>>> outer_pfn &= ~0UL << order;
>>>>> }
>>>>> pfn = outer_pfn;
>>>>> continue;
>>>>> } else
>>>>> #endif
>>>>> - goto failed;
>>>>> + return -EBUSY;
>>>>> }
>>>>>
>>>>> pfn++;
>>>>> }
>>>>> return 0;
>>>>> -failed:
>>>>> - /* restore the original migratetype */
>>>>> - if (!skip_isolation)
>>>>> - unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
>>>>> - return -EBUSY;
>>>>> }
>>>>>
>>>>> /**
>>>>> @@ -534,21 +514,30 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>>>> unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
>>>>> unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
>>>>> int ret;
>>>>> - bool skip_isolation = false;
>>>>>
>>>>> /* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
>>>>> - ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
>>>>> + ret = set_migratetype_isolate(pfn_to_page(isolate_start), migratetype,
>>>>> + flags, isolate_start, isolate_start + pageblock_nr_pages);
>>>>> if (ret)
>>>>> return ret;
>>>>> -
>>>>> - if (isolate_start == isolate_end - pageblock_nr_pages)
>>>>> - skip_isolation = true;
>>>>> + ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
>>>>> + if (ret)
>>>>> + goto unset_start_block;
>>>>>
>>>>> /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
>>>>> - ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
>>>>> + pfn = isolate_end - pageblock_nr_pages;
>>>>> + if (isolate_start != pfn) {
>>>>> + ret = set_migratetype_isolate(pfn_to_page(pfn), migratetype,
>>>>> + flags, pfn, pfn + pageblock_nr_pages);
>>>>> + if (ret)
>>>>> + goto unset_start_block;
>>>>> + }
>>>>> + ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
>>>>> if (ret) {
>>>>> - unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>>>>> - return ret;
>>>>> + if (isolate_start != pfn)
>>>>> + goto unset_end_block;
>>>>> + else
>>>>> + goto unset_start_block;
>>>>> }
>>>>>
>>>>> /* skip isolated pageblocks at the beginning and end */
>>>>> @@ -557,15 +546,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>>>> pfn += pageblock_nr_pages) {
>>>>> page = __first_valid_page(pfn, pageblock_nr_pages);
>>>>> if (page && set_migratetype_isolate(page, migratetype, flags,
>>>>> - start_pfn, end_pfn)) {
>>>>> - undo_isolate_page_range(isolate_start, pfn, migratetype);
>>>>> - unset_migratetype_isolate(
>>>>> - pfn_to_page(isolate_end - pageblock_nr_pages),
>>>>> - migratetype);
>>>>> - return -EBUSY;
>>>>> - }
>>>>> + start_pfn, end_pfn))
>>>>> + goto unset_isolated_blocks;
>>>>> }
>>>>> return 0;
>>>>> +
>>>>> +unset_isolated_blocks:
>>>>> + ret = -EBUSY;
>>>>> + undo_isolate_page_range(isolate_start + pageblock_nr_pages, pfn,
>>>>> + migratetype);
>>>>> +unset_end_block:
>>>>> + unset_migratetype_isolate(pfn_to_page(isolate_end - pageblock_nr_pages),
>>>>> + migratetype);
>>>>> +unset_start_block:
>>>>> + unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>>>>> + return ret;
>>>>> }
>>>>>
>>>>> /*
>>>>> --
>>>>> 2.25.1
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Yan, Zi
>>
>>
>> --
>> Best Regards,
>> Yan, Zi
> Thanks for your efforts to get alloc_contig_range to work at pageblock granularity!
> -Doug


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2022-09-14 02:44:11

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 01/21] mm/page_isolation: protect cma from isolate_single_pageblock

On 9/13/2022 6:09 PM, Zi Yan wrote:
> On 13 Sep 2022, at 20:59, Doug Berger wrote:
>
>> On 9/13/2022 5:02 PM, Zi Yan wrote:
>>> On 13 Sep 2022, at 15:54, Doug Berger wrote:
>>>
>>>> The function set_migratetype_isolate() has special handling for
>>>> pageblocks of MIGRATE_CMA type that protects them from being
>>>> isolated for MIGRATE_MOVABLE requests.
>>>>
>>>> Since isolate_single_pageblock() doesn't receive the migratetype
>>>> argument of start_isolate_page_range() it used the migratetype
>>>> of the pageblock instead of the requested migratetype which
>>>> defeats this MIGRATE_CMA check.
>>>>
>>>> This allows an attempt to create a gigantic page within a CMA
>>>> region to change the migratetype of the first and last pageblocks
>>>> from MIGRATE_CMA to MIGRATE_MOVABLE when they are restored after
>>>> failure, which corrupts the CMA region.
>>>>
>>>> The calls to (un)set_migratetype_isolate() for the first and last
>>>> pageblocks of the start_isolate_page_range() are moved back into
>>>> that function to allow access to its migratetype argument and make
>>>> it easier to see how all of the pageblocks in the range are
>>>> isolated.
>>>>
>>>> Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
>>>> Signed-off-by: Doug Berger <[email protected]>
>>>> ---
>>>> mm/page_isolation.c | 75 +++++++++++++++++++++------------------------
>>>> 1 file changed, 35 insertions(+), 40 deletions(-)
>>>
>>> Thanks for the fix.
>> Thanks for the review.
>>
>>>
>>> Why not just pass migratetype into isolate_single_pageblock() and use
>>> it when set_migratetype_isolate() is used? That would have much
>>> fewer changes. What is the reason of pulling skip isolation logic out?
>> I found the skip_isolation logic confusing and thought that setting and restoring the migratetype within the same function and consolidating the error recovery paths also within that function was easier to understand and less prone to accidental breakage.
>>
>> In particular, setting MIGRATE_ISOLATE in isolate_single_pageblock() and having to remember to unset it in start_isolate_page_range() differently on different error paths was troublesome for me.
>
> Wouldn't this work as well?
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index c1307d1bea81..a312cabd0d95 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -288,6 +288,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> * @isolate_before: isolate the pageblock before the boundary_pfn
> * @skip_isolation: the flag to skip the pageblock isolation in second
> * isolate_single_pageblock()
> + * @migratetype: Migrate type to set in error recovery.
> *
> * Free and in-use pages can be as big as MAX_ORDER and contain more than one
> * pageblock. When not all pageblocks within a page are isolated at the same
> @@ -302,9 +303,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> * the in-use page then splitting the free page.
> */
> static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
> - gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
> + gfp_t gfp_flags, bool isolate_before, bool skip_isolation,
> + int migratetype)
> {
> - unsigned char saved_mt;
> unsigned long start_pfn;
> unsigned long isolate_pageblock;
> unsigned long pfn;
> @@ -328,12 +329,10 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
> start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
> zone->zone_start_pfn);
>
> - saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
> -
> if (skip_isolation)
> - VM_BUG_ON(!is_migrate_isolate(saved_mt));
> + VM_BUG_ON(!is_migrate_isolate(get_pageblock_migratetype(pfn_to_page(isolate_pageblock))));
> else {
> - ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
> + ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype, flags,
> isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
>
> if (ret)
> @@ -475,7 +474,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
> failed:
> /* restore the original migratetype */
> if (!skip_isolation)
> - unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
> + unset_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype);
> return -EBUSY;
> }
>
> @@ -537,7 +536,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> bool skip_isolation = false;
>
> /* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
> - ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
> + ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false,
> + skip_isolation, migratetype);
> if (ret)
> return ret;
>
> @@ -545,7 +545,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> skip_isolation = true;
>
> /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
> - ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
> + ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
> + skip_isolation, migratetype);
> if (ret) {
> unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
> return ret;
>
I would expect this to work as well, but it is not my preference.

>>
>> It could certainly be done differently, but this was my preference.
>
> A smaller patch can make review easier, right?
It certainly can. Especially when it is for code that you are familiar
with ;).

I am happy to have you submit a patch to fix this issue and submit it to
stable for backporting. Fixing the issue is what's important to me.

>
>>>
>>> Ultimately, I would like to make MIGRATE_ISOLATE a separate bit,
>>> so that migratetype will not be overwritten during page isolation.
>>> Then, set_migratetype_isolate() and start_isolate_page_range()
>>> will not have migratetype to set in error recovery any more.
>>> That is on my TODO.
>>>
>>>>
>>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>>> index 9d73dc38e3d7..8e16aa22cb61 100644
>>>> --- a/mm/page_isolation.c
>>>> +++ b/mm/page_isolation.c
>>>> @@ -286,8 +286,6 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>>> * @flags: isolation flags
>>>> * @gfp_flags: GFP flags used for migrating pages
>>>> * @isolate_before: isolate the pageblock before the boundary_pfn
>>>> - * @skip_isolation: the flag to skip the pageblock isolation in second
>>>> - * isolate_single_pageblock()
>>>> *
>>>> * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
>>>> * pageblock. When not all pageblocks within a page are isolated at the same
>>>> @@ -302,9 +300,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>>> * the in-use page then splitting the free page.
>>>> */
>>>> static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>> - gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
>>>> + gfp_t gfp_flags, bool isolate_before)
>>>> {
>>>> - unsigned char saved_mt;
>>>> unsigned long start_pfn;
>>>> unsigned long isolate_pageblock;
>>>> unsigned long pfn;
>>>> @@ -328,18 +325,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>> start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
>>>> zone->zone_start_pfn);
>>>>
>>>> - saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
>>>> -
>>>> - if (skip_isolation)
>>>> - VM_BUG_ON(!is_migrate_isolate(saved_mt));
>>>> - else {
>>>> - ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
>>>> - isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
>>>> -
>>>> - if (ret)
>>>> - return ret;
>>>> - }
>>>> -
>>>> /*
>>>> * Bail out early when the to-be-isolated pageblock does not form
>>>> * a free or in-use page across boundary_pfn:
>>>> @@ -428,7 +413,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>> ret = set_migratetype_isolate(page, page_mt,
>>>> flags, head_pfn, head_pfn + nr_pages);
>>>> if (ret)
>>>> - goto failed;
>>>> + return ret;
>>>> }
>>>>
>>>> ret = __alloc_contig_migrate_range(&cc, head_pfn,
>>>> @@ -443,7 +428,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>> unset_migratetype_isolate(page, page_mt);
>>>>
>>>> if (ret)
>>>> - goto failed;
>>>> + return -EBUSY;
>>>> /*
>>>> * reset pfn to the head of the free page, so
>>>> * that the free page handling code above can split
>>>> @@ -459,24 +444,19 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>> while (!PageBuddy(pfn_to_page(outer_pfn))) {
>>>> /* stop if we cannot find the free page */
>>>> if (++order >= MAX_ORDER)
>>>> - goto failed;
>>>> + return -EBUSY;
>>>> outer_pfn &= ~0UL << order;
>>>> }
>>>> pfn = outer_pfn;
>>>> continue;
>>>> } else
>>>> #endif
>>>> - goto failed;
>>>> + return -EBUSY;
>>>> }
>>>>
>>>> pfn++;
>>>> }
>>>> return 0;
>>>> -failed:
>>>> - /* restore the original migratetype */
>>>> - if (!skip_isolation)
>>>> - unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
>>>> - return -EBUSY;
>>>> }
>>>>
>>>> /**
>>>> @@ -534,21 +514,30 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>>> unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
>>>> unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
>>>> int ret;
>>>> - bool skip_isolation = false;
>>>>
>>>> /* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
>>>> - ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
>>>> + ret = set_migratetype_isolate(pfn_to_page(isolate_start), migratetype,
>>>> + flags, isolate_start, isolate_start + pageblock_nr_pages);
>>>> if (ret)
>>>> return ret;
>>>> -
>>>> - if (isolate_start == isolate_end - pageblock_nr_pages)
>>>> - skip_isolation = true;
>>>> + ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
>>>> + if (ret)
>>>> + goto unset_start_block;
>>>>
>>>> /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
>>>> - ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
>>>> + pfn = isolate_end - pageblock_nr_pages;
>>>> + if (isolate_start != pfn) {
>>>> + ret = set_migratetype_isolate(pfn_to_page(pfn), migratetype,
>>>> + flags, pfn, pfn + pageblock_nr_pages);
>>>> + if (ret)
>>>> + goto unset_start_block;
>>>> + }
>>>> + ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
>>>> if (ret) {
>>>> - unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>>>> - return ret;
>>>> + if (isolate_start != pfn)
>>>> + goto unset_end_block;
>>>> + else
>>>> + goto unset_start_block;
>>>> }
>>>>
>>>> /* skip isolated pageblocks at the beginning and end */
>>>> @@ -557,15 +546,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>>> pfn += pageblock_nr_pages) {
>>>> page = __first_valid_page(pfn, pageblock_nr_pages);
>>>> if (page && set_migratetype_isolate(page, migratetype, flags,
>>>> - start_pfn, end_pfn)) {
>>>> - undo_isolate_page_range(isolate_start, pfn, migratetype);
>>>> - unset_migratetype_isolate(
>>>> - pfn_to_page(isolate_end - pageblock_nr_pages),
>>>> - migratetype);
>>>> - return -EBUSY;
>>>> - }
>>>> + start_pfn, end_pfn))
>>>> + goto unset_isolated_blocks;
>>>> }
>>>> return 0;
>>>> +
>>>> +unset_isolated_blocks:
>>>> + ret = -EBUSY;
>>>> + undo_isolate_page_range(isolate_start + pageblock_nr_pages, pfn,
>>>> + migratetype);
>>>> +unset_end_block:
>>>> + unset_migratetype_isolate(pfn_to_page(isolate_end - pageblock_nr_pages),
>>>> + migratetype);
>>>> +unset_start_block:
>>>> + unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>>>> + return ret;
>>>> }
>>>>
>>>> /*
>>>> --
>>>> 2.25.1
>>>
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>
>
> --
> Best Regards,
> Yan, Zi
Thanks for your efforts to get alloc_contig_range to work at pageblock
granularity!
-Doug

2022-09-14 13:48:45

by Rob Herring

[permalink] [raw]
Subject: Re: [PATCH 00/21] mm: introduce Designated Movable Blocks

On Tue, Sep 13, 2022 at 2:57 PM Doug Berger <[email protected]> wrote:
>
> MOTIVATION:
> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
> controllers with each mapped in a different address range within
> a Uniform Memory Architecture. Some users of these systems have
> expressed the desire to locate ZONE_MOVABLE memory on each
> memory controller to allow user space intensive processing to
> make better use of the additional memory bandwidth.
> Unfortunately, the historical monotonic layout of zones would
> mean that if the lowest addressed memory controller contains
> ZONE_MOVABLE memory then all of the memory available from
> memory controllers at higher addresses must also be in the
> ZONE_MOVABLE zone. This would force all kernel memory accesses
> onto the lowest addressed memory controller and significantly
> reduce the amount of memory available for non-movable
> allocations.

Why are you sending kernel patches to the Devicetree specification list?

Rob

2022-09-14 15:11:04

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [PATCH 16/21] dt-bindings: reserved-memory: introduce designated-movable-block

On Tue, Sep 13, 2022 at 12:55:03PM -0700, Doug Berger wrote:
> Introduce designated-movable-block.yaml to document the
> devicetree binding for Designated Movable Block children of the
> reserved-memory node.

What is a Designated Movable Block? This patch needs to stand on its
own.

Why does this belong or need to be in DT?

>
> Signed-off-by: Doug Berger <[email protected]>
> ---
> .../designated-movable-block.yaml | 51 +++++++++++++++++++
> 1 file changed, 51 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
>
> diff --git a/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml b/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
> new file mode 100644
> index 000000000000..42f846069a2e
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
> @@ -0,0 +1,51 @@
> +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/reserved-memory/designated-movable-block.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: /reserved-memory Designated Movable Block node binding
> +
> +maintainers:
> + - [email protected]
> +
> +allOf:
> + - $ref: "reserved-memory.yaml"
> +
> +properties:
> + compatible:
> + const: designated-movable-block
> + description:
> + This indicates a region of memory meant to be placed into
> + ZONE_MOVABLE.

Don't put Linuxisms into bindings.

> +
> +unevaluatedProperties: false
> +
> +required:
> + - compatible
> + - reusable
> +
> +examples:
> + - |
> + reserved-memory {
> + #address-cells = <0x2>;
> + #size-cells = <0x2>;
> +
> + DMB0@10800000 {
> + compatible = "designated-movable-block";
> + reusable;
> + reg = <0x0 0x10800000 0x0 0x2d800000>;
> + };
> +
> + DMB1@40000000 {
> + compatible = "designated-movable-block";
> + reusable;
> + reg = <0x0 0x40000000 0x0 0x30000000>;
> + };
> +
> + DMB2@80000000 {
> + compatible = "designated-movable-block";
> + reusable;
> + reg = <0x0 0x80000000 0x0 0x2fc00000>;
> + };
> + };
> --
> 2.25.1
>
>

2022-09-14 17:04:17

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 00/21] mm: introduce Designated Movable Blocks

On 9/14/2022 6:21 AM, Rob Herring wrote:
> On Tue, Sep 13, 2022 at 2:57 PM Doug Berger <[email protected]> wrote:
>>
>> MOTIVATION:
>> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
>> controllers with each mapped in a different address range within
>> a Uniform Memory Architecture. Some users of these systems have
>> expressed the desire to locate ZONE_MOVABLE memory on each
>> memory controller to allow user space intensive processing to
>> make better use of the additional memory bandwidth.
>> Unfortunately, the historical monotonic layout of zones would
>> mean that if the lowest addressed memory controller contains
>> ZONE_MOVABLE memory then all of the memory available from
>> memory controllers at higher addresses must also be in the
>> ZONE_MOVABLE zone. This would force all kernel memory accesses
>> onto the lowest addressed memory controller and significantly
>> reduce the amount of memory available for non-movable
>> allocations.
>
> Why are you sending kernel patches to the Devicetree specification list?
>
> Rob
My apologies if this is a problem. No offense was intended.

My process has been to run my patches through get_maintainers.pl to get
the list of addresses to copy on submissions and my
0016-dt-bindings-reserved-memory-introduce-designated-mov.patch
solicited the
'- <[email protected]>' address.

My preference when reviewing is to receive an entire patch set to
understand the context of an individual commit, but I can certainly
understand that others may have different preferences.

It was my understanding that the Devicetree specification list was part
of the kernel (e.g. @vger.kernel.org) and would be willing to receive
patches that might be of relevance to it.

I am inexperienced with yaml and devicetree processes in general so I
have tried to lean on the examples of other reserved-memory node
bindings for help.

There is much to learn and I am happy to modify my process to better
accommodate your needs.

Regards,
Doug

2022-09-14 17:32:40

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 01/21] mm/page_isolation: protect cma from isolate_single_pageblock

On 9/13/2022 6:53 PM, Zi Yan wrote:
> On 13 Sep 2022, at 21:47, Doug Berger wrote:
>
>> On 9/13/2022 6:09 PM, Zi Yan wrote:
>>> On 13 Sep 2022, at 20:59, Doug Berger wrote:
>>>
>>>> On 9/13/2022 5:02 PM, Zi Yan wrote:
>>>>> On 13 Sep 2022, at 15:54, Doug Berger wrote:
>>>>>
>>>>>> The function set_migratetype_isolate() has special handling for
>>>>>> pageblocks of MIGRATE_CMA type that protects them from being
>>>>>> isolated for MIGRATE_MOVABLE requests.
>>>>>>
>>>>>> Since isolate_single_pageblock() doesn't receive the migratetype
>>>>>> argument of start_isolate_page_range() it used the migratetype
>>>>>> of the pageblock instead of the requested migratetype which
>>>>>> defeats this MIGRATE_CMA check.
>>>>>>
>>>>>> This allows an attempt to create a gigantic page within a CMA
>>>>>> region to change the migratetype of the first and last pageblocks
>>>>>> from MIGRATE_CMA to MIGRATE_MOVABLE when they are restored after
>>>>>> failure, which corrupts the CMA region.
>>>>>>
>>>>>> The calls to (un)set_migratetype_isolate() for the first and last
>>>>>> pageblocks of the start_isolate_page_range() are moved back into
>>>>>> that function to allow access to its migratetype argument and make
>>>>>> it easier to see how all of the pageblocks in the range are
>>>>>> isolated.
>>>>>>
>>>>>> Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
>>>>>> Signed-off-by: Doug Berger <[email protected]>
>>>>>> ---
>>>>>> mm/page_isolation.c | 75 +++++++++++++++++++++------------------------
>>>>>> 1 file changed, 35 insertions(+), 40 deletions(-)
>>>>>
>>>>> Thanks for the fix.
>>>> Thanks for the review.
>>>>
>>>>>
>>>>> Why not just pass migratetype into isolate_single_pageblock() and use
>>>>> it when set_migratetype_isolate() is used? That would have much
>>>>> fewer changes. What is the reason of pulling skip isolation logic out?
>>>> I found the skip_isolation logic confusing and thought that setting and restoring the migratetype within the same function and consolidating the error recovery paths also within that function was easier to understand and less prone to accidental breakage.
>>>>
>>>> In particular, setting MIGRATE_ISOLATE in isolate_single_pageblock() and having to remember to unset it in start_isolate_page_range() differently on different error paths was troublesome for me.
>>>
>>> Wouldn't this work as well?
>>>
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index c1307d1bea81..a312cabd0d95 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -288,6 +288,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>> * @isolate_before: isolate the pageblock before the boundary_pfn
>>> * @skip_isolation: the flag to skip the pageblock isolation in second
>>> * isolate_single_pageblock()
>>> + * @migratetype: Migrate type to set in error recovery.
>>> *
>>> * Free and in-use pages can be as big as MAX_ORDER and contain more than one
>>> * pageblock. When not all pageblocks within a page are isolated at the same
>>> @@ -302,9 +303,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>> * the in-use page then splitting the free page.
>>> */
>>> static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> - gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
>>> + gfp_t gfp_flags, bool isolate_before, bool skip_isolation,
>>> + int migratetype)
>>> {
>>> - unsigned char saved_mt;
>>> unsigned long start_pfn;
>>> unsigned long isolate_pageblock;
>>> unsigned long pfn;
>>> @@ -328,12 +329,10 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
>>> zone->zone_start_pfn);
>>>
>>> - saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
>>> -
>>> if (skip_isolation)
>>> - VM_BUG_ON(!is_migrate_isolate(saved_mt));
>>> + VM_BUG_ON(!is_migrate_isolate(get_pageblock_migratetype(pfn_to_page(isolate_pageblock))));
>>> else {
>>> - ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
>>> + ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype, flags,
>>> isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
>>>
>>> if (ret)
>>> @@ -475,7 +474,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> failed:
>>> /* restore the original migratetype */
>>> if (!skip_isolation)
>>> - unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
>>> + unset_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype);
>>> return -EBUSY;
>>> }
>>>
>>> @@ -537,7 +536,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>> bool skip_isolation = false;
>>>
>>> /* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
>>> - ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
>>> + ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false,
>>> + skip_isolation, migratetype);
>>> if (ret)
>>> return ret;
>>>
>>> @@ -545,7 +545,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>> skip_isolation = true;
>>>
>>> /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
>>> - ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
>>> + ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
>>> + skip_isolation, migratetype);
>>> if (ret) {
>>> unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>>> return ret;
>>>
>> I would expect this to work as well, but it is not my preference.
>>
>>>>
>>>> It could certainly be done differently, but this was my preference.
>>>
>>> A smaller patch can make review easier, right?
>> It certainly can. Especially when it is for code that you are familiar with ;).
>>
>> I am happy to have you submit a patch to fix this issue and submit it to stable for backporting. Fixing the issue is what's important to me.
>>
>
> I can submit the above as a patch. Is there a visible userspace issue, so that we need to
> backport it? Thanks.
I did not observe symptoms of the issue, but I did observe the issue
when allocating gigantic huge pages as part of the hugetlbfs on a system
with CMA regions.

My best guess is that it probably does not create a "functional" problem
since the error would likely be cancelled out by subsequent CMA
allocations restoring the pageblock migratetype. However, in the
meantime the page allocator would handle free pages in those pageblocks
without the MIGRATE_CMA qualifications which might impact driver
performance. There might be other problems of which I am unaware.

The issue currently only exists in the wild in v5.19, so it would be
nice to get it backported there to nip it in the bud.

>
>>>
>>>>>
>>>>> Ultimately, I would like to make MIGRATE_ISOLATE a separate bit,
>>>>> so that migratetype will not be overwritten during page isolation.
>>>>> Then, set_migratetype_isolate() and start_isolate_page_range()
>>>>> will not have migratetype to set in error recovery any more.
>>>>> That is on my TODO.
>>>>>
>>>>>>
>>>>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>>>>> index 9d73dc38e3d7..8e16aa22cb61 100644
>>>>>> --- a/mm/page_isolation.c
>>>>>> +++ b/mm/page_isolation.c
>>>>>> @@ -286,8 +286,6 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>>>>> * @flags: isolation flags
>>>>>> * @gfp_flags: GFP flags used for migrating pages
>>>>>> * @isolate_before: isolate the pageblock before the boundary_pfn
>>>>>> - * @skip_isolation: the flag to skip the pageblock isolation in second
>>>>>> - * isolate_single_pageblock()
>>>>>> *
>>>>>> * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
>>>>>> * pageblock. When not all pageblocks within a page are isolated at the same
>>>>>> @@ -302,9 +300,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>>>>> * the in-use page then splitting the free page.
>>>>>> */
>>>>>> static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>>>> - gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
>>>>>> + gfp_t gfp_flags, bool isolate_before)
>>>>>> {
>>>>>> - unsigned char saved_mt;
>>>>>> unsigned long start_pfn;
>>>>>> unsigned long isolate_pageblock;
>>>>>> unsigned long pfn;
>>>>>> @@ -328,18 +325,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>>>> start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
>>>>>> zone->zone_start_pfn);
>>>>>>
>>>>>> - saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
>>>>>> -
>>>>>> - if (skip_isolation)
>>>>>> - VM_BUG_ON(!is_migrate_isolate(saved_mt));
>>>>>> - else {
>>>>>> - ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
>>>>>> - isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
>>>>>> -
>>>>>> - if (ret)
>>>>>> - return ret;
>>>>>> - }
>>>>>> -
>>>>>> /*
>>>>>> * Bail out early when the to-be-isolated pageblock does not form
>>>>>> * a free or in-use page across boundary_pfn:
>>>>>> @@ -428,7 +413,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>>>> ret = set_migratetype_isolate(page, page_mt,
>>>>>> flags, head_pfn, head_pfn + nr_pages);
>>>>>> if (ret)
>>>>>> - goto failed;
>>>>>> + return ret;
>>>>>> }
>>>>>>
>>>>>> ret = __alloc_contig_migrate_range(&cc, head_pfn,
>>>>>> @@ -443,7 +428,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>>>> unset_migratetype_isolate(page, page_mt);
>>>>>>
>>>>>> if (ret)
>>>>>> - goto failed;
>>>>>> + return -EBUSY;
>>>>>> /*
>>>>>> * reset pfn to the head of the free page, so
>>>>>> * that the free page handling code above can split
>>>>>> @@ -459,24 +444,19 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>>>> while (!PageBuddy(pfn_to_page(outer_pfn))) {
>>>>>> /* stop if we cannot find the free page */
>>>>>> if (++order >= MAX_ORDER)
>>>>>> - goto failed;
>>>>>> + return -EBUSY;
>>>>>> outer_pfn &= ~0UL << order;
>>>>>> }
>>>>>> pfn = outer_pfn;
>>>>>> continue;
>>>>>> } else
>>>>>> #endif
>>>>>> - goto failed;
>>>>>> + return -EBUSY;
>>>>>> }
>>>>>>
>>>>>> pfn++;
>>>>>> }
>>>>>> return 0;
>>>>>> -failed:
>>>>>> - /* restore the original migratetype */
>>>>>> - if (!skip_isolation)
>>>>>> - unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
>>>>>> - return -EBUSY;
>>>>>> }
>>>>>>
>>>>>> /**
>>>>>> @@ -534,21 +514,30 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>>>>> unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
>>>>>> unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
>>>>>> int ret;
>>>>>> - bool skip_isolation = false;
>>>>>>
>>>>>> /* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
>>>>>> - ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
>>>>>> + ret = set_migratetype_isolate(pfn_to_page(isolate_start), migratetype,
>>>>>> + flags, isolate_start, isolate_start + pageblock_nr_pages);
>>>>>> if (ret)
>>>>>> return ret;
>>>>>> -
>>>>>> - if (isolate_start == isolate_end - pageblock_nr_pages)
>>>>>> - skip_isolation = true;
>>>>>> + ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
>>>>>> + if (ret)
>>>>>> + goto unset_start_block;
>>>>>>
>>>>>> /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
>>>>>> - ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
>>>>>> + pfn = isolate_end - pageblock_nr_pages;
>>>>>> + if (isolate_start != pfn) {
>>>>>> + ret = set_migratetype_isolate(pfn_to_page(pfn), migratetype,
>>>>>> + flags, pfn, pfn + pageblock_nr_pages);
>>>>>> + if (ret)
>>>>>> + goto unset_start_block;
>>>>>> + }
>>>>>> + ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
>>>>>> if (ret) {
>>>>>> - unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>>>>>> - return ret;
>>>>>> + if (isolate_start != pfn)
>>>>>> + goto unset_end_block;
>>>>>> + else
>>>>>> + goto unset_start_block;
>>>>>> }
>>>>>>
>>>>>> /* skip isolated pageblocks at the beginning and end */
>>>>>> @@ -557,15 +546,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>>>>> pfn += pageblock_nr_pages) {
>>>>>> page = __first_valid_page(pfn, pageblock_nr_pages);
>>>>>> if (page && set_migratetype_isolate(page, migratetype, flags,
>>>>>> - start_pfn, end_pfn)) {
>>>>>> - undo_isolate_page_range(isolate_start, pfn, migratetype);
>>>>>> - unset_migratetype_isolate(
>>>>>> - pfn_to_page(isolate_end - pageblock_nr_pages),
>>>>>> - migratetype);
>>>>>> - return -EBUSY;
>>>>>> - }
>>>>>> + start_pfn, end_pfn))
>>>>>> + goto unset_isolated_blocks;
>>>>>> }
>>>>>> return 0;
>>>>>> +
>>>>>> +unset_isolated_blocks:
>>>>>> + ret = -EBUSY;
>>>>>> + undo_isolate_page_range(isolate_start + pageblock_nr_pages, pfn,
>>>>>> + migratetype);
>>>>>> +unset_end_block:
>>>>>> + unset_migratetype_isolate(pfn_to_page(isolate_end - pageblock_nr_pages),
>>>>>> + migratetype);
>>>>>> +unset_start_block:
>>>>>> + unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>>>>>> + return ret;
>>>>>> }
>>>>>>
>>>>>> /*
>>>>>> --
>>>>>> 2.25.1
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Yan, Zi
>>>
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>> Thanks for your efforts to get alloc_contig_range to work at pageblock granularity!
>> -Doug
>
>
> --
> Best Regards,
> Yan, Zi
Thanks,
-Doug

2022-09-14 17:52:33

by Florian Fainelli

[permalink] [raw]
Subject: Re: [PATCH 02/21] mm/hugetlb: correct max_huge_pages accounting on demote

On 9/14/22 10:23, Mike Kravetz wrote:
> On 09/13/22 12:54, Doug Berger wrote:
>> When demoting a hugepage to a smaller order, the number of pages
>> added to the target hstate will be the size of the large page
>> divided by the size of the smaller page.
>>
>> Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
>> Signed-off-by: Doug Berger <[email protected]>
>> ---
>> mm/hugetlb.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index e070b8593b37..79949893ac12 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3472,7 +3472,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
>> * based on pool changes for the demoted page.
>> */
>> h->max_huge_pages--;
>> - target_hstate->max_huge_pages += pages_per_huge_page(h);
>> + target_hstate->max_huge_pages += pages_per_huge_page(h) /
>> + pages_per_huge_page(target_hstate);
>>
>> return rc;
>> }
>
> This has already been fixed here,
>
> https://lore.kernel.org/linux-mm/[email protected]/
>

Could we slap the Fixes tag when this Miaohe's patch series gets
accepted since the offending commit is in v5.16 and beyond. Thanks!
--
Florian

2022-09-14 18:00:03

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 02/21] mm/hugetlb: correct max_huge_pages accounting on demote

On 09/13/22 12:54, Doug Berger wrote:
> When demoting a hugepage to a smaller order, the number of pages
> added to the target hstate will be the size of the large page
> divided by the size of the smaller page.
>
> Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
> Signed-off-by: Doug Berger <[email protected]>
> ---
> mm/hugetlb.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e070b8593b37..79949893ac12 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3472,7 +3472,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
> * based on pool changes for the demoted page.
> */
> h->max_huge_pages--;
> - target_hstate->max_huge_pages += pages_per_huge_page(h);
> + target_hstate->max_huge_pages += pages_per_huge_page(h) /
> + pages_per_huge_page(target_hstate);
>
> return rc;
> }

This has already been fixed here,

https://lore.kernel.org/linux-mm/[email protected]/

--
Mike Kravetz

2022-09-14 18:00:04

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 16/21] dt-bindings: reserved-memory: introduce designated-movable-block

On 9/14/2022 7:55 AM, Rob Herring wrote:
> On Tue, Sep 13, 2022 at 12:55:03PM -0700, Doug Berger wrote:
>> Introduce designated-movable-block.yaml to document the
>> devicetree binding for Designated Movable Block children of the
>> reserved-memory node.
>
> What is a Designated Movable Block? This patch needs to stand on its
> own.
As noted in my reply to your [PATCH 00/21] comment, my intention in
submitting the entire patch set (and specifically PATCH 00/21]) was to
communicate this context. Now that I believe I understand that only this
patch should have been submitted to the devicetree-spec mailing list, I
will strive harder to make it more self contained.

>
> Why does this belong or need to be in DT?
While my preferred method of declaring Designated Movable Blocks is
through the movablecore kernel parameter, I can conceive that others may
wish to take advantage of the reserved-memory DT nodes. In particular,
it has the advantage that a device can claim ownership of the
reserved-memory via device tree, which is something that has yet to be
implemented for DMBs defined with movablecore.

>
>>
>> Signed-off-by: Doug Berger <[email protected]>
>> ---
>> .../designated-movable-block.yaml | 51 +++++++++++++++++++
>> 1 file changed, 51 insertions(+)
>> create mode 100644 Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
>>
>> diff --git a/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml b/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
>> new file mode 100644
>> index 000000000000..42f846069a2e
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
>> @@ -0,0 +1,51 @@
>> +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
>> +%YAML 1.2
>> +---
>> +$id: http://devicetree.org/schemas/reserved-memory/designated-movable-block.yaml#
>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>> +
>> +title: /reserved-memory Designated Movable Block node binding
>> +
>> +maintainers:
>> + - [email protected]
>> +
>> +allOf:
>> + - $ref: "reserved-memory.yaml"
>> +
>> +properties:
>> + compatible:
>> + const: designated-movable-block
>> + description:
>> + This indicates a region of memory meant to be placed into
>> + ZONE_MOVABLE.
>
> Don't put Linuxisms into bindings.
I will avoid ZONE_MOVABLE if this commit is included in V2 of this patch
set.
>
>> +
>> +unevaluatedProperties: false
>> +
>> +required:
>> + - compatible
>> + - reusable
>> +
>> +examples:
>> + - |
>> + reserved-memory {
>> + #address-cells = <0x2>;
>> + #size-cells = <0x2>;
>> +
>> + DMB0@10800000 {
>> + compatible = "designated-movable-block";
>> + reusable;
>> + reg = <0x0 0x10800000 0x0 0x2d800000>;
>> + };
>> +
>> + DMB1@40000000 {
>> + compatible = "designated-movable-block";
>> + reusable;
>> + reg = <0x0 0x40000000 0x0 0x30000000>;
>> + };
>> +
>> + DMB2@80000000 {
>> + compatible = "designated-movable-block";
>> + reusable;
>> + reg = <0x0 0x80000000 0x0 0x2fc00000>;
>> + };
>> + };
>> --
>> 2.25.1
>>
>>
Thank you for the review!
-Doug

2022-09-14 18:00:59

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 02/21] mm/hugetlb: correct max_huge_pages accounting on demote

On 9/14/2022 10:23 AM, Mike Kravetz wrote:
> On 09/13/22 12:54, Doug Berger wrote:
>> When demoting a hugepage to a smaller order, the number of pages
>> added to the target hstate will be the size of the large page
>> divided by the size of the smaller page.
>>
>> Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
>> Signed-off-by: Doug Berger <[email protected]>
>> ---
>> mm/hugetlb.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index e070b8593b37..79949893ac12 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3472,7 +3472,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
>> * based on pool changes for the demoted page.
>> */
>> h->max_huge_pages--;
>> - target_hstate->max_huge_pages += pages_per_huge_page(h);
>> + target_hstate->max_huge_pages += pages_per_huge_page(h) /
>> + pages_per_huge_page(target_hstate);
>>
>> return rc;
>> }
>
> This has already been fixed here,
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
Excellent! Thanks for the pointer and sorry for the noise.
-Doug

2022-09-14 18:07:23

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 18/21] mm/cma: support CMA in Designated Movable Blocks

Hi Doug,

I love your patch! Perhaps something to improve:

[auto build test WARNING on robh/for-next]
[also build test WARNING on linus/master v6.0-rc5]
[cannot apply to akpm-mm/mm-everything next-20220914]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Doug-Berger/mm-introduce-Designated-Movable-Blocks/20220914-040216
base: https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
config: i386-randconfig-a002 (https://download.01.org/0day-ci/archive/20220915/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/635e919c92ca242c4b900bdfc7e21529e76f2f8e
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Doug-Berger/mm-introduce-Designated-Movable-Blocks/20220914-040216
git checkout 635e919c92ca242c4b900bdfc7e21529e76f2f8e
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

>> mm/page_alloc.c:9236:5: warning: no previous prototype for function '_alloc_contig_range' [-Wmissing-prototypes]
int _alloc_contig_range(unsigned long start, unsigned long end,
^
mm/page_alloc.c:9236:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int _alloc_contig_range(unsigned long start, unsigned long end,
^
static
1 warning generated.


vim +/_alloc_contig_range +9236 mm/page_alloc.c

9235
> 9236 int _alloc_contig_range(unsigned long start, unsigned long end,
9237 unsigned int migratetype, gfp_t gfp_mask)
9238 {
9239 unsigned long outer_start, outer_end;
9240 int order;
9241 int ret = 0;
9242
9243 struct compact_control cc = {
9244 .nr_migratepages = 0,
9245 .order = -1,
9246 .zone = page_zone(pfn_to_page(start)),
9247 .mode = MIGRATE_SYNC,
9248 .ignore_skip_hint = true,
9249 .no_set_skip_hint = true,
9250 .gfp_mask = current_gfp_context(gfp_mask),
9251 .alloc_contig = true,
9252 };
9253 INIT_LIST_HEAD(&cc.migratepages);
9254
9255 /*
9256 * What we do here is we mark all pageblocks in range as
9257 * MIGRATE_ISOLATE. Because pageblock and max order pages may
9258 * have different sizes, and due to the way page allocator
9259 * work, start_isolate_page_range() has special handlings for this.
9260 *
9261 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
9262 * migrate the pages from an unaligned range (ie. pages that
9263 * we are interested in). This will put all the pages in
9264 * range back to page allocator as MIGRATE_ISOLATE.
9265 *
9266 * When this is done, we take the pages in range from page
9267 * allocator removing them from the buddy system. This way
9268 * page allocator will never consider using them.
9269 *
9270 * This lets us mark the pageblocks back as
9271 * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
9272 * aligned range but not in the unaligned, original range are
9273 * put back to page allocator so that buddy can use them.
9274 */
9275
9276 ret = start_isolate_page_range(start, end, migratetype, 0, gfp_mask);
9277 if (ret)
9278 goto done;
9279
9280 drain_all_pages(cc.zone);
9281
9282 /*
9283 * In case of -EBUSY, we'd like to know which page causes problem.
9284 * So, just fall through. test_pages_isolated() has a tracepoint
9285 * which will report the busy page.
9286 *
9287 * It is possible that busy pages could become available before
9288 * the call to test_pages_isolated, and the range will actually be
9289 * allocated. So, if we fall through be sure to clear ret so that
9290 * -EBUSY is not accidentally used or returned to caller.
9291 */
9292 ret = __alloc_contig_migrate_range(&cc, start, end);
9293 if (ret && ret != -EBUSY)
9294 goto done;
9295 ret = 0;
9296 sync_hugetlb_dissolve();
9297
9298 /*
9299 * Pages from [start, end) are within a pageblock_nr_pages
9300 * aligned blocks that are marked as MIGRATE_ISOLATE. What's
9301 * more, all pages in [start, end) are free in page allocator.
9302 * What we are going to do is to allocate all pages from
9303 * [start, end) (that is remove them from page allocator).
9304 *
9305 * The only problem is that pages at the beginning and at the
9306 * end of interesting range may be not aligned with pages that
9307 * page allocator holds, ie. they can be part of higher order
9308 * pages. Because of this, we reserve the bigger range and
9309 * once this is done free the pages we are not interested in.
9310 *
9311 * We don't have to hold zone->lock here because the pages are
9312 * isolated thus they won't get removed from buddy.
9313 */
9314
9315 order = 0;
9316 outer_start = start;
9317 while (!PageBuddy(pfn_to_page(outer_start))) {
9318 if (++order >= MAX_ORDER) {
9319 outer_start = start;
9320 break;
9321 }
9322 outer_start &= ~0UL << order;
9323 }
9324
9325 if (outer_start != start) {
9326 order = buddy_order(pfn_to_page(outer_start));
9327
9328 /*
9329 * outer_start page could be small order buddy page and
9330 * it doesn't include start page. Adjust outer_start
9331 * in this case to report failed page properly
9332 * on tracepoint in test_pages_isolated()
9333 */
9334 if (outer_start + (1UL << order) <= start)
9335 outer_start = start;
9336 }
9337
9338 /* Make sure the range is really isolated. */
9339 if (test_pages_isolated(outer_start, end, 0)) {
9340 ret = -EBUSY;
9341 goto done;
9342 }
9343
9344 /* Grab isolated pages from freelists. */
9345 outer_end = isolate_freepages_range(&cc, outer_start, end);
9346 if (!outer_end) {
9347 ret = -EBUSY;
9348 goto done;
9349 }
9350
9351 /* Free head and tail (if any) */
9352 if (start != outer_start)
9353 free_contig_range(outer_start, start - outer_start);
9354 if (end != outer_end)
9355 free_contig_range(end, outer_end - end);
9356
9357 done:
9358 undo_isolate_page_range(start, end, migratetype);
9359 return ret;
9360 }
9361

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-09-14 18:17:19

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 03/21] mm/hugetlb: correct demote page offset logic

On 09/13/22 18:07, Doug Berger wrote:
> On 9/13/2022 4:34 PM, Matthew Wilcox wrote:
> > On Tue, Sep 13, 2022 at 12:54:50PM -0700, Doug Berger wrote:
> > > With gigantic pages it may not be true that struct page structures
> > > are contiguous across the entire gigantic page. The mem_map_offset
> > > function is used here in place of direct pointer arithmetic to
> > > correct for this.
> >
> > We're just eliminating mem_map_offset(). Please use nth_page()
> > instead.That's good to know. I will include that in v2.

Thanks Doug and Matthew. I will take a closer look at this series soon.

It seems like this patch is a fix independent of the series. If so, I
would suggest sending separate to make it easy for backports to stable.
--
Mike Kravetz

2022-09-14 18:20:37

by Rob Herring

[permalink] [raw]
Subject: Re: [PATCH 00/21] mm: introduce Designated Movable Blocks

On Wed, Sep 14, 2022 at 11:57 AM Doug Berger <[email protected]> wrote:
>
> On 9/14/2022 6:21 AM, Rob Herring wrote:
> > On Tue, Sep 13, 2022 at 2:57 PM Doug Berger <[email protected]> wrote:
> >>
> >> MOTIVATION:
> >> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
> >> controllers with each mapped in a different address range within
> >> a Uniform Memory Architecture. Some users of these systems have
> >> expressed the desire to locate ZONE_MOVABLE memory on each
> >> memory controller to allow user space intensive processing to
> >> make better use of the additional memory bandwidth.
> >> Unfortunately, the historical monotonic layout of zones would
> >> mean that if the lowest addressed memory controller contains
> >> ZONE_MOVABLE memory then all of the memory available from
> >> memory controllers at higher addresses must also be in the
> >> ZONE_MOVABLE zone. This would force all kernel memory accesses
> >> onto the lowest addressed memory controller and significantly
> >> reduce the amount of memory available for non-movable
> >> allocations.
> >
> > Why are you sending kernel patches to the Devicetree specification list?
> >
> > Rob
> My apologies if this is a problem. No offense was intended.

None taken. Just trying to keep a low traffic list low traffic.

> My process has been to run my patches through get_maintainers.pl to get
> the list of addresses to copy on submissions and my
> 0016-dt-bindings-reserved-memory-introduce-designated-mov.patch
> solicited the
> '- <[email protected]>' address.

Yeah, I see that now. That needs to be a person for a specific
binding. The only bindings using the list should be targeting the
dtschema repo. (And even those are a person ideally.)

Rob

2022-09-14 18:34:09

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 03/21] mm/hugetlb: correct demote page offset logic

On 9/14/2022 10:08 AM, Mike Kravetz wrote:
> On 09/13/22 18:07, Doug Berger wrote:
>> On 9/13/2022 4:34 PM, Matthew Wilcox wrote:
>>> On Tue, Sep 13, 2022 at 12:54:50PM -0700, Doug Berger wrote:
>>>> With gigantic pages it may not be true that struct page structures
>>>> are contiguous across the entire gigantic page. The mem_map_offset
>>>> function is used here in place of direct pointer arithmetic to
>>>> correct for this.
>>>
>>> We're just eliminating mem_map_offset(). Please use nth_page()
>>> instead.That's good to know. I will include that in v2.
>
> Thanks Doug and Matthew. I will take a closer look at this series soon.
>
> It seems like this patch is a fix independent of the series. If so, I
> would suggest sending separate to make it easy for backports to stable.
Yes, as I noted in [PATCH 00/21] the first three patches fit that
description, but I included them here in case someone was brave enough
to attempt to use this patch set. They were in my branch for my own testing.

Full disclosure: An earlier version of this patch set had more complete
support for hugepage isolation that included migrating the isolation
state when demoting a hugepage that touched lines in
demote_free_huge_page() and depended on the subpage variable introduced
here.

At this point I will submit a patch for this on its own and will likely
remove the first three commits when submitting V2 of the set.

Thanks for your consideration.
-Doug

2022-09-14 18:36:28

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 18/21] mm/cma: support CMA in Designated Movable Blocks

Hi Doug,

I love your patch! Perhaps something to improve:

[auto build test WARNING on robh/for-next]
[also build test WARNING on linus/master v6.0-rc5]
[cannot apply to akpm-mm/mm-everything next-20220914]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Doug-Berger/mm-introduce-Designated-Movable-Blocks/20220914-040216
base: https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
config: i386-randconfig-a001
compiler: gcc-11 (Debian 11.3.0-5) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/635e919c92ca242c4b900bdfc7e21529e76f2f8e
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Doug-Berger/mm-introduce-Designated-Movable-Blocks/20220914-040216
git checkout 635e919c92ca242c4b900bdfc7e21529e76f2f8e
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

>> mm/page_alloc.c:9236:5: warning: no previous prototype for '_alloc_contig_range' [-Wmissing-prototypes]
9236 | int _alloc_contig_range(unsigned long start, unsigned long end,
| ^~~~~~~~~~~~~~~~~~~


vim +/_alloc_contig_range +9236 mm/page_alloc.c

9235
> 9236 int _alloc_contig_range(unsigned long start, unsigned long end,
9237 unsigned int migratetype, gfp_t gfp_mask)
9238 {
9239 unsigned long outer_start, outer_end;
9240 int order;
9241 int ret = 0;
9242
9243 struct compact_control cc = {
9244 .nr_migratepages = 0,
9245 .order = -1,
9246 .zone = page_zone(pfn_to_page(start)),
9247 .mode = MIGRATE_SYNC,
9248 .ignore_skip_hint = true,
9249 .no_set_skip_hint = true,
9250 .gfp_mask = current_gfp_context(gfp_mask),
9251 .alloc_contig = true,
9252 };
9253 INIT_LIST_HEAD(&cc.migratepages);
9254
9255 /*
9256 * What we do here is we mark all pageblocks in range as
9257 * MIGRATE_ISOLATE. Because pageblock and max order pages may
9258 * have different sizes, and due to the way page allocator
9259 * work, start_isolate_page_range() has special handlings for this.
9260 *
9261 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
9262 * migrate the pages from an unaligned range (ie. pages that
9263 * we are interested in). This will put all the pages in
9264 * range back to page allocator as MIGRATE_ISOLATE.
9265 *
9266 * When this is done, we take the pages in range from page
9267 * allocator removing them from the buddy system. This way
9268 * page allocator will never consider using them.
9269 *
9270 * This lets us mark the pageblocks back as
9271 * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
9272 * aligned range but not in the unaligned, original range are
9273 * put back to page allocator so that buddy can use them.
9274 */
9275
9276 ret = start_isolate_page_range(start, end, migratetype, 0, gfp_mask);
9277 if (ret)
9278 goto done;
9279
9280 drain_all_pages(cc.zone);
9281
9282 /*
9283 * In case of -EBUSY, we'd like to know which page causes problem.
9284 * So, just fall through. test_pages_isolated() has a tracepoint
9285 * which will report the busy page.
9286 *
9287 * It is possible that busy pages could become available before
9288 * the call to test_pages_isolated, and the range will actually be
9289 * allocated. So, if we fall through be sure to clear ret so that
9290 * -EBUSY is not accidentally used or returned to caller.
9291 */
9292 ret = __alloc_contig_migrate_range(&cc, start, end);
9293 if (ret && ret != -EBUSY)
9294 goto done;
9295 ret = 0;
9296 sync_hugetlb_dissolve();
9297
9298 /*
9299 * Pages from [start, end) are within a pageblock_nr_pages
9300 * aligned blocks that are marked as MIGRATE_ISOLATE. What's
9301 * more, all pages in [start, end) are free in page allocator.
9302 * What we are going to do is to allocate all pages from
9303 * [start, end) (that is remove them from page allocator).
9304 *
9305 * The only problem is that pages at the beginning and at the
9306 * end of interesting range may be not aligned with pages that
9307 * page allocator holds, ie. they can be part of higher order
9308 * pages. Because of this, we reserve the bigger range and
9309 * once this is done free the pages we are not interested in.
9310 *
9311 * We don't have to hold zone->lock here because the pages are
9312 * isolated thus they won't get removed from buddy.
9313 */
9314
9315 order = 0;
9316 outer_start = start;
9317 while (!PageBuddy(pfn_to_page(outer_start))) {
9318 if (++order >= MAX_ORDER) {
9319 outer_start = start;
9320 break;
9321 }
9322 outer_start &= ~0UL << order;
9323 }
9324
9325 if (outer_start != start) {
9326 order = buddy_order(pfn_to_page(outer_start));
9327
9328 /*
9329 * outer_start page could be small order buddy page and
9330 * it doesn't include start page. Adjust outer_start
9331 * in this case to report failed page properly
9332 * on tracepoint in test_pages_isolated()
9333 */
9334 if (outer_start + (1UL << order) <= start)
9335 outer_start = start;
9336 }
9337
9338 /* Make sure the range is really isolated. */
9339 if (test_pages_isolated(outer_start, end, 0)) {
9340 ret = -EBUSY;
9341 goto done;
9342 }
9343
9344 /* Grab isolated pages from freelists. */
9345 outer_end = isolate_freepages_range(&cc, outer_start, end);
9346 if (!outer_end) {
9347 ret = -EBUSY;
9348 goto done;
9349 }
9350
9351 /* Free head and tail (if any) */
9352 if (start != outer_start)
9353 free_contig_range(outer_start, start - outer_start);
9354 if (end != outer_end)
9355 free_contig_range(end, outer_end - end);
9356
9357 done:
9358 undo_isolate_page_range(start, end, migratetype);
9359 return ret;
9360 }
9361

--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (6.51 kB)
config (153.73 kB)
Download all attachments

2022-09-14 19:25:33

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 02/21] mm/hugetlb: correct max_huge_pages accounting on demote

On 09/14/22 10:26, Florian Fainelli wrote:
> On 9/14/22 10:23, Mike Kravetz wrote:
> > On 09/13/22 12:54, Doug Berger wrote:
> > > When demoting a hugepage to a smaller order, the number of pages
> > > added to the target hstate will be the size of the large page
> > > divided by the size of the smaller page.
> > >
> > > Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
> > > Signed-off-by: Doug Berger <[email protected]>
> > > ---
> > > mm/hugetlb.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index e070b8593b37..79949893ac12 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -3472,7 +3472,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
> > > * based on pool changes for the demoted page.
> > > */
> > > h->max_huge_pages--;
> > > - target_hstate->max_huge_pages += pages_per_huge_page(h);
> > > + target_hstate->max_huge_pages += pages_per_huge_page(h) /
> > > + pages_per_huge_page(target_hstate);
> > > return rc;
> > > }
> >
> > This has already been fixed here,
> >
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
>
> Could we slap the Fixes tag when this Miaohe's patch series gets accepted
> since the offending commit is in v5.16 and beyond. Thanks!

We could. However, this fix/change does not really have any impact on
the way code functions (unless I am mistaken). See my analysis at,

https://lore.kernel.org/linux-mm/YvwfvxXewnZpHQcz@monkey/

--
Mike Kravetz

2022-09-14 21:12:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 02/21] mm/hugetlb: correct max_huge_pages accounting on demote

On Wed, 14 Sep 2022 10:23:05 -0700 Mike Kravetz <[email protected]> wrote:

> On 09/13/22 12:54, Doug Berger wrote:
> > When demoting a hugepage to a smaller order, the number of pages
> > added to the target hstate will be the size of the large page
> > divided by the size of the smaller page.
> >
> > Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
> > Signed-off-by: Doug Berger <[email protected]>
> > ---
> > mm/hugetlb.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index e070b8593b37..79949893ac12 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -3472,7 +3472,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
> > * based on pool changes for the demoted page.
> > */
> > h->max_huge_pages--;
> > - target_hstate->max_huge_pages += pages_per_huge_page(h);
> > + target_hstate->max_huge_pages += pages_per_huge_page(h) /
> > + pages_per_huge_page(target_hstate);
> >
> > return rc;
> > }
>
> This has already been fixed here,
>
> https://lore.kernel.org/linux-mm/[email protected]/

Neither version tells us the user-visible runtime effects of the change :(

2022-09-14 21:48:15

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 02/21] mm/hugetlb: correct max_huge_pages accounting on demote

On 09/14/22 13:58, Andrew Morton wrote:
> On Wed, 14 Sep 2022 10:23:05 -0700 Mike Kravetz <[email protected]> wrote:
>
> > On 09/13/22 12:54, Doug Berger wrote:
> > > When demoting a hugepage to a smaller order, the number of pages
> > > added to the target hstate will be the size of the large page
> > > divided by the size of the smaller page.
> > >
> > > Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
> > > Signed-off-by: Doug Berger <[email protected]>
> > > ---
> > > mm/hugetlb.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index e070b8593b37..79949893ac12 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -3472,7 +3472,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
> > > * based on pool changes for the demoted page.
> > > */
> > > h->max_huge_pages--;
> > > - target_hstate->max_huge_pages += pages_per_huge_page(h);
> > > + target_hstate->max_huge_pages += pages_per_huge_page(h) /
> > > + pages_per_huge_page(target_hstate);
> > >
> > > return rc;
> > > }
> >
> > This has already been fixed here,
> >
> > https://lore.kernel.org/linux-mm/[email protected]/
>
> Neither version tells us the user-visible runtime effects of the change :(

Sorry, I should have pushed harder on this with Miaohe's patch.

There are no user-visible runtime effects. In fact, this change really causes
no functional change (unless I am mistaken and Miaohe did not correct me).
max_huge_pages is not used again until it is reset. See my explanation at:
https://lore.kernel.org/linux-mm/YvwfvxXewnZpHQcz@monkey/
--
Mike Kravetz

2022-09-14 22:42:29

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 18/21] mm/cma: support CMA in Designated Movable Blocks

Hi Doug,

I love your patch! Yet something to improve:

[auto build test ERROR on robh/for-next]
[also build test ERROR on linus/master v6.0-rc5]
[cannot apply to akpm-mm/mm-everything next-20220914]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Doug-Berger/mm-introduce-Designated-Movable-Blocks/20220914-040216
base: https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220915/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-5) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/635e919c92ca242c4b900bdfc7e21529e76f2f8e
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Doug-Berger/mm-introduce-Designated-Movable-Blocks/20220914-040216
git checkout 635e919c92ca242c4b900bdfc7e21529e76f2f8e
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

mm/page_alloc.c:9236:5: warning: no previous prototype for '_alloc_contig_range' [-Wmissing-prototypes]
9236 | int _alloc_contig_range(unsigned long start, unsigned long end,
| ^~~~~~~~~~~~~~~~~~~
mm/page_alloc.c: In function 'alloc_contig_range':
>> mm/page_alloc.c:9390:36: error: 'MIGRATE_CMA' undeclared (first use in this function); did you mean 'MIGRATE_SYNC'?
9390 | if (migratetype == MIGRATE_CMA)
| ^~~~~~~~~~~
| MIGRATE_SYNC
mm/page_alloc.c:9390:36: note: each undeclared identifier is reported only once for each function it appears in


vim +9390 mm/page_alloc.c

9361
9362 /**
9363 * alloc_contig_range() -- tries to allocate given range of pages
9364 * @start: start PFN to allocate
9365 * @end: one-past-the-last PFN to allocate
9366 * @migratetype: migratetype of the underlying pageblocks (either
9367 * #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks
9368 * in range must have the same migratetype and it must
9369 * be either of the two.
9370 * @gfp_mask: GFP mask to use during compaction
9371 *
9372 * The PFN range does not have to be pageblock aligned. The PFN range must
9373 * belong to a single zone.
9374 *
9375 * The first thing this routine does is attempt to MIGRATE_ISOLATE all
9376 * pageblocks in the range. Once isolated, the pageblocks should not
9377 * be modified by others.
9378 *
9379 * Return: zero on success or negative error code. On success all
9380 * pages which PFN is in [start, end) are allocated for the caller and
9381 * need to be freed with free_contig_range().
9382 */
9383 int alloc_contig_range(unsigned long start, unsigned long end,
9384 unsigned int migratetype, gfp_t gfp_mask)
9385 {
9386 switch (dmb_intersects(start, end)) {
9387 case DMB_DISJOINT:
9388 break;
9389 case DMB_INTERSECTS:
> 9390 if (migratetype == MIGRATE_CMA)
9391 migratetype = MIGRATE_MOVABLE;
9392 else
9393 return -EBUSY;
9394 break;
9395 default:
9396 return -EBUSY;
9397 }
9398
9399 return _alloc_contig_range(start, end, migratetype, gfp_mask);
9400 }
9401 EXPORT_SYMBOL(alloc_contig_range);
9402

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-09-15 02:12:04

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH 03/21] mm/hugetlb: correct demote page offset logic



> On Sep 14, 2022, at 03:54, Doug Berger <[email protected]> wrote:
>
> With gigantic pages it may not be true that struct page structures
> are contiguous across the entire gigantic page. The mem_map_offset
> function is used here in place of direct pointer arithmetic to
> correct for this.
>
> Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
> Signed-off-by: Doug Berger <[email protected]>

With Matthew’s suggestion.

Acked-by: Muchun Song <[email protected]>

Thanks.

2022-09-16 03:43:23

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 01/21] mm/page_isolation: protect cma from isolate_single_pageblock

Hi Doug,

I love your patch! Perhaps something to improve:

[auto build test WARNING on robh/for-next]
[also build test WARNING on linus/master v6.0-rc5]
[cannot apply to akpm-mm/mm-everything next-20220915]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Doug-Berger/mm-introduce-Designated-Movable-Blocks/20220914-040216
base: https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
config: x86_64-randconfig-a012 (https://download.01.org/0day-ci/archive/20220916/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/10d000298e8a6b50a40ccc90d0d638105255f6e2
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Doug-Berger/mm-introduce-Designated-Movable-Blocks/20220914-040216
git checkout 10d000298e8a6b50a40ccc90d0d638105255f6e2
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

>> mm/page_isolation.c:309:6: warning: unused variable 'ret' [-Wunused-variable]
int ret;
^
1 warning generated.


vim +/ret +309 mm/page_isolation.c

a5d76b54a3f3a4 KAMEZAWA Hiroyuki 2007-10-16 281
b2c9e2fbba3253 Zi Yan 2022-05-12 282 /**
b2c9e2fbba3253 Zi Yan 2022-05-12 283 * isolate_single_pageblock() -- tries to isolate a pageblock that might be
b2c9e2fbba3253 Zi Yan 2022-05-12 284 * within a free or in-use page.
b2c9e2fbba3253 Zi Yan 2022-05-12 285 * @boundary_pfn: pageblock-aligned pfn that a page might cross
88ee134320b831 Zi Yan 2022-05-24 286 * @flags: isolation flags
b2c9e2fbba3253 Zi Yan 2022-05-12 287 * @gfp_flags: GFP flags used for migrating pages
b2c9e2fbba3253 Zi Yan 2022-05-12 288 * @isolate_before: isolate the pageblock before the boundary_pfn
b2c9e2fbba3253 Zi Yan 2022-05-12 289 *
b2c9e2fbba3253 Zi Yan 2022-05-12 290 * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
b2c9e2fbba3253 Zi Yan 2022-05-12 291 * pageblock. When not all pageblocks within a page are isolated at the same
b2c9e2fbba3253 Zi Yan 2022-05-12 292 * time, free page accounting can go wrong. For example, in the case of
b2c9e2fbba3253 Zi Yan 2022-05-12 293 * MAX_ORDER-1 = pageblock_order + 1, a MAX_ORDER-1 page has two pagelbocks.
b2c9e2fbba3253 Zi Yan 2022-05-12 294 * [ MAX_ORDER-1 ]
b2c9e2fbba3253 Zi Yan 2022-05-12 295 * [ pageblock0 | pageblock1 ]
b2c9e2fbba3253 Zi Yan 2022-05-12 296 * When either pageblock is isolated, if it is a free page, the page is not
b2c9e2fbba3253 Zi Yan 2022-05-12 297 * split into separate migratetype lists, which is supposed to; if it is an
b2c9e2fbba3253 Zi Yan 2022-05-12 298 * in-use page and freed later, __free_one_page() does not split the free page
b2c9e2fbba3253 Zi Yan 2022-05-12 299 * either. The function handles this by splitting the free page or migrating
b2c9e2fbba3253 Zi Yan 2022-05-12 300 * the in-use page then splitting the free page.
b2c9e2fbba3253 Zi Yan 2022-05-12 301 */
88ee134320b831 Zi Yan 2022-05-24 302 static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
10d000298e8a6b Doug Berger 2022-09-13 303 gfp_t gfp_flags, bool isolate_before)
b2c9e2fbba3253 Zi Yan 2022-05-12 304 {
b2c9e2fbba3253 Zi Yan 2022-05-12 305 unsigned long start_pfn;
b2c9e2fbba3253 Zi Yan 2022-05-12 306 unsigned long isolate_pageblock;
b2c9e2fbba3253 Zi Yan 2022-05-12 307 unsigned long pfn;
b2c9e2fbba3253 Zi Yan 2022-05-12 308 struct zone *zone;
88ee134320b831 Zi Yan 2022-05-24 @309 int ret;
b2c9e2fbba3253 Zi Yan 2022-05-12 310
b2c9e2fbba3253 Zi Yan 2022-05-12 311 VM_BUG_ON(!IS_ALIGNED(boundary_pfn, pageblock_nr_pages));
b2c9e2fbba3253 Zi Yan 2022-05-12 312
b2c9e2fbba3253 Zi Yan 2022-05-12 313 if (isolate_before)
b2c9e2fbba3253 Zi Yan 2022-05-12 314 isolate_pageblock = boundary_pfn - pageblock_nr_pages;
b2c9e2fbba3253 Zi Yan 2022-05-12 315 else
b2c9e2fbba3253 Zi Yan 2022-05-12 316 isolate_pageblock = boundary_pfn;
b2c9e2fbba3253 Zi Yan 2022-05-12 317
b2c9e2fbba3253 Zi Yan 2022-05-12 318 /*
b2c9e2fbba3253 Zi Yan 2022-05-12 319 * scan at the beginning of MAX_ORDER_NR_PAGES aligned range to avoid
b2c9e2fbba3253 Zi Yan 2022-05-12 320 * only isolating a subset of pageblocks from a bigger than pageblock
b2c9e2fbba3253 Zi Yan 2022-05-12 321 * free or in-use page. Also make sure all to-be-isolated pageblocks
b2c9e2fbba3253 Zi Yan 2022-05-12 322 * are within the same zone.
b2c9e2fbba3253 Zi Yan 2022-05-12 323 */
b2c9e2fbba3253 Zi Yan 2022-05-12 324 zone = page_zone(pfn_to_page(isolate_pageblock));
b2c9e2fbba3253 Zi Yan 2022-05-12 325 start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
b2c9e2fbba3253 Zi Yan 2022-05-12 326 zone->zone_start_pfn);
b2c9e2fbba3253 Zi Yan 2022-05-12 327
b2c9e2fbba3253 Zi Yan 2022-05-12 328 /*
b2c9e2fbba3253 Zi Yan 2022-05-12 329 * Bail out early when the to-be-isolated pageblock does not form
b2c9e2fbba3253 Zi Yan 2022-05-12 330 * a free or in-use page across boundary_pfn:
b2c9e2fbba3253 Zi Yan 2022-05-12 331 *
b2c9e2fbba3253 Zi Yan 2022-05-12 332 * 1. isolate before boundary_pfn: the page after is not online
b2c9e2fbba3253 Zi Yan 2022-05-12 333 * 2. isolate after boundary_pfn: the page before is not online
b2c9e2fbba3253 Zi Yan 2022-05-12 334 *
b2c9e2fbba3253 Zi Yan 2022-05-12 335 * This also ensures correctness. Without it, when isolate after
b2c9e2fbba3253 Zi Yan 2022-05-12 336 * boundary_pfn and [start_pfn, boundary_pfn) are not online,
b2c9e2fbba3253 Zi Yan 2022-05-12 337 * __first_valid_page() will return unexpected NULL in the for loop
b2c9e2fbba3253 Zi Yan 2022-05-12 338 * below.
b2c9e2fbba3253 Zi Yan 2022-05-12 339 */
b2c9e2fbba3253 Zi Yan 2022-05-12 340 if (isolate_before) {
b2c9e2fbba3253 Zi Yan 2022-05-12 341 if (!pfn_to_online_page(boundary_pfn))
b2c9e2fbba3253 Zi Yan 2022-05-12 342 return 0;
b2c9e2fbba3253 Zi Yan 2022-05-12 343 } else {
b2c9e2fbba3253 Zi Yan 2022-05-12 344 if (!pfn_to_online_page(boundary_pfn - 1))
b2c9e2fbba3253 Zi Yan 2022-05-12 345 return 0;
b2c9e2fbba3253 Zi Yan 2022-05-12 346 }
b2c9e2fbba3253 Zi Yan 2022-05-12 347
b2c9e2fbba3253 Zi Yan 2022-05-12 348 for (pfn = start_pfn; pfn < boundary_pfn;) {
b2c9e2fbba3253 Zi Yan 2022-05-12 349 struct page *page = __first_valid_page(pfn, boundary_pfn - pfn);
b2c9e2fbba3253 Zi Yan 2022-05-12 350
b2c9e2fbba3253 Zi Yan 2022-05-12 351 VM_BUG_ON(!page);
b2c9e2fbba3253 Zi Yan 2022-05-12 352 pfn = page_to_pfn(page);
b2c9e2fbba3253 Zi Yan 2022-05-12 353 /*
b2c9e2fbba3253 Zi Yan 2022-05-12 354 * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
b2c9e2fbba3253 Zi Yan 2022-05-12 355 * free pages in [start_pfn, boundary_pfn), its head page will
b2c9e2fbba3253 Zi Yan 2022-05-12 356 * always be in the range.
b2c9e2fbba3253 Zi Yan 2022-05-12 357 */
b2c9e2fbba3253 Zi Yan 2022-05-12 358 if (PageBuddy(page)) {
b2c9e2fbba3253 Zi Yan 2022-05-12 359 int order = buddy_order(page);
b2c9e2fbba3253 Zi Yan 2022-05-12 360
86d28b0709279c Zi Yan 2022-05-26 361 if (pfn + (1UL << order) > boundary_pfn) {
86d28b0709279c Zi Yan 2022-05-26 362 /* free page changed before split, check it again */
86d28b0709279c Zi Yan 2022-05-26 363 if (split_free_page(page, order, boundary_pfn - pfn))
86d28b0709279c Zi Yan 2022-05-26 364 continue;
86d28b0709279c Zi Yan 2022-05-26 365 }
86d28b0709279c Zi Yan 2022-05-26 366
86d28b0709279c Zi Yan 2022-05-26 367 pfn += 1UL << order;
b2c9e2fbba3253 Zi Yan 2022-05-12 368 continue;
b2c9e2fbba3253 Zi Yan 2022-05-12 369 }
b2c9e2fbba3253 Zi Yan 2022-05-12 370 /*
b2c9e2fbba3253 Zi Yan 2022-05-12 371 * migrate compound pages then let the free page handling code
b2c9e2fbba3253 Zi Yan 2022-05-12 372 * above do the rest. If migration is not possible, just fail.
b2c9e2fbba3253 Zi Yan 2022-05-12 373 */
b2c9e2fbba3253 Zi Yan 2022-05-12 374 if (PageCompound(page)) {
b2c9e2fbba3253 Zi Yan 2022-05-12 375 struct page *head = compound_head(page);
b2c9e2fbba3253 Zi Yan 2022-05-12 376 unsigned long head_pfn = page_to_pfn(head);
547be963c99f1e Zi Yan 2022-05-30 377 unsigned long nr_pages = compound_nr(head);
b2c9e2fbba3253 Zi Yan 2022-05-12 378
88ee134320b831 Zi Yan 2022-05-24 379 if (head_pfn + nr_pages <= boundary_pfn) {
b2c9e2fbba3253 Zi Yan 2022-05-12 380 pfn = head_pfn + nr_pages;
b2c9e2fbba3253 Zi Yan 2022-05-12 381 continue;
b2c9e2fbba3253 Zi Yan 2022-05-12 382 }
b2c9e2fbba3253 Zi Yan 2022-05-12 383 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
b2c9e2fbba3253 Zi Yan 2022-05-12 384 /*
b2c9e2fbba3253 Zi Yan 2022-05-12 385 * hugetlb, lru compound (THP), and movable compound pages
b2c9e2fbba3253 Zi Yan 2022-05-12 386 * can be migrated. Otherwise, fail the isolation.
b2c9e2fbba3253 Zi Yan 2022-05-12 387 */
b2c9e2fbba3253 Zi Yan 2022-05-12 388 if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
b2c9e2fbba3253 Zi Yan 2022-05-12 389 int order;
b2c9e2fbba3253 Zi Yan 2022-05-12 390 unsigned long outer_pfn;
88ee134320b831 Zi Yan 2022-05-24 391 int page_mt = get_pageblock_migratetype(page);
88ee134320b831 Zi Yan 2022-05-24 392 bool isolate_page = !is_migrate_isolate_page(page);
b2c9e2fbba3253 Zi Yan 2022-05-12 393 struct compact_control cc = {
b2c9e2fbba3253 Zi Yan 2022-05-12 394 .nr_migratepages = 0,
b2c9e2fbba3253 Zi Yan 2022-05-12 395 .order = -1,
b2c9e2fbba3253 Zi Yan 2022-05-12 396 .zone = page_zone(pfn_to_page(head_pfn)),
b2c9e2fbba3253 Zi Yan 2022-05-12 397 .mode = MIGRATE_SYNC,
b2c9e2fbba3253 Zi Yan 2022-05-12 398 .ignore_skip_hint = true,
b2c9e2fbba3253 Zi Yan 2022-05-12 399 .no_set_skip_hint = true,
b2c9e2fbba3253 Zi Yan 2022-05-12 400 .gfp_mask = gfp_flags,
b2c9e2fbba3253 Zi Yan 2022-05-12 401 .alloc_contig = true,
b2c9e2fbba3253 Zi Yan 2022-05-12 402 };
b2c9e2fbba3253 Zi Yan 2022-05-12 403 INIT_LIST_HEAD(&cc.migratepages);
b2c9e2fbba3253 Zi Yan 2022-05-12 404
88ee134320b831 Zi Yan 2022-05-24 405 /*
88ee134320b831 Zi Yan 2022-05-24 406 * XXX: mark the page as MIGRATE_ISOLATE so that
88ee134320b831 Zi Yan 2022-05-24 407 * no one else can grab the freed page after migration.
88ee134320b831 Zi Yan 2022-05-24 408 * Ideally, the page should be freed as two separate
88ee134320b831 Zi Yan 2022-05-24 409 * pages to be added into separate migratetype free
88ee134320b831 Zi Yan 2022-05-24 410 * lists.
88ee134320b831 Zi Yan 2022-05-24 411 */
88ee134320b831 Zi Yan 2022-05-24 412 if (isolate_page) {
88ee134320b831 Zi Yan 2022-05-24 413 ret = set_migratetype_isolate(page, page_mt,
88ee134320b831 Zi Yan 2022-05-24 414 flags, head_pfn, head_pfn + nr_pages);
88ee134320b831 Zi Yan 2022-05-24 415 if (ret)
10d000298e8a6b Doug Berger 2022-09-13 416 return ret;
88ee134320b831 Zi Yan 2022-05-24 417 }
88ee134320b831 Zi Yan 2022-05-24 418
b2c9e2fbba3253 Zi Yan 2022-05-12 419 ret = __alloc_contig_migrate_range(&cc, head_pfn,
b2c9e2fbba3253 Zi Yan 2022-05-12 420 head_pfn + nr_pages);
b2c9e2fbba3253 Zi Yan 2022-05-12 421
88ee134320b831 Zi Yan 2022-05-24 422 /*
88ee134320b831 Zi Yan 2022-05-24 423 * restore the page's migratetype so that it can
88ee134320b831 Zi Yan 2022-05-24 424 * be split into separate migratetype free lists
88ee134320b831 Zi Yan 2022-05-24 425 * later.
88ee134320b831 Zi Yan 2022-05-24 426 */
88ee134320b831 Zi Yan 2022-05-24 427 if (isolate_page)
88ee134320b831 Zi Yan 2022-05-24 428 unset_migratetype_isolate(page, page_mt);
88ee134320b831 Zi Yan 2022-05-24 429
b2c9e2fbba3253 Zi Yan 2022-05-12 430 if (ret)
10d000298e8a6b Doug Berger 2022-09-13 431 return -EBUSY;
b2c9e2fbba3253 Zi Yan 2022-05-12 432 /*
b2c9e2fbba3253 Zi Yan 2022-05-12 433 * reset pfn to the head of the free page, so
b2c9e2fbba3253 Zi Yan 2022-05-12 434 * that the free page handling code above can split
b2c9e2fbba3253 Zi Yan 2022-05-12 435 * the free page to the right migratetype list.
b2c9e2fbba3253 Zi Yan 2022-05-12 436 *
b2c9e2fbba3253 Zi Yan 2022-05-12 437 * head_pfn is not used here as a hugetlb page order
b2c9e2fbba3253 Zi Yan 2022-05-12 438 * can be bigger than MAX_ORDER-1, but after it is
b2c9e2fbba3253 Zi Yan 2022-05-12 439 * freed, the free page order is not. Use pfn within
b2c9e2fbba3253 Zi Yan 2022-05-12 440 * the range to find the head of the free page.
b2c9e2fbba3253 Zi Yan 2022-05-12 441 */
b2c9e2fbba3253 Zi Yan 2022-05-12 442 order = 0;
b2c9e2fbba3253 Zi Yan 2022-05-12 443 outer_pfn = pfn;
b2c9e2fbba3253 Zi Yan 2022-05-12 444 while (!PageBuddy(pfn_to_page(outer_pfn))) {
88ee134320b831 Zi Yan 2022-05-24 445 /* stop if we cannot find the free page */
88ee134320b831 Zi Yan 2022-05-24 446 if (++order >= MAX_ORDER)
10d000298e8a6b Doug Berger 2022-09-13 447 return -EBUSY;
b2c9e2fbba3253 Zi Yan 2022-05-12 448 outer_pfn &= ~0UL << order;
b2c9e2fbba3253 Zi Yan 2022-05-12 449 }
b2c9e2fbba3253 Zi Yan 2022-05-12 450 pfn = outer_pfn;
b2c9e2fbba3253 Zi Yan 2022-05-12 451 continue;
b2c9e2fbba3253 Zi Yan 2022-05-12 452 } else
b2c9e2fbba3253 Zi Yan 2022-05-12 453 #endif
10d000298e8a6b Doug Berger 2022-09-13 454 return -EBUSY;
b2c9e2fbba3253 Zi Yan 2022-05-12 455 }
b2c9e2fbba3253 Zi Yan 2022-05-12 456
b2c9e2fbba3253 Zi Yan 2022-05-12 457 pfn++;
b2c9e2fbba3253 Zi Yan 2022-05-12 458 }
b2c9e2fbba3253 Zi Yan 2022-05-12 459 return 0;
b2c9e2fbba3253 Zi Yan 2022-05-12 460 }
b2c9e2fbba3253 Zi Yan 2022-05-12 461

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-09-18 10:59:02

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH 16/21] dt-bindings: reserved-memory: introduce designated-movable-block

On 14/09/2022 18:13, Doug Berger wrote:
> On 9/14/2022 7:55 AM, Rob Herring wrote:
>> On Tue, Sep 13, 2022 at 12:55:03PM -0700, Doug Berger wrote:
>>> Introduce designated-movable-block.yaml to document the
>>> devicetree binding for Designated Movable Block children of the
>>> reserved-memory node.
>>
>> What is a Designated Movable Block? This patch needs to stand on its
>> own.
> As noted in my reply to your [PATCH 00/21] comment, my intention in
> submitting the entire patch set (and specifically PATCH 00/21]) was to
> communicate this context. Now that I believe I understand that only this
> patch should have been submitted to the devicetree-spec mailing list, I
> will strive harder to make it more self contained.

The submission of entire thread was ok. What is missing is the
explanation in this commit. This commit must be self-explanatory (e.g.
in explaining "Why are you doing it?"), not rely on other commits for
such explanation.

>
>>
>> Why does this belong or need to be in DT?
> While my preferred method of declaring Designated Movable Blocks is
> through the movablecore kernel parameter, I can conceive that others may
> wish to take advantage of the reserved-memory DT nodes. In particular,
> it has the advantage that a device can claim ownership of the
> reserved-memory via device tree, which is something that has yet to be
> implemented for DMBs defined with movablecore.

Rephrasing the question: why OS memory layout and OS behavior is a
property of hardware (DTS)?



Best regards,
Krzysztof

2022-09-18 11:01:43

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH 16/21] dt-bindings: reserved-memory: introduce designated-movable-block

On 13/09/2022 20:55, Doug Berger wrote:
> Introduce designated-movable-block.yaml to document the
> devicetree binding for Designated Movable Block children of the
> reserved-memory node.
>
> Signed-off-by: Doug Berger <[email protected]>
> ---
> .../designated-movable-block.yaml | 51 +++++++++++++++++++
> 1 file changed, 51 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
>
> diff --git a/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml b/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
> new file mode 100644
> index 000000000000..42f846069a2e
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
> @@ -0,0 +1,51 @@
> +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/reserved-memory/designated-movable-block.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: /reserved-memory Designated Movable Block node binding

Drop "binding"

> +
> +maintainers:
> + - [email protected]
> +
> +allOf:
> + - $ref: "reserved-memory.yaml"

Skip quotes

> +
> +properties:
> + compatible:
> + const: designated-movable-block
> + description:
> + This indicates a region of memory meant to be placed into
> + ZONE_MOVABLE.
> +
> +unevaluatedProperties: false
> +
> +required:
> + - compatible
> + - reusable
> +
> +examples:
> + - |
> + reserved-memory {

Use 4 spaces for example indentation.

> + #address-cells = <0x2>;
> + #size-cells = <0x2>;
> +
> + DMB0@10800000 {

The convention for node names is to use lowercase and generic node
names, so just "dmb".



Best regards,
Krzysztof

2022-09-18 23:32:28

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 16/21] dt-bindings: reserved-memory: introduce designated-movable-block

On 9/18/2022 3:31 AM, Krzysztof Kozlowski wrote:
> On 14/09/2022 18:13, Doug Berger wrote:
>> On 9/14/2022 7:55 AM, Rob Herring wrote:
>>> On Tue, Sep 13, 2022 at 12:55:03PM -0700, Doug Berger wrote:
>>>> Introduce designated-movable-block.yaml to document the
>>>> devicetree binding for Designated Movable Block children of the
>>>> reserved-memory node.
>>>
>>> What is a Designated Movable Block? This patch needs to stand on its
>>> own.
>> As noted in my reply to your [PATCH 00/21] comment, my intention in
>> submitting the entire patch set (and specifically PATCH 00/21]) was to
>> communicate this context. Now that I believe I understand that only this
>> patch should have been submitted to the devicetree-spec mailing list, I
>> will strive harder to make it more self contained.
>
> The submission of entire thread was ok. What is missing is the
> explanation in this commit. This commit must be self-explanatory (e.g.
> in explaining "Why are you doing it?"), not rely on other commits for
> such explanation.
>
>>
>>>
>>> Why does this belong or need to be in DT?
>> While my preferred method of declaring Designated Movable Blocks is
>> through the movablecore kernel parameter, I can conceive that others may
>> wish to take advantage of the reserved-memory DT nodes. In particular,
>> it has the advantage that a device can claim ownership of the
>> reserved-memory via device tree, which is something that has yet to be
>> implemented for DMBs defined with movablecore.
>
> Rephrasing the question: why OS memory layout and OS behavior is a
> property of hardware (DTS)?
I would say the premise is fundamentally the same as the existing
reserved-memory child node.

I've been rethinking how this should be specified. I am now thinking
that it may be better to introduce a new Reserved Memory property that
serves as a modifier to the 'reusable' property. The 'reusable' property
allows the OS to use memory that has been reserved for a device and
therefore requires the device driver to reclaim the memory prior to its
use. However, an OS may have multiple ways of implementing such reuse
and reclamation.

I am considering introducing the vendor specific 'linux,dmb' property
that is dependent on the 'reusable' property to allow both the OS and
the device driver to identify the method used by the Linux OS to support
reuse and reclamation of the reserved-memory child node.

Such a property would remove any need for new compatible strings to the
device tree. Does that approach seem reasonable to you?

>
> Best regards,
> Krzysztof
Thanks again for taking the time,
-Doug

2022-09-19 00:26:52

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 16/21] dt-bindings: reserved-memory: introduce designated-movable-block

On 9/18/2022 3:28 AM, Krzysztof Kozlowski wrote:
> On 13/09/2022 20:55, Doug Berger wrote:
>> Introduce designated-movable-block.yaml to document the
>> devicetree binding for Designated Movable Block children of the
>> reserved-memory node.
>>
>> Signed-off-by: Doug Berger <[email protected]>
>> ---
>> .../designated-movable-block.yaml | 51 +++++++++++++++++++
>> 1 file changed, 51 insertions(+)
>> create mode 100644 Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
>>
>> diff --git a/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml b/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
>> new file mode 100644
>> index 000000000000..42f846069a2e
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/reserved-memory/designated-movable-block.yaml
>> @@ -0,0 +1,51 @@
>> +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
>> +%YAML 1.2
>> +---
>> +$id: http://devicetree.org/schemas/reserved-memory/designated-movable-block.yaml#
>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>> +
>> +title: /reserved-memory Designated Movable Block node binding
>
> Drop "binding"
>
>> +
>> +maintainers:
>> + - [email protected]
>> +
>> +allOf:
>> + - $ref: "reserved-memory.yaml"
>
> Skip quotes
>
>> +
>> +properties:
>> + compatible:
>> + const: designated-movable-block
>> + description:
>> + This indicates a region of memory meant to be placed into
>> + ZONE_MOVABLE.
>> +
>> +unevaluatedProperties: false
>> +
>> +required:
>> + - compatible
>> + - reusable
>> +
>> +examples:
>> + - |
>> + reserved-memory {
>
> Use 4 spaces for example indentation.
>
>> + #address-cells = <0x2>;
>> + #size-cells = <0x2>;
>> +
>> + DMB0@10800000 {
>
> The convention for node names is to use lowercase and generic node
> names, so just "dmb".
>
>
>
> Best regards,
> Krzysztof
Thanks for taking the time to review and provide feedback on this patch.
-Doug

2022-09-19 09:45:50

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 00/21] mm: introduce Designated Movable Blocks

Hi Dough,

I have some high-level questions.

> MOTIVATION:
> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
> controllers with each mapped in a different address range within
> a Uniform Memory Architecture. Some users of these systems have

How large are these areas typically?

How large are they in comparison to other memory in the system?

How is this memory currently presented to the system?

> expressed the desire to locate ZONE_MOVABLE memory on each
> memory controller to allow user space intensive processing to
> make better use of the additional memory bandwidth.

Can you share some more how exactly ZONE_MOVABLE would help here to make
better use of the memory bandwidth?

> Unfortunately, the historical monotonic layout of zones would
> mean that if the lowest addressed memory controller contains
> ZONE_MOVABLE memory then all of the memory available from
> memory controllers at higher addresses must also be in the
> ZONE_MOVABLE zone. This would force all kernel memory accesses
> onto the lowest addressed memory controller and significantly
> reduce the amount of memory available for non-movable
> allocations.

We do have code that relies on zones during boot to not overlap within a
single node.

>
> The main objective of this patch set is therefore to allow a
> block of memory to be designated as part of the ZONE_MOVABLE
> zone where it will always only be used by the kernel page
> allocator to satisfy requests for movable pages. The term
> Designated Movable Block is introduced here to represent such a
> block. The favored implementation allows modification of the

Sorry to say, but that term is rather suboptimal to describe what you
are doing here. You simply have some system RAM you'd want to have
managed by ZONE_MOVABLE, no?

> 'movablecore' kernel parameter to allow specification of a base
> address and support for multiple blocks. The existing
> 'movablecore' mechanisms are retained. Other mechanisms based on
> device tree are also included in this set.
>
> BACKGROUND:
> NUMA architectures support distributing movablecore memory
> across each node, but it is undesirable to introduce the
> overhead and complexities of NUMA on systems that don't have a
> Non-Uniform Memory Architecture.

How exactly would that look like? I think I am missing something :)

>
> Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror option")
> also depends on zone overlap to support sytems with multiple
> mirrored ranges.

IIRC, zones will not overlap within a single node.

>
> Commit c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions")
> embraced overlapped zones for memory hotplug.

Yes, after boot.

>
> This commit set follows their lead to allow the ZONE_MOVABLE
> zone to overlap other zones while spanning the pages from the
> lowest Designated Movable Block to the end of the node.
> Designated Movable Blocks are made absent from overlapping zones
> and present within the ZONE_MOVABLE zone.
>
> I initially investigated an implementation using a Designated
> Movable migrate type in line with comments[1] made by Mel Gorman
> regarding a "sticky" MIGRATE_MOVABLE type to avoid using
> ZONE_MOVABLE. However, this approach was riskier since it was
> much more instrusive on the allocation paths. Ultimately, the
> progress made by the memory hotplug folks to expand the
> ZONE_MOVABLE functionality convinced me to follow this approach.
>
> OPPORTUNITIES:
> There have been many attempts to modify the behavior of the
> kernel page allocators use of CMA regions. This implementation
> of Designated Movable Blocks creates an opportunity to repurpose
> the CMA allocator to operate on ZONE_MOVABLE memory that the
> kernel page allocator can use more agressively, without
> affecting the existing CMA implementation. It is hoped that the
> "shared-dmb-pool" approach included here will be useful in cases
> where memory sharing is more important than allocation latency.
>
> CMA introduced a paradigm where multiple allocators could
> operate on the same region of memory, and that paradigm can be
> extended to Designated Movable Blocks as well. I was interested
> in using kernel resource management as a mechanism for exposing
> Designated Movable Block resources (e.g. /proc/iomem) that would
> be used by the kernel page allocator like any other ZONE_MOVABLE
> memory, but could be claimed by an alternative allocator (e.g.
> CMA). Unfortunately, this becomes complicated because the kernel
> resource implementation varies materially across different
> architectures and I do not require this capability so I have
> deferred that.

Why can't we simply designate these regions as CMA regions?

Why do we have to start using ZONE_MOVABLE for them?

--
Thanks,

David / dhildenb

2022-09-19 11:14:29

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH 16/21] dt-bindings: reserved-memory: introduce designated-movable-block

On 19/09/2022 01:12, Doug Berger wrote:
> On 9/18/2022 3:31 AM, Krzysztof Kozlowski wrote:
>> On 14/09/2022 18:13, Doug Berger wrote:
>>> On 9/14/2022 7:55 AM, Rob Herring wrote:
>>>> On Tue, Sep 13, 2022 at 12:55:03PM -0700, Doug Berger wrote:
>>>>> Introduce designated-movable-block.yaml to document the
>>>>> devicetree binding for Designated Movable Block children of the
>>>>> reserved-memory node.
>>>>
>>>> What is a Designated Movable Block? This patch needs to stand on its
>>>> own.
>>> As noted in my reply to your [PATCH 00/21] comment, my intention in
>>> submitting the entire patch set (and specifically PATCH 00/21]) was to
>>> communicate this context. Now that I believe I understand that only this
>>> patch should have been submitted to the devicetree-spec mailing list, I
>>> will strive harder to make it more self contained.
>>
>> The submission of entire thread was ok. What is missing is the
>> explanation in this commit. This commit must be self-explanatory (e.g.
>> in explaining "Why are you doing it?"), not rely on other commits for
>> such explanation.
>>
>>>
>>>>
>>>> Why does this belong or need to be in DT?
>>> While my preferred method of declaring Designated Movable Blocks is
>>> through the movablecore kernel parameter, I can conceive that others may
>>> wish to take advantage of the reserved-memory DT nodes. In particular,
>>> it has the advantage that a device can claim ownership of the
>>> reserved-memory via device tree, which is something that has yet to be
>>> implemented for DMBs defined with movablecore.
>>
>> Rephrasing the question: why OS memory layout and OS behavior is a
>> property of hardware (DTS)?
> I would say the premise is fundamentally the same as the existing
> reserved-memory child node.

I don't think it is fundamentally the same.

The existing reserved-memory node describes memory used by hardware - by
other devices. The OS way of handling this memory - movable, reclaimable
etc - is not part of it.

So no, it is not the same.

>
> I've been rethinking how this should be specified. I am now thinking
> that it may be better to introduce a new Reserved Memory property that
> serves as a modifier to the 'reusable' property. The 'reusable' property
> allows the OS to use memory that has been reserved for a device and
> therefore requires the device driver to reclaim the memory prior to its
> use. However, an OS may have multiple ways of implementing such reuse
> and reclamation.

... and I repeat the question - why OS way of implementing reuse and
reclamation is relevant to DT?

> I am considering introducing the vendor specific 'linux,dmb' property
> that is dependent on the 'reusable' property to allow both the OS and
> the device driver to identify the method used by the Linux OS to support
> reuse and reclamation of the reserved-memory child node.

Sure, but why? Why OS and Linux driver specific pieces should be in DT?
> Such a property would remove any need for new compatible strings to the
> device tree. Does that approach seem reasonable to you?

No, because you did not explain original question. At all.

Best regards,
Krzysztof

2022-09-20 01:30:02

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 00/21] mm: introduce Designated Movable Blocks

On 9/19/2022 2:00 AM, David Hildenbrand wrote:
> Hi Dough,
>
> I have some high-level questions.
Thanks for your interest. I will attempt to answer them.

>
>> MOTIVATION:
>> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
>> controllers with each mapped in a different address range within
>> a Uniform Memory Architecture. Some users of these systems have
>
> How large are these areas typically?
>
> How large are they in comparison to other memory in the system?
>
> How is this memory currently presented to the system?
I'm not certain what is typical because these systems are highly
configurable and Broadcom's customers have different ideas about
application processing.

The 7278 device has four ARMv8 CPU cores in an SMP cluster and two
memory controllers (MEMCs). Each MEMC is capable of controlling up to
8GB of DRAM. An example 7278 system might have 1GB on each controller,
so an arm64 kernel might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and
1GB on MEMC1 at 0x300000000-0x33FFFFFFF.

The Designated Movable Block concept introduced here has the potential
to offer useful services to different constituencies. I tried to
highlight this in my V1 patch set with the hope of attracting some
interest, but it can complicate the overall discussion, so I would like
to maybe narrow the discussion here. It may be good to keep them in mind
when assessing the overall value, but perhaps the "other opportunities"
can be covered as a follow on discussion.

The base capability described in commits 7-15 of this V1 patch set is to
allow a 'movablecore' block to be created at a particular base address
rather than solely at the end of addressable memory.

>
>> expressed the desire to locate ZONE_MOVABLE memory on each
>> memory controller to allow user space intensive processing to
>> make better use of the additional memory bandwidth.
>
> Can you share some more how exactly ZONE_MOVABLE would help here to make
> better use of the memory bandwidth?
ZONE_MOVABLE memory is effectively unusable by the kernel. It can be
used by user space applications through both the page allocator and the
Hugetlbfs. If a large 'movablecore' allocation is defined and it can
only be located at the end of addressable memory then it will always be
located on MEMC1 of a 7278 system. This will create a tendency for user
space accesses to consume more bandwidth on the MEMC1 memory controller
and kernel space accesses to consume more bandwidth on MEMC0. A more
even distribution of ZONE_MOVABLE memory between the available memory
controllers in theory makes more memory bandwidth available to user
space intensive loads.

>
>> Unfortunately, the historical monotonic layout of zones would
>> mean that if the lowest addressed memory controller contains
>> ZONE_MOVABLE memory then all of the memory available from
>> memory controllers at higher addresses must also be in the
>> ZONE_MOVABLE zone. This would force all kernel memory accesses
>> onto the lowest addressed memory controller and significantly
>> reduce the amount of memory available for non-movable
>> allocations.
>
> We do have code that relies on zones during boot to not overlap within a
> single node.
I believe my changes address all such reliance, but if you are aware of
something I missed please let me know.

>
>>
>> The main objective of this patch set is therefore to allow a
>> block of memory to be designated as part of the ZONE_MOVABLE
>> zone where it will always only be used by the kernel page
>> allocator to satisfy requests for movable pages. The term
>> Designated Movable Block is introduced here to represent such a
>> block. The favored implementation allows modification of the
>
> Sorry to say, but that term is rather suboptimal to describe what you
> are doing here. You simply have some system RAM you'd want to have
> managed by ZONE_MOVABLE, no?
That may be true, but I found it superior to the 'sticky' movable
terminology put forth by Mel Gorman ;). I'm happy to entertain
alternatives, but they may not be as easy to find as you think.

>
>> 'movablecore' kernel parameter to allow specification of a base
>> address and support for multiple blocks. The existing
>> 'movablecore' mechanisms are retained. Other mechanisms based on
>> device tree are also included in this set.
>>
>> BACKGROUND:
>> NUMA architectures support distributing movablecore memory
>> across each node, but it is undesirable to introduce the
>> overhead and complexities of NUMA on systems that don't have a
>> Non-Uniform Memory Architecture.
>
> How exactly would that look like? I think I am missing something :)
The notion would be to consider each memory controller as a separate
node, but as stated it is not desirable.

>
>>
>> Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror
>> option")
>> also depends on zone overlap to support sytems with multiple
>> mirrored ranges.
>
> IIRC, zones will not overlap within a single node.
I believe the implementation for kernelcore=mirror allows for the
possibility of multiple non-adjacent mirrored ranges in a single node
and accommodates the zone overlap.

>
>>
>> Commit c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions")
>> embraced overlapped zones for memory hotplug.
>
> Yes, after boot.
>
>>
>> This commit set follows their lead to allow the ZONE_MOVABLE
>> zone to overlap other zones while spanning the pages from the
>> lowest Designated Movable Block to the end of the node.
>> Designated Movable Blocks are made absent from overlapping zones
>> and present within the ZONE_MOVABLE zone.
>>
>> I initially investigated an implementation using a Designated
>> Movable migrate type in line with comments[1] made by Mel Gorman
>> regarding a "sticky" MIGRATE_MOVABLE type to avoid using
>> ZONE_MOVABLE. However, this approach was riskier since it was
>> much more instrusive on the allocation paths. Ultimately, the
>> progress made by the memory hotplug folks to expand the
>> ZONE_MOVABLE functionality convinced me to follow this approach.
>>
>> OPPORTUNITIES:
>> There have been many attempts to modify the behavior of the
>> kernel page allocators use of CMA regions. This implementation
>> of Designated Movable Blocks creates an opportunity to repurpose
>> the CMA allocator to operate on ZONE_MOVABLE memory that the
>> kernel page allocator can use more agressively, without
>> affecting the existing CMA implementation. It is hoped that the
>> "shared-dmb-pool" approach included here will be useful in cases
>> where memory sharing is more important than allocation latency.
>>
>> CMA introduced a paradigm where multiple allocators could
>> operate on the same region of memory, and that paradigm can be
>> extended to Designated Movable Blocks as well. I was interested
>> in using kernel resource management as a mechanism for exposing
>> Designated Movable Block resources (e.g. /proc/iomem) that would
>> be used by the kernel page allocator like any other ZONE_MOVABLE
>> memory, but could be claimed by an alternative allocator (e.g.
>> CMA). Unfortunately, this becomes complicated because the kernel
>> resource implementation varies materially across different
>> architectures and I do not require this capability so I have
>> deferred that.
>
> Why can't we simply designate these regions as CMA regions?
We and others have encountered significant performance issues when large
CMA regions are used. There are significant restrictions on the page
allocator's use of MIGRATE_CMA pages and the memory subsystem works very
hard to keep about half of the memory in the CMA region free. There have
been attempts to patch the CMA implementation to alter this behavior
(for example the set I referenced Mel's response to in [1]), but there
are users that desire the current behavior.

>
> Why do we have to start using ZONE_MOVABLE for them?
One of the "other opportunities" for Designated Movable Blocks is to
allow CMA to allocate from a DMB as an alternative. This would allow
current users to continue using CMA as they want, but would allow users
(e.g. hugetlb_cma) that are not sensitive to the allocation latency to
let the kernel page allocator make more complete use (i.e. waste less)
of the shared memory. ZONE_MOVABLE pageblocks are always MIGRATE_MOVABLE
so the restrictions placed on MIGRATE_CMA pageblocks are lifted within a
DMB.

>
Thanks for your consideration,
Dough Baker ... I mean Doug Berger :).

2022-09-21 01:08:09

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 16/21] dt-bindings: reserved-memory: introduce designated-movable-block

On 9/19/2022 4:03 AM, Krzysztof Kozlowski wrote:
> On 19/09/2022 01:12, Doug Berger wrote:
>> On 9/18/2022 3:31 AM, Krzysztof Kozlowski wrote:
>>> On 14/09/2022 18:13, Doug Berger wrote:
>>>> On 9/14/2022 7:55 AM, Rob Herring wrote:
>>>>> On Tue, Sep 13, 2022 at 12:55:03PM -0700, Doug Berger wrote:
>>>>>> Introduce designated-movable-block.yaml to document the
>>>>>> devicetree binding for Designated Movable Block children of the
>>>>>> reserved-memory node.
>>>>>
>>>>> What is a Designated Movable Block? This patch needs to stand on its
>>>>> own.
>>>> As noted in my reply to your [PATCH 00/21] comment, my intention in
>>>> submitting the entire patch set (and specifically PATCH 00/21]) was to
>>>> communicate this context. Now that I believe I understand that only this
>>>> patch should have been submitted to the devicetree-spec mailing list, I
>>>> will strive harder to make it more self contained.
>>>
>>> The submission of entire thread was ok. What is missing is the
>>> explanation in this commit. This commit must be self-explanatory (e.g.
>>> in explaining "Why are you doing it?"), not rely on other commits for
>>> such explanation.
>>>
>>>>
>>>>>
>>>>> Why does this belong or need to be in DT?
>>>> While my preferred method of declaring Designated Movable Blocks is
>>>> through the movablecore kernel parameter, I can conceive that others may
>>>> wish to take advantage of the reserved-memory DT nodes. In particular,
>>>> it has the advantage that a device can claim ownership of the
>>>> reserved-memory via device tree, which is something that has yet to be
>>>> implemented for DMBs defined with movablecore.
>>>
>>> Rephrasing the question: why OS memory layout and OS behavior is a
>>> property of hardware (DTS)?
>> I would say the premise is fundamentally the same as the existing
>> reserved-memory child node.
>
> I don't think it is fundamentally the same.
>
> The existing reserved-memory node describes memory used by hardware - by
> other devices. The OS way of handling this memory - movable, reclaimable
> etc - is not part of it.
>
> So no, it is not the same.
>
>>
>> I've been rethinking how this should be specified. I am now thinking
>> that it may be better to introduce a new Reserved Memory property that
>> serves as a modifier to the 'reusable' property. The 'reusable' property
>> allows the OS to use memory that has been reserved for a device and
>> therefore requires the device driver to reclaim the memory prior to its
>> use. However, an OS may have multiple ways of implementing such reuse
>> and reclamation.
>
> ... and I repeat the question - why OS way of implementing reuse and
> reclamation is relevant to DT?
>
>> I am considering introducing the vendor specific 'linux,dmb' property
>> that is dependent on the 'reusable' property to allow both the OS and
>> the device driver to identify the method used by the Linux OS to support
>> reuse and reclamation of the reserved-memory child node.
>
> Sure, but why? Why OS and Linux driver specific pieces should be in DT?
>> Such a property would remove any need for new compatible strings to the
>> device tree. Does that approach seem reasonable to you?
>
> No, because you did not explain original question. At all.
I apologize if I have somehow offended you, but please recognize that my
apparent inability to answer your question does not come from an
unwillingness to do so.

I believe an example of the reserved-memory node being used the way you
indicate (though there are other uses) can be expressed with device tree
nodes like these:

reserved-memory {
#address-cells = <0x1>;
#size-cells = <0x1>;
ranges;

multimedia_reserved: multimedia@80000000 {
reg = <0x80000000 0x10000000>;
};
};

decoder@8012000 {
memory-region = <&multimedia_reserved>;
/* ... */
};

Here a 256MB chunk of memory is reserved for use by a hardware decoder
as part of rendering a video stream. In this case the memory is reserved
for the exclusive use of the decoder device and its associated device
driver.

The Devicetree Specification includes a property named 'reusable' that
could be applied to the multimedia node to allow the OS to "use the
memory in this region with the limitation that the device driver(s)
owning the region need to be able to reclaim it back". This is a good
idea, because this memory could probably be put to good use when the
decoder is not active. Unfortunately, the methods for reusing this
memory are not defined for Linux so the multimedia reserved memory would
not be reused even though the devicetree indicates that it is allowed.

The notion behind this commit was to introduce the
'designated-movable-block' compatible string that could be added to the
multimedia node to allow the Client Program (i.e. Linux) to select a
device driver that knows how to reclaim reserved memory back from the OS
when it is needed by the decoder device and release it back to the OS
when the decoder no longer needs it. In this way, the purpose of the
multimedia node remains the same (i.e. to reserve memory for use by a
device), but a new compatible string is defined to allow for selection
of an appropriate device driver and allow successful reuse of the memory
for the benefit of the system.

From Rob's feedback it is clear that 'designated-movable-block' is not
an appropriate name, but maybe 'linux,dmb' might have been. However, it
would be more flexible if a 'linux,dmb' property could be introduced as
a modifier to the existing 'reusable' property to provide a general
mechanism for clarifying how 'reusable' should be supported by the
Client Software and its device drivers.

Such a property is not directly relevant to hardware, but the devicetree
is not wholly concerned with hardware. Reserved memory node children
include support for 'linux,cma-default' and 'linux,dma-default'
properties that signal behavioral intent to the Linux OS. Some aspects
of the devicetree (e.g. the /chosen node and 'reusable' property) are
for the benefit of the Client Program.

>
> Best regards,
> Krzysztof
I hope this is closer to the answer you seek, but I may simply not
understand the question being asked,
-Doug

2022-09-21 06:51:03

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH 16/21] dt-bindings: reserved-memory: introduce designated-movable-block

On 21/09/2022 02:14, Doug Berger wrote:
> On 9/19/2022 4:03 AM, Krzysztof Kozlowski wrote:
>> On 19/09/2022 01:12, Doug Berger wrote:
>>> On 9/18/2022 3:31 AM, Krzysztof Kozlowski wrote:
>>>> On 14/09/2022 18:13, Doug Berger wrote:
>>>>> On 9/14/2022 7:55 AM, Rob Herring wrote:
>>>>>> On Tue, Sep 13, 2022 at 12:55:03PM -0700, Doug Berger wrote:
>>>>>>> Introduce designated-movable-block.yaml to document the
>>>>>>> devicetree binding for Designated Movable Block children of the
>>>>>>> reserved-memory node.
>>>>>>
>>>>>> What is a Designated Movable Block? This patch needs to stand on its
>>>>>> own.
>>>>> As noted in my reply to your [PATCH 00/21] comment, my intention in
>>>>> submitting the entire patch set (and specifically PATCH 00/21]) was to
>>>>> communicate this context. Now that I believe I understand that only this
>>>>> patch should have been submitted to the devicetree-spec mailing list, I
>>>>> will strive harder to make it more self contained.
>>>>
>>>> The submission of entire thread was ok. What is missing is the
>>>> explanation in this commit. This commit must be self-explanatory (e.g.
>>>> in explaining "Why are you doing it?"), not rely on other commits for
>>>> such explanation.
>>>>
>>>>>
>>>>>>
>>>>>> Why does this belong or need to be in DT?
>>>>> While my preferred method of declaring Designated Movable Blocks is
>>>>> through the movablecore kernel parameter, I can conceive that others may
>>>>> wish to take advantage of the reserved-memory DT nodes. In particular,
>>>>> it has the advantage that a device can claim ownership of the
>>>>> reserved-memory via device tree, which is something that has yet to be
>>>>> implemented for DMBs defined with movablecore.
>>>>
>>>> Rephrasing the question: why OS memory layout and OS behavior is a
>>>> property of hardware (DTS)?
>>> I would say the premise is fundamentally the same as the existing
>>> reserved-memory child node.
>>
>> I don't think it is fundamentally the same.
>>
>> The existing reserved-memory node describes memory used by hardware - by
>> other devices. The OS way of handling this memory - movable, reclaimable
>> etc - is not part of it.
>>
>> So no, it is not the same.
>>
>>>
>>> I've been rethinking how this should be specified. I am now thinking
>>> that it may be better to introduce a new Reserved Memory property that
>>> serves as a modifier to the 'reusable' property. The 'reusable' property
>>> allows the OS to use memory that has been reserved for a device and
>>> therefore requires the device driver to reclaim the memory prior to its
>>> use. However, an OS may have multiple ways of implementing such reuse
>>> and reclamation.
>>
>> ... and I repeat the question - why OS way of implementing reuse and
>> reclamation is relevant to DT?
>>
>>> I am considering introducing the vendor specific 'linux,dmb' property
>>> that is dependent on the 'reusable' property to allow both the OS and
>>> the device driver to identify the method used by the Linux OS to support
>>> reuse and reclamation of the reserved-memory child node.
>>
>> Sure, but why? Why OS and Linux driver specific pieces should be in DT?
>>> Such a property would remove any need for new compatible strings to the
>>> device tree. Does that approach seem reasonable to you?
>>
>> No, because you did not explain original question. At all.
> I apologize if I have somehow offended you, but please recognize that my
> apparent inability to answer your question does not come from an
> unwillingness to do so.
>
> I believe an example of the reserved-memory node being used the way you
> indicate (though there are other uses) can be expressed with device tree
> nodes like these:
>
> reserved-memory {
> #address-cells = <0x1>;
> #size-cells = <0x1>;
> ranges;
>
> multimedia_reserved: multimedia@80000000 {
> reg = <0x80000000 0x10000000>;
> };
> };
>
> decoder@8012000 {
> memory-region = <&multimedia_reserved>;
> /* ... */
> };
>
> Here a 256MB chunk of memory is reserved for use by a hardware decoder
> as part of rendering a video stream. In this case the memory is reserved
> for the exclusive use of the decoder device and its associated device
> driver.
>
> The Devicetree Specification includes a property named 'reusable' that
> could be applied to the multimedia node to allow the OS to "use the
> memory in this region with the limitation that the device driver(s)
> owning the region need to be able to reclaim it back".

Indeed, there is such.... and should be used instead. :)

> This is a good
> idea, because this memory could probably be put to good use when the
> decoder is not active. Unfortunately, the methods for reusing this
> memory are not defined for Linux so the multimedia reserved memory would
> not be reused even though the devicetree indicates that it is allowed.

Then rather implementation has to be changed, not Devicetree bindings.

>
> The notion behind this commit was to introduce the
> 'designated-movable-block' compatible string that could be added to the
> multimedia node to allow the Client Program (i.e. Linux) to select a
> device driver that knows how to reclaim reserved memory back from the OS
> when it is needed by the decoder device and release it back to the OS
> when the decoder no longer needs it. In this way, the purpose of the
> multimedia node remains the same (i.e. to reserve memory for use by a
> device), but a new compatible string is defined to allow for selection
> of an appropriate device driver and allow successful reuse of the memory
> for the benefit of the system.

We don't need a new compatible for it but use that existing property.

>
> From Rob's feedback it is clear that 'designated-movable-block' is not
> an appropriate name, but maybe 'linux,dmb' might have been. However, it
> would be more flexible if a 'linux,dmb' property could be introduced as
> a modifier to the existing 'reusable' property to provide a general
> mechanism for clarifying how 'reusable' should be supported by the
> Client Software and its device drivers.
>
> Such a property is not directly relevant to hardware, but the devicetree
> is not wholly concerned with hardware. Reserved memory node children
> include support for 'linux,cma-default' and 'linux,dma-default'
> properties that signal behavioral intent to the Linux OS. Some aspects
> of the devicetree (e.g. the /chosen node and 'reusable' property) are
> for the benefit of the Client Program.

Fair enough, although there is difference between generic property for
reusable/reclaimable memory and a property describing one of Linux
memory-management zones.

Best regards,
Krzysztof

2022-09-23 12:11:40

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/21] mm: introduce Designated Movable Blocks

Hi Doug,

I only had time to skim through the patches and before diving in I'd like
to clarify a few things.

On Mon, Sep 19, 2022 at 06:03:55PM -0700, Doug Berger wrote:
> On 9/19/2022 2:00 AM, David Hildenbrand wrote:
> >
> > How is this memory currently presented to the system?
>
> The 7278 device has four ARMv8 CPU cores in an SMP cluster and two memory
> controllers (MEMCs). Each MEMC is capable of controlling up to 8GB of DRAM.
> An example 7278 system might have 1GB on each controller, so an arm64 kernel
> might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and 1GB on MEMC1 at
> 0x300000000-0x33FFFFFFF.
>
> The base capability described in commits 7-15 of this V1 patch set is to
> allow a 'movablecore' block to be created at a particular base address
> rather than solely at the end of addressable memory.

I think this capability is only useful when there is non-uniform access to
different memory ranges. Otherwise it wouldn't matter where the movable
pages reside. The system you describe looks quite NUMA to me, with two
memory controllers, each for accessing a partial range of the available
memory.

> > > expressed the desire to locate ZONE_MOVABLE memory on each
> > > memory controller to allow user space intensive processing to
> > > make better use of the additional memory bandwidth.
> >
> > Can you share some more how exactly ZONE_MOVABLE would help here to make
> > better use of the memory bandwidth?
>
> ZONE_MOVABLE memory is effectively unusable by the kernel. It can be used by
> user space applications through both the page allocator and the Hugetlbfs.
> If a large 'movablecore' allocation is defined and it can only be located at
> the end of addressable memory then it will always be located on MEMC1 of a
> 7278 system. This will create a tendency for user space accesses to consume
> more bandwidth on the MEMC1 memory controller and kernel space accesses to
> consume more bandwidth on MEMC0. A more even distribution of ZONE_MOVABLE
> memory between the available memory controllers in theory makes more memory
> bandwidth available to user space intensive loads.

The theory makes perfect sense, but is there any practical evidence of
improvement?
Some benchmark results that illustrate the difference would be nice.

> > > BACKGROUND:
> > > NUMA architectures support distributing movablecore memory
> > > across each node, but it is undesirable to introduce the
> > > overhead and complexities of NUMA on systems that don't have a
> > > Non-Uniform Memory Architecture.
> >
> > How exactly would that look like? I think I am missing something :)
>
> The notion would be to consider each memory controller as a separate node,
> but as stated it is not desirable.

Why?

> Thanks for your consideration,
> Dough Baker ... I mean Doug Berger :).

--
Sincerely yours,
Mike.

2022-09-23 22:25:23

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 00/21] mm: introduce Designated Movable Blocks

On 9/23/2022 4:19 AM, Mike Rapoport wrote:
> Hi Doug,
>
> I only had time to skim through the patches and before diving in I'd like
> to clarify a few things.
Thanks for taking the time. Any input is appreciated.

>
> On Mon, Sep 19, 2022 at 06:03:55PM -0700, Doug Berger wrote:
>> On 9/19/2022 2:00 AM, David Hildenbrand wrote:
>>>
>>> How is this memory currently presented to the system?
>>
>> The 7278 device has four ARMv8 CPU cores in an SMP cluster and two memory
>> controllers (MEMCs). Each MEMC is capable of controlling up to 8GB of DRAM.
>> An example 7278 system might have 1GB on each controller, so an arm64 kernel
>> might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and 1GB on MEMC1 at
>> 0x300000000-0x33FFFFFFF.
>>
>> The base capability described in commits 7-15 of this V1 patch set is to
>> allow a 'movablecore' block to be created at a particular base address
>> rather than solely at the end of addressable memory.
>
> I think this capability is only useful when there is non-uniform access to
> different memory ranges. Otherwise it wouldn't matter where the movable
> pages reside.
I think that is a fair assessment of the described capability. However,
the non-uniform access is a result of the current Linux architecture
rather than the hardware architecture.

> The system you describe looks quite NUMA to me, with two
> memory controllers, each for accessing a partial range of the available
> memory.
NUMA was created to deal with non-uniformity in the hardware
architecture where a CPU and/or other hardware device can make more
efficient use of some nodes than other nodes. NUMA attempts to allocate
from "closer" nodes to improve the operational efficiency of the system.

If we consider how an arm64 architecture Linux kernel will apply zones
to the above example system we find that Linux will place MEMC0 in
ZONE_DMA and MEMC1 in ZONE_NORMAL. This allows both kernel and user
space to compete for bandwidth on MEMC1, but largely excludes user space
from MEMC0. It is possible for user space to get memory from ZONE_DMA
through fallback when ZONE_NORMAL has been consumed, but there is a
pretty clear bias against user space use of MEMC0. This non-uniformity
doesn't come from the bus architecture since each CPU has equal costs to
access MEMC0 and MEMC1. They compete for bandwidth, but there is no
hardware bias for one node over another. Creating ZONE_MOVABLE memory on
MEMC0 can help correct for the Linux bias.

>
>>>> expressed the desire to locate ZONE_MOVABLE memory on each
>>>> memory controller to allow user space intensive processing to
>>>> make better use of the additional memory bandwidth.
>>>
>>> Can you share some more how exactly ZONE_MOVABLE would help here to make
>>> better use of the memory bandwidth?
>>
>> ZONE_MOVABLE memory is effectively unusable by the kernel. It can be used by
>> user space applications through both the page allocator and the Hugetlbfs.
>> If a large 'movablecore' allocation is defined and it can only be located at
>> the end of addressable memory then it will always be located on MEMC1 of a
>> 7278 system. This will create a tendency for user space accesses to consume
>> more bandwidth on the MEMC1 memory controller and kernel space accesses to
>> consume more bandwidth on MEMC0. A more even distribution of ZONE_MOVABLE
>> memory between the available memory controllers in theory makes more memory
>> bandwidth available to user space intensive loads.
>
> The theory makes perfect sense, but is there any practical evidence of
> improvement?
> Some benchmark results that illustrate the difference would be nice.
I agree that benchmark results would be nice. Unfortunately, I am not
part of the constituency that uses these Linux features so I have no
representative user space work loads to measure. I can only say that I
was asked to implement this capability, this is the approach I took, and
customers of Broadcom are making use of it. I am submitting it upstream
with the hope that: its/my sanity can be better reviewed, it will not
get broken by future changes in the kernel, and it will be useful to others.

This "narrow" capability may have limited value to others, but it should
not create issues for those that do not actively wish to use it. I would
hope that makes it easier to review and get accepted.

However, I believe "other opportunities" exist that may have broader
appeal so I have suggested some along with the "narrow" capability to
hopefully give others motivation to consider accepting the narrow
capability and to help shape how these "other capabilities" should be
implemented.

One "other opportunity" that I have realized may be more interesting
than I originally anticipated comes from the recognition that the
Devicetree Specification includes support for Reserved Memory regions
that can contain the 'reusable' property to allow the OS to make use of
the memory. Currently, Linux only takes advantage of that capability for
reserved memory nodes that are compatible with 'shared-dma-pool' where
CMA is used to allow the memory to be used by the OS and by device
drivers. CMA is a great concept, but we have observed shortcomings that
become more apparent as the size of the CMA region grows. Specifically,
the Linux memory management works very hard to keep half of the CMA
memory free. A number of submissions have been made over the years to
alter the CMA implementation to allow more aggressive use of the memory
by the OS, but there are users that desire the current behavior so the
submissions have been rejected.

No other types of reserved memory nodes can take advantage of sharing
the memory with the Linux operating system because there is insufficient
specification of how device drivers can reclaim the reserved memory when
it is needed. The introduction of Designated Movable Block support
provides a mechanism that would allow this capability to be realized.
Because DMBs are in ZONE_MOVABLE their pages are reclaimable, and
because they can be located anywhere they can satisfy DMA constraints of
owning devices. In the simplest case, device drivers can use the
dmb_intersects() function to determine whether their reserved memory
range is within a DMB and can use the alloc_contig_range() function to
reclaim the pages. This simple API could certainly be improved upon
(e.g. the CMA allocator seems like an obvious choice), but it doesn't
need to be defined by me so I would be happy to hear other people's ideas.

>
>>>> BACKGROUND:
>>>> NUMA architectures support distributing movablecore memory
>>>> across each node, but it is undesirable to introduce the
>>>> overhead and complexities of NUMA on systems that don't have a
>>>> Non-Uniform Memory Architecture.
>>>
>>> How exactly would that look like? I think I am missing something :)
>>
>> The notion would be to consider each memory controller as a separate node,
>> but as stated it is not desirable.
>
> Why?
In my opinion this is an inappropriate application of NUMA because the
hardware does not impose any access non-uniformity to justify the
complexity and overhead associated with NUMA. It would only be
shoe-horned into the implementation to add some logical notion of memory
nodes being associated with memory controllers. I would expect such an
approach to receive a lot of push back from the Android Common Kernel
users which may not be relevant to everyone, but is to many.

Thanks for your consideration,
-Doug

2022-09-29 09:22:19

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 00/21] mm: introduce Designated Movable Blocks

On 20.09.22 03:03, Doug Berger wrote:
> On 9/19/2022 2:00 AM, David Hildenbrand wrote:
>> Hi Dough,
>>
>> I have some high-level questions.
> Thanks for your interest. I will attempt to answer them.
>

Hi Doug,

sorry for the late reply, slowly catching up on mails.

>>
>>> MOTIVATION:
>>> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
>>> controllers with each mapped in a different address range within
>>> a Uniform Memory Architecture. Some users of these systems have
>>
>> How large are these areas typically?
>>
>> How large are they in comparison to other memory in the system?
>>
>> How is this memory currently presented to the system?
> I'm not certain what is typical because these systems are highly
> configurable and Broadcom's customers have different ideas about
> application processing.
>
> The 7278 device has four ARMv8 CPU cores in an SMP cluster and two
> memory controllers (MEMCs). Each MEMC is capable of controlling up to
> 8GB of DRAM. An example 7278 system might have 1GB on each controller,
> so an arm64 kernel might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and
> 1GB on MEMC1 at 0x300000000-0x33FFFFFFF.
>
> The Designated Movable Block concept introduced here has the potential
> to offer useful services to different constituencies. I tried to
> highlight this in my V1 patch set with the hope of attracting some
> interest, but it can complicate the overall discussion, so I would like
> to maybe narrow the discussion here. It may be good to keep them in mind
> when assessing the overall value, but perhaps the "other opportunities"
> can be covered as a follow on discussion.
>
> The base capability described in commits 7-15 of this V1 patch set is to
> allow a 'movablecore' block to be created at a particular base address
> rather than solely at the end of addressable memory.
>

Just so we're on the same page:

Having too much ZONE_MOVABLE memory (ratio compared to !ZONE_MOVABLE
memory) is dangerous. Acceptable ratios highly depend on the target
workload. An extreme example is memory-hungry applications that end up
long-term pinning a lot of memory (e.g., VMs with SR-IO): we can run
easily out of free memory in the !ZONE_MOVABLE zones and might not want
ZONE_MOVABLE at all.

So whatever we do, this should in general not be the kernel sole
decision to make this memory any special and let ZONE_MOVABLE manage it.

It's the same with CMA. "Heavy" CMA users require special configuration:
hugetlb_cma is one prime example.

>>
>>> expressed the desire to locate ZONE_MOVABLE memory on each
>>> memory controller to allow user space intensive processing to
>>> make better use of the additional memory bandwidth.
>>
>> Can you share some more how exactly ZONE_MOVABLE would help here to make
>> better use of the memory bandwidth?
> ZONE_MOVABLE memory is effectively unusable by the kernel. It can be
> used by user space applications through both the page allocator and the
> Hugetlbfs. If a large 'movablecore' allocation is defined and it can

Hugetlbfs not necessarily by all architectures. Some architectures don't
support placing hugetlb pages on ZONE_MOVABLE (not migratable) and
gigantic pages are special either way.

> only be located at the end of addressable memory then it will always be
> located on MEMC1 of a 7278 system. This will create a tendency for user
> space accesses to consume more bandwidth on the MEMC1 memory controller
> and kernel space accesses to consume more bandwidth on MEMC0. A more
> even distribution of ZONE_MOVABLE memory between the available memory
> controllers in theory makes more memory bandwidth available to user
> space intensive loads.
>

Sorry to be dense, is this also about different memory access latency or
just memory bandwidth?

Do these memory areas have special/different performance
characteristics? Using dedicated/fake NUMA nodes might be more in line
with what CXL and PMEM are up to.

Using ZONE_MOVABLE for that purpose feels a little bit like an abuse of
the mechanism. To be clearer what I mean:

We can place any movable allocations on ZONE_MOVABLE, including kernel
allocations. User space allocations are just one example, and int he
future we'll turn more and more allocations movable to be able to cope
with bigger ZONE_MOVABLE demands due to DAX/CXL. I once looked into
migrating user space page tables, just to give an example.


>>
>>> Unfortunately, the historical monotonic layout of zones would
>>> mean that if the lowest addressed memory controller contains
>>> ZONE_MOVABLE memory then all of the memory available from
>>> memory controllers at higher addresses must also be in the
>>> ZONE_MOVABLE zone. This would force all kernel memory accesses
>>> onto the lowest addressed memory controller and significantly
>>> reduce the amount of memory available for non-movable
>>> allocations.
>>
>> We do have code that relies on zones during boot to not overlap within a
>> single node.
> I believe my changes address all such reliance, but if you are aware of
> something I missed please let me know.
>

One example I'm aware of is drivers/base/memory.c:memory_block_add_nid()
/ early_node_zone_for_memory_block().

If we get it wrong, or actually have memory blocks that span multiple
zones, we can no longer offline these memory blocks. We really wanted to
avoid scanning the memmap for now and it seems to get the job done in
environments we care about.

>>
>>>
>>> The main objective of this patch set is therefore to allow a
>>> block of memory to be designated as part of the ZONE_MOVABLE
>>> zone where it will always only be used by the kernel page
>>> allocator to satisfy requests for movable pages. The term
>>> Designated Movable Block is introduced here to represent such a
>>> block. The favored implementation allows modification of the
>>
>> Sorry to say, but that term is rather suboptimal to describe what you
>> are doing here. You simply have some system RAM you'd want to have
>> managed by ZONE_MOVABLE, no?
> That may be true, but I found it superior to the 'sticky' movable
> terminology put forth by Mel Gorman ;). I'm happy to entertain
> alternatives, but they may not be as easy to find as you think.

Especially the "blocks" part is confusing. Movable pageblocks? Movable
Linux memory blocks?

Note that the sticky movable *pageblocks* were a completely different
concept than simply reusing ZONE_MOVABLE for some memory ranges.

>
>>
>>> 'movablecore' kernel parameter to allow specification of a base
>>> address and support for multiple blocks. The existing
>>> 'movablecore' mechanisms are retained. Other mechanisms based on
>>> device tree are also included in this set.
>>>
>>> BACKGROUND:
>>> NUMA architectures support distributing movablecore memory
>>> across each node, but it is undesirable to introduce the
>>> overhead and complexities of NUMA on systems that don't have a
>>> Non-Uniform Memory Architecture.
>>
>> How exactly would that look like? I think I am missing something :)
> The notion would be to consider each memory controller as a separate
> node, but as stated it is not desirable.
>

Doing it the DAX/CXL way would be to expose these memory ranges as
daxdev instead, and letting the admin decide how to online these memory
ranges when adding them to the buddy via the dax/kmem kernel module.

That could mean that your booting with memory on MC0 only, and expose
memory of MC1 via a daxdev, giving the admin the possibility do decide
to which zone the memory should be onlined too.

That would avoid most kernel code changes.

>>
>> Why can't we simply designate these regions as CMA regions?
> We and others have encountered significant performance issues when large
> CMA regions are used. There are significant restrictions on the page
> allocator's use of MIGRATE_CMA pages and the memory subsystem works very
> hard to keep about half of the memory in the CMA region free. There have
> been attempts to patch the CMA implementation to alter this behavior
> (for example the set I referenced Mel's response to in [1]), but there
> are users that desire the current behavior.

Optimizing that would be great, eventually making it configurable or
selecting the behavior based on the actual CMA area sizes.

>
>>
>> Why do we have to start using ZONE_MOVABLE for them?
> One of the "other opportunities" for Designated Movable Blocks is to
> allow CMA to allocate from a DMB as an alternative. This would allow
> current users to continue using CMA as they want, but would allow users
> (e.g. hugetlb_cma) that are not sensitive to the allocation latency to
> let the kernel page allocator make more complete use (i.e. waste less)
> of the shared memory. ZONE_MOVABLE pageblocks are always MIGRATE_MOVABLE
> so the restrictions placed on MIGRATE_CMA pageblocks are lifted within a
> DMB.

The whole purpose of ZONE_MOVABLE is that *no* unmovable allocations end
up on it. The biggest difference to CMA is that the CMA *owner* is able
to place unmovable allocations on it.

Using ZONE_MOVABLE for unmovable allocations (hugetlb_cma) is not
acceptable as is.

Using ZONE_MOVABLE in different context and calling it DMB is very
confusing TBH.

Just a note that I described the idea of a "PREFER_MOVABLE" zone in the
past. In contrast to ZONE_MOVABLE, we cannot run into weird OOM
situations in a ZONE misconfiguration, and we'd end up placing only
movable allocations on it as long as we can. However, especially
gigantic pages could be allocated from it. It sounds kind-of more like
what you want -- and maybe in combination of daxctl to let the user
decide how to online memory ranges.


And just to make it clear again: depending on ZONE_MOVABLE == only user
space allocations is not future proof.

>
>>
> Thanks for your consideration,
> Dough Baker ... I mean Doug Berger :).


:) Thanks Doug!

--
Thanks,

David / dhildenb

2022-10-01 01:00:35

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 00/21] mm: introduce Designated Movable Blocks

On 9/29/2022 2:00 AM, David Hildenbrand wrote:
> On 20.09.22 03:03, Doug Berger wrote:
>> On 9/19/2022 2:00 AM, David Hildenbrand wrote:
>>> Hi Dough,
>>>
>>> I have some high-level questions.
>> Thanks for your interest. I will attempt to answer them.
>>
>
> Hi Doug,
>
> sorry for the late reply, slowly catching up on mails.
Thanks for finding the time, and for the thoughtful feedback.

>
>>>
>>>> MOTIVATION:
>>>> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
>>>> controllers with each mapped in a different address range within
>>>> a Uniform Memory Architecture. Some users of these systems have
>>>
>>> How large are these areas typically?
>>>
>>> How large are they in comparison to other memory in the system?
>>>
>>> How is this memory currently presented to the system?
>> I'm not certain what is typical because these systems are highly
>> configurable and Broadcom's customers have different ideas about
>> application processing.
>>
>> The 7278 device has four ARMv8 CPU cores in an SMP cluster and two
>> memory controllers (MEMCs). Each MEMC is capable of controlling up to
>> 8GB of DRAM. An example 7278 system might have 1GB on each controller,
>> so an arm64 kernel might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and
>> 1GB on MEMC1 at 0x300000000-0x33FFFFFFF.
>>
>> The Designated Movable Block concept introduced here has the potential
>> to offer useful services to different constituencies. I tried to
>> highlight this in my V1 patch set with the hope of attracting some
>> interest, but it can complicate the overall discussion, so I would like
>> to maybe narrow the discussion here. It may be good to keep them in mind
>> when assessing the overall value, but perhaps the "other opportunities"
>> can be covered as a follow on discussion.
>>
>> The base capability described in commits 7-15 of this V1 patch set is to
>> allow a 'movablecore' block to be created at a particular base address
>> rather than solely at the end of addressable memory.
>>
>
> Just so we're on the same page:
>
> Having too much ZONE_MOVABLE memory (ratio compared to !ZONE_MOVABLE
> memory) is dangerous. Acceptable ratios highly depend on the target
> workload. An extreme example is memory-hungry applications that end up
> long-term pinning a lot of memory (e.g., VMs with SR-IO): we can run
> easily out of free memory in the !ZONE_MOVABLE zones and might not want
> ZONE_MOVABLE at all.
Definitely. I've had to explain this to application developers myself
:). This is fundamentally why the existing 'movablecore' implementation
is insufficient for multiple memory controllers. Placing any
ZONE_MOVABLE memory on the lower addressed memory controller forces all
of the higher addressed memory controller(s) to only contain
ZONE_MOVABLE memory, which is generally unacceptable for any workload.

>
> So whatever we do, this should in general not be the kernel sole
> decision to make this memory any special and let ZONE_MOVABLE manage it.
I believe you are stating that Designated Movable Blocks should only be
created as a result of special configuration (e.g. kernel parameters,
devicetree, ...). I would agree with that. Is that what you intended by
this statement, or am I missing something?

>
> It's the same with CMA. "Heavy" CMA users require special configuration:
> hugetlb_cma is one prime example.
>
>>>
>>>> expressed the desire to locate ZONE_MOVABLE memory on each
>>>> memory controller to allow user space intensive processing to
>>>> make better use of the additional memory bandwidth.
>>>
>>> Can you share some more how exactly ZONE_MOVABLE would help here to make
>>> better use of the memory bandwidth?
>> ZONE_MOVABLE memory is effectively unusable by the kernel. It can be
>> used by user space applications through both the page allocator and the
>> Hugetlbfs. If a large 'movablecore' allocation is defined and it can
>
> Hugetlbfs not necessarily by all architectures. Some architectures don't
> support placing hugetlb pages on ZONE_MOVABLE (not migratable) and
> gigantic pages are special either way.
That's true.

>
>> only be located at the end of addressable memory then it will always be
>> located on MEMC1 of a 7278 system. This will create a tendency for user
>> space accesses to consume more bandwidth on the MEMC1 memory controller
>> and kernel space accesses to consume more bandwidth on MEMC0. A more
>> even distribution of ZONE_MOVABLE memory between the available memory
>> controllers in theory makes more memory bandwidth available to user
>> space intensive loads.
>>
>
> Sorry to be dense, is this also about different memory access latency or
> just memory bandwidth?
Broadcom memory controllers do support configurable real-time scheduling
with bandwidth guarantees for different memory clients so I suppose this
is a fair question. However, the expectation here is that the CPUs would
have equivalent access latencies, so it is really just about memory
bandwidth for the CPUs.

>
> Do these memory areas have special/different performance
> characteristics? Using dedicated/fake NUMA nodes might be more in line
> with what CXL and PMEM are up to.
>
> Using ZONE_MOVABLE for that purpose feels a little bit like an abuse of
> the mechanism.
Current usage intends to have equivalent performance from a CPU
perspective. God forbid any Broadcom customers read your questions and
start asking for such capabilities :), but if they do I agree that
ZONE_MOVABLE for that purpose would be harebrained.

> To be clearer what I mean:
>
> We can place any movable allocations on ZONE_MOVABLE, including kernel
> allocations. User space allocations are just one example, and int he
> future we'll turn more and more allocations movable to be able to cope
> with bigger ZONE_MOVABLE demands due to DAX/CXL. I once looked into
> migrating user space page tables, just to give an example.
That's good to know.

>
>
>>>
>>>> Unfortunately, the historical monotonic layout of zones would
>>>> mean that if the lowest addressed memory controller contains
>>>> ZONE_MOVABLE memory then all of the memory available from
>>>> memory controllers at higher addresses must also be in the
>>>> ZONE_MOVABLE zone. This would force all kernel memory accesses
>>>> onto the lowest addressed memory controller and significantly
>>>> reduce the amount of memory available for non-movable
>>>> allocations.
>>>
>>> We do have code that relies on zones during boot to not overlap within a
>>> single node.
>> I believe my changes address all such reliance, but if you are aware of
>> something I missed please let me know.
>>
>
> One example I'm aware of is drivers/base/memory.c:memory_block_add_nid()
> / early_node_zone_for_memory_block().
>
> If we get it wrong, or actually have memory blocks that span multiple
> zones, we can no longer offline these memory blocks. We really wanted to
> avoid scanning the memmap for now and it seems to get the job done in
> environments we care about.
To the extent that this implementation only supports creating Designated
Movable Blocks in boot memory and boot memory does not generally support
offlining, I wouldn't expect this to be an issue. However, if for some
reason offlining boot memory becomes desirable then we should use
dmb_intersects() along with zone_intersects() to take the appropriate
action. Based on the current usage of zone_intersects() I'm not entirely
sure what the correct action should be.

>
>>>
>>>>
>>>> The main objective of this patch set is therefore to allow a
>>>> block of memory to be designated as part of the ZONE_MOVABLE
>>>> zone where it will always only be used by the kernel page
>>>> allocator to satisfy requests for movable pages. The term
>>>> Designated Movable Block is introduced here to represent such a
>>>> block. The favored implementation allows modification of the
>>>
>>> Sorry to say, but that term is rather suboptimal to describe what you
>>> are doing here. You simply have some system RAM you'd want to have
>>> managed by ZONE_MOVABLE, no?
>> That may be true, but I found it superior to the 'sticky' movable
>> terminology put forth by Mel Gorman ;). I'm happy to entertain
>> alternatives, but they may not be as easy to find as you think.
>
> Especially the "blocks" part is confusing. Movable pageblocks? Movable
> Linux memory blocks?
>
> Note that the sticky movable *pageblocks* were a completely different
> concept than simply reusing ZONE_MOVABLE for some memory ranges.
I would say that is open for debate. The implementations would be
"completely different" but the objectives could be quite similar.
There appear to be a number of people that are interested in the concept
of memory that can only contain data that tolerates relocation for
various potentially non-competing reasons.

Fundamentally, the concept of MIGRATE_MOVABLE memory is useful to allow
competing user space processes to share limited physical memory supplied
by the kernel. The data in that memory can be relocated elsewhere by the
kernel when the process that owns it is not executing. This movement is
typically not observable to the owning process which has its own address
space.

The kernel uses MIGRATE_UNMOVABLE memory to protect the integrity of its
address space, but of course what the kernel considers unmovable could
in fact be moved by a hypervisor in a way that is analogous to what the
kernel does for user space.

For maximum flexibility the Linux memory management allows for
converting the migratetype of free memory to help satisfy requests to
allocate pages of memory through a mechanism I will call "fallback". The
concepts of sticky movable pageblocks and ZONE_MOVABLE have the common
objective of preventing the migratetype of pageblocks from getting
converted to anything other than MIGRATE_MOVABLE, and this is what makes
the memory special.

I agree with Mel Gorman that zones are meant to be about address induced
limitations, so using a zone for the purpose of breaking the fallback
mechanism of the page allocator is a misuse of the concept. A new
migratetype would be more appropriate for representing this change in
how fallback should apply to the pageblock because the desired behavior
has nothing to do with the address at which the memory is located. It is
entirely reasonable to desire "sticky" movable behavior for memory in
any zone. Such a solution would be directly applicable to our multiple
memory controller use case, and is really how Designated Movable Blocks
should be imagined.

However, I also recognize the efficiency benefits of using a
ZONE_MOVABLE zone to manage the pages that have this "sticky" movable
behavior. Introducing a new sticky MIGRATE_MOVABLE migratetype adds a
new free_list to every free_area which increases the search space and
associated work when trying to allocate a page for all callers.
Introducing ZONE_MOVABLE reduces the search space by providing an early
separation between searches for movable and non-movable allocations. The
classic zone restrictions weren't a good fit for multiple memory
controllers, but those restrictions were lifted to overcome similar
issues with memory_hotplug. It is not that Designated Movable Blocks
want to be in ZONE_MOVABLE, but rather that ZONE_MOVABLE provides a
convenience for managing the page allocators use of "sticky" movable
memory just like it does for memory hotplug. Dumping the memory in
Designated Movable Blocks into the ZONE_MOVABLE zone allows an existing
mechanism to be reused, reducing the risk of negatively impacting the
page allocator behavior.

There are some subtle distinctions between Designated Movable Blocks and
the existing ZONE_MOVABLE zone. Because Designated Movable Blocks are
reserved when created they are protected against any early boot time
kernel reservations that might place unmovable allocations in them. The
implementation continues to track the zone_movable_pfn as the start of
the "classic" ZONE_MOVABLE zone on each node. A Designated Movable Block
can overlap any other zone including the "classic" ZONE_MOVABLE zone.

>
>>
>>>
>>>> 'movablecore' kernel parameter to allow specification of a base
>>>> address and support for multiple blocks. The existing
>>>> 'movablecore' mechanisms are retained. Other mechanisms based on
>>>> device tree are also included in this set.
>>>>
>>>> BACKGROUND:
>>>> NUMA architectures support distributing movablecore memory
>>>> across each node, but it is undesirable to introduce the
>>>> overhead and complexities of NUMA on systems that don't have a
>>>> Non-Uniform Memory Architecture.
>>>
>>> How exactly would that look like? I think I am missing something :)
>> The notion would be to consider each memory controller as a separate
>> node, but as stated it is not desirable.
>>
>
> Doing it the DAX/CXL way would be to expose these memory ranges as
> daxdev instead, and letting the admin decide how to online these memory
> ranges when adding them to the buddy via the dax/kmem kernel module.
>
> That could mean that your booting with memory on MC0 only, and expose
> memory of MC1 via a daxdev, giving the admin the possibility do decide
> to which zone the memory should be onlined too.
>
> That would avoid most kernel code changes.
I wasn't familiar with these kernel mechanisms and did enjoy reading
about the somewhat oxymoronic "volatile-use of persistent memory" that
is dax/kmem, but this isn't performance differentiated RAM. It really is
just System RAM so this degree of complexity seems unwarranted.

>
>>>
>>> Why can't we simply designate these regions as CMA regions?
>> We and others have encountered significant performance issues when large
>> CMA regions are used. There are significant restrictions on the page
>> allocator's use of MIGRATE_CMA pages and the memory subsystem works very
>> hard to keep about half of the memory in the CMA region free. There have
>> been attempts to patch the CMA implementation to alter this behavior
>> (for example the set I referenced Mel's response to in [1]), but there
>> are users that desire the current behavior.
>
> Optimizing that would be great, eventually making it configurable or
> selecting the behavior based on the actual CMA area sizes.
>
>>
>>>
>>> Why do we have to start using ZONE_MOVABLE for them?
>> One of the "other opportunities" for Designated Movable Blocks is to
>> allow CMA to allocate from a DMB as an alternative. This would allow
>> current users to continue using CMA as they want, but would allow users
>> (e.g. hugetlb_cma) that are not sensitive to the allocation latency to
>> let the kernel page allocator make more complete use (i.e. waste less)
>> of the shared memory. ZONE_MOVABLE pageblocks are always MIGRATE_MOVABLE
>> so the restrictions placed on MIGRATE_CMA pageblocks are lifted within a
>> DMB.
>
> The whole purpose of ZONE_MOVABLE is that *no* unmovable allocations end
> up on it. The biggest difference to CMA is that the CMA *owner* is able
> to place unmovable allocations on it.
I'm not sure that is a wholly fair characterization (or maybe I just
hope that's the case :). I would agree that the Linux page allocator
can't place any unmovable allocations on it. I expect that people locate
memory in ZONE_MOVABLE for different purposes. For example, the memory
hotplug users ostensibly place memory their so that any data on the hot
plugged memory can be moved off of the memory prior to it being hot
unplugged. Unplugging the memory removes the memory from the
ZONE_MOVABLE zone, but it is not materially different from allocating
the memory for a different purpose (perhaps in a different machine).

Conceptually, allowing a CMA allocator to operate on a Designated
Movable Block of memory that it *owns* is also removing that memory from
the ZONE_MOVABLE zone. Issues of ownership should be addressed which is
why these "other opportunities" are being deferred for now, but I do not
believe such use is unreasonable. Again, Designated Movable Blocks are
only allowed in boot memory so there shouldn't be a conflict with memory
hotplug. I believe the same would apply for hugetlb_cma.
>
> Using ZONE_MOVABLE for unmovable allocations (hugetlb_cma) is not
> acceptable as is.
>
> Using ZONE_MOVABLE in different context and calling it DMB is very
> confusing TBH.
Perhaps it is more helpful to think of a Designated Movable Block as a
block of memory whose migratetype is not allowed to be changed from
MIGRATE_MOVABLE (i.e. "sticky" migrate movable). The fact that
ZONE_MOVABLE is being used to achieve that is an implementation detail
for this commit set. In the same way that memory hotplug is the concept
of adding System RAM during run time, but placing it in ZONE_MOVABLE is
an implementation detail to make it easier to unplug.

>
> Just a note that I described the idea of a "PREFER_MOVABLE" zone in the
> past. In contrast to ZONE_MOVABLE, we cannot run into weird OOM
> situations in a ZONE misconfiguration, and we'd end up placing only
> movable allocations on it as long as we can. However, especially
> gigantic pages could be allocated from it. It sounds kind-of more like
> what you want -- and maybe in combination of daxctl to let the user
> decide how to online memory ranges.
Best not let Mel hear you suggesting another zone;).

>
>
> And just to make it clear again: depending on ZONE_MOVABLE == only user
> space allocations is not future proof.
Understood.

>
>>
>>>
>> Thanks for your consideration,
>> Dough Baker ... I mean Doug Berger :).
>
>
> :) Thanks Doug!
>
Thank you!
-Doug

2022-10-05 18:57:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 00/21] mm: introduce Designated Movable Blocks


>> So whatever we do, this should in general not be the kernel sole
>> decision to make this memory any special and let ZONE_MOVABLE manage it.
> I believe you are stating that Designated Movable Blocks should only be
> created as a result of special configuration (e.g. kernel parameters,
> devicetree, ...). I would agree with that. Is that what you intended by
> this statement, or am I missing something?

Essentially, that it should mostly be the decision of an educated admin.

...

>>
>>> only be located at the end of addressable memory then it will always be
>>> located on MEMC1 of a 7278 system. This will create a tendency for user
>>> space accesses to consume more bandwidth on the MEMC1 memory controller
>>> and kernel space accesses to consume more bandwidth on MEMC0. A more
>>> even distribution of ZONE_MOVABLE memory between the available memory
>>> controllers in theory makes more memory bandwidth available to user
>>> space intensive loads.
>>>
>>
>> Sorry to be dense, is this also about different memory access latency or
>> just memory bandwidth?
> Broadcom memory controllers do support configurable real-time scheduling
> with bandwidth guarantees for different memory clients so I suppose this
> is a fair question. However, the expectation here is that the CPUs would
> have equivalent access latencies, so it is really just about memory
> bandwidth for the CPUs.

Okay, thanks for clarifying.

...

>>>>
>>>>> Unfortunately, the historical monotonic layout of zones would
>>>>> mean that if the lowest addressed memory controller contains
>>>>> ZONE_MOVABLE memory then all of the memory available from
>>>>> memory controllers at higher addresses must also be in the
>>>>> ZONE_MOVABLE zone. This would force all kernel memory accesses
>>>>> onto the lowest addressed memory controller and significantly
>>>>> reduce the amount of memory available for non-movable
>>>>> allocations.
>>>>
>>>> We do have code that relies on zones during boot to not overlap within a
>>>> single node.
>>> I believe my changes address all such reliance, but if you are aware of
>>> something I missed please let me know.
>>>
>>
>> One example I'm aware of is drivers/base/memory.c:memory_block_add_nid()
>> / early_node_zone_for_memory_block().
>>
>> If we get it wrong, or actually have memory blocks that span multiple
>> zones, we can no longer offline these memory blocks. We really wanted to
>> avoid scanning the memmap for now and it seems to get the job done in
>> environments we care about.
> To the extent that this implementation only supports creating Designated
> Movable Blocks in boot memory and boot memory does not generally support
> offlining, I wouldn't expect this to be an issue. However, if for some

Sad truth is, that boot memory sometimes is supposed to support
offlining -- or people expect it to work to some degree. For example,
with special memblock hacks you can get them into ZONE_MOVABLE to be
able to hotunplug some NUMA nodes even after a reboot (movable_node
kernel parameter).

There are use cases where you want to offline boot memory to save energy
by disabling complete memory banks -- best effort when not using
ZONE_MOVABLE.

Having that said, I agree that it's a corner case use case.

> reason offlining boot memory becomes desirable then we should use
> dmb_intersects() along with zone_intersects() to take the appropriate
> action. Based on the current usage of zone_intersects() I'm not entirely
> sure what the correct action should be.
>
>>
>>>>
>>>>>
>>>>> The main objective of this patch set is therefore to allow a
>>>>> block of memory to be designated as part of the ZONE_MOVABLE
>>>>> zone where it will always only be used by the kernel page
>>>>> allocator to satisfy requests for movable pages. The term
>>>>> Designated Movable Block is introduced here to represent such a
>>>>> block. The favored implementation allows modification of the
>>>>
>>>> Sorry to say, but that term is rather suboptimal to describe what you
>>>> are doing here. You simply have some system RAM you'd want to have
>>>> managed by ZONE_MOVABLE, no?
>>> That may be true, but I found it superior to the 'sticky' movable
>>> terminology put forth by Mel Gorman ;). I'm happy to entertain
>>> alternatives, but they may not be as easy to find as you think.
>>
>> Especially the "blocks" part is confusing. Movable pageblocks? Movable
>> Linux memory blocks?
>>
>> Note that the sticky movable *pageblocks* were a completely different
>> concept than simply reusing ZONE_MOVABLE for some memory ranges.
> I would say that is open for debate. The implementations would be
> "completely different" but the objectives could be quite similar.
> There appear to be a number of people that are interested in the concept
> of memory that can only contain data that tolerates relocation for
> various potentially non-competing reasons.
>
> Fundamentally, the concept of MIGRATE_MOVABLE memory is useful to allow
> competing user space processes to share limited physical memory supplied
> by the kernel. The data in that memory can be relocated elsewhere by the
> kernel when the process that owns it is not executing. This movement is
> typically not observable to the owning process which has its own address
> space.
>
> The kernel uses MIGRATE_UNMOVABLE memory to protect the integrity of its
> address space, but of course what the kernel considers unmovable could
> in fact be moved by a hypervisor in a way that is analogous to what the
> kernel does for user space.
>
> For maximum flexibility the Linux memory management allows for
> converting the migratetype of free memory to help satisfy requests to
> allocate pages of memory through a mechanism I will call "fallback". The
> concepts of sticky movable pageblocks and ZONE_MOVABLE have the common
> objective of preventing the migratetype of pageblocks from getting
> converted to anything other than MIGRATE_MOVABLE, and this is what makes
> the memory special.

Yes, good summary.

>
> I agree with Mel Gorman that zones are meant to be about address induced
> limitations, so using a zone for the purpose of breaking the fallback
> mechanism of the page allocator is a misuse of the concept. A new
> migratetype would be more appropriate for representing this change in
> how fallback should apply to the pageblock because the desired behavior
> has nothing to do with the address at which the memory is located. It is
> entirely reasonable to desire "sticky" movable behavior for memory in
> any zone. Such a solution would be directly applicable to our multiple
> memory controller use case, and is really how Designated Movable Blocks
> should be imagined.

I usually agree with Mel, but not necessarily on that point that it's a
misuse of a concept. It's an extension of an existing concept, that
doesn't imply it's a misuse. Traditionally, it was about address
limitations, yes. Now it's also about allocation types. Sure, there
might be other ways to get it done as well.

I'd compare it to the current use of NUMA nodes: traditionally, it
really used to be actual NUMA nodes. Nowadays, it's a mechanism, for
example, to expose performance-differented memory, let applications use
it via mbind() or have the page allocator dynamically migrate hot/cold
pages back and forth according to memory tiering strategies.

>
> However, I also recognize the efficiency benefits of using a
> ZONE_MOVABLE zone to manage the pages that have this "sticky" movable
> behavior. Introducing a new sticky MIGRATE_MOVABLE migratetype adds a
> new free_list to every free_area which increases the search space and
> associated work when trying to allocate a page for all callers.
> Introducing ZONE_MOVABLE reduces the search space by providing an early
> separation between searches for movable and non-movable allocations. The
> classic zone restrictions weren't a good fit for multiple memory
> controllers, but those restrictions were lifted to overcome similar
> issues with memory_hotplug. It is not that Designated Movable Blocks
> want to be in ZONE_MOVABLE, but rather that ZONE_MOVABLE provides a
> convenience for managing the page allocators use of "sticky" movable
> memory just like it does for memory hotplug. Dumping the memory in
> Designated Movable Blocks into the ZONE_MOVABLE zone allows an existing
> mechanism to be reused, reducing the risk of negatively impacting the
> page allocator behavior.
>
> There are some subtle distinctions between Designated Movable Blocks and
> the existing ZONE_MOVABLE zone. Because Designated Movable Blocks are
> reserved when created they are protected against any early boot time
> kernel reservations that might place unmovable allocations in them. The
> implementation continues to track the zone_movable_pfn as the start of
> the "classic" ZONE_MOVABLE zone on each node. A Designated Movable Block
> can overlap any other zone including the "classic" ZONE_MOVABLE zone.

What exactly to you mean with "overlay" -- I assume you mean that zone
span will overlay but it really "belongs" to ZONE_MOVABLE, as indicated
by it's struct page metadata.

>>
>> Doing it the DAX/CXL way would be to expose these memory ranges as
>> daxdev instead, and letting the admin decide how to online these memory
>> ranges when adding them to the buddy via the dax/kmem kernel module.
>>
>> That could mean that your booting with memory on MC0 only, and expose
>> memory of MC1 via a daxdev, giving the admin the possibility do decide
>> to which zone the memory should be onlined too.
>>
>> That would avoid most kernel code changes.
> I wasn't familiar with these kernel mechanisms and did enjoy reading
> about the somewhat oxymoronic "volatile-use of persistent memory" that
> is dax/kmem, but this isn't performance differentiated RAM. It really is
> just System RAM so this degree of complexity seems unwarranted.

It's an existing mechanism that will get heavily used by CXL -- for all
kinds of memory. I feel like it could solve your use case eventually.

Excluded memory cannot be allocated by the early allocator and you can
online it to ZONE_MOVABLE. It at least seems to roughly do something you
want to achieve. I'd be curious what you can't achieve or what we might
need to make

>>>
>>>>
>>>> Why do we have to start using ZONE_MOVABLE for them?
>>> One of the "other opportunities" for Designated Movable Blocks is to
>>> allow CMA to allocate from a DMB as an alternative. This would allow
>>> current users to continue using CMA as they want, but would allow users
>>> (e.g. hugetlb_cma) that are not sensitive to the allocation latency to
>>> let the kernel page allocator make more complete use (i.e. waste less)
>>> of the shared memory. ZONE_MOVABLE pageblocks are always MIGRATE_MOVABLE
>>> so the restrictions placed on MIGRATE_CMA pageblocks are lifted within a
>>> DMB.
>>
>> The whole purpose of ZONE_MOVABLE is that *no* unmovable allocations end
>> up on it. The biggest difference to CMA is that the CMA *owner* is able
>> to place unmovable allocations on it.
> I'm not sure that is a wholly fair characterization (or maybe I just
> hope that's the case :). I would agree that the Linux page allocator
> can't place any unmovable allocations on it. I expect that people locate
> memory in ZONE_MOVABLE for different purposes. For example, the memory
> hotplug users ostensibly place memory their so that any data on the hot
> plugged memory can be moved off of the memory prior to it being hot
> unplugged. Unplugging the memory removes the memory from the
> ZONE_MOVABLE zone, but it is not materially different from allocating
> the memory for a different purpose (perhaps in a different machine).

Well, memory offlining is the one operation that evacuates memory) and
makes sure it cannot be allocated anymore (possibly with the intention
of removing that memory from the system). Sure, you can call it a fake
allocation, but there is a more fundamental difference compared to
random subsystems placing unmovable allocations there.

>
> Conceptually, allowing a CMA allocator to operate on a Designated
> Movable Block of memory that it *owns* is also removing that memory from
> the ZONE_MOVABLE zone. Issues of ownership should be addressed which is
> why these "other opportunities" are being deferred for now, but I do not
> believe such use is unreasonable. Again, Designated Movable Blocks are
> only allowed in boot memory so there shouldn't be a conflict with memory
> hotplug. I believe the same would apply for hugetlb_cma.
>>
>> Using ZONE_MOVABLE for unmovable allocations (hugetlb_cma) is not
>> acceptable as is.
>>
>> Using ZONE_MOVABLE in different context and calling it DMB is very
>> confusing TBH.
> Perhaps it is more helpful to think of a Designated Movable Block as a
> block of memory whose migratetype is not allowed to be changed from
> MIGRATE_MOVABLE (i.e. "sticky" migrate movable). The fact that

I think that such a description might make the feature easier to grasp.
Although I am not sure yet if DMB as proposed is rather a hack to avoid
introducing real sticky movable blocks (sorry, I'm just trying to
connect the dots and there is a lot of complexity involved) or actually
a clean design. Messing with zones and memblock always implies complexity :)

> ZONE_MOVABLE is being used to achieve that is an implementation detail
> for this commit set. In the same way that memory hotplug is the concept
> of adding System RAM during run time, but placing it in ZONE_MOVABLE is
> an implementation detail to make it easier to unplug.

Right, but there we don't play any tricks: it's just ZONE_MOVABLE
without any other metadata pointing out ownership. Maybe that's what you
are trying to describe here: A DMB inside ZONE_MOVABLE implies that
there is another owner and that even memory offlining should fail.

>
>>
>> Just a note that I described the idea of a "PREFER_MOVABLE" zone in the
>> past. In contrast to ZONE_MOVABLE, we cannot run into weird OOM
>> situations in a ZONE misconfiguration, and we'd end up placing only
>> movable allocations on it as long as we can. However, especially
>> gigantic pages could be allocated from it. It sounds kind-of more like
>> what you want -- and maybe in combination of daxctl to let the user
>> decide how to online memory ranges.
> Best not let Mel hear you suggesting another zone;).

He most probably read it already. ;) I can understand all theoretical
complains about ZONE_MOVABLE, but in the end it has been getting the job
done for years.

>
>>
>>
>> And just to make it clear again: depending on ZONE_MOVABLE == only user
>> space allocations is not future proof.
> Understood.

May I ask what the main purpose/use case of DMB is?

Would it be sufficient, to specify that hugetlb are allocated from a
specific memory area, possible managed by CMA? And then simply providing
the application that cares these hugetlb pages? Would you need something
that is *not* hugetlb?

But even then, how would an application be able to specify that exactly
it's allocation will get served from that part of ZONE_MOVABLE? Sure, if
you don't reserve any other hugetlb pages, it's easy.


I'd like to note that if you'd go with (fake) NUMA nodes like PMEM or
CXL you could easily let your application mbind() to that memory and
have it configured.

--
Thanks,

David / dhildenb

2022-10-12 23:52:27

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH 00/21] mm: introduce Designated Movable Blocks

Reordered to (hopefully) improve readability.

On 10/5/2022 11:39 AM, David Hildenbrand wrote:
> May I ask what the main purpose/use case of DMB is?
The concept of Designated Movable Blocks was conceived to provide a
common mechanism for different use cases, so identifying the "main" one
is not so easy. Broadly speaking I would say there are two different but
compatible objectives that could be used to categorize use cases.

The narrower objective is the ability to locate some "user space
friendly" memory on each memory controller to make more of the total
memory bandwidth available to user space processes. The ZONE_MOVABLE
zone is considered to be "user space friendly" so locating some of it on
each memory controller would meet this objective. The existing
'movablecore' kernel parameter allows the balance of kernel/movable
memory to be adjusted, but movable memory will always be located on the
highest addressed memory controller. The v2 patch set attempts to focus
explicitly on the use case of adding a base address to the 'movablecore'
kernel parameter to support this objective.

The other general objective is to facilitate better reuse/sharing of
memory. Broadcom Set-Top Box SoCs include video processing devices that
can require large amounts of memory to perform their functions.
Historically, memory carve-outs have been used to ensure guaranteed
availability of memory to meet the requirements of cable television
customers. The rise of Android TV and Google TV have made the
inefficiency of memory carve-outs unacceptable.

We have tried to meet the reusability objective with a CMA based
implementation, but Broadcom customers were unhappy with the
performance. Efforts to improve the CMA performance led me to Joonsoo's
efforts to do the same and to the "sticky" MIGRATE_MOVABLE proposal from
Mel Gorman that I cited. I began working on an implementation of
Designated Movable Blocks based on that proposal which could be
characterized as reserving a block of memory, assigning it a new
"sticky" movable migrate type, and modifying the fast and slow path page
allocators to handle the new migrate type such that requests for movable
memory could be satisfied by pages from the blocks and that the migrate
type of pages in the blocks could not be changed by "fallback" mechanisms.

Both of these objectives require the ability to specify the location of
a block of memory that can only be used by the Linux kernel page
allocator to satisfy requests for movable memory. The location is
relevant because it may need to be on a specific memory controller or it
may have to satisfy the DMA address range of a specific device. The
movability is relevant because it improves the availability to user
space allocations or it allows the data occupying the memory to be moved
away when the memory is required by the device. The Designated Movable
Block mechanism was designed to satisfy these requirements and was seen
as a common mechanism for both objectives.

While learning more about the page allocator implementation, I realized
that hotplug memory also has these same requirements. The location of
hotplug memory is determined by the system hardware independent of
Linux's zone concepts and the data stored on the memory must be movable
to support the ability to offline the memory before it is unplugged.
This led me to study the hotplug memory implementation to understand how
they satisfied these requirements.

I became aware that the "narrower objective" could conceivably be
satisfied by the hotplug memory capability with a few challenges. First
the size of hotplug memory sections is a bit course. The current 128MB
sections on arm64 are not too bad and are far better than the 1GB
sections that were in place when I first looked at it.

For systems that do not support ACPI there is no clear way to specify
hotplug memory regions at boot time. When Linux boots an arm64 kernel
with devicetree the OS attempts to initialize all available memory
described by the devicetree. Typically this boot memory cannot be
unplugged to allow it to be plugged into a different zone. A devicetree
specification of the hardware could intentionally leave holes in its
memory description to allow for runtime plugging of memory into the
holes, but this goes against the spirit of a devicetree description of
the system hardware as it is not representative of what hardware is
actually present. The 'mem=' kernel parameter can be used to prevent
Linux from initializing all of the available memory so that memory could
be hotplugged after boot, but this breaks devicetree mechanisms for
reserving memory from addresses that might only be populated by hotplug
after boot.

It also becomes difficult to manage the selection of zones where memory
is hotplugged. Referring again to the example system with 1GB on MEMC0
and 1GB on MEMC1 we could boot with 'mem=768M' to leave 256MB
unpopulated on MEMC0 and all of the memory (1GB) on MEMC1 unpopulated.
If we set the memory_hotplug module parameter online_policy to
"auto-movable" then adding 256MB at 0x70000000 will put the memory in
ZONE_MOVABLE as desired. However, we might want to hotplug 768MB at
0x300000000 into ZONE_NORMAL and 256MB at 0x330000000 into ZONE_MOVABLE.
The fact that the memory_hotplug parameters are not easily modifiable
from the kernel modules that are necessary to access the memory_hotplug
API makes this a difficult dance. I have experimented with a simple
module exposing hotplug capability to user space and have confirmed as a
proof of concept that user space can adjust the memory_hotplug
parameters and use the module to achieve the desired zone population
with hotplug. The /sys/devices/system/memory/probe control simplifies
this, but is not enabled on arm64 architectures.

In addition, keeping this memory unplugged until after boot means that
the memory cannot be used during boot. Kernel boot time reservations are
a mixed bag. On the one hand they won't land in ZONE_MOVABLE which is
nice, but in this example they land in ZONE_DMA which can be considered
a more valuable resource than ZONE_NORMAL. Both of these issues are not
likely to be of significant consequence, but neither is really desirable.

Finally, just like there are those that may not want to execute a NUMA
kernel (e.g. Android GKI arm64), there may also be those that don't want
to include memory hotplug support in their kernel. These things can
change, but are not always under our control.

If you are aware of solutions to these issues that would make memory
hotplug a more viable solution for us than DMB I would be happy to know
them.

These observations led me to design DMB more as an extension of
'movablecore' than an extension of memory hotplug. However, the
efficiency of using the ZONE_MOVABLE zone to collect and manage "sticky"
movable pages in an address independent way without "fallback" (as is
done by memory hotplug) won me over and I abandoned the idea of
modifying the fast and slow page allocator paths to support a "sticky"
movable migrate type. The implementation of DMB was re-conceived to
preserve the existing 'movablecore' mechanism of creating a dynamic
ZONE_MOVABLE zone that spans from zone_movable_pfn for each node to the
end of memory on the node, and adding the ability to designate blocks of
memory whose pages would be removed from their default zone and placed
in the ZONE_MOVABLE zone. The span of each ZONE_MOVABLE zone was
increased to start at the lowest pfn in the zone on the node and
continue to the end of memory on the node. I also neglected to destroy
zones that became empty after their pages were moved to ZONE_MOVABLE.
These last two decisions were a matter of convenience, but I can see
that they may have created some confusion (based on your questions) so I
am happy to reconsider them.

>
> Would it be sufficient, to specify that hugetlb are allocated from a
> specific memory area, possible managed by CMA? And then simply providing
> the application that cares these hugetlb pages? Would you need something
> that is *not* hugetlb?
>
> But even then, how would an application be able to specify that exactly
> it's allocation will get served from that part of ZONE_MOVABLE? Sure, if
> you don't reserve any other hugetlb pages, it's easy.
As noted before I actually have very limited visibility into how the
"narrower objective" is being used by Broadcom customers and how much
benefit it provides. I believe its current use is probably simply
opportunistic, but these kinds of improvements to hugetlb allocation
might be welcomed.

I'd say the hugetlb_cma is similar to what you are describing except
that it is consolidated rather than being distributed across multiple
memory areas. Such changes to add benefit to the "narrower objective"
need not be considered with respect to this patch set. On the other
hand, the reuse objective of Designated Movable Blocks could be very
relevant to hugetlb_cma.

>>
>> I agree with Mel Gorman that zones are meant to be about address induced
>> limitations, so using a zone for the purpose of breaking the fallback
>> mechanism of the page allocator is a misuse of the concept. A new
>> migratetype would be more appropriate for representing this change in
>> how fallback should apply to the pageblock because the desired behavior
>> has nothing to do with the address at which the memory is located. It is
>> entirely reasonable to desire "sticky" movable behavior for memory in
>> any zone. Such a solution would be directly applicable to our multiple
>> memory controller use case, and is really how Designated Movable Blocks
>> should be imagined.
>
> I usually agree with Mel, but not necessarily on that point that it's a
> misuse of a concept. It's an extension of an existing concept, that
> doesn't imply it's a misuse. Traditionally, it was about address
> limitations, yes. Now it's also about allocation types. Sure, there
> might be other ways to get it done as well.
Yes, I would also agree that when introduced that was the concept, but
that the extensions made for memory hotplug have enough value to be a
justified extension of the initial concept. That is exactly why I
changed my approach.

>
> I'd compare it to the current use of NUMA nodes: traditionally, it
> really used to be actual NUMA nodes. Nowadays, it's a mechanism, for
> example, to expose performance-differented memory, let applications use
> it via mbind() or have the page allocator dynamically migrate hot/cold
> pages back and forth according to memory tiering strategies.
You are helping me gain an appreciation for the current extensions of
the node concept beyond the initial use for NUMA. It does sound useful
for applications that do want to have that finer control over the
resources they use.

However, I still believe there is value in the Designated Movable Block
concept that should be realizable when nodes are not available in the
kernel config. The implementation I am proposing should not incur a cost
for those that don't wish to use it.

>
>>
>> However, I also recognize the efficiency benefits of using a
>> ZONE_MOVABLE zone to manage the pages that have this "sticky" movable
>> behavior. Introducing a new sticky MIGRATE_MOVABLE migratetype adds a
>> new free_list to every free_area which increases the search space and
>> associated work when trying to allocate a page for all callers.
>> Introducing ZONE_MOVABLE reduces the search space by providing an early
>> separation between searches for movable and non-movable allocations. The
>> classic zone restrictions weren't a good fit for multiple memory
>> controllers, but those restrictions were lifted to overcome similar
>> issues with memory_hotplug. It is not that Designated Movable Blocks
>> want to be in ZONE_MOVABLE, but rather that ZONE_MOVABLE provides a
>> convenience for managing the page allocators use of "sticky" movable
>> memory just like it does for memory hotplug. Dumping the memory in
>> Designated Movable Blocks into the ZONE_MOVABLE zone allows an existing
>> mechanism to be reused, reducing the risk of negatively impacting the
>> page allocator behavior.
>>
>> There are some subtle distinctions between Designated Movable Blocks and
>> the existing ZONE_MOVABLE zone. Because Designated Movable Blocks are
>> reserved when created they are protected against any early boot time
>> kernel reservations that might place unmovable allocations in them. The
>> implementation continues to track the zone_movable_pfn as the start of
>> the "classic" ZONE_MOVABLE zone on each node. A Designated Movable Block
>> can overlap any other zone including the "classic" ZONE_MOVABLE zone.
>
> What exactly to you mean with "overlay" -- I assume you mean that zone
> span will overlay but it really "belongs" to ZONE_MOVABLE, as indicated
> by it's struct page metadata.
Yes. If the pages of a DMB are within the span of a zone I am saying it
overlaps that zone. The pages will only be "present" in the ZONE_MOVABLE
zone.

>>>>
>>>>>
>>>>> Why do we have to start using ZONE_MOVABLE for them?
>>>> One of the "other opportunities" for Designated Movable Blocks is to
>>>> allow CMA to allocate from a DMB as an alternative. This would allow
>>>> current users to continue using CMA as they want, but would allow users
>>>> (e.g. hugetlb_cma) that are not sensitive to the allocation latency to
>>>> let the kernel page allocator make more complete use (i.e. waste less)
>>>> of the shared memory. ZONE_MOVABLE pageblocks are always
>>>> MIGRATE_MOVABLE
>>>> so the restrictions placed on MIGRATE_CMA pageblocks are lifted
>>>> within a
>>>> DMB.
>>>
>>> The whole purpose of ZONE_MOVABLE is that *no* unmovable allocations end
>>> up on it. The biggest difference to CMA is that the CMA *owner* is able
>>> to place unmovable allocations on it.
>> I'm not sure that is a wholly fair characterization (or maybe I just
>> hope that's the case :). I would agree that the Linux page allocator
>> can't place any unmovable allocations on it. I expect that people locate
>> memory in ZONE_MOVABLE for different purposes. For example, the memory
>> hotplug users ostensibly place memory there so that any data on the hot
>> plugged memory can be moved off of the memory prior to it being hot
>> unplugged. Unplugging the memory removes the memory from the
>> ZONE_MOVABLE zone, but it is not materially different from allocating
>> the memory for a different purpose (perhaps in a different machine).
>
> Well, memory offlining is the one operation that evacuates memory) and
> makes sure it cannot be allocated anymore (possibly with the intention
> of removing that memory from the system). Sure, you can call it a fake
> allocation, but there is a more fundamental difference compared to
> random subsystems placing unmovable allocations there.
For the record, I am not offended by your use of the word "random" in
that statement. I was once informed I unintentionally offended someone
by using the term "arbitrary" in a similar way ;).

Any such unmovable allocation should be made with intent and with
authority to do so. The memory hotunplug is an example (perhaps a
singular one) of a subsystem that can do so with intent and authority.
Randomness plays no role.

"Ownership" of a DMB would imply authority and such an owner should be
presumed to be acting with intent. So the mechanics of ownership and
methods should be formalized before the general objective of reuse of
DMBs for non-movable purposes (e.g. hugetlb_cma, device driver, ...) is
allowed. This is why that objective has been deferred with the hope that
users that may have an interest in this objective can propose their
favored mechanism.

The "narrower objective" expressed in my v2 submission (i.e. movablecore
with base address) does not make any non-movable allocations so explicit
ownership is not necessary. Maybe whoever provided the 'movablecore'
parameter is the implied owner, but it doesn't much matter in this case.
Conceptually such a DMB could be hotunplugged, but that would be unexpected.

>
>>
>> Conceptually, allowing a CMA allocator to operate on a Designated
>> Movable Block of memory that it *owns* is also removing that memory from
>> the ZONE_MOVABLE zone. Issues of ownership should be addressed which is
>> why these "other opportunities" are being deferred for now, but I do not
>> believe such use is unreasonable. Again, Designated Movable Blocks are
>> only allowed in boot memory so there shouldn't be a conflict with memory
>> hotplug. I believe the same would apply for hugetlb_cma.
>>>
>>> Using ZONE_MOVABLE for unmovable allocations (hugetlb_cma) is not
>>> acceptable as is.
>>>
>>> Using ZONE_MOVABLE in different context and calling it DMB is very
>>> confusing TBH.
>> Perhaps it is more helpful to think of a Designated Movable Block as a
>> block of memory whose migratetype is not allowed to be changed from
>> MIGRATE_MOVABLE (i.e. "sticky" migrate movable). The fact that
>
> I think that such a description might make the feature easier to grasp.
> Although I am not sure yet if DMB as proposed is rather a hack to avoid
> introducing real sticky movable blocks (sorry, I'm just trying to
> connect the dots and there is a lot of complexity involved) or actually
> a clean design. Messing with zones and memblock always implies
> complexity :)
I very much appreciate your efforts to make sense of this. I am not
certain whether that OR is INCLUSIVE or EXCLUSIVE. I would say that the
implementation attempts to reuse the clean design of ZONE_MOVABLE (as
extended by memory hotplug) to provide the management of "sticky"
movable blocks that may overlap/overlay other zones. Doing so makes it
unnecessary to provide an otherwise redundant implementation of "sticky"
movable blocks that would likely degrade the performance of page
allocations from zones other than ZONE_MOVABLE, even when no "sticky"
movable blocks exist in the system.

>
>> ZONE_MOVABLE is being used to achieve that is an implementation detail
>> for this commit set. In the same way that memory hotplug is the concept
>> of adding System RAM during run time, but placing it in ZONE_MOVABLE is
>> an implementation detail to make it easier to unplug.
>
> Right, but there we don't play any tricks: it's just ZONE_MOVABLE
> without any other metadata pointing out ownership. Maybe that's what you
> are trying to describe here: A DMB inside ZONE_MOVABLE implies that
> there is another owner and that even memory offlining should fail.
Now why didn't I just say that in the first place :). The general
objective of reuse is inspired by CMA which has implied/explicit
ownership and as noted above DMB needs ownership to meet this objective
as well.

Thanks for your patience and helping me attempt to communicate this more
clearly.
-Doug