2022-09-28 23:01:37

by Doug Berger

[permalink] [raw]
Subject: [PATCH v2 0/9] mm: introduce Designated Movable Blocks

MOTIVATION:
Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
controllers with each mapped in a different address range within
a Uniform Memory Architecture. Some users of these systems have
expressed the desire to locate ZONE_MOVABLE memory on each
memory controller to allow user space intensive processing to
make better use of the additional memory bandwidth.
Unfortunately, the historical monotonic layout of zones would
mean that if the lowest addressed memory controller contains
ZONE_MOVABLE memory then all of the memory available from
memory controllers at higher addresses must also be in the
ZONE_MOVABLE zone. This would force all kernel memory accesses
onto the lowest addressed memory controller and significantly
reduce the amount of memory available for non-movable
allocations.

The main objective of this patch set is therefore to allow a
block of memory to be designated as part of the ZONE_MOVABLE
zone where it will always only be used by the kernel page
allocator to satisfy requests for movable pages. The term
Designated Movable Block is introduced here to represent such a
block. The favored implementation allows modification of the
'movablecore' kernel parameter to allow specification of a base
address and support for multiple blocks. The existing
'movablecore' mechanisms are retained.

BACKGROUND:
NUMA architectures support distributing movablecore memory
across each node, but it is undesirable to introduce the
overhead and complexities of NUMA on systems that don't have a
Non-Uniform Memory Architecture.

Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror option")
also depends on zone overlap to support sytems with multiple
mirrored ranges.

Commit c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions")
embraced overlapped zones for memory hotplug.

This commit set follows their lead to allow the ZONE_MOVABLE
zone to overlap other zones while spanning the pages from the
lowest Designated Movable Block to the end of the node.
Designated Movable Blocks are made absent from overlapping zones
and present within the ZONE_MOVABLE zone.

I initially investigated an implementation using a Designated
Movable migrate type in line with comments[1] made by Mel Gorman
regarding a "sticky" MIGRATE_MOVABLE type to avoid using
ZONE_MOVABLE. However, this approach was riskier since it was
much more instrusive on the allocation paths. Ultimately, the
progress made by the memory hotplug folks to expand the
ZONE_MOVABLE functionality convinced me to follow this approach.

OTHER OPPORTUNITIES:
CMA introduced a paradigm where multiple allocators could
operate on the same region of memory, and that paradigm can be
extended to Designated Movable Blocks as well. I was interested
in using kernel resource management as a mechanism for exposing
Designated Movable Block resources (e.g. /proc/iomem) that would
be used by the kernel page allocator like any other ZONE_MOVABLE
memory, but could be claimed by an alternative allocator (e.g.
CMA). Unfortunately, this becomes complicated because the kernel
resource implementation varies materially across different
architectures and I do not require this capability so I have
deferred that.

The Devicetree Specification includes support for specifying
reserved memory regions with a 'reusable' property to allow
sharing of the reserved memory between device drivers and the
OS. This is in line with the paradigm introduced by CMA, but is
currently only used by 'shared-dma-pool' compatible reserved
memory regions. Linux could choose to use Designated Movable
Blocks as the default mechanism for other 'reusable' reserved
memory. Device drivers that own 'reusable' reserved memory could
use the dmb_intersects() function introduced here to determine
whether memory requires reclamation from the OS before use and
could use the alloc/free_contig_range() functions to perform the
reclamation and release of memory needed by the device. The CMA
allocator API could be another candidate for device driver
reclamation, but it is not currently exposed for use by device
drivers in modules.

There have been attempts to modify the behavior of the kernel
page allocators use of CMA regions (e.g. [1] & [2]). This
implementation of Designated Movable Blocks creates an
opportunity to allow the CMA allocator to operate on
ZONE_MOVABLE memory that the kernel page allocator can use more
agressively, without affecting users of the existing CMA
implementation. This would have benefit when memory reuse is
more valuable than the cost of increased latency of CMA
allocations (e.g. hugetlb_cma).

These other opportunities are dependent on the Designated
Movable Block concept introduced here, so I will hold off
submitting any such follow-on proposals until there is movement
on this commit set.

NOTES:
The MEMBLOCK_MOVABLE and MEMBLOCK_HOTPLUG flags have a lot in
common and could potentially be consolidated, but I chose to
avoid that here to reduce controversy.

The CMA and DMB alignment constraints are currently the same so
the logic could be simplified, but this implementation keeps
them distinct to facilitate independent evolution of the
implementations if necessary.

Changes in v2:
- first three commits upstreamed separately [3], [4], and [5].
- commits 04-06 submitted separately [6].
- Corrected errors "Reported-by: kernel test robot <[email protected]>"
- Deferred commits after 15 to simplify review of the base
functionality.
- minor reorganization of commit 13.

v1: https://lore.kernel.org/linux-mm/[email protected]/

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]
[3] https://lore.kernel.org/linux-mm/[email protected]
[4] https://lore.kernel.org/linux-mm/[email protected]
[5] https://lore.kernel.org/linux-mm/[email protected]
[6] https://lore.kernel.org/linux-mm/[email protected]/

Doug Berger (9):
lib/show_mem.c: display MovableOnly
mm/vmstat: show start_pfn when zone spans pages
mm/page_alloc: calculate node_spanned_pages from pfns
mm/page_alloc.c: allow oversized movablecore
mm/page_alloc: introduce init_reserved_pageblock()
memblock: introduce MEMBLOCK_MOVABLE flag
mm/dmb: Introduce Designated Movable Blocks
mm/page_alloc: make alloc_contig_pages DMB aware
mm/page_alloc: allow base for movablecore

.../admin-guide/kernel-parameters.txt | 14 +-
include/linux/dmb.h | 29 ++++
include/linux/gfp.h | 5 +-
include/linux/memblock.h | 8 +
lib/show_mem.c | 2 +-
mm/Kconfig | 12 ++
mm/Makefile | 1 +
mm/cma.c | 15 +-
mm/dmb.c | 91 ++++++++++
mm/memblock.c | 30 +++-
mm/page_alloc.c | 155 ++++++++++++++----
mm/vmstat.c | 5 +
12 files changed, 321 insertions(+), 46 deletions(-)
create mode 100644 include/linux/dmb.h
create mode 100644 mm/dmb.c

--
2.25.1


2022-09-28 23:03:24

by Doug Berger

[permalink] [raw]
Subject: [PATCH v2 2/9] mm/vmstat: show start_pfn when zone spans pages

A zone that overlaps with another zone may span a range of pages
that are not present. In this case, displaying the start_pfn of
the zone allows the zone page range to be identified.

Signed-off-by: Doug Berger <[email protected]>
---
mm/vmstat.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 90af9a8572f5..e2f19f2b7615 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1717,6 +1717,11 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,

/* If unpopulated, no other information is useful */
if (!populated_zone(zone)) {
+ /* Show start_pfn for empty overlapped zones */
+ if (zone->spanned_pages)
+ seq_printf(m,
+ "\n start_pfn: %lu",
+ zone->zone_start_pfn);
seq_putc(m, '\n');
return;
}
--
2.25.1

2022-09-28 23:20:43

by Doug Berger

[permalink] [raw]
Subject: [PATCH v2 6/9] memblock: introduce MEMBLOCK_MOVABLE flag

The MEMBLOCK_MOVABLE flag is introduced to designate a memblock
as only supporting movable allocations by the page allocator.

Signed-off-by: Doug Berger <[email protected]>
---
include/linux/memblock.h | 8 ++++++++
mm/memblock.c | 24 ++++++++++++++++++++++++
2 files changed, 32 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 50ad19662a32..8eb3ca32dfa7 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -47,6 +47,7 @@ enum memblock_flags {
MEMBLOCK_MIRROR = 0x2, /* mirrored region */
MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
+ MEMBLOCK_MOVABLE = 0x10, /* designated movable block */
};

/**
@@ -125,6 +126,8 @@ int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
+int memblock_mark_movable(phys_addr_t base, phys_addr_t size);
+int memblock_clear_movable(phys_addr_t base, phys_addr_t size);

void memblock_free_all(void);
void memblock_free(void *ptr, size_t size);
@@ -265,6 +268,11 @@ static inline bool memblock_is_driver_managed(struct memblock_region *m)
return m->flags & MEMBLOCK_DRIVER_MANAGED;
}

+static inline bool memblock_is_movable(struct memblock_region *m)
+{
+ return m->flags & MEMBLOCK_MOVABLE;
+}
+
int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long *end_pfn);
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index b5d3026979fc..5d6a210d98ec 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -979,6 +979,30 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
}

+/**
+ * memblock_mark_movable - Mark designated movable block with MEMBLOCK_MOVABLE.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_mark_movable(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(base, size, 1, MEMBLOCK_MOVABLE);
+}
+
+/**
+ * memblock_clear_movable - Clear flag MEMBLOCK_MOVABLE for a specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_clear_movable(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(base, size, 0, MEMBLOCK_MOVABLE);
+}
+
static bool should_skip_region(struct memblock_type *type,
struct memblock_region *m,
int nid, int flags)
--
2.25.1

2022-09-29 09:13:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/9] mm/vmstat: show start_pfn when zone spans pages

On 29.09.22 00:32, Doug Berger wrote:
> A zone that overlaps with another zone may span a range of pages
> that are not present. In this case, displaying the start_pfn of
> the zone allows the zone page range to be identified.
>

I don't understand the intention here.

"/* If unpopulated, no other information is useful */"

Why would the start pfn be of any use here?

What is the user visible impact without that change?

> Signed-off-by: Doug Berger <[email protected]>
> ---
> mm/vmstat.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 90af9a8572f5..e2f19f2b7615 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1717,6 +1717,11 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>
> /* If unpopulated, no other information is useful */
> if (!populated_zone(zone)) {
> + /* Show start_pfn for empty overlapped zones */
> + if (zone->spanned_pages)
> + seq_printf(m,
> + "\n start_pfn: %lu",
> + zone->zone_start_pfn);
> seq_putc(m, '\n');
> return;
> }
--
Thanks,

David / dhildenb

2022-10-01 02:49:47

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH v2 2/9] mm/vmstat: show start_pfn when zone spans pages

On 9/29/2022 1:15 AM, David Hildenbrand wrote:
> On 29.09.22 00:32, Doug Berger wrote:
>> A zone that overlaps with another zone may span a range of pages
>> that are not present. In this case, displaying the start_pfn of
>> the zone allows the zone page range to be identified.
>>
>
> I don't understand the intention here.
>
> "/* If unpopulated, no other information is useful */"
>
> Why would the start pfn be of any use here?
>
> What is the user visible impact without that change?
Yes, this is very subtle. I only caught it while testing some
pathological cases.

If you take the example system:
The 7278 device has four ARMv8 CPU cores in an SMP cluster and two
memory controllers (MEMCs). Each MEMC is capable of controlling up to
8GB of DRAM. An example 7278 system might have 1GB on each controller,
so an arm64 kernel might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and
1GB on MEMC1 at 0x300000000-0x33FFFFFFF.

Placing a DMB on MEMC0 with 'movablecore=256M@0x70000000' will lead to
the ZONE_MOVABLE zone spanning from 0x70000000-0x33fffffff and the
ZONE_NORMAL zone spanning from 0x300000000-0x33fffffff.

If instead you specified 'movablecore=256M@0x70000000,512M' you would
get the same ZONE_MOVABLE span, but the ZONE_NORMAL would now span
0x300000000-0x32fffffff. The requested 512M of movablecore would be
divided into a 256MB DMB at 0x70000000 and a 256MB "classic" movable
zone start would be displayed in the bootlog as:
[ 0.000000] Movable zone start for each node
[ 0.000000] Node 0: 0x000000330000000

Finally, if you specified the pathological
'movablecore=256M@0x70000000,1G@12G' you would still have the same
ZONE_MOVABLE span, and the ZONE_NORMAL span would go back to
0x300000000-0x33fffffff. However, because the second DMB (1G@12G)
completely overlaps the ZONE_NORMAL there would be no pages present in
ZONE_NORMAL and /proc/zoneinfo would report ZONE_NORMAL 'spanned
262144', but not where those pages are. This commit adds the 'start_pfn'
back to the /proc/zoneinfo for ZONE_NORMAL so the span has context.

Regards,
Doug

>
>> Signed-off-by: Doug Berger <[email protected]>
>> ---
>>   mm/vmstat.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 90af9a8572f5..e2f19f2b7615 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1717,6 +1717,11 @@ static void zoneinfo_show_print(struct seq_file
>> *m, pg_data_t *pgdat,
>>       /* If unpopulated, no other information is useful */
>>       if (!populated_zone(zone)) {
>> +        /* Show start_pfn for empty overlapped zones */
>> +        if (zone->spanned_pages)
>> +            seq_printf(m,
>> +                   "\n  start_pfn:           %lu",
>> +                   zone->zone_start_pfn);
>>           seq_putc(m, '\n');
>>           return;
>>       }

2022-10-05 18:19:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/9] mm/vmstat: show start_pfn when zone spans pages

On 01.10.22 03:28, Doug Berger wrote:
> On 9/29/2022 1:15 AM, David Hildenbrand wrote:
>> On 29.09.22 00:32, Doug Berger wrote:
>>> A zone that overlaps with another zone may span a range of pages
>>> that are not present. In this case, displaying the start_pfn of
>>> the zone allows the zone page range to be identified.
>>>
>>
>> I don't understand the intention here.
>>
>> "/* If unpopulated, no other information is useful */"
>>
>> Why would the start pfn be of any use here?
>>
>> What is the user visible impact without that change?
> Yes, this is very subtle. I only caught it while testing some
> pathological cases.
>
> If you take the example system:
> The 7278 device has four ARMv8 CPU cores in an SMP cluster and two
> memory controllers (MEMCs). Each MEMC is capable of controlling up to
> 8GB of DRAM. An example 7278 system might have 1GB on each controller,
> so an arm64 kernel might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and
> 1GB on MEMC1 at 0x300000000-0x33FFFFFFF.
>

Okay, thanks. You should make it clearer in the patch description --
especially how this relates to DMB. Having that said, I still have to
digest your examples:

> Placing a DMB on MEMC0 with 'movablecore=256M@0x70000000' will lead to
> the ZONE_MOVABLE zone spanning from 0x70000000-0x33fffffff and the
> ZONE_NORMAL zone spanning from 0x300000000-0x33fffffff.

Why is ZONE_MOVABLE spanning more than 256M? It should span

0x70000000-0x80000000

Or what am I missing?

>
> If instead you specified 'movablecore=256M@0x70000000,512M' you would
> get the same ZONE_MOVABLE span, but the ZONE_NORMAL would now span
> 0x300000000-0x32fffffff. The requested 512M of movablecore would be
> divided into a 256MB DMB at 0x70000000 and a 256MB "classic" movable
> zone start would be displayed in the bootlog as:
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Node 0: 0x000000330000000


Okay, so that's the movable zone range excluding DMB.

>
> Finally, if you specified the pathological
> 'movablecore=256M@0x70000000,1G@12G' you would still have the same
> ZONE_MOVABLE span, and the ZONE_NORMAL span would go back to
> 0x300000000-0x33fffffff. However, because the second DMB (1G@12G)
> completely overlaps the ZONE_NORMAL there would be no pages present in
> ZONE_NORMAL and /proc/zoneinfo would report ZONE_NORMAL 'spanned
> 262144', but not where those pages are. This commit adds the 'start_pfn'
> back to the /proc/zoneinfo for ZONE_NORMAL so the span has context.

... but why? If there are no pages present, there is no ZONE_NORMAL we
care about. The zone span should be 0. Does this maybe rather indicate
that there is a zone span processing issue in your DMB implementation?

Special-casing zones based on DMBs feels wrong. But most probably I am
missing something important :)

--
Thanks,

David / dhildenb

2022-10-13 00:07:45

by Doug Berger

[permalink] [raw]
Subject: Re: [PATCH v2 2/9] mm/vmstat: show start_pfn when zone spans pages

On 10/5/2022 11:09 AM, David Hildenbrand wrote:
> On 01.10.22 03:28, Doug Berger wrote:
>> On 9/29/2022 1:15 AM, David Hildenbrand wrote:
>>> On 29.09.22 00:32, Doug Berger wrote:
>>>> A zone that overlaps with another zone may span a range of pages
>>>> that are not present. In this case, displaying the start_pfn of
>>>> the zone allows the zone page range to be identified.
>>>>
>>>
>>> I don't understand the intention here.
>>>
>>> "/* If unpopulated, no other information is useful */"
>>>
>>> Why would the start pfn be of any use here?
>>>
>>> What is the user visible impact without that change?
>> Yes, this is very subtle. I only caught it while testing some
>> pathological cases.
>>
>> If you take the example system:
>> The 7278 device has four ARMv8 CPU cores in an SMP cluster and two
>> memory controllers (MEMCs). Each MEMC is capable of controlling up to
>> 8GB of DRAM. An example 7278 system might have 1GB on each controller,
>> so an arm64 kernel might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and
>> 1GB on MEMC1 at 0x300000000-0x33FFFFFFF.
>>
>
> Okay, thanks. You should make it clearer in the patch description --
> especially how this relates to DMB. Having that said, I still have to
> digest your examples:
>
>> Placing a DMB on MEMC0 with 'movablecore=256M@0x70000000' will lead to
>> the ZONE_MOVABLE zone spanning from 0x70000000-0x33fffffff and the
>> ZONE_NORMAL zone spanning from 0x300000000-0x33fffffff.
>
> Why is ZONE_MOVABLE spanning more than 256M? It should span
>
> 0x70000000-0x80000000
>
> Or what am I missing?
I was working from the notion that the classic 'movablecore'
implementation keeps the ZONE_MOVABLE zone the last zone on System RAM
so it always spans the last page on the node (i.e. 0x33ffff000). My
implementation moves the start of ZONE_MOVABLE up to the lowest page of
any defined DMBs on the node.

I see that memory hotplug does not behave this way, which is probably
more intuitive (though less consistent with the classic zone layout). I
could attempt to change this in a v3 if desired.

>
>>
>> If instead you specified 'movablecore=256M@0x70000000,512M' you would
>> get the same ZONE_MOVABLE span, but the ZONE_NORMAL would now span
>> 0x300000000-0x32fffffff. The requested 512M of movablecore would be
>> divided into a 256MB DMB at 0x70000000 and a 256MB "classic" movable
>> zone start would be displayed in the bootlog as:
>> [    0.000000] Movable zone start for each node
>> [    0.000000]   Node 0: 0x000000330000000
>
>
> Okay, so that's the movable zone range excluding DMB.
>
>>
>> Finally, if you specified the pathological
>> 'movablecore=256M@0x70000000,1G@12G' you would still have the same
>> ZONE_MOVABLE span, and the ZONE_NORMAL span would go back to
>> 0x300000000-0x33fffffff. However, because the second DMB (1G@12G)
>> completely overlaps the ZONE_NORMAL there would be no pages present in
>> ZONE_NORMAL and /proc/zoneinfo would report ZONE_NORMAL 'spanned
>> 262144', but not where those pages are. This commit adds the 'start_pfn'
>> back to the /proc/zoneinfo for ZONE_NORMAL so the span has context.
>
> ... but why? If there are no pages present, there is no ZONE_NORMAL we
> care about. The zone span should be 0. Does this maybe rather indicate
> that there is a zone span processing issue in your DMB implementation?
My implementation uses the zones created by the classic 'movablecore'
behavior and relocates the pages within DMBs. In this case the
ZONE_NORMAL still has a span which gets output but no present pages so
the output didn't show where the zone was without this patch. This is a
convenience to avoid adding zone resizing and destruction logic outside
of memory hotplug support, but I could attempt to add that code in a v3
if desired.

>
> Special-casing zones based on DMBs feels wrong. But most probably I am
> missing something important :)
>

Thanks for making me aware of your confusion so I can attempt to make it
clearer.
-Doug

2022-10-13 12:03:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/9] mm/vmstat: show start_pfn when zone spans pages

On Wed 12-10-22 16:57:53, Doug Berger wrote:
[...]
> I was working from the notion that the classic 'movablecore' implementation
> keeps the ZONE_MOVABLE zone the last zone on System RAM so it always spans
> the last page on the node (i.e. 0x33ffff000). My implementation moves the
> start of ZONE_MOVABLE up to the lowest page of any defined DMBs on the node.

I wouldn't rely on movablecore specific implementation. ZONE_MOVABLE can
span any physical address range. ZONE_NORMAL usually covers any ranges
not covered by more specific zones like ZONE_DMA{32}. At least on most
architectures I am familiar with.
--
Michal Hocko
SUSE Labs