by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 03/21] mm/hugetlb: correct demote page offset logic

On Tue, Sep 13, 2022 at 12:54:50PM -0700, Doug Berger wrote:
> With gigantic pages it may not be true that struct page structures
> are contiguous across the entire gigantic page. The mem_map_offset
> function is used here in place of direct pointer arithmetic to
> correct for this.

We're just eliminating mem_map_offset(). Please use nth_page()
instead.

> for (i = 0; i < pages_per_huge_page(h);
> i += pages_per_huge_page(target_hstate)) {
> + subpage = mem_map_offset(page, i);
> if (hstate_is_gigantic(target_hstate))

2022-09-14 00:10:24

On 13 Sep 2022, at 20:59, Doug Berger wrote:

> On 9/13/2022 5:02 PM, Zi Yan wrote:
>> On 13 Sep 2022, at 15:54, Doug Berger wrote:
>>
>>> The function set_migratetype_isolate() has special handling for
>>> pageblocks of MIGRATE_CMA type that protects them from being
>>> isolated for MIGRATE_MOVABLE requests.
>>>
>>> Since isolate_single_pageblock() doesn't receive the migratetype
>>> argument of start_isolate_page_range() it used the migratetype
>>> of the pageblock instead of the requested migratetype which
>>> defeats this MIGRATE_CMA check.
>>>
>>> This allows an attempt to create a gigantic page within a CMA
>>> region to change the migratetype of the first and last pageblocks
>>> from MIGRATE_CMA to MIGRATE_MOVABLE when they are restored after
>>> failure, which corrupts the CMA region.
>>>
>>> The calls to (un)set_migratetype_isolate() for the first and last
>>> pageblocks of the start_isolate_page_range() are moved back into
>>> that function to allow access to its migratetype argument and make
>>> it easier to see how all of the pageblocks in the range are
>>> isolated.
>>>
>>> Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
>>> Signed-off-by: Doug Berger <[email protected]>
>>> ---
>>> mm/page_isolation.c | 75 +++++++++++++++++++++------------------------
>>> 1 file changed, 35 insertions(+), 40 deletions(-)
>>
>> Thanks for the fix.
> Thanks for the review.
>
>>
>> Why not just pass migratetype into isolate_single_pageblock() and use
>> it when set_migratetype_isolate() is used? That would have much
>> fewer changes. What is the reason of pulling skip isolation logic out?
> I found the skip_isolation logic confusing and thought that setting and restoring the migratetype within the same function and consolidating the error recovery paths also within that function was easier to understand and less prone to accidental breakage.
>
> In particular, setting MIGRATE_ISOLATE in isolate_single_pageblock() and having to remember to unset it in start_isolate_page_range() differently on different error paths was troublesome for me.

Wouldn't this work as well?

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c1307d1bea81..a312cabd0d95 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -288,6 +288,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* @isolate_before: isolate the pageblock before the boundary_pfn
* @skip_isolation: the flag to skip the pageblock isolation in second
* isolate_single_pageblock()
+ * @migratetype: Migrate type to set in error recovery.
*
* Free and in-use pages can be as big as MAX_ORDER and contain more than one
* pageblock. When not all pageblocks within a page are isolated at the same
@@ -302,9 +303,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* the in-use page then splitting the free page.
*/
static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
- gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
+ gfp_t gfp_flags, bool isolate_before, bool skip_isolation,
+ int migratetype)
{
- unsigned char saved_mt;
unsigned long start_pfn;
unsigned long isolate_pageblock;
unsigned long pfn;
@@ -328,12 +329,10 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
zone->zone_start_pfn);

- saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
-
if (skip_isolation)
- VM_BUG_ON(!is_migrate_isolate(saved_mt));
+ VM_BUG_ON(!is_migrate_isolate(get_pageblock_migratetype(pfn_to_page(isolate_pageblock))));
else {
- ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
+ ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype, flags,
isolate_pageblock, isolate_pageblock + pageblock_nr_pages);

if (ret)
@@ -475,7 +474,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
failed:
/* restore the original migratetype */
if (!skip_isolation)
- unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
+ unset_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype);
return -EBUSY;
}

@@ -537,7 +536,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
bool skip_isolation = false;

/* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
- ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
+ ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false,
+ skip_isolation, migratetype);
if (ret)
return ret;

@@ -545,7 +545,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
skip_isolation = true;

/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
- ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
+ ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
+ skip_isolation, migratetype);
if (ret) {
unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
return ret;

>
> It could certainly be done differently, but this was my preference.

A smaller patch can make review easier, right?

>>
>> Ultimately, I would like to make MIGRATE_ISOLATE a separate bit,
>> so that migratetype will not be overwritten during page isolation.
>> Then, set_migratetype_isolate() and start_isolate_page_range()
>> will not have migratetype to set in error recovery any more.
>> That is on my TODO.
>>
>>>
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index 9d73dc38e3d7..8e16aa22cb61 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -286,8 +286,6 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>> * @flags: isolation flags
>>> * @gfp_flags: GFP flags used for migrating pages
>>> * @isolate_before: isolate the pageblock before the boundary_pfn
>>> - * @skip_isolation: the flag to skip the pageblock isolation in second
>>> - * isolate_single_pageblock()
>>> *
>>> * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
>>> * pageblock. When not all pageblocks within a page are isolated at the same
>>> @@ -302,9 +300,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>> * the in-use page then splitting the free page.
>>> */
>>> static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> - gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
>>> + gfp_t gfp_flags, bool isolate_before)
>>> {
>>> - unsigned char saved_mt;
>>> unsigned long start_pfn;
>>> unsigned long isolate_pageblock;
>>> unsigned long pfn;
>>> @@ -328,18 +325,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
>>> zone->zone_start_pfn);
>>>
>>> - saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
>>> -
>>> - if (skip_isolation)
>>> - VM_BUG_ON(!is_migrate_isolate(saved_mt));
>>> - else {
>>> - ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
>>> - isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
>>> -
>>> - if (ret)
>>> - return ret;
>>> - }
>>> -
>>> /*
>>> * Bail out early when the to-be-isolated pageblock does not form
>>> * a free or in-use page across boundary_pfn:
>>> @@ -428,7 +413,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> ret = set_migratetype_isolate(page, page_mt,
>>> flags, head_pfn, head_pfn + nr_pages);
>>> if (ret)
>>> - goto failed;
>>> + return ret;
>>> }
>>>
>>> ret = __alloc_contig_migrate_range(&cc, head_pfn,
>>> @@ -443,7 +428,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> unset_migratetype_isolate(page, page_mt);
>>>
>>> if (ret)
>>> - goto failed;
>>> + return -EBUSY;
>>> /*
>>> * reset pfn to the head of the free page, so
>>> * that the free page handling code above can split
>>> @@ -459,24 +444,19 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>> while (!PageBuddy(pfn_to_page(outer_pfn))) {
>>> /* stop if we cannot find the free page */
>>> if (++order >= MAX_ORDER)
>>> - goto failed;
>>> + return -EBUSY;
>>> outer_pfn &= ~0UL << order;
>>> }
>>> pfn = outer_pfn;
>>> continue;
>>> } else
>>> #endif
>>> - goto failed;
>>> + return -EBUSY;
>>> }
>>>
>>> pfn++;
>>> }
>>> return 0;
>>> -failed:
>>> - /* restore the original migratetype */
>>> - if (!skip_isolation)
>>> - unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
>>> - return -EBUSY;
>>> }
>>>
>>> /**
>>> @@ -534,21 +514,30 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>> unsigned long isolate_start = ALIGN_DOWN(start_pfn, pageblock_nr_pages);
>>> unsigned long isolate_end = ALIGN(end_pfn, pageblock_nr_pages);
>>> int ret;
>>> - bool skip_isolation = false;
>>>
>>> /* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
>>> - ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
>>> + ret = set_migratetype_isolate(pfn_to_page(isolate_start), migratetype,
>>> + flags, isolate_start, isolate_start + pageblock_nr_pages);
>>> if (ret)
>>> return ret;
>>> -
>>> - if (isolate_start == isolate_end - pageblock_nr_pages)
>>> - skip_isolation = true;
>>> + ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
>>> + if (ret)
>>> + goto unset_start_block;
>>>
>>> /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
>>> - ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
>>> + pfn = isolate_end - pageblock_nr_pages;
>>> + if (isolate_start != pfn) {
>>> + ret = set_migratetype_isolate(pfn_to_page(pfn), migratetype,
>>> + flags, pfn, pfn + pageblock_nr_pages);
>>> + if (ret)
>>> + goto unset_start_block;
>>> + }
>>> + ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
>>> if (ret) {
>>> - unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>>> - return ret;
>>> + if (isolate_start != pfn)
>>> + goto unset_end_block;
>>> + else
>>> + goto unset_start_block;
>>> }
>>>
>>> /* skip isolated pageblocks at the beginning and end */
>>> @@ -557,15 +546,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>> pfn += pageblock_nr_pages) {
>>> page = __first_valid_page(pfn, pageblock_nr_pages);
>>> if (page && set_migratetype_isolate(page, migratetype, flags,
>>> - start_pfn, end_pfn)) {
>>> - undo_isolate_page_range(isolate_start, pfn, migratetype);
>>> - unset_migratetype_isolate(
>>> - pfn_to_page(isolate_end - pageblock_nr_pages),
>>> - migratetype);
>>> - return -EBUSY;
>>> - }
>>> + start_pfn, end_pfn))
>>> + goto unset_isolated_blocks;
>>> }
>>> return 0;
>>> +
>>> +unset_isolated_blocks:
>>> + ret = -EBUSY;
>>> + undo_isolate_page_range(isolate_start + pageblock_nr_pages, pfn,
>>> + migratetype);
>>> +unset_end_block:
>>> + unset_migratetype_isolate(pfn_to_page(isolate_end - pageblock_nr_pages),
>>> + migratetype);
>>> +unset_start_block:
>>> + unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
>>> + return ret;
>>> }
>>>
>>> /*
>>> --
>>> 2.25.1
>>
>>
>> --
>> Best Regards,
>> Yan, Zi

--
Best Regards,
Yan, Zi

Attachments:

signature.asc (871.00 B)
OpenPGP digital signature

2022-09-14 01:48:28

by Doug Berger

[permalink] [raw]

Subject: Re: [PATCH 03/21] mm/hugetlb: correct demote page offset logic

On 9/13/2022 4:34 PM, Matthew Wilcox wrote:
> On Tue, Sep 13, 2022 at 12:54:50PM -0700, Doug Berger wrote:
>> With gigantic pages it may not be true that struct page structures
>> are contiguous across the entire gigantic page. The mem_map_offset
>> function is used here in place of direct pointer arithmetic to
>> correct for this.
>
> We're just eliminating mem_map_offset(). Please use nth_page()
> instead.That's good to know. I will include that in v2.

>
>> for (i = 0; i < pages_per_huge_page(h);
>> i += pages_per_huge_page(target_hstate)) {
>> + subpage = mem_map_offset(page, i);
>> if (hstate_is_gigantic(target_hstate))

2022-09-14 02:10:37

On 9/14/2022 10:08 AM, Mike Kravetz wrote:
> On 09/13/22 18:07, Doug Berger wrote:
>> On 9/13/2022 4:34 PM, Matthew Wilcox wrote:
>>> On Tue, Sep 13, 2022 at 12:54:50PM -0700, Doug Berger wrote:
>>>> With gigantic pages it may not be true that struct page structures
>>>> are contiguous across the entire gigantic page. The mem_map_offset
>>>> function is used here in place of direct pointer arithmetic to
>>>> correct for this.
>>>
>>> We're just eliminating mem_map_offset(). Please use nth_page()
>>> instead.That's good to know. I will include that in v2.
>
> Thanks Doug and Matthew. I will take a closer look at this series soon.
>
> It seems like this patch is a fix independent of the series. If so, I
> would suggest sending separate to make it easy for backports to stable.
Yes, as I noted in [PATCH 00/21] the first three patches fit that
description, but I included them here in case someone was brave enough
to attempt to use this patch set. They were in my branch for my own testing.

Full disclosure: An earlier version of this patch set had more complete
support for hugepage isolation that included migrating the isolation
state when demoting a hugepage that touched lines in
demote_free_huge_page() and depended on the subpage variable introduced
here.

At this point I will submit a patch for this on its own and will likely
remove the first three commits when submitting V2 of the set.

Thanks for your consideration.
-Doug

2022-09-14 18:36:28

Reordered to (hopefully) improve readability.

On 10/5/2022 11:39 AM, David Hildenbrand wrote:
> May I ask what the main purpose/use case of DMB is?
The concept of Designated Movable Blocks was conceived to provide a
common mechanism for different use cases, so identifying the "main" one
is not so easy. Broadly speaking I would say there are two different but
compatible objectives that could be used to categorize use cases.

The narrower objective is the ability to locate some "user space
friendly" memory on each memory controller to make more of the total
memory bandwidth available to user space processes. The ZONE_MOVABLE
zone is considered to be "user space friendly" so locating some of it on
each memory controller would meet this objective. The existing
'movablecore' kernel parameter allows the balance of kernel/movable
memory to be adjusted, but movable memory will always be located on the
highest addressed memory controller. The v2 patch set attempts to focus
explicitly on the use case of adding a base address to the 'movablecore'
kernel parameter to support this objective.

The other general objective is to facilitate better reuse/sharing of
memory. Broadcom Set-Top Box SoCs include video processing devices that
can require large amounts of memory to perform their functions.
Historically, memory carve-outs have been used to ensure guaranteed
availability of memory to meet the requirements of cable television
customers. The rise of Android TV and Google TV have made the
inefficiency of memory carve-outs unacceptable.

We have tried to meet the reusability objective with a CMA based
implementation, but Broadcom customers were unhappy with the
performance. Efforts to improve the CMA performance led me to Joonsoo's
efforts to do the same and to the "sticky" MIGRATE_MOVABLE proposal from
Mel Gorman that I cited. I began working on an implementation of
Designated Movable Blocks based on that proposal which could be
characterized as reserving a block of memory, assigning it a new
"sticky" movable migrate type, and modifying the fast and slow path page
allocators to handle the new migrate type such that requests for movable
memory could be satisfied by pages from the blocks and that the migrate
type of pages in the blocks could not be changed by "fallback" mechanisms.

Both of these objectives require the ability to specify the location of
a block of memory that can only be used by the Linux kernel page
allocator to satisfy requests for movable memory. The location is
relevant because it may need to be on a specific memory controller or it
may have to satisfy the DMA address range of a specific device. The
movability is relevant because it improves the availability to user
space allocations or it allows the data occupying the memory to be moved
away when the memory is required by the device. The Designated Movable
Block mechanism was designed to satisfy these requirements and was seen
as a common mechanism for both objectives.

While learning more about the page allocator implementation, I realized
that hotplug memory also has these same requirements. The location of
hotplug memory is determined by the system hardware independent of
Linux's zone concepts and the data stored on the memory must be movable
to support the ability to offline the memory before it is unplugged.
This led me to study the hotplug memory implementation to understand how
they satisfied these requirements.

I became aware that the "narrower objective" could conceivably be
satisfied by the hotplug memory capability with a few challenges. First
the size of hotplug memory sections is a bit course. The current 128MB
sections on arm64 are not too bad and are far better than the 1GB
sections that were in place when I first looked at it.

For systems that do not support ACPI there is no clear way to specify
hotplug memory regions at boot time. When Linux boots an arm64 kernel
with devicetree the OS attempts to initialize all available memory
described by the devicetree. Typically this boot memory cannot be
unplugged to allow it to be plugged into a different zone. A devicetree
specification of the hardware could intentionally leave holes in its
memory description to allow for runtime plugging of memory into the
holes, but this goes against the spirit of a devicetree description of
the system hardware as it is not representative of what hardware is
actually present. The 'mem=' kernel parameter can be used to prevent
Linux from initializing all of the available memory so that memory could
be hotplugged after boot, but this breaks devicetree mechanisms for
reserving memory from addresses that might only be populated by hotplug
after boot.

It also becomes difficult to manage the selection of zones where memory
is hotplugged. Referring again to the example system with 1GB on MEMC0
and 1GB on MEMC1 we could boot with 'mem=768M' to leave 256MB
unpopulated on MEMC0 and all of the memory (1GB) on MEMC1 unpopulated.
If we set the memory_hotplug module parameter online_policy to
"auto-movable" then adding 256MB at 0x70000000 will put the memory in
ZONE_MOVABLE as desired. However, we might want to hotplug 768MB at
0x300000000 into ZONE_NORMAL and 256MB at 0x330000000 into ZONE_MOVABLE.
The fact that the memory_hotplug parameters are not easily modifiable
from the kernel modules that are necessary to access the memory_hotplug
API makes this a difficult dance. I have experimented with a simple
module exposing hotplug capability to user space and have confirmed as a
proof of concept that user space can adjust the memory_hotplug
parameters and use the module to achieve the desired zone population
with hotplug. The /sys/devices/system/memory/probe control simplifies
this, but is not enabled on arm64 architectures.

In addition, keeping this memory unplugged until after boot means that
the memory cannot be used during boot. Kernel boot time reservations are
a mixed bag. On the one hand they won't land in ZONE_MOVABLE which is
nice, but in this example they land in ZONE_DMA which can be considered
a more valuable resource than ZONE_NORMAL. Both of these issues are not
likely to be of significant consequence, but neither is really desirable.

Finally, just like there are those that may not want to execute a NUMA
kernel (e.g. Android GKI arm64), there may also be those that don't want
to include memory hotplug support in their kernel. These things can
change, but are not always under our control.

If you are aware of solutions to these issues that would make memory
hotplug a more viable solution for us than DMB I would be happy to know
them.

These observations led me to design DMB more as an extension of
'movablecore' than an extension of memory hotplug. However, the
efficiency of using the ZONE_MOVABLE zone to collect and manage "sticky"
movable pages in an address independent way without "fallback" (as is
done by memory hotplug) won me over and I abandoned the idea of
modifying the fast and slow page allocator paths to support a "sticky"
movable migrate type. The implementation of DMB was re-conceived to
preserve the existing 'movablecore' mechanism of creating a dynamic
ZONE_MOVABLE zone that spans from zone_movable_pfn for each node to the
end of memory on the node, and adding the ability to designate blocks of
memory whose pages would be removed from their default zone and placed
in the ZONE_MOVABLE zone. The span of each ZONE_MOVABLE zone was
increased to start at the lowest pfn in the zone on the node and
continue to the end of memory on the node. I also neglected to destroy
zones that became empty after their pages were moved to ZONE_MOVABLE.
These last two decisions were a matter of convenience, but I can see
that they may have created some confusion (based on your questions) so I
am happy to reconsider them.

>
> Would it be sufficient, to specify that hugetlb are allocated from a
> specific memory area, possible managed by CMA? And then simply providing
> the application that cares these hugetlb pages? Would you need something
> that is *not* hugetlb?
>
> But even then, how would an application be able to specify that exactly
> it's allocation will get served from that part of ZONE_MOVABLE? Sure, if
> you don't reserve any other hugetlb pages, it's easy.
As noted before I actually have very limited visibility into how the
"narrower objective" is being used by Broadcom customers and how much
benefit it provides. I believe its current use is probably simply
opportunistic, but these kinds of improvements to hugetlb allocation
might be welcomed.

I'd say the hugetlb_cma is similar to what you are describing except
that it is consolidated rather than being distributed across multiple
memory areas. Such changes to add benefit to the "narrower objective"
need not be considered with respect to this patch set. On the other
hand, the reuse objective of Designated Movable Blocks could be very
relevant to hugetlb_cma.

>>
>> I agree with Mel Gorman that zones are meant to be about address induced
>> limitations, so using a zone for the purpose of breaking the fallback
>> mechanism of the page allocator is a misuse of the concept. A new
>> migratetype would be more appropriate for representing this change in
>> how fallback should apply to the pageblock because the desired behavior
>> has nothing to do with the address at which the memory is located. It is
>> entirely reasonable to desire "sticky" movable behavior for memory in
>> any zone. Such a solution would be directly applicable to our multiple
>> memory controller use case, and is really how Designated Movable Blocks
>> should be imagined.
>
> I usually agree with Mel, but not necessarily on that point that it's a
> misuse of a concept. It's an extension of an existing concept, that
> doesn't imply it's a misuse. Traditionally, it was about address
> limitations, yes. Now it's also about allocation types. Sure, there
> might be other ways to get it done as well.
Yes, I would also agree that when introduced that was the concept, but
that the extensions made for memory hotplug have enough value to be a
justified extension of the initial concept. That is exactly why I
changed my approach.

>
> I'd compare it to the current use of NUMA nodes: traditionally, it
> really used to be actual NUMA nodes. Nowadays, it's a mechanism, for
> example, to expose performance-differented memory, let applications use
> it via mbind() or have the page allocator dynamically migrate hot/cold
> pages back and forth according to memory tiering strategies.
You are helping me gain an appreciation for the current extensions of
the node concept beyond the initial use for NUMA. It does sound useful
for applications that do want to have that finer control over the
resources they use.

However, I still believe there is value in the Designated Movable Block
concept that should be realizable when nodes are not available in the
kernel config. The implementation I am proposing should not incur a cost
for those that don't wish to use it.

>
>>
>> However, I also recognize the efficiency benefits of using a
>> ZONE_MOVABLE zone to manage the pages that have this "sticky" movable
>> behavior. Introducing a new sticky MIGRATE_MOVABLE migratetype adds a
>> new free_list to every free_area which increases the search space and
>> associated work when trying to allocate a page for all callers.
>> Introducing ZONE_MOVABLE reduces the search space by providing an early
>> separation between searches for movable and non-movable allocations. The
>> classic zone restrictions weren't a good fit for multiple memory
>> controllers, but those restrictions were lifted to overcome similar
>> issues with memory_hotplug. It is not that Designated Movable Blocks
>> want to be in ZONE_MOVABLE, but rather that ZONE_MOVABLE provides a
>> convenience for managing the page allocators use of "sticky" movable
>> memory just like it does for memory hotplug. Dumping the memory in
>> Designated Movable Blocks into the ZONE_MOVABLE zone allows an existing
>> mechanism to be reused, reducing the risk of negatively impacting the
>> page allocator behavior.
>>
>> There are some subtle distinctions between Designated Movable Blocks and
>> the existing ZONE_MOVABLE zone. Because Designated Movable Blocks are
>> reserved when created they are protected against any early boot time
>> kernel reservations that might place unmovable allocations in them. The
>> implementation continues to track the zone_movable_pfn as the start of
>> the "classic" ZONE_MOVABLE zone on each node. A Designated Movable Block
>> can overlap any other zone including the "classic" ZONE_MOVABLE zone.
>
> What exactly to you mean with "overlay" -- I assume you mean that zone
> span will overlay but it really "belongs" to ZONE_MOVABLE, as indicated
> by it's struct page metadata.
Yes. If the pages of a DMB are within the span of a zone I am saying it
overlaps that zone. The pages will only be "present" in the ZONE_MOVABLE
zone.

>>>>
>>>>>
>>>>> Why do we have to start using ZONE_MOVABLE for them?
>>>> One of the "other opportunities" for Designated Movable Blocks is to
>>>> allow CMA to allocate from a DMB as an alternative. This would allow
>>>> current users to continue using CMA as they want, but would allow users
>>>> (e.g. hugetlb_cma) that are not sensitive to the allocation latency to
>>>> let the kernel page allocator make more complete use (i.e. waste less)
>>>> of the shared memory. ZONE_MOVABLE pageblocks are always
>>>> MIGRATE_MOVABLE
>>>> so the restrictions placed on MIGRATE_CMA pageblocks are lifted
>>>> within a
>>>> DMB.
>>>
>>> The whole purpose of ZONE_MOVABLE is that *no* unmovable allocations end
>>> up on it. The biggest difference to CMA is that the CMA *owner* is able
>>> to place unmovable allocations on it.
>> I'm not sure that is a wholly fair characterization (or maybe I just
>> hope that's the case :). I would agree that the Linux page allocator
>> can't place any unmovable allocations on it. I expect that people locate
>> memory in ZONE_MOVABLE for different purposes. For example, the memory
>> hotplug users ostensibly place memory there so that any data on the hot
>> plugged memory can be moved off of the memory prior to it being hot
>> unplugged. Unplugging the memory removes the memory from the
>> ZONE_MOVABLE zone, but it is not materially different from allocating
>> the memory for a different purpose (perhaps in a different machine).
>
> Well, memory offlining is the one operation that evacuates memory) and
> makes sure it cannot be allocated anymore (possibly with the intention
> of removing that memory from the system). Sure, you can call it a fake
> allocation, but there is a more fundamental difference compared to
> random subsystems placing unmovable allocations there.
For the record, I am not offended by your use of the word "random" in
that statement. I was once informed I unintentionally offended someone
by using the term "arbitrary" in a similar way ;).

Any such unmovable allocation should be made with intent and with
authority to do so. The memory hotunplug is an example (perhaps a
singular one) of a subsystem that can do so with intent and authority.
Randomness plays no role.

"Ownership" of a DMB would imply authority and such an owner should be
presumed to be acting with intent. So the mechanics of ownership and
methods should be formalized before the general objective of reuse of
DMBs for non-movable purposes (e.g. hugetlb_cma, device driver, ...) is
allowed. This is why that objective has been deferred with the hope that
users that may have an interest in this objective can propose their
favored mechanism.

The "narrower objective" expressed in my v2 submission (i.e. movablecore
with base address) does not make any non-movable allocations so explicit
ownership is not necessary. Maybe whoever provided the 'movablecore'
parameter is the implied owner, but it doesn't much matter in this case.
Conceptually such a DMB could be hotunplugged, but that would be unexpected.

>
>>
>> Conceptually, allowing a CMA allocator to operate on a Designated
>> Movable Block of memory that it *owns* is also removing that memory from
>> the ZONE_MOVABLE zone. Issues of ownership should be addressed which is
>> why these "other opportunities" are being deferred for now, but I do not
>> believe such use is unreasonable. Again, Designated Movable Blocks are
>> only allowed in boot memory so there shouldn't be a conflict with memory
>> hotplug. I believe the same would apply for hugetlb_cma.
>>>
>>> Using ZONE_MOVABLE for unmovable allocations (hugetlb_cma) is not
>>> acceptable as is.
>>>
>>> Using ZONE_MOVABLE in different context and calling it DMB is very
>>> confusing TBH.
>> Perhaps it is more helpful to think of a Designated Movable Block as a
>> block of memory whose migratetype is not allowed to be changed from
>> MIGRATE_MOVABLE (i.e. "sticky" migrate movable). The fact that
>
> I think that such a description might make the feature easier to grasp.
> Although I am not sure yet if DMB as proposed is rather a hack to avoid
> introducing real sticky movable blocks (sorry, I'm just trying to
> connect the dots and there is a lot of complexity involved) or actually
> a clean design. Messing with zones and memblock always implies
> complexity :)
I very much appreciate your efforts to make sense of this. I am not
certain whether that OR is INCLUSIVE or EXCLUSIVE. I would say that the
implementation attempts to reuse the clean design of ZONE_MOVABLE (as
extended by memory hotplug) to provide the management of "sticky"
movable blocks that may overlap/overlay other zones. Doing so makes it
unnecessary to provide an otherwise redundant implementation of "sticky"
movable blocks that would likely degrade the performance of page
allocations from zones other than ZONE_MOVABLE, even when no "sticky"
movable blocks exist in the system.

>
>> ZONE_MOVABLE is being used to achieve that is an implementation detail
>> for this commit set. In the same way that memory hotplug is the concept
>> of adding System RAM during run time, but placing it in ZONE_MOVABLE is
>> an implementation detail to make it easier to unplug.
>
> Right, but there we don't play any tricks: it's just ZONE_MOVABLE
> without any other metadata pointing out ownership. Maybe that's what you
> are trying to describe here: A DMB inside ZONE_MOVABLE implies that
> there is another owner and that even memory offlining should fail.
Now why didn't I just say that in the first place :). The general
objective of reuse is inspired by CMA which has implied/explicit
ownership and as noted above DMB needs ownership to meet this objective
as well.

Thanks for your patience and helping me attempt to communicate this more
clearly.
-Doug