2023-09-12 19:39:06

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 0/4] Enable >0 order folio memory compaction

From: Zi Yan <[email protected]>

Hi all,

This patchset enables >0 order folio memory compaction, which is one of
the prerequisitions for large folio support[1]. It is on top of
mm-everything-2023-09-11-22-56.

Overview
===

To support >0 order folio compaction, the patchset changes how free pages used
for migration are kept during compaction. Free pages used to be split into
order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
page order stored in page->private is zeroed, and page reference is set to 1).
Now all free pages are kept in a MAX_ORDER+1 array of page lists based
on their order without post allocation process. When migrate_pages() asks for
a new page, one of the free pages, based on the requested page order, is
then processed and given out.


Optimizations
===

1. Free page split is added to increase migration success rate in case
a source page does not have a matched free page in the free page lists.
Free page merge is possible but not implemented, since existing
PFN-based buddy page merge algorithm requires the identification of
buddy pages, but free pages kept for memory compaction cannot have
PageBuddy set to avoid confusing other PFN scanners.

2. Sort source pages in ascending order before migration is added to
reduce free page split. Otherwise, high order free pages might be
prematurely split, causing undesired high order folio migration failures.


TODOs
===

1. Refactor free page post allocation and free page preparation code so
that compaction_alloc() and compaction_free() can call functions instead
of hard coding.

2. One possible optimization is to allow migrate_pages() to continue
even if get_new_folio() returns a NULL. In general, that means there is
not enough memory. But in >0 order folio compaction case, that means
there is no suitable free page at source page order. It might be better
to skip that page and finish the rest of migration to achieve a better
compaction result.

3. Another possible optimization is to enable free page merge. It is
possible that a to-be-migrated page causes free page split then fails to
migrate eventually. We would lose a high order free page without free
page merge function. But a way of identifying free pages for memory
compaction is needed to reuse existing PFN-based buddy page merge.

4. The implemented >0 order folio compaction algorithm is quite naive
and does not consider all possible situations. A better algorithm can
improve compaction success rate.


Feel free to give comments and ask questions.

Thanks.


[1] https://lore.kernel.org/linux-mm/[email protected]/

Zi Yan (4):
mm/compaction: add support for >0 order folio memory compaction.
mm/compaction: optimize >0 order folio compaction with free page
split.
mm/compaction: optimize >0 order folio compaction by sorting source
pages.
mm/compaction: enable compacting >0 order folios.

mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
mm/internal.h | 7 +-
2 files changed, 176 insertions(+), 36 deletions(-)

--
2.40.1


2023-09-12 21:10:47

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 2/4] mm/compaction: optimize >0 order folio compaction with free page split.

From: Zi Yan <[email protected]>

During migration in a memory compaction, free pages are placed in an array
of page lists based on their order. But the desired free page order (i.e.,
the order of a source page) might not be always present, thus leading to
migration failures. Split a high order free pages when source migration
page has a lower order to increase migration successful rate.

Note: merging free pages when a migration fails and a lower order free
page is returned via compaction_free() is possible, but there is too much
work. Since the free pages are not buddy pages, it is hard to identify
these free pages using existing PFN-based page merging algorithm.

Signed-off-by: Zi Yan <[email protected]>
---
mm/compaction.c | 40 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 868e92e55d27..45747ab5f380 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1801,9 +1801,46 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
struct compact_control *cc = (struct compact_control *)data;
struct folio *dst;
int order = folio_order(src);
+ bool has_isolated_pages = false;

+again:
if (!cc->freepages[order].nr_free) {
- isolate_freepages(cc);
+ int i;
+
+ for (i = order + 1; i <= MAX_ORDER; i++) {
+ if (cc->freepages[i].nr_free) {
+ struct page *freepage =
+ list_first_entry(&cc->freepages[i].pages,
+ struct page, lru);
+
+ int start_order = i;
+ unsigned long size = 1 << start_order;
+
+ list_del(&freepage->lru);
+ cc->freepages[i].nr_free--;
+
+ while (start_order > order) {
+ start_order--;
+ size >>= 1;
+
+ list_add(&freepage[size].lru,
+ &cc->freepages[start_order].pages);
+ cc->freepages[start_order].nr_free++;
+ set_page_private(&freepage[size], start_order);
+ }
+ post_alloc_hook(freepage, order, __GFP_MOVABLE);
+ if (order)
+ prep_compound_page(freepage, order);
+ dst = page_folio(freepage);
+ goto done;
+ }
+ }
+ if (!has_isolated_pages) {
+ isolate_freepages(cc);
+ has_isolated_pages = true;
+ goto again;
+ }
+
if (!cc->freepages[order].nr_free)
return NULL;
}
@@ -1814,6 +1851,7 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
if (order)
prep_compound_page(&dst->page, order);
+done:
cc->nr_freepages -= 1 << order;
return dst;
}
--
2.40.1

2023-09-13 00:31:21

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 3/4] mm/compaction: optimize >0 order folio compaction by sorting source pages.

From: Zi Yan <[email protected]>

It should maximize high order free page use and minimize free page splits.
It might be useful before free page merging is implemented.

Signed-off-by: Zi Yan <[email protected]>
---
mm/compaction.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 45747ab5f380..4300d877b824 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -145,6 +145,38 @@ static void sort_free_pages(struct list_head *src, struct free_list *dst)
}
}

+static void sort_folios_by_order(struct list_head *pages)
+{
+ struct free_list page_list[MAX_ORDER + 1];
+ int order;
+ struct folio *folio, *next;
+
+ for (order = 0; order <= MAX_ORDER; order++) {
+ INIT_LIST_HEAD(&page_list[order].pages);
+ page_list[order].nr_free = 0;
+ }
+
+ list_for_each_entry_safe(folio, next, pages, lru) {
+ order = folio_order(folio);
+
+ if (order > MAX_ORDER)
+ continue;
+
+ list_move(&folio->lru, &page_list[order].pages);
+ page_list[order].nr_free++;
+ }
+
+ for (order = MAX_ORDER; order >= 0; order--) {
+ if (page_list[order].nr_free) {
+
+ list_for_each_entry_safe(folio, next,
+ &page_list[order].pages, lru) {
+ list_move_tail(&folio->lru, pages);
+ }
+ }
+ }
+}
+
#ifdef CONFIG_COMPACTION
bool PageMovable(struct page *page)
{
@@ -2636,6 +2668,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
pageblock_start_pfn(cc->migrate_pfn - 1));
}

+ sort_folios_by_order(&cc->migratepages);
+
err = migrate_pages(&cc->migratepages, compaction_alloc,
compaction_free, (unsigned long)cc, cc->mode,
MR_COMPACTION, &nr_succeeded);
--
2.40.1

2023-09-13 14:13:19

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios.

From: Zi Yan <[email protected]>

Since compaction code can compact >0 order folios, enable it during the
process.

Signed-off-by: Zi Yan <[email protected]>
---
mm/compaction.c | 25 ++++++++++---------------
1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 4300d877b824..f72af74094de 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1087,11 +1087,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (PageCompound(page) && !cc->alloc_contig) {
const unsigned int order = compound_order(page);

- if (likely(order <= MAX_ORDER)) {
- low_pfn += (1UL << order) - 1;
- nr_scanned += (1UL << order) - 1;
+ /*
+ * Compacting > pageblock_order pages does not improve
+ * memory fragmentation. Also skip hugetlbfs pages.
+ */
+ if (likely(order >= pageblock_order) || PageHuge(page)) {
+ if (order <= MAX_ORDER) {
+ low_pfn += (1UL << order) - 1;
+ nr_scanned += (1UL << order) - 1;
+ }
+ goto isolate_fail;
}
- goto isolate_fail;
}

/*
@@ -1214,17 +1220,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
goto isolate_abort;
}
}
-
- /*
- * folio become large since the non-locked check,
- * and it's on LRU.
- */
- if (unlikely(folio_test_large(folio) && !cc->alloc_contig)) {
- low_pfn += folio_nr_pages(folio) - 1;
- nr_scanned += folio_nr_pages(folio) - 1;
- folio_set_lru(folio);
- goto isolate_fail_put;
- }
}

/* The folio is taken off the LRU */
--
2.40.1

2023-09-15 09:47:08

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios.



On 9/13/2023 12:28 AM, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> Since compaction code can compact >0 order folios, enable it during the
> process.
>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> mm/compaction.c | 25 ++++++++++---------------
> 1 file changed, 10 insertions(+), 15 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4300d877b824..f72af74094de 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1087,11 +1087,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> if (PageCompound(page) && !cc->alloc_contig) {
> const unsigned int order = compound_order(page);
>
> - if (likely(order <= MAX_ORDER)) {
> - low_pfn += (1UL << order) - 1;
> - nr_scanned += (1UL << order) - 1;
> + /*
> + * Compacting > pageblock_order pages does not improve
> + * memory fragmentation. Also skip hugetlbfs pages.
> + */
> + if (likely(order >= pageblock_order) || PageHuge(page)) {

IMO, if the compound page order is larger than the requested cc->order,
we should also fail the isolation, cause it also does not improve
fragmentation, right?

> + if (order <= MAX_ORDER) {
> + low_pfn += (1UL << order) - 1;
> + nr_scanned += (1UL << order) - 1;
> + }
> + goto isolate_fail;
> }
> - goto isolate_fail;
> }
>
> /*
> @@ -1214,17 +1220,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> goto isolate_abort;
> }
> }
> -
> - /*
> - * folio become large since the non-locked check,
> - * and it's on LRU.
> - */
> - if (unlikely(folio_test_large(folio) && !cc->alloc_contig)) > - low_pfn += folio_nr_pages(folio) - 1;
> - nr_scanned += folio_nr_pages(folio) - 1;
> - folio_set_lru(folio);
> - goto isolate_fail_put;
> - }

I do not think you can remove this validation, since previous validation
is lockless. So under the lock, we need re-check if the compound page
order is larger than pageblock_order or cc->order, that need fail to
isolate.

> }
>
> /* The folio is taken off the LRU */

2023-09-18 07:58:51

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] mm/compaction: optimize >0 order folio compaction with free page split.



On 9/13/2023 12:28 AM, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> During migration in a memory compaction, free pages are placed in an array
> of page lists based on their order. But the desired free page order (i.e.,
> the order of a source page) might not be always present, thus leading to
> migration failures. Split a high order free pages when source migration
> page has a lower order to increase migration successful rate.
>
> Note: merging free pages when a migration fails and a lower order free
> page is returned via compaction_free() is possible, but there is too much
> work. Since the free pages are not buddy pages, it is hard to identify
> these free pages using existing PFN-based page merging algorithm.
>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> mm/compaction.c | 40 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 868e92e55d27..45747ab5f380 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1801,9 +1801,46 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
> struct compact_control *cc = (struct compact_control *)data;
> struct folio *dst;
> int order = folio_order(src);
> + bool has_isolated_pages = false;
>
> +again:
> if (!cc->freepages[order].nr_free) {
> - isolate_freepages(cc);
> + int i;
> +
> + for (i = order + 1; i <= MAX_ORDER; i++) {
> + if (cc->freepages[i].nr_free) {
> + struct page *freepage =
> + list_first_entry(&cc->freepages[i].pages,
> + struct page, lru);
> +
> + int start_order = i;
> + unsigned long size = 1 << start_order;
> +
> + list_del(&freepage->lru);
> + cc->freepages[i].nr_free--;
> +
> + while (start_order > order) {
> + start_order--;
> + size >>= 1;
> +
> + list_add(&freepage[size].lru,
> + &cc->freepages[start_order].pages);
> + cc->freepages[start_order].nr_free++;
> + set_page_private(&freepage[size], start_order);

IIUC, these split pages should also call functions to initialize? e.g.
prep_compound_page()?

> + }
> + post_alloc_hook(freepage, order, __GFP_MOVABLE);
> + if (order)
> + prep_compound_page(freepage, order);
> + dst = page_folio(freepage);
> + goto done;
> + }
> + }
> + if (!has_isolated_pages) {
> + isolate_freepages(cc);
> + has_isolated_pages = true;
> + goto again;
> + }
> +
> if (!cc->freepages[order].nr_free)
> return NULL;
> }
> @@ -1814,6 +1851,7 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
> post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
> if (order)
> prep_compound_page(&dst->page, order);
> +done:
> cc->nr_freepages -= 1 << order;
> return dst;
> }

2023-09-18 20:52:00

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] mm/compaction: optimize >0 order folio compaction with free page split.

On 18 Sep 2023, at 3:34, Baolin Wang wrote:

> On 9/13/2023 12:28 AM, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> During migration in a memory compaction, free pages are placed in an array
>> of page lists based on their order. But the desired free page order (i.e.,
>> the order of a source page) might not be always present, thus leading to
>> migration failures. Split a high order free pages when source migration
>> page has a lower order to increase migration successful rate.
>>
>> Note: merging free pages when a migration fails and a lower order free
>> page is returned via compaction_free() is possible, but there is too much
>> work. Since the free pages are not buddy pages, it is hard to identify
>> these free pages using existing PFN-based page merging algorithm.
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> ---
>> mm/compaction.c | 40 +++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 39 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 868e92e55d27..45747ab5f380 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1801,9 +1801,46 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>> struct compact_control *cc = (struct compact_control *)data;
>> struct folio *dst;
>> int order = folio_order(src);
>> + bool has_isolated_pages = false;
>> +again:
>> if (!cc->freepages[order].nr_free) {
>> - isolate_freepages(cc);
>> + int i;
>> +
>> + for (i = order + 1; i <= MAX_ORDER; i++) {
>> + if (cc->freepages[i].nr_free) {
>> + struct page *freepage =
>> + list_first_entry(&cc->freepages[i].pages,
>> + struct page, lru);
>> +
>> + int start_order = i;
>> + unsigned long size = 1 << start_order;
>> +
>> + list_del(&freepage->lru);
>> + cc->freepages[i].nr_free--;
>> +
>> + while (start_order > order) {
>> + start_order--;
>> + size >>= 1;
>> +
>> + list_add(&freepage[size].lru,
>> + &cc->freepages[start_order].pages);
>> + cc->freepages[start_order].nr_free++;
>> + set_page_private(&freepage[size], start_order);
>
> IIUC, these split pages should also call functions to initialize? e.g. prep_compound_page()?

Not at this place. It is done right below and above "done" label. When free pages
are on cc->freepages, we want to keep them without being post_alloc_hook() or
prep_compound_page() processed for a possible future split. A free page is
only initialized when it is returned by compaction_alloc().

>
>> + }
>> + post_alloc_hook(freepage, order, __GFP_MOVABLE);
>> + if (order)
>> + prep_compound_page(freepage, order);
>> + dst = page_folio(freepage);
>> + goto done;
>> + }
>> + }
>> + if (!has_isolated_pages) {
>> + isolate_freepages(cc);
>> + has_isolated_pages = true;
>> + goto again;
>> + }
>> +
>> if (!cc->freepages[order].nr_free)
>> return NULL;
>> }
>> @@ -1814,6 +1851,7 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>> post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
>> if (order)
>> prep_compound_page(&dst->page, order);
>> +done:
>> cc->nr_freepages -= 1 << order;
>> return dst;
>> }


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-09-18 22:37:52

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios.

On 15 Sep 2023, at 5:41, Baolin Wang wrote:

> On 9/13/2023 12:28 AM, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> Since compaction code can compact >0 order folios, enable it during the
>> process.
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> ---
>> mm/compaction.c | 25 ++++++++++---------------
>> 1 file changed, 10 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 4300d877b824..f72af74094de 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1087,11 +1087,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>> if (PageCompound(page) && !cc->alloc_contig) {
>> const unsigned int order = compound_order(page);
>> - if (likely(order <= MAX_ORDER)) {
>> - low_pfn += (1UL << order) - 1;
>> - nr_scanned += (1UL << order) - 1;
>> + /*
>> + * Compacting > pageblock_order pages does not improve
>> + * memory fragmentation. Also skip hugetlbfs pages.
>> + */
>> + if (likely(order >= pageblock_order) || PageHuge(page)) {
>
> IMO, if the compound page order is larger than the requested cc->order, we should also fail the isolation, cause it also does not improve fragmentation, right?
>

Probably yes. I think the reasoning should be since compaction is asking for cc->order,
we should not compacting folios with orders larger than or equal to that, since
cc->order tells us the max free page order is smaller than it, otherwise the
allocation would happen already. I will add this condition in the next version.

>> + if (order <= MAX_ORDER) {
>> + low_pfn += (1UL << order) - 1;
>> + nr_scanned += (1UL << order) - 1;
>> + }
>> + goto isolate_fail;
>> }
>> - goto isolate_fail;
>> }
>> /*
>> @@ -1214,17 +1220,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>> goto isolate_abort;
>> }
>> }
>> -
>> - /*
>> - * folio become large since the non-locked check,
>> - * and it's on LRU.
>> - */
>> - if (unlikely(folio_test_large(folio) && !cc->alloc_contig)) > - low_pfn += folio_nr_pages(folio) - 1;
>> - nr_scanned += folio_nr_pages(folio) - 1;
>> - folio_set_lru(folio);
>> - goto isolate_fail_put;
>> - }
>
> I do not think you can remove this validation, since previous validation is lockless. So under the lock, we need re-check if the compound page order is larger than pageblock_order or cc->order, that need fail to isolate.

This check should go away, but a new order check for large folios should be
added. Will add it. Thanks.

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-09-20 08:19:34

by Baolin Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] mm/compaction: optimize >0 order folio compaction with free page split.



On 9/19/2023 1:20 AM, Zi Yan wrote:
> On 18 Sep 2023, at 3:34, Baolin Wang wrote:
>
>> On 9/13/2023 12:28 AM, Zi Yan wrote:
>>> From: Zi Yan <[email protected]>
>>>
>>> During migration in a memory compaction, free pages are placed in an array
>>> of page lists based on their order. But the desired free page order (i.e.,
>>> the order of a source page) might not be always present, thus leading to
>>> migration failures. Split a high order free pages when source migration
>>> page has a lower order to increase migration successful rate.
>>>
>>> Note: merging free pages when a migration fails and a lower order free
>>> page is returned via compaction_free() is possible, but there is too much
>>> work. Since the free pages are not buddy pages, it is hard to identify
>>> these free pages using existing PFN-based page merging algorithm.
>>>
>>> Signed-off-by: Zi Yan <[email protected]>
>>> ---
>>> mm/compaction.c | 40 +++++++++++++++++++++++++++++++++++++++-
>>> 1 file changed, 39 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>> index 868e92e55d27..45747ab5f380 100644
>>> --- a/mm/compaction.c
>>> +++ b/mm/compaction.c
>>> @@ -1801,9 +1801,46 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>>> struct compact_control *cc = (struct compact_control *)data;
>>> struct folio *dst;
>>> int order = folio_order(src);
>>> + bool has_isolated_pages = false;
>>> +again:
>>> if (!cc->freepages[order].nr_free) {
>>> - isolate_freepages(cc);
>>> + int i;
>>> +
>>> + for (i = order + 1; i <= MAX_ORDER; i++) {
>>> + if (cc->freepages[i].nr_free) {
>>> + struct page *freepage =
>>> + list_first_entry(&cc->freepages[i].pages,
>>> + struct page, lru);
>>> +
>>> + int start_order = i;
>>> + unsigned long size = 1 << start_order;
>>> +
>>> + list_del(&freepage->lru);
>>> + cc->freepages[i].nr_free--;
>>> +
>>> + while (start_order > order) {
>>> + start_order--;
>>> + size >>= 1;
>>> +
>>> + list_add(&freepage[size].lru,
>>> + &cc->freepages[start_order].pages);
>>> + cc->freepages[start_order].nr_free++;
>>> + set_page_private(&freepage[size], start_order);
>>
>> IIUC, these split pages should also call functions to initialize? e.g. prep_compound_page()?
>
> Not at this place. It is done right below and above "done" label. When free pages
> are on cc->freepages, we want to keep them without being post_alloc_hook() or
> prep_compound_page() processed for a possible future split. A free page is
> only initialized when it is returned by compaction_alloc().

Ah, I see. Thanks for explanation.

2023-09-20 19:57:58

by kernel test robot

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios.



Hello,

kernel test robot noticed "kernel_BUG_at_lib/list_debug.c" on:

commit: 810d9ce367799ba4fef1e894b342e5ab74d44681 ("[RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios.")
url: https://github.com/intel-lab-lkp/linux/commits/Zi-Yan/mm-compaction-add-support-for-0-order-folio-memory-compaction/20230913-003027
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [RFC PATCH 4/4] mm/compaction: enable compacting >0 order folios.

in testcase: vm-scalability
version: vm-scalability-x86_64-1.0-0_20220518
with following parameters:

runtime: 300s
test: lru-file-readtwice
cpufreq_governor: performance

test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/


compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480L (Sapphire Rapids) with 512G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230920/[email protected]


[ 104.256019][ T1493] list_del corruption, ffd40001e3611490->prev is NULL
[ 104.264911][ T1493] ------------[ cut here ]------------
[ 104.272315][ T1493] kernel BUG at lib/list_debug.c:54!
[ 104.279501][ T1493] invalid opcode: 0000 [#1] SMP NOPTI
[ 104.286658][ T1493] CPU: 91 PID: 1493 Comm: kcompactd1 Not tainted 6.6.0-rc1-00153-g810d9ce36779 #1
[ 104.298169][ T1493] Hardware name: NULL NULL/NULL, BIOS 05.02.01 05/12/2023
[ 104.307252][ T1493] RIP: 0010:__list_del_entry_valid_or_report+0x6e/0xf0
[ 104.315987][ T1493] Code: b8 01 00 00 00 c3 cc cc cc cc 48 89 fe 48 c7 c7 80 c1 71 82 e8 e3 37 a3 ff 0f 0b 48 89 fe 48 c7 c7 b0 c1 71 82 e8 d2 37 a3 ff <0f> 0b 48 89 fe 48 c7 c7 e0 c1 71 82 e8 c1 37 a3 ff 0f 0b 48 89 fe
[ 104.339068][ T1493] RSP: 0018:ffa0000010a37910 EFLAGS: 00010046
[ 104.346919][ T1493] RAX: 0000000000000033 RBX: ff110080749b5ab8 RCX: 0000000000000000
[ 104.356938][ T1493] RDX: 0000000000000000 RSI: ff11007f416dc6c0 RDI: ff11007f416dc6c0
[ 104.366914][ T1493] RBP: ff110040b00af858 R08: 0000000000000000 R09: ffa0000010a377b8
[ 104.376873][ T1493] R10: 0000000000000003 R11: ff11007f40dfffe8 R12: ffd40001e3611400
[ 104.386808][ T1493] R13: 0000000000000000 R14: ffd40001e3611400 R15: ffa0000010a37938
[ 104.396739][ T1493] FS: 0000000000000000(0000) GS:ff11007f416c0000(0000) knlGS:0000000000000000
[ 104.407739][ T1493] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 104.416072][ T1493] CR2: 000055a550b5eb38 CR3: 0000008069078004 CR4: 0000000000f71ee0
[ 104.425986][ T1493] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 104.435870][ T1493] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 104.445790][ T1493] PKRU: 55555554
[ 104.450668][ T1493] Call Trace:
[ 104.455221][ T1493] <TASK>
[ 104.459360][ T1493] ? die+0x36/0xb0
[ 104.464363][ T1493] ? do_trap+0xda/0x130
[ 104.469839][ T1493] ? __list_del_entry_valid_or_report+0x6e/0xf0
[ 104.477666][ T1493] ? do_error_trap+0x65/0xb0
[ 104.483614][ T1493] ? __list_del_entry_valid_or_report+0x6e/0xf0
[ 104.491418][ T1493] ? exc_invalid_op+0x50/0x70
[ 104.497453][ T1493] ? __list_del_entry_valid_or_report+0x6e/0xf0
[ 104.505246][ T1493] ? asm_exc_invalid_op+0x1a/0x20
[ 104.511655][ T1493] ? __list_del_entry_valid_or_report+0x6e/0xf0
[ 104.519423][ T1493] split_huge_page_to_list+0x3ad/0x5b0
[ 104.526306][ T1493] migrate_pages_batch+0x1f6/0x970
[ 104.532797][ T1493] ? __pfx_compaction_alloc+0x10/0x10
[ 104.539564][ T1493] ? __pfx_compaction_free+0x10/0x10
[ 104.546219][ T1493] ? __pfx_compaction_alloc+0x10/0x10
[ 104.552955][ T1493] migrate_pages_sync+0x99/0x230
[ 104.559201][ T1493] ? __pfx_compaction_alloc+0x10/0x10
[ 104.565917][ T1493] ? __pfx_compaction_free+0x10/0x10
[ 104.572522][ T1493] migrate_pages+0x3d9/0x530
[ 104.578341][ T1493] ? __pfx_compaction_alloc+0x10/0x10
[ 104.585033][ T1493] ? __pfx_compaction_free+0x10/0x10
[ 104.591617][ T1493] compact_zone+0x286/0xa30
[ 104.597313][ T1493] kcompactd_do_work+0x103/0x2f0
[ 104.603487][ T1493] kcompactd+0x238/0x430
[ 104.608873][ T1493] ? __pfx_autoremove_wake_function+0x10/0x10
[ 104.616315][ T1493] ? __pfx_kcompactd+0x10/0x10
[ 104.622284][ T1493] kthread+0xcd/0x130
[ 104.627371][ T1493] ? __pfx_kthread+0x10/0x10
[ 104.633117][ T1493] ret_from_fork+0x31/0x70
[ 104.638664][ T1493] ? __pfx_kthread+0x10/0x10
[ 104.644390][ T1493] ret_from_fork_asm+0x1b/0x30
[ 104.650309][ T1493] </TASK>
[ 104.654264][ T1493] Modules linked in: xfs loop btrfs intel_rapl_msr blake2b_generic intel_rapl_common xor ses x86_pkg_temp_thermal enclosure raid6_pq sd_mod scsi_transport_sas intel_powerclamp libcrc32c sg coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel nvme sha512_ssse3 ahci nvme_core rapl ast ipmi_ssif libahci t10_pi mei_me intel_cstate drm_shmem_helper crc64_rocksoft_generic i2c_i801 crc64_rocksoft acpi_ipmi drm_kms_helper megaraid_sas joydev dax_hmem intel_uncore libata mei i2c_ismt crc64 i2c_smbus wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter drm fuse ip_tables
[ 104.717626][ T1493] ---[ end trace 0000000000000000 ]---
[ 104.807226][ T1493] RIP: 0010:__list_del_entry_valid_or_report+0x6e/0xf0
[ 104.815628][ T1493] Code: b8 01 00 00 00 c3 cc cc cc cc 48 89 fe 48 c7 c7 80 c1 71 82 e8 e3 37 a3 ff 0f 0b 48 89 fe 48 c7 c7 b0 c1 71 82 e8 d2 37 a3 ff <0f> 0b 48 89 fe 48 c7 c7 e0 c1 71 82 e8 c1 37 a3 ff 0f 0b 48 89 fe
[ 104.838334][ T1493] RSP: 0018:ffa0000010a37910 EFLAGS: 00010046
[ 104.845773][ T1493] RAX: 0000000000000033 RBX: ff110080749b5ab8 RCX: 0000000000000000
[ 104.855343][ T1493] RDX: 0000000000000000 RSI: ff11007f416dc6c0 RDI: ff11007f416dc6c0
[ 104.864898][ T1493] RBP: ff110040b00af858 R08: 0000000000000000 R09: ffa0000010a377b8
[ 104.874452][ T1493] R10: 0000000000000003 R11: ff11007f40dfffe8 R12: ffd40001e3611400
[ 104.884001][ T1493] R13: 0000000000000000 R14: ffd40001e3611400 R15: ffa0000010a37938
[ 104.893543][ T1493] FS: 0000000000000000(0000) GS:ff11007f416c0000(0000) knlGS:0000000000000000
[ 104.904152][ T1493] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 104.912119][ T1493] CR2: 000055a550b5eb38 CR3: 0000008069078004 CR4: 0000000000f71ee0
[ 104.921634][ T1493] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 104.931149][ T1493] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 104.940655][ T1493] PKRU: 55555554
[ 104.945174][ T1493] Kernel panic - not syncing: Fatal exception
[ 105.991260][ T1493] Shutting down cpus with NMI
[ 106.046902][ T1493] Kernel Offset: disabled

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-09-21 03:18:01

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

On Tue, Sep 12, 2023 at 12:28:11PM -0400, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> Feel free to give comments and ask questions.

How about testing? I'm looking with an eye towards creating a
pathalogical situation which can be automated for fragmentation and see
how things go.

Mel Gorman's original artificial fragmentation taken from his first
patches ot help with fragmentation avoidance from 2018 suggested he
tried [0]:

------ From 2018
a) Create an XFS filesystem

b) Start 4 fio threads that write a number of 64K files inefficiently.
Inefficiently means that files are created on first access and not
created in advance (fio parameterr create_on_open=1) and fallocate is
not used (fallocate=none). With multiple IO issuers this creates a mix
of slab and page cache allocations over time. The total size of the
files is 150% physical memory so that the slabs and page cache pages get
mixed

c) Warm up a number of fio read-only threads accessing the same files
created in step 2. This part runs for the same length of time it took to
create the files. It'll fault back in old data and further interleave
slab and page cache allocations. As it's now low on memory due to step
2, fragmentation occurs as pageblocks get stolen. While step 3 is still
running, start a process that tries to allocate 75% of memory as huge
pages with a number of threads. The number of threads is based on a
(NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with
fio, any other threads or forcing cross-NUMA scheduling. Note that the
test has not been used on a machine with less than 8 cores. The
benchmark records whether huge pages were allocated and what the fault
latency was in microseconds

d) Measure the number of events potentially causing external fragmentation,
the fault latency and the huge page allocation success rate.
------- end of extract

These days we can probably do a bit more damage. There has been concerns
that LBS support (block size > ps) could hinder fragmentation, one of
the reasons is that any file created despite it's size will require at
least the block size, and if using 64k block size that means 64k
allocation for each new file on that 64k block size filesystem, so
clearly you may run out of lower order allocations pretty quickly. You
can also create different larg eblock filesystems too, one for 64k
another for 32k. Although LBS is new and we're still ironing out the
kinks if you wanna give it a go we've rebased the patches onto Linus'
tree [1], and if you wanted to ramp up fast you could use kdevops [2] which
let's you pick that branch and also a series of NVMe drives (by enabling
CONFIG_LIBVIRT_EXTRA_STORAGE_DRIVE_NVME) for large IO experimentation (by
enabling CONFIG_VAGRANT_ENABLE_LARGEIO). Creating different filesystem
with large block size (64k, 32k, 16k) on a 4k sector size drive
(mkfs.xfs -f -b size=64k -s size=4k) should let you easily do tons of
crazy pathalogical things.

Are there other known recipes test help test this stuff?
How do we measure success in your patches for fragmentation exactly?

[0] https://lwn.net/Articles/770235/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus-nobdev
[2] https://github.com/linux-kdevops/kdevops

Luis

2023-09-21 04:57:43

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

On Wed, Sep 20, 2023 at 07:05:25PM -0700, John Hubbard wrote:
> On 9/20/23 18:16, Luis Chamberlain wrote:
> > On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
> > > Are there other known recipes test help test this stuff?
> >
> > You know, it got me wondering... since how memory fragmented a system
> > might be by just running fstests, because, well, we already have
> > that automated in kdevops and it also has LBS support for all the
> > different large block sizes on 4k sector size. So if we just had a
> > way to "measure" or "quantify" memory fragmentation with a score,
> > we could just tally up how we did after 4 hours of testing for each
> > block size with a set of memory on the guest / target node / cloud
> > system.
> >
> > Luis
>
> I thought about it, and here is one possible way to quantify
> fragmentation with just a single number. Take this with some
> skepticism because it is a first draft sort of thing:
>
> a) Let BLOCKS be the number of 4KB pages (or more generally, then number
> of smallest sized objects allowed) in the area.
>
> b) Let FRAGS be the number of free *or* allocated chunks (no need to
> consider the size of each, as that is automatically taken into
> consideration).
>
> Then:
> fragmentation percentage = (FRAGS / BLOCKS) * 100%
>
> This has some nice properties. For one thing, it's easy to calculate.
> For another, it can discern between these cases:
>
> Assume a 12-page area:
>
> Case 1) 6 pages allocated allocated unevenly:
>
> 1 page allocated | 1 page free | 1 page allocated | 5 pages free | 4 pages allocated
>
> fragmentation = (5 FRAGS / 12 BLOCKS) * 100% = 41.7%
>
> Case 2) 6 pages allocated evenly: every other page is allocated:
>
> fragmentation = (12 FRAGS / 12 BLOCKS) * 100% = 100%

Thanks! Will try this!

BTW stress-ng might also be a nice way to do other pathalogical things here.

Luis

2023-09-21 06:26:08

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

On 9/20/23 18:16, Luis Chamberlain wrote:
> On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
>> Are there other known recipes test help test this stuff?
>
> You know, it got me wondering... since how memory fragmented a system
> might be by just running fstests, because, well, we already have
> that automated in kdevops and it also has LBS support for all the
> different large block sizes on 4k sector size. So if we just had a
> way to "measure" or "quantify" memory fragmentation with a score,
> we could just tally up how we did after 4 hours of testing for each
> block size with a set of memory on the guest / target node / cloud
> system.
>
> Luis

I thought about it, and here is one possible way to quantify
fragmentation with just a single number. Take this with some
skepticism because it is a first draft sort of thing:

a) Let BLOCKS be the number of 4KB pages (or more generally, then number
of smallest sized objects allowed) in the area.

b) Let FRAGS be the number of free *or* allocated chunks (no need to
consider the size of each, as that is automatically taken into
consideration).

Then:
fragmentation percentage = (FRAGS / BLOCKS) * 100%

This has some nice properties. For one thing, it's easy to calculate.
For another, it can discern between these cases:

Assume a 12-page area:

Case 1) 6 pages allocated allocated unevenly:

1 page allocated | 1 page free | 1 page allocated | 5 pages free | 4 pages allocated

fragmentation = (5 FRAGS / 12 BLOCKS) * 100% = 41.7%

Case 2) 6 pages allocated evenly: every other page is allocated:

fragmentation = (12 FRAGS / 12 BLOCKS) * 100% = 100%



thanks,
--
John Hubbard
NVIDIA

2023-09-21 07:04:49

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
> Are there other known recipes test help test this stuff?

You know, it got me wondering... since how memory fragmented a system
might be by just running fstests, because, well, we already have
that automated in kdevops and it also has LBS support for all the
different large block sizes on 4k sector size. So if we just had a
way to "measure" or "quantify" memory fragmentation with a score,
we could just tally up how we did after 4 hours of testing for each
block size with a set of memory on the guest / target node / cloud
system.

Luis

2023-09-22 02:46:36

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction


On 20 Sep 2023, at 23:14, Luis Chamberlain wrote:

> On Wed, Sep 20, 2023 at 07:05:25PM -0700, John Hubbard wrote:
>> On 9/20/23 18:16, Luis Chamberlain wrote:
>>> On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
>>>> Are there other known recipes test help test this stuff?
>>>
>>> You know, it got me wondering... since how memory fragmented a system
>>> might be by just running fstests, because, well, we already have
>>> that automated in kdevops and it also has LBS support for all the
>>> different large block sizes on 4k sector size. So if we just had a
>>> way to "measure" or "quantify" memory fragmentation with a score,
>>> we could just tally up how we did after 4 hours of testing for each
>>> block size with a set of memory on the guest / target node / cloud
>>> system.
>>>
>>> Luis
>>
>> I thought about it, and here is one possible way to quantify
>> fragmentation with just a single number. Take this with some
>> skepticism because it is a first draft sort of thing:
>>
>> a) Let BLOCKS be the number of 4KB pages (or more generally, then number
>> of smallest sized objects allowed) in the area.
>>
>> b) Let FRAGS be the number of free *or* allocated chunks (no need to
>> consider the size of each, as that is automatically taken into
>> consideration).
>>
>> Then:
>> fragmentation percentage = (FRAGS / BLOCKS) * 100%
>>
>> This has some nice properties. For one thing, it's easy to calculate.
>> For another, it can discern between these cases:
>>
>> Assume a 12-page area:
>>
>> Case 1) 6 pages allocated allocated unevenly:
>>
>> 1 page allocated | 1 page free | 1 page allocated | 5 pages free | 4 pages allocated
>>
>> fragmentation = (5 FRAGS / 12 BLOCKS) * 100% = 41.7%
>>
>> Case 2) 6 pages allocated evenly: every other page is allocated:
>>
>> fragmentation = (12 FRAGS / 12 BLOCKS) * 100% = 100%
>
> Thanks! Will try this!
>
> BTW stress-ng might also be a nice way to do other pathalogical things here.
>
> Luis

Thanks. These are all good performance tests and a good fragmentation metric.
I would like to get it working properly first. As I mentioned in another email,
there will be tons of exploration to do to improve >0 folio memory compaction
with the consideration of:

1. the distribution of free pages,
2. the goal of compaction, e.g., to allocate a single order folio or reduce
the overall fragmentation level,
3. the runtime cost of compaction, and more.
My patchset aims to provide a reasonably working compaction functionality.


In terms of correctness testing, what I have done locally is to:

1. have a XFS partition,
2. create files with various sizes from 4KB to 2MB,
3. mmap each of these files to use one folio at the file size,
4. get the physical addresses of these folios,
5. trigger global memory compaction via sysctl,
6. read the physical addresses of these folios again.

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-10-02 12:59:14

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

Hi Zi,

On 12/09/2023 17:28, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> Hi all,
>
> This patchset enables >0 order folio memory compaction, which is one of
> the prerequisitions for large folio support[1]. It is on top of
> mm-everything-2023-09-11-22-56.

I've taken a quick look at these and realize I'm not well equipped to provide
much in the way of meaningful review comments; All I can say is thanks for
putting this together, and yes, I think it will become even more important for
my work on anonymous large folios.


>
> Overview
> ===
>
> To support >0 order folio compaction, the patchset changes how free pages used
> for migration are kept during compaction. Free pages used to be split into
> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
> page order stored in page->private is zeroed, and page reference is set to 1).
> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
> on their order without post allocation process. When migrate_pages() asks for
> a new page, one of the free pages, based on the requested page order, is
> then processed and given out.
>
>
> Optimizations
> ===
>
> 1. Free page split is added to increase migration success rate in case
> a source page does not have a matched free page in the free page lists.
> Free page merge is possible but not implemented, since existing
> PFN-based buddy page merge algorithm requires the identification of
> buddy pages, but free pages kept for memory compaction cannot have
> PageBuddy set to avoid confusing other PFN scanners.
>
> 2. Sort source pages in ascending order before migration is added to
> reduce free page split. Otherwise, high order free pages might be
> prematurely split, causing undesired high order folio migration failures.

Not knowing much about how compaction actually works, naively I would imagine
that if you are just trying to free up a known amount of contiguous physical
space, then working through the pages in PFN order is more likely to yield the
result quicker? Unless all of the pages in the set must be successfully migrated
in order to free up the required amount of space...

Thanks,
Ryan

2023-10-09 07:14:59

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

Hi, Zi,

Thanks for your patch!

Zi Yan <[email protected]> writes:

> From: Zi Yan <[email protected]>
>
> Hi all,
>
> This patchset enables >0 order folio memory compaction, which is one of
> the prerequisitions for large folio support[1]. It is on top of
> mm-everything-2023-09-11-22-56.
>
> Overview
> ===
>
> To support >0 order folio compaction, the patchset changes how free pages used
> for migration are kept during compaction.

migrate_pages() can split the large folio for allocation failure. So
the minimal implementation could be

- allow to migrate large folios in compaction
- return -ENOMEM for order > 0 in compaction_alloc()

The performance may be not desirable. But that may be a baseline for
further optimization.

And, if we can measure the performance for each step of optimization,
that will be even better.

> Free pages used to be split into
> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
> page order stored in page->private is zeroed, and page reference is set to 1).
> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
> on their order without post allocation process. When migrate_pages() asks for
> a new page, one of the free pages, based on the requested page order, is
> then processed and given out.
>
>
> Optimizations
> ===
>
> 1. Free page split is added to increase migration success rate in case
> a source page does not have a matched free page in the free page lists.
> Free page merge is possible but not implemented, since existing
> PFN-based buddy page merge algorithm requires the identification of
> buddy pages, but free pages kept for memory compaction cannot have
> PageBuddy set to avoid confusing other PFN scanners.
>
> 2. Sort source pages in ascending order before migration is added to

Trivial.

s/ascending/descending/

> reduce free page split. Otherwise, high order free pages might be
> prematurely split, causing undesired high order folio migration failures.
>
>
> TODOs
> ===
>
> 1. Refactor free page post allocation and free page preparation code so
> that compaction_alloc() and compaction_free() can call functions instead
> of hard coding.
>
> 2. One possible optimization is to allow migrate_pages() to continue
> even if get_new_folio() returns a NULL. In general, that means there is
> not enough memory. But in >0 order folio compaction case, that means
> there is no suitable free page at source page order. It might be better
> to skip that page and finish the rest of migration to achieve a better
> compaction result.

We can split the source folio if get_new_folio() returns NULL. So, do
we really need this?

In general, we may reconsider all further optimizations given splitting
is available already.

> 3. Another possible optimization is to enable free page merge. It is
> possible that a to-be-migrated page causes free page split then fails to
> migrate eventually. We would lose a high order free page without free
> page merge function. But a way of identifying free pages for memory
> compaction is needed to reuse existing PFN-based buddy page merge.
>
> 4. The implemented >0 order folio compaction algorithm is quite naive
> and does not consider all possible situations. A better algorithm can
> improve compaction success rate.
>
>
> Feel free to give comments and ask questions.
>
> Thanks.
>
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> Zi Yan (4):
> mm/compaction: add support for >0 order folio memory compaction.
> mm/compaction: optimize >0 order folio compaction with free page
> split.
> mm/compaction: optimize >0 order folio compaction by sorting source
> pages.
> mm/compaction: enable compacting >0 order folios.
>
> mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
> mm/internal.h | 7 +-
> 2 files changed, 176 insertions(+), 36 deletions(-)

--
Best Regards,
Huang, Ying

2023-10-09 13:25:59

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

On 2 Oct 2023, at 8:32, Ryan Roberts wrote:

> Hi Zi,
>
> On 12/09/2023 17:28, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> Hi all,
>>
>> This patchset enables >0 order folio memory compaction, which is one of
>> the prerequisitions for large folio support[1]. It is on top of
>> mm-everything-2023-09-11-22-56.
>
> I've taken a quick look at these and realize I'm not well equipped to provide
> much in the way of meaningful review comments; All I can say is thanks for
> putting this together, and yes, I think it will become even more important for
> my work on anonymous large folios.
>
>
>>
>> Overview
>> ===
>>
>> To support >0 order folio compaction, the patchset changes how free pages used
>> for migration are kept during compaction. Free pages used to be split into
>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>> page order stored in page->private is zeroed, and page reference is set to 1).
>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>> on their order without post allocation process. When migrate_pages() asks for
>> a new page, one of the free pages, based on the requested page order, is
>> then processed and given out.
>>
>>
>> Optimizations
>> ===
>>
>> 1. Free page split is added to increase migration success rate in case
>> a source page does not have a matched free page in the free page lists.
>> Free page merge is possible but not implemented, since existing
>> PFN-based buddy page merge algorithm requires the identification of
>> buddy pages, but free pages kept for memory compaction cannot have
>> PageBuddy set to avoid confusing other PFN scanners.
>>
>> 2. Sort source pages in ascending order before migration is added to
>> reduce free page split. Otherwise, high order free pages might be
>> prematurely split, causing undesired high order folio migration failures.
>
> Not knowing much about how compaction actually works, naively I would imagine
> that if you are just trying to free up a known amount of contiguous physical
> space, then working through the pages in PFN order is more likely to yield the
> result quicker? Unless all of the pages in the set must be successfully migrated
> in order to free up the required amount of space...

During compaction, pages are not freed, since that is the job of page reclaim.
The goal of compaction is to get a high order free page without freeing existing
pages to avoid potential high cost IO operations. If compaction does not work,
page reclaim would free pages to get us there (and potentially another follow-up
compaction). So either pages are migrated or stay where they are during compaction.

BTW compaction works by scanning in use pages from lower PFN to higher PFN,
and free pages from higher PFN to lower PFN until two scanners meet in the middle.

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-10-09 13:44:27

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

On 9 Oct 2023, at 3:12, Huang, Ying wrote:

> Hi, Zi,
>
> Thanks for your patch!
>
> Zi Yan <[email protected]> writes:
>
>> From: Zi Yan <[email protected]>
>>
>> Hi all,
>>
>> This patchset enables >0 order folio memory compaction, which is one of
>> the prerequisitions for large folio support[1]. It is on top of
>> mm-everything-2023-09-11-22-56.
>>
>> Overview
>> ===
>>
>> To support >0 order folio compaction, the patchset changes how free pages used
>> for migration are kept during compaction.
>
> migrate_pages() can split the large folio for allocation failure. So
> the minimal implementation could be
>
> - allow to migrate large folios in compaction
> - return -ENOMEM for order > 0 in compaction_alloc()
>
> The performance may be not desirable. But that may be a baseline for
> further optimization.

I would imagine it might cause a regression since compaction might gradually
split high order folios in the system. But I can move Patch 4 first to make this
the baseline and see how system performance changes.

>
> And, if we can measure the performance for each step of optimization,
> that will be even better.

Do you have any benchmark in mind for the performance tests? vm-scalability?

>
>> Free pages used to be split into
>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>> page order stored in page->private is zeroed, and page reference is set to 1).
>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>> on their order without post allocation process. When migrate_pages() asks for
>> a new page, one of the free pages, based on the requested page order, is
>> then processed and given out.
>>
>>
>> Optimizations
>> ===
>>
>> 1. Free page split is added to increase migration success rate in case
>> a source page does not have a matched free page in the free page lists.
>> Free page merge is possible but not implemented, since existing
>> PFN-based buddy page merge algorithm requires the identification of
>> buddy pages, but free pages kept for memory compaction cannot have
>> PageBuddy set to avoid confusing other PFN scanners.
>>
>> 2. Sort source pages in ascending order before migration is added to
>
> Trivial.
>
> s/ascending/descending/
>
>> reduce free page split. Otherwise, high order free pages might be
>> prematurely split, causing undesired high order folio migration failures.
>>
>>
>> TODOs
>> ===
>>
>> 1. Refactor free page post allocation and free page preparation code so
>> that compaction_alloc() and compaction_free() can call functions instead
>> of hard coding.
>>
>> 2. One possible optimization is to allow migrate_pages() to continue
>> even if get_new_folio() returns a NULL. In general, that means there is
>> not enough memory. But in >0 order folio compaction case, that means
>> there is no suitable free page at source page order. It might be better
>> to skip that page and finish the rest of migration to achieve a better
>> compaction result.
>
> We can split the source folio if get_new_folio() returns NULL. So, do
> we really need this?

It depends. The situation it can benefit is that when the system is going
to allocate a high order free page and trigger a compaction, it is possible to
get the high order free page by migrating a bunch of base pages instead of
splitting a existing high order folio.

>
> In general, we may reconsider all further optimizations given splitting
> is available already.

In my mind, split should be avoided as much as possible. But it really depends
on the actual situation, e.g., how much effort and cost the compaction wants
to pay to get memory defragmented. If the system really wants to get a high
order free page at any cost, split can be used without any issue. But applications
might lose performance because existing large folios are split just to a
new one.

Like I said in the email, there are tons of optimizations and policies for us
to explore. We can start with the bare minimum support (if no performance
regression is observed, we can even start with split all high folios like you
suggested) and add optimizations one by one.

>
>> 3. Another possible optimization is to enable free page merge. It is
>> possible that a to-be-migrated page causes free page split then fails to
>> migrate eventually. We would lose a high order free page without free
>> page merge function. But a way of identifying free pages for memory
>> compaction is needed to reuse existing PFN-based buddy page merge.
>>
>> 4. The implemented >0 order folio compaction algorithm is quite naive
>> and does not consider all possible situations. A better algorithm can
>> improve compaction success rate.
>>
>>
>> Feel free to give comments and ask questions.
>>
>> Thanks.
>>
>>
>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>
>> Zi Yan (4):
>> mm/compaction: add support for >0 order folio memory compaction.
>> mm/compaction: optimize >0 order folio compaction with free page
>> split.
>> mm/compaction: optimize >0 order folio compaction by sorting source
>> pages.
>> mm/compaction: enable compacting >0 order folios.
>>
>> mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
>> mm/internal.h | 7 +-
>> 2 files changed, 176 insertions(+), 36 deletions(-)
>
> --
> Best Regards,
> Huang, Ying


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-10-09 14:11:01

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

On 09/10/2023 14:24, Zi Yan wrote:
> On 2 Oct 2023, at 8:32, Ryan Roberts wrote:
>
>> Hi Zi,
>>
>> On 12/09/2023 17:28, Zi Yan wrote:
>>> From: Zi Yan <[email protected]>
>>>
>>> Hi all,
>>>
>>> This patchset enables >0 order folio memory compaction, which is one of
>>> the prerequisitions for large folio support[1]. It is on top of
>>> mm-everything-2023-09-11-22-56.
>>
>> I've taken a quick look at these and realize I'm not well equipped to provide
>> much in the way of meaningful review comments; All I can say is thanks for
>> putting this together, and yes, I think it will become even more important for
>> my work on anonymous large folios.
>>
>>
>>>
>>> Overview
>>> ===
>>>
>>> To support >0 order folio compaction, the patchset changes how free pages used
>>> for migration are kept during compaction. Free pages used to be split into
>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>> on their order without post allocation process. When migrate_pages() asks for
>>> a new page, one of the free pages, based on the requested page order, is
>>> then processed and given out.
>>>
>>>
>>> Optimizations
>>> ===
>>>
>>> 1. Free page split is added to increase migration success rate in case
>>> a source page does not have a matched free page in the free page lists.
>>> Free page merge is possible but not implemented, since existing
>>> PFN-based buddy page merge algorithm requires the identification of
>>> buddy pages, but free pages kept for memory compaction cannot have
>>> PageBuddy set to avoid confusing other PFN scanners.
>>>
>>> 2. Sort source pages in ascending order before migration is added to
>>> reduce free page split. Otherwise, high order free pages might be
>>> prematurely split, causing undesired high order folio migration failures.
>>
>> Not knowing much about how compaction actually works, naively I would imagine
>> that if you are just trying to free up a known amount of contiguous physical
>> space, then working through the pages in PFN order is more likely to yield the
>> result quicker? Unless all of the pages in the set must be successfully migrated
>> in order to free up the required amount of space...
>
> During compaction, pages are not freed, since that is the job of page reclaim.

Sorry yes - my fault for using sloppy language. When I said "free up a known
amount of contiguous physical space", I really meant "move pages in order to
recover an amount of contiguous physical space". But I still think the rest of
what I said applies; wouldn't you be more likely to reach your goal quicker if
you sort by PFN?

> The goal of compaction is to get a high order free page without freeing existing
> pages to avoid potential high cost IO operations. If compaction does not work,
> page reclaim would free pages to get us there (and potentially another follow-up
> compaction). So either pages are migrated or stay where they are during compaction.
>
> BTW compaction works by scanning in use pages from lower PFN to higher PFN,
> and free pages from higher PFN to lower PFN until two scanners meet in the middle.
>
> --
> Best Regards,
> Yan, Zi

2023-10-09 15:52:47

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

(resent as plain text)
On 9 Oct 2023, at 10:10, Ryan Roberts wrote:

> On 09/10/2023 14:24, Zi Yan wrote:
>> On 2 Oct 2023, at 8:32, Ryan Roberts wrote:
>>
>>> Hi Zi,
>>>
>>> On 12/09/2023 17:28, Zi Yan wrote:
>>>> From: Zi Yan <[email protected]>
>>>>
>>>> Hi all,
>>>>
>>>> This patchset enables >0 order folio memory compaction, which is one of
>>>> the prerequisitions for large folio support[1]. It is on top of
>>>> mm-everything-2023-09-11-22-56.
>>>
>>> I've taken a quick look at these and realize I'm not well equipped to provide
>>> much in the way of meaningful review comments; All I can say is thanks for
>>> putting this together, and yes, I think it will become even more important for
>>> my work on anonymous large folios.
>>>
>>>
>>>>
>>>> Overview
>>>> ===
>>>>
>>>> To support >0 order folio compaction, the patchset changes how free pages used
>>>> for migration are kept during compaction. Free pages used to be split into
>>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>>> on their order without post allocation process. When migrate_pages() asks for
>>>> a new page, one of the free pages, based on the requested page order, is
>>>> then processed and given out.
>>>>
>>>>
>>>> Optimizations
>>>> ===
>>>>
>>>> 1. Free page split is added to increase migration success rate in case
>>>> a source page does not have a matched free page in the free page lists.
>>>> Free page merge is possible but not implemented, since existing
>>>> PFN-based buddy page merge algorithm requires the identification of
>>>> buddy pages, but free pages kept for memory compaction cannot have
>>>> PageBuddy set to avoid confusing other PFN scanners.
>>>>
>>>> 2. Sort source pages in ascending order before migration is added to
>>>> reduce free page split. Otherwise, high order free pages might be
>>>> prematurely split, causing undesired high order folio migration failures.
>>>
>>> Not knowing much about how compaction actually works, naively I would imagine
>>> that if you are just trying to free up a known amount of contiguous physical
>>> space, then working through the pages in PFN order is more likely to yield the
>>> result quicker? Unless all of the pages in the set must be successfully migrated
>>> in order to free up the required amount of space...
>>
>> During compaction, pages are not freed, since that is the job of page reclaim.
>
> Sorry yes - my fault for using sloppy language. When I said "free up a known
> amount of contiguous physical space", I really meant "move pages in order to
> recover an amount of contiguous physical space". But I still think the rest of
> what I said applies; wouldn't you be more likely to reach your goal quicker if
> you sort by PFN?

Not always. If the in-use folios on the left are order-2, order-2, order-4
(all contiguous in one pageblock) and free pages on the right are order-4 (pageblock N),
order-2, order-2 (pageblock N-1) and it is not a single order-8, since there are
in-use folios in the middle), going in PFN order will not get you an order-8 free
page, since first order-4 free page will be split into two order-2 for the first
two order-2 in-use folios. But if you migrate in the the descending order of
in-use page orders, you can get an order-8 free page at the end.

The patchset minimizes free page splits to avoid the situation described above,
since once a high order free page is split, the opportunity of migrating a high order
in-use folio into it is gone and hardly recoverable.


>> The goal of compaction is to get a high order free page without freeing existing
>> pages to avoid potential high cost IO operations. If compaction does not work,
>> page reclaim would free pages to get us there (and potentially another follow-up
>> compaction). So either pages are migrated or stay where they are during compaction.
>>
>> BTW compaction works by scanning in use pages from lower PFN to higher PFN,
>> and free pages from higher PFN to lower PFN until two scanners meet in the middle.
>>
>> --
>> Best Regards,
>> Yan, Zi


Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-10-10 06:10:38

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction


Something wrong with my mail box. Sorry, if you received duplicated
mail.

Zi Yan <[email protected]> writes:

> On 9 Oct 2023, at 3:12, Huang, Ying wrote:
>
>> Hi, Zi,
>>
>> Thanks for your patch!
>>
>> Zi Yan <[email protected]> writes:
>>
>>> From: Zi Yan <[email protected]>
>>>
>>> Hi all,
>>>
>>> This patchset enables >0 order folio memory compaction, which is one of
>>> the prerequisitions for large folio support[1]. It is on top of
>>> mm-everything-2023-09-11-22-56.
>>>
>>> Overview
>>> ===
>>>
>>> To support >0 order folio compaction, the patchset changes how free pages used
>>> for migration are kept during compaction.
>>
>> migrate_pages() can split the large folio for allocation failure. So
>> the minimal implementation could be
>>
>> - allow to migrate large folios in compaction
>> - return -ENOMEM for order > 0 in compaction_alloc()
>>
>> The performance may be not desirable. But that may be a baseline for
>> further optimization.
>
> I would imagine it might cause a regression since compaction might gradually
> split high order folios in the system.

I may not call it a pure regression, since large folio can be migrated
during compaction with that, but it's possible that this hurts
performance.

Anyway, this can be a not-so-good minimal baseline.

> But I can move Patch 4 first to make this the baseline and see how
> system performance changes.

Thanks!

>>
>> And, if we can measure the performance for each step of optimization,
>> that will be even better.
>
> Do you have any benchmark in mind for the performance tests? vm-scalability?

I remember Mel Gorman has done some tests for defragmentation before.
But that's for order-0 pages.

>>> Free pages used to be split into
>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>> on their order without post allocation process. When migrate_pages() asks for
>>> a new page, one of the free pages, based on the requested page order, is
>>> then processed and given out.
>>>
>>>
>>> Optimizations
>>> ===
>>>
>>> 1. Free page split is added to increase migration success rate in case
>>> a source page does not have a matched free page in the free page lists.
>>> Free page merge is possible but not implemented, since existing
>>> PFN-based buddy page merge algorithm requires the identification of
>>> buddy pages, but free pages kept for memory compaction cannot have
>>> PageBuddy set to avoid confusing other PFN scanners.
>>>
>>> 2. Sort source pages in ascending order before migration is added to
>>
>> Trivial.
>>
>> s/ascending/descending/
>>
>>> reduce free page split. Otherwise, high order free pages might be
>>> prematurely split, causing undesired high order folio migration failures.
>>>
>>>
>>> TODOs
>>> ===
>>>
>>> 1. Refactor free page post allocation and free page preparation code so
>>> that compaction_alloc() and compaction_free() can call functions instead
>>> of hard coding.
>>>
>>> 2. One possible optimization is to allow migrate_pages() to continue
>>> even if get_new_folio() returns a NULL. In general, that means there is
>>> not enough memory. But in >0 order folio compaction case, that means
>>> there is no suitable free page at source page order. It might be better
>>> to skip that page and finish the rest of migration to achieve a better
>>> compaction result.
>>
>> We can split the source folio if get_new_folio() returns NULL. So, do
>> we really need this?
>
> It depends. The situation it can benefit is that when the system is going
> to allocate a high order free page and trigger a compaction, it is possible to
> get the high order free page by migrating a bunch of base pages instead of
> splitting a existing high order folio.
>
>>
>> In general, we may reconsider all further optimizations given splitting
>> is available already.
>
> In my mind, split should be avoided as much as possible.

If so, should we use "nosplit" logic in migrate_pages_batch() in some
situation?

> But it really depends
> on the actual situation, e.g., how much effort and cost the compaction wants
> to pay to get memory defragmented. If the system really wants to get a high
> order free page at any cost, split can be used without any issue. But applications
> might lose performance because existing large folios are split just to a
> new one.

Is it possible that splitting is desirable in some situation? For
example, allocate some large DMA buffers at the cost of large anonymous
folios?

> Like I said in the email, there are tons of optimizations and policies for us
> to explore. We can start with the bare minimum support (if no performance
> regression is observed, we can even start with split all high folios like you
> suggested) and add optimizations one by one.

Sound good to me! Thanks!

>>
>>> 3. Another possible optimization is to enable free page merge. It is
>>> possible that a to-be-migrated page causes free page split then fails to
>>> migrate eventually. We would lose a high order free page without free
>>> page merge function. But a way of identifying free pages for memory
>>> compaction is needed to reuse existing PFN-based buddy page merge.
>>>
>>> 4. The implemented >0 order folio compaction algorithm is quite naive
>>> and does not consider all possible situations. A better algorithm can
>>> improve compaction success rate.
>>>
>>>
>>> Feel free to give comments and ask questions.
>>>
>>> Thanks.
>>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> Zi Yan (4):
>>> mm/compaction: add support for >0 order folio memory compaction.
>>> mm/compaction: optimize >0 order folio compaction with free page
>>> split.
>>> mm/compaction: optimize >0 order folio compaction by sorting source
>>> pages.
>>> mm/compaction: enable compacting >0 order folios.
>>>
>>> mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
>>> mm/internal.h | 7 +-
>>> 2 files changed, 176 insertions(+), 36 deletions(-)

--
Best Regards,
Huang, Ying

2023-10-10 10:00:52

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

On 09/10/2023 16:52, Zi Yan wrote:
> (resent as plain text)
> On 9 Oct 2023, at 10:10, Ryan Roberts wrote:
>
>> On 09/10/2023 14:24, Zi Yan wrote:
>>> On 2 Oct 2023, at 8:32, Ryan Roberts wrote:
>>>
>>>> Hi Zi,
>>>>
>>>> On 12/09/2023 17:28, Zi Yan wrote:
>>>>> From: Zi Yan <[email protected]>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> This patchset enables >0 order folio memory compaction, which is one of
>>>>> the prerequisitions for large folio support[1]. It is on top of
>>>>> mm-everything-2023-09-11-22-56.
>>>>
>>>> I've taken a quick look at these and realize I'm not well equipped to provide
>>>> much in the way of meaningful review comments; All I can say is thanks for
>>>> putting this together, and yes, I think it will become even more important for
>>>> my work on anonymous large folios.
>>>>
>>>>
>>>>>
>>>>> Overview
>>>>> ===
>>>>>
>>>>> To support >0 order folio compaction, the patchset changes how free pages used
>>>>> for migration are kept during compaction. Free pages used to be split into
>>>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>>>> on their order without post allocation process. When migrate_pages() asks for
>>>>> a new page, one of the free pages, based on the requested page order, is
>>>>> then processed and given out.
>>>>>
>>>>>
>>>>> Optimizations
>>>>> ===
>>>>>
>>>>> 1. Free page split is added to increase migration success rate in case
>>>>> a source page does not have a matched free page in the free page lists.
>>>>> Free page merge is possible but not implemented, since existing
>>>>> PFN-based buddy page merge algorithm requires the identification of
>>>>> buddy pages, but free pages kept for memory compaction cannot have
>>>>> PageBuddy set to avoid confusing other PFN scanners.
>>>>>
>>>>> 2. Sort source pages in ascending order before migration is added to
>>>>> reduce free page split. Otherwise, high order free pages might be
>>>>> prematurely split, causing undesired high order folio migration failures.
>>>>
>>>> Not knowing much about how compaction actually works, naively I would imagine
>>>> that if you are just trying to free up a known amount of contiguous physical
>>>> space, then working through the pages in PFN order is more likely to yield the
>>>> result quicker? Unless all of the pages in the set must be successfully migrated
>>>> in order to free up the required amount of space...
>>>
>>> During compaction, pages are not freed, since that is the job of page reclaim.
>>
>> Sorry yes - my fault for using sloppy language. When I said "free up a known
>> amount of contiguous physical space", I really meant "move pages in order to
>> recover an amount of contiguous physical space". But I still think the rest of
>> what I said applies; wouldn't you be more likely to reach your goal quicker if
>> you sort by PFN?
>
> Not always. If the in-use folios on the left are order-2, order-2, order-4
> (all contiguous in one pageblock) and free pages on the right are order-4 (pageblock N),
> order-2, order-2 (pageblock N-1) and it is not a single order-8, since there are
> in-use folios in the middle), going in PFN order will not get you an order-8 free
> page, since first order-4 free page will be split into two order-2 for the first
> two order-2 in-use folios. But if you migrate in the the descending order of
> in-use page orders, you can get an order-8 free page at the end.
>
> The patchset minimizes free page splits to avoid the situation described above,
> since once a high order free page is split, the opportunity of migrating a high order
> in-use folio into it is gone and hardly recoverable.

OK I get it now - thanks!

>
>
>>> The goal of compaction is to get a high order free page without freeing existing
>>> pages to avoid potential high cost IO operations. If compaction does not work,
>>> page reclaim would free pages to get us there (and potentially another follow-up
>>> compaction). So either pages are migrated or stay where they are during compaction.
>>>
>>> BTW compaction works by scanning in use pages from lower PFN to higher PFN,
>>> and free pages from higher PFN to lower PFN until two scanners meet in the middle.
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>
>
> Best Regards,
> Yan, Zi

2023-10-10 16:48:51

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Enable >0 order folio memory compaction

On 10 Oct 2023, at 2:08, Huang, Ying wrote:

> Something wrong with my mail box. Sorry, if you received duplicated
> mail.
>
> Zi Yan <[email protected]> writes:
>
>> On 9 Oct 2023, at 3:12, Huang, Ying wrote:
>>
>>> Hi, Zi,
>>>
>>> Thanks for your patch!
>>>
>>> Zi Yan <[email protected]> writes:
>>>
>>>> From: Zi Yan <[email protected]>
>>>>
>>>> Hi all,
>>>>
>>>> This patchset enables >0 order folio memory compaction, which is one of
>>>> the prerequisitions for large folio support[1]. It is on top of
>>>> mm-everything-2023-09-11-22-56.
>>>>
>>>> Overview
>>>> ===
>>>>
>>>> To support >0 order folio compaction, the patchset changes how free pages used
>>>> for migration are kept during compaction.
>>>
>>> migrate_pages() can split the large folio for allocation failure. So
>>> the minimal implementation could be
>>>
>>> - allow to migrate large folios in compaction
>>> - return -ENOMEM for order > 0 in compaction_alloc()
>>>
>>> The performance may be not desirable. But that may be a baseline for
>>> further optimization.
>>
>> I would imagine it might cause a regression since compaction might gradually
>> split high order folios in the system.
>
> I may not call it a pure regression, since large folio can be migrated
> during compaction with that, but it's possible that this hurts
> performance.
>
> Anyway, this can be a not-so-good minimal baseline.
>
>> But I can move Patch 4 first to make this the baseline and see how
>> system performance changes.
>
> Thanks!
>
>>>
>>> And, if we can measure the performance for each step of optimization,
>>> that will be even better.
>>
>> Do you have any benchmark in mind for the performance tests? vm-scalability?
>
> I remember Mel Gorman has done some tests for defragmentation before.
> But that's for order-0 pages.

OK, I will try to find that.

>
>>>> Free pages used to be split into
>>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>>> on their order without post allocation process. When migrate_pages() asks for
>>>> a new page, one of the free pages, based on the requested page order, is
>>>> then processed and given out.
>>>>
>>>>
>>>> Optimizations
>>>> ===
>>>>
>>>> 1. Free page split is added to increase migration success rate in case
>>>> a source page does not have a matched free page in the free page lists.
>>>> Free page merge is possible but not implemented, since existing
>>>> PFN-based buddy page merge algorithm requires the identification of
>>>> buddy pages, but free pages kept for memory compaction cannot have
>>>> PageBuddy set to avoid confusing other PFN scanners.
>>>>
>>>> 2. Sort source pages in ascending order before migration is added to
>>>
>>> Trivial.
>>>
>>> s/ascending/descending/
>>>
>>>> reduce free page split. Otherwise, high order free pages might be
>>>> prematurely split, causing undesired high order folio migration failures.
>>>>
>>>>
>>>> TODOs
>>>> ===
>>>>
>>>> 1. Refactor free page post allocation and free page preparation code so
>>>> that compaction_alloc() and compaction_free() can call functions instead
>>>> of hard coding.
>>>>
>>>> 2. One possible optimization is to allow migrate_pages() to continue
>>>> even if get_new_folio() returns a NULL. In general, that means there is
>>>> not enough memory. But in >0 order folio compaction case, that means
>>>> there is no suitable free page at source page order. It might be better
>>>> to skip that page and finish the rest of migration to achieve a better
>>>> compaction result.
>>>
>>> We can split the source folio if get_new_folio() returns NULL. So, do
>>> we really need this?
>>
>> It depends. The situation it can benefit is that when the system is going
>> to allocate a high order free page and trigger a compaction, it is possible to
>> get the high order free page by migrating a bunch of base pages instead of
>> splitting a existing high order folio.
>>
>>>
>>> In general, we may reconsider all further optimizations given splitting
>>> is available already.
>>
>> In my mind, split should be avoided as much as possible.
>
> If so, should we use "nosplit" logic in migrate_pages_batch() in some
> situation?

A possible future optimization.

>
>> But it really depends
>> on the actual situation, e.g., how much effort and cost the compaction wants
>> to pay to get memory defragmented. If the system really wants to get a high
>> order free page at any cost, split can be used without any issue. But applications
>> might lose performance because existing large folios are split just to a
>> new one.
>
> Is it possible that splitting is desirable in some situation? For
> example, allocate some large DMA buffers at the cost of large anonymous
> folios?

Sure. There are definitely cases split is better than non-split. But let's leave
it when large anonymous folio is deployed.

>
>> Like I said in the email, there are tons of optimizations and policies for us
>> to explore. We can start with the bare minimum support (if no performance
>> regression is observed, we can even start with split all high folios like you
>> suggested) and add optimizations one by one.
>
> Sound good to me! Thanks!
>
>>>
>>>> 3. Another possible optimization is to enable free page merge. It is
>>>> possible that a to-be-migrated page causes free page split then fails to
>>>> migrate eventually. We would lose a high order free page without free
>>>> page merge function. But a way of identifying free pages for memory
>>>> compaction is needed to reuse existing PFN-based buddy page merge.
>>>>
>>>> 4. The implemented >0 order folio compaction algorithm is quite naive
>>>> and does not consider all possible situations. A better algorithm can
>>>> improve compaction success rate.
>>>>
>>>>
>>>> Feel free to give comments and ask questions.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>>>
>>>> Zi Yan (4):
>>>> mm/compaction: add support for >0 order folio memory compaction.
>>>> mm/compaction: optimize >0 order folio compaction with free page
>>>> split.
>>>> mm/compaction: optimize >0 order folio compaction by sorting source
>>>> pages.
>>>> mm/compaction: enable compacting >0 order folios.
>>>>
>>>> mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
>>>> mm/internal.h | 7 +-
>>>> 2 files changed, 176 insertions(+), 36 deletions(-)
>
> --
> Best Regards,
> Huang, Ying


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature