2023-08-10 14:48:07

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v5 0/5] variable-order, large folios for anonymous memory

Hi All,

This is v5 of a series to implement variable order, large folios for anonymous
memory. (currently called "LARGE_ANON_FOLIO", previously called "FLEXIBLE_THP").
The objective of this is to improve performance by allocating larger chunks of
memory during anonymous page faults:

1) Since SW (the kernel) is dealing with larger chunks of memory than base
pages, there are efficiency savings to be had; fewer page faults, batched PTE
and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
overhead. This should benefit all architectures.
2) Since we are now mapping physically contiguous chunks of memory, we can take
advantage of HW TLB compression techniques. A reduction in TLB pressure
speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
TLB entries; "the contiguous bit" (architectural) and HPA (uarch).

This patch set deals with the SW side of things (1). (2) is being tackled in a
separate series. The new behaviour is hidden behind a new Kconfig switch,
LARGE_ANON_FOLIO, which is disabled by default. Although the eventual aim is to
enable it by default.

My hope is that we are pretty much there with the changes at this point;
hopefully this is sufficient to get an initial version merged so that we can
scale up characterization efforts. Although they should not be merged until the
prerequisites are complete. These are in progress and tracked at [5].

This series is based on mm-unstable (ad3232df3e41).

I'm going to be out on holiday from the end of today, returning on 29th
August. So responses will likely be patchy, as I'm terrified of posting
to list from my phone!


Testing
-------

This version adds patches to mm selftests so that the cow tests explicitly test
large anon folios, in the same way that thp is tested. When enabled you should
see something similar at the start of the test suite:

# [INFO] detected large anon folio size: 32 KiB

Then the following results are expected. The fails and skips are due to existing
issues in mm-unstable:

# Totals: pass:207 fail:16 xfail:0 xpass:0 skip:85 error:0

Existing mm selftests reveal 1 regression in khugepaged tests when
LARGE_ANON_FOLIO is enabled:

Run test: collapse_max_ptes_none (khugepaged:anon)
Maybe collapse with max_ptes_none exceeded.... Fail
Unexpected huge page

I believe this is because khugepaged currently skips non-order-0 pages when
looking for collapse opportunities and should get fixed with the help of
DavidH's work to create a mechanism to precisely determine shared vs exclusive
pages.


Changes since v4 [4]
--------------------

- Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
now uses the default order-3 size. I have moved this patch over to
the contpte series.
- Added "mm: Allow deferred splitting of arbitrary large anon folios" back
into series. I originally removed this at v2 to add to a separate series,
but that series has transformed significantly and it no longer fits, so
bringing it back here.
- Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
set_ptes() is in mm-unstable now.
- Updated policy for when to allocate LAF; only fallback to order-0 if
MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
sysfs's never/madvise/always knob.
- Fallback to order-0 whenever uffd is armed for the vma, not just when
uffd-wp is set on the pte.
- alloc_anon_folio() now returns `strucxt folio *`, where errors are encoded
with ERR_PTR().

The last 3 changes were proposed by Yu Zhao - thanks!


Changes since v3 [3]
--------------------

- Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
- Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
sysctl is preferable but we will wait until real workload needs it.
- Fixed uninitialized `addr` on read fault path in do_anonymous_page().
- Added mm selftests for large anon folios in cow test suite.


Changes since v2 [2]
--------------------

- Dropped commit "Allow deferred splitting of arbitrary large anon folios"
- Huang, Ying suggested the "batch zap" work (which I dropped from this
series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
moved the deferred split patch to a separate series along with the batch
zap changes. I plan to submit this series early next week.
- Changed folio order fallback policy
- We no longer iterate from preferred to 0 looking for acceptable policy
- Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
- Removed vma parameter from arch_wants_pte_order()
- Added command line parameter `flexthp_unhinted_max`
- clamps preferred order when vma hasn't explicitly opted-in to THP
- Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
for process or system).
- Simplified implementation and integration with do_anonymous_page()
- Removed dependency on set_ptes()


Changes since v1 [1]
--------------------

- removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
- replaced with arch-independent alloc_anon_folio()
- follows THP allocation approach
- no longer retry with intermediate orders if allocation fails
- fallback directly to order-0
- remove folio_add_new_anon_rmap_range() patch
- instead add its new functionality to folio_add_new_anon_rmap()
- remove batch-zap pte mappings optimization patch
- remove enabler folio_remove_rmap_range() patch too
- These offer real perf improvement so will submit separately
- simplify Kconfig
- single FLEXIBLE_THP option, which is independent of arch
- depends on TRANSPARENT_HUGEPAGE
- when enabled default to max anon folio size of 64K unless arch
explicitly overrides
- simplify changes to do_anonymous_page():
- no more retry loop


[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/
[4] https://lore.kernel.org/linux-mm/[email protected]/
[5] https://lore.kernel.org/linux-mm/[email protected]/


Thanks,
Ryan

Ryan Roberts (5):
mm: Allow deferred splitting of arbitrary large anon folios
mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
mm: LARGE_ANON_FOLIO for improved performance
selftests/mm/cow: Generalize do_run_with_thp() helper
selftests/mm/cow: Add large anon folio tests

include/linux/pgtable.h | 13 ++
mm/Kconfig | 10 ++
mm/memory.c | 144 +++++++++++++++++--
mm/rmap.c | 31 +++--
tools/testing/selftests/mm/cow.c | 229 ++++++++++++++++++++++---------
5 files changed, 347 insertions(+), 80 deletions(-)

--
2.25.1



2023-08-10 15:11:38

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v5 1/5] mm: Allow deferred splitting of arbitrary large anon folios

In preparation for the introduction of large folios for anonymous
memory, we would like to be able to split them when they have unmapped
subpages, in order to free those unused pages under memory pressure. So
remove the artificial requirement that the large folio needed to be at
least PMD-sized.

Reviewed-by: Yu Zhao <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
mm/rmap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 1f04debdc87a..769fcabc6c56 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1446,11 +1446,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
__lruvec_stat_mod_folio(folio, idx, -nr);

/*
- * Queue anon THP for deferred split if at least one
+ * Queue anon large folio for deferred split if at least one
* page of the folio is unmapped and at least one page
* is still mapped.
*/
- if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+ if (folio_test_large(folio) && folio_test_anon(folio))
if (!compound || nr < nr_pmdmapped)
deferred_split_folio(folio);
}
--
2.25.1


2023-08-10 15:12:46

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v5 4/5] selftests/mm/cow: Generalize do_run_with_thp() helper

do_run_with_thp() prepares THP memory into different states before
running tests. We would like to reuse this logic to also test large anon
folios. So let's add a size parameter which tells the function what size
of memory it should operate on.

Remove references to THP and replace with LARGE, and fix up all existing
call sites to pass thpsize as the required size.

No functional change intended here, but a separate commit will add new
large anon folio tests that use this new capability.

Signed-off-by: Ryan Roberts <[email protected]>
---
tools/testing/selftests/mm/cow.c | 118 ++++++++++++++++---------------
1 file changed, 61 insertions(+), 57 deletions(-)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index 7324ce5363c0..304882bf2e5d 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -723,25 +723,25 @@ static void run_with_base_page_swap(test_fn fn, const char *desc)
do_run_with_base_page(fn, true);
}

-enum thp_run {
- THP_RUN_PMD,
- THP_RUN_PMD_SWAPOUT,
- THP_RUN_PTE,
- THP_RUN_PTE_SWAPOUT,
- THP_RUN_SINGLE_PTE,
- THP_RUN_SINGLE_PTE_SWAPOUT,
- THP_RUN_PARTIAL_MREMAP,
- THP_RUN_PARTIAL_SHARED,
+enum large_run {
+ LARGE_RUN_PMD,
+ LARGE_RUN_PMD_SWAPOUT,
+ LARGE_RUN_PTE,
+ LARGE_RUN_PTE_SWAPOUT,
+ LARGE_RUN_SINGLE_PTE,
+ LARGE_RUN_SINGLE_PTE_SWAPOUT,
+ LARGE_RUN_PARTIAL_MREMAP,
+ LARGE_RUN_PARTIAL_SHARED,
};

-static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
+static void do_run_with_large(test_fn fn, enum large_run large_run, size_t size)
{
char *mem, *mmap_mem, *tmp, *mremap_mem = MAP_FAILED;
- size_t size, mmap_size, mremap_size;
+ size_t mmap_size, mremap_size;
int ret;

- /* For alignment purposes, we need twice the thp size. */
- mmap_size = 2 * thpsize;
+ /* For alignment purposes, we need twice the requested size. */
+ mmap_size = 2 * size;
mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mmap_mem == MAP_FAILED) {
@@ -749,36 +749,40 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
return;
}

- /* We need a THP-aligned memory area. */
- mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
+ /* We need to naturally align the memory area. */
+ mem = (char *)(((uintptr_t)mmap_mem + size) & ~(size - 1));

- ret = madvise(mem, thpsize, MADV_HUGEPAGE);
+ ret = madvise(mem, size, MADV_HUGEPAGE);
if (ret) {
ksft_test_result_fail("MADV_HUGEPAGE failed\n");
goto munmap;
}

/*
- * Try to populate a THP. Touch the first sub-page and test if we get
- * another sub-page populated automatically.
+ * Try to populate a large folio. Touch the first sub-page and test if
+ * we get the last sub-page populated automatically.
*/
mem[0] = 0;
- if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
- ksft_test_result_skip("Did not get a THP populated\n");
+ if (!pagemap_is_populated(pagemap_fd, mem + size - pagesize)) {
+ ksft_test_result_skip("Did not get fully populated\n");
goto munmap;
}
- memset(mem, 0, thpsize);
+ memset(mem, 0, size);

- size = thpsize;
- switch (thp_run) {
- case THP_RUN_PMD:
- case THP_RUN_PMD_SWAPOUT:
+ switch (large_run) {
+ case LARGE_RUN_PMD:
+ case LARGE_RUN_PMD_SWAPOUT:
+ if (size != thpsize) {
+ ksft_test_result_fail("test bug: can't PMD-map size\n");
+ goto munmap;
+ }
break;
- case THP_RUN_PTE:
- case THP_RUN_PTE_SWAPOUT:
+ case LARGE_RUN_PTE:
+ case LARGE_RUN_PTE_SWAPOUT:
/*
- * Trigger PTE-mapping the THP by temporarily mapping a single
- * subpage R/O.
+ * Trigger PTE-mapping the large folio by temporarily mapping a
+ * single subpage R/O. This is a noop if the large-folio is not
+ * thpsize (and therefore already PTE-mapped).
*/
ret = mprotect(mem + pagesize, pagesize, PROT_READ);
if (ret) {
@@ -791,25 +795,25 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
goto munmap;
}
break;
- case THP_RUN_SINGLE_PTE:
- case THP_RUN_SINGLE_PTE_SWAPOUT:
+ case LARGE_RUN_SINGLE_PTE:
+ case LARGE_RUN_SINGLE_PTE_SWAPOUT:
/*
- * Discard all but a single subpage of that PTE-mapped THP. What
- * remains is a single PTE mapping a single subpage.
+ * Discard all but a single subpage of that PTE-mapped large
+ * folio. What remains is a single PTE mapping a single subpage.
*/
- ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTNEED);
+ ret = madvise(mem + pagesize, size - pagesize, MADV_DONTNEED);
if (ret) {
ksft_test_result_fail("MADV_DONTNEED failed\n");
goto munmap;
}
size = pagesize;
break;
- case THP_RUN_PARTIAL_MREMAP:
+ case LARGE_RUN_PARTIAL_MREMAP:
/*
- * Remap half of the THP. We need some new memory location
- * for that.
+ * Remap half of the lareg folio. We need some new memory
+ * location for that.
*/
- mremap_size = thpsize / 2;
+ mremap_size = size / 2;
mremap_mem = mmap(NULL, mremap_size, PROT_NONE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mem == MAP_FAILED) {
@@ -824,13 +828,13 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
}
size = mremap_size;
break;
- case THP_RUN_PARTIAL_SHARED:
+ case LARGE_RUN_PARTIAL_SHARED:
/*
- * Share the first page of the THP with a child and quit the
- * child. This will result in some parts of the THP never
- * have been shared.
+ * Share the first page of the large folio with a child and quit
+ * the child. This will result in some parts of the large folio
+ * never have been shared.
*/
- ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTFORK);
+ ret = madvise(mem + pagesize, size - pagesize, MADV_DONTFORK);
if (ret) {
ksft_test_result_fail("MADV_DONTFORK failed\n");
goto munmap;
@@ -844,7 +848,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
}
wait(&ret);
/* Allow for sharing all pages again. */
- ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DOFORK);
+ ret = madvise(mem + pagesize, size - pagesize, MADV_DOFORK);
if (ret) {
ksft_test_result_fail("MADV_DOFORK failed\n");
goto munmap;
@@ -854,10 +858,10 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
assert(false);
}

- switch (thp_run) {
- case THP_RUN_PMD_SWAPOUT:
- case THP_RUN_PTE_SWAPOUT:
- case THP_RUN_SINGLE_PTE_SWAPOUT:
+ switch (large_run) {
+ case LARGE_RUN_PMD_SWAPOUT:
+ case LARGE_RUN_PTE_SWAPOUT:
+ case LARGE_RUN_SINGLE_PTE_SWAPOUT:
madvise(mem, size, MADV_PAGEOUT);
if (!range_is_swapped(mem, size)) {
ksft_test_result_skip("MADV_PAGEOUT did not work, is swap enabled?\n");
@@ -878,49 +882,49 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
static void run_with_thp(test_fn fn, const char *desc)
{
ksft_print_msg("[RUN] %s ... with THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PMD);
+ do_run_with_large(fn, LARGE_RUN_PMD, thpsize);
}

static void run_with_thp_swap(test_fn fn, const char *desc)
{
ksft_print_msg("[RUN] %s ... with swapped-out THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT);
+ do_run_with_large(fn, LARGE_RUN_PMD_SWAPOUT, thpsize);
}

static void run_with_pte_mapped_thp(test_fn fn, const char *desc)
{
ksft_print_msg("[RUN] %s ... with PTE-mapped THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PTE);
+ do_run_with_large(fn, LARGE_RUN_PTE, thpsize);
}

static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc)
{
ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT);
+ do_run_with_large(fn, LARGE_RUN_PTE_SWAPOUT, thpsize);
}

static void run_with_single_pte_of_thp(test_fn fn, const char *desc)
{
ksft_print_msg("[RUN] %s ... with single PTE of THP\n", desc);
- do_run_with_thp(fn, THP_RUN_SINGLE_PTE);
+ do_run_with_large(fn, LARGE_RUN_SINGLE_PTE, thpsize);
}

static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc)
{
ksft_print_msg("[RUN] %s ... with single PTE of swapped-out THP\n", desc);
- do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT);
+ do_run_with_large(fn, LARGE_RUN_SINGLE_PTE_SWAPOUT, thpsize);
}

static void run_with_partial_mremap_thp(test_fn fn, const char *desc)
{
ksft_print_msg("[RUN] %s ... with partially mremap()'ed THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP);
+ do_run_with_large(fn, LARGE_RUN_PARTIAL_MREMAP, thpsize);
}

static void run_with_partial_shared_thp(test_fn fn, const char *desc)
{
ksft_print_msg("[RUN] %s ... with partially shared THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED);
+ do_run_with_large(fn, LARGE_RUN_PARTIAL_SHARED, thpsize);
}

static void run_with_hugetlb(test_fn fn, const char *desc, size_t hugetlbsize)
@@ -1338,7 +1342,7 @@ static void run_anon_thp_test_cases(void)
struct test_case const *test_case = &anon_thp_test_cases[i];

ksft_print_msg("[RUN] %s\n", test_case->desc);
- do_run_with_thp(test_case->fn, THP_RUN_PMD);
+ do_run_with_large(test_case->fn, LARGE_RUN_PMD, thpsize);
}
}

--
2.25.1


2023-08-10 15:13:36

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v5 2/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()

In preparation for LARGE_ANON_FOLIO support, improve
folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
passed to it. In this case, all contained pages are accounted using the
order-0 folio (or base page) scheme.

Reviewed-by: Yu Zhao <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
mm/rmap.c | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 769fcabc6c56..d1ff92b4bf6b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1266,31 +1266,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
* This means the inc-and-test can be bypassed.
* The folio does not have to be locked.
*
- * If the folio is large, it is accounted as a THP. As the folio
+ * If the folio is pmd-mappable, it is accounted as a THP. As the folio
* is new, it's assumed to be mapped exclusively by a single process.
*/
void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
unsigned long address)
{
- int nr;
+ int nr = folio_nr_pages(folio);

- VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+ VM_BUG_ON_VMA(address < vma->vm_start ||
+ address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
__folio_set_swapbacked(folio);

- if (likely(!folio_test_pmd_mappable(folio))) {
+ if (likely(!folio_test_large(folio))) {
/* increment count (starts at -1) */
atomic_set(&folio->_mapcount, 0);
- nr = 1;
+ __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
+ } else if (!folio_test_pmd_mappable(folio)) {
+ int i;
+
+ for (i = 0; i < nr; i++) {
+ struct page *page = folio_page(folio, i);
+
+ /* increment count (starts at -1) */
+ atomic_set(&page->_mapcount, 0);
+ __page_set_anon_rmap(folio, page, vma,
+ address + (i << PAGE_SHIFT), 1);
+ }
+
+ atomic_set(&folio->_nr_pages_mapped, nr);
} else {
/* increment count (starts at -1) */
atomic_set(&folio->_entire_mapcount, 0);
atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
- nr = folio_nr_pages(folio);
+ __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
}

__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
- __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
}

/**
--
2.25.1


2023-08-10 15:34:20

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v5 5/5] selftests/mm/cow: Add large anon folio tests

Add tests similar to the existing THP tests, but which operate on memory
backed by large anonymous folios, which are smaller than THP.

This reuses all the existing infrastructure. If the test suite detects
that large anonyomous folios are not supported by the kernel, the new
tests are skipped.

Signed-off-by: Ryan Roberts <[email protected]>
---
tools/testing/selftests/mm/cow.c | 111 +++++++++++++++++++++++++++++--
1 file changed, 106 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index 304882bf2e5d..932242c965a4 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -33,6 +33,7 @@
static size_t pagesize;
static int pagemap_fd;
static size_t thpsize;
+static size_t lafsize;
static int nr_hugetlbsizes;
static size_t hugetlbsizes[10];
static int gup_fd;
@@ -927,6 +928,42 @@ static void run_with_partial_shared_thp(test_fn fn, const char *desc)
do_run_with_large(fn, LARGE_RUN_PARTIAL_SHARED, thpsize);
}

+static void run_with_laf(test_fn fn, const char *desc)
+{
+ ksft_print_msg("[RUN] %s ... with large anon folio\n", desc);
+ do_run_with_large(fn, LARGE_RUN_PTE, lafsize);
+}
+
+static void run_with_laf_swap(test_fn fn, const char *desc)
+{
+ ksft_print_msg("[RUN] %s ... with swapped-out large anon folio\n", desc);
+ do_run_with_large(fn, LARGE_RUN_PTE_SWAPOUT, lafsize);
+}
+
+static void run_with_single_pte_of_laf(test_fn fn, const char *desc)
+{
+ ksft_print_msg("[RUN] %s ... with single PTE of large anon folio\n", desc);
+ do_run_with_large(fn, LARGE_RUN_SINGLE_PTE, lafsize);
+}
+
+static void run_with_single_pte_of_laf_swap(test_fn fn, const char *desc)
+{
+ ksft_print_msg("[RUN] %s ... with single PTE of swapped-out large anon folio\n", desc);
+ do_run_with_large(fn, LARGE_RUN_SINGLE_PTE_SWAPOUT, lafsize);
+}
+
+static void run_with_partial_mremap_laf(test_fn fn, const char *desc)
+{
+ ksft_print_msg("[RUN] %s ... with partially mremap()'ed large anon folio\n", desc);
+ do_run_with_large(fn, LARGE_RUN_PARTIAL_MREMAP, lafsize);
+}
+
+static void run_with_partial_shared_laf(test_fn fn, const char *desc)
+{
+ ksft_print_msg("[RUN] %s ... with partially shared large anon folio\n", desc);
+ do_run_with_large(fn, LARGE_RUN_PARTIAL_SHARED, lafsize);
+}
+
static void run_with_hugetlb(test_fn fn, const char *desc, size_t hugetlbsize)
{
int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB;
@@ -1105,6 +1142,14 @@ static void run_anon_test_case(struct test_case const *test_case)
run_with_partial_mremap_thp(test_case->fn, test_case->desc);
run_with_partial_shared_thp(test_case->fn, test_case->desc);
}
+ if (lafsize) {
+ run_with_laf(test_case->fn, test_case->desc);
+ run_with_laf_swap(test_case->fn, test_case->desc);
+ run_with_single_pte_of_laf(test_case->fn, test_case->desc);
+ run_with_single_pte_of_laf_swap(test_case->fn, test_case->desc);
+ run_with_partial_mremap_laf(test_case->fn, test_case->desc);
+ run_with_partial_shared_laf(test_case->fn, test_case->desc);
+ }
for (i = 0; i < nr_hugetlbsizes; i++)
run_with_hugetlb(test_case->fn, test_case->desc,
hugetlbsizes[i]);
@@ -1126,6 +1171,8 @@ static int tests_per_anon_test_case(void)

if (thpsize)
tests += 8;
+ if (lafsize)
+ tests += 6;
return tests;
}

@@ -1680,15 +1727,74 @@ static int tests_per_non_anon_test_case(void)
return tests;
}

+static size_t large_anon_folio_size(void)
+{
+ /*
+ * There is no interface to query this. But we know that it must be less
+ * than thpsize. So we map a thpsize area, aligned to thpsize offset by
+ * thpsize/2 (to avoid a hugepage being allocated), then touch the first
+ * page and see how many pages get faulted in.
+ */
+
+ int max_order = __builtin_ctz(thpsize);
+ size_t mmap_size = thpsize * 3;
+ char *mmap_mem = NULL;
+ int order = 0;
+ char *mem;
+ size_t offset;
+ int ret;
+
+ /* For alignment purposes, we need 2.5x the requested size. */
+ mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (mmap_mem == MAP_FAILED)
+ goto out;
+
+ /* Align the memory area to thpsize then offset it by thpsize/2. */
+ mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
+ mem += thpsize / 2;
+
+ /* We might get a bigger large anon folio when MADV_HUGEPAGE is set. */
+ ret = madvise(mem, thpsize, MADV_HUGEPAGE);
+ if (ret)
+ goto out;
+
+ /* Probe the memory to see how much is populated. */
+ mem[0] = 0;
+ for (order = 0; order < max_order; order++) {
+ offset = (1 << order) * pagesize;
+ if (!pagemap_is_populated(pagemap_fd, mem + offset))
+ break;
+ }
+
+out:
+ if (mmap_mem)
+ munmap(mmap_mem, mmap_size);
+
+ if (order == 0)
+ return 0;
+
+ return offset;
+}
+
int main(int argc, char **argv)
{
int err;

+ gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
+ pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+ if (pagemap_fd < 0)
+ ksft_exit_fail_msg("opening pagemap failed\n");
+
pagesize = getpagesize();
thpsize = read_pmd_pagesize();
if (thpsize)
ksft_print_msg("[INFO] detected THP size: %zu KiB\n",
thpsize / 1024);
+ lafsize = large_anon_folio_size();
+ if (lafsize)
+ ksft_print_msg("[INFO] detected large anon folio size: %zu KiB\n",
+ lafsize / 1024);
nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
ARRAY_SIZE(hugetlbsizes));
detect_huge_zeropage();
@@ -1698,11 +1804,6 @@ int main(int argc, char **argv)
ARRAY_SIZE(anon_thp_test_cases) * tests_per_anon_thp_test_case() +
ARRAY_SIZE(non_anon_test_cases) * tests_per_non_anon_test_case());

- gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
- pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
- if (pagemap_fd < 0)
- ksft_exit_fail_msg("opening pagemap failed\n");
-
run_anon_test_cases();
run_anon_thp_test_cases();
run_non_anon_test_cases();
--
2.25.1


2023-08-10 15:42:13

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
allocated in large folios of a determined order. All pages of the large
folio are pte-mapped during the same page fault, significantly reducing
the number of page faults. The number of per-page operations (e.g. ref
counting, rmap management lru list management) are also significantly
reduced since those ops now become per-folio.

The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
which defaults to disabled for now; The long term aim is for this to
defaut to enabled, but there are some risks around internal
fragmentation that need to be better understood first.

Large anonymous folio (LAF) allocation is integrated with the existing
(PMD-order) THP and single (S) page allocation according to this policy,
where fallback (>) is performed for various reasons, such as the
proposed folio order not fitting within the bounds of the VMA, etc:

| prctl=dis | prctl=ena | prctl=ena | prctl=ena
| sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
----------------|-----------|-------------|---------------|-------------
no hint | S | LAF>S | LAF>S | THP>LAF>S
MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
MADV_NOHUGEPAGE | S | S | S | S

This approach ensures that we don't violate existing hints to only
allocate single pages - this is required for QEMU's VM live migration
implementation to work correctly - while allowing us to use LAF
independently of THP (when sysfs=never). This makes wide scale
performance characterization simpler, while avoiding exposing any new
ABI to user space.

When using LAF for allocation, the folio order is determined as follows:
The return value of arch_wants_pte_order() is used. For vmas that have
not explicitly opted-in to use transparent hugepages (e.g. where
sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
is bigger). This allows for a performance boost without requiring any
explicit opt-in from the workload while limitting internal
fragmentation.

If the preferred order can't be used (e.g. because the folio would
breach the bounds of the vma, or because ptes in the region are already
mapped) then we fall back to a suitable lower order; first
PAGE_ALLOC_COSTLY_ORDER, then order-0.

arch_wants_pte_order() can be overridden by the architecture if desired.
Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
set of ptes map physically contigious, naturally aligned memory, so this
mechanism allows the architecture to optimize as required.

Here we add the default implementation of arch_wants_pte_order(), used
when the architecture does not define it, which returns -1, implying
that the HW has no preference. In this case, mm will choose it's own
default order.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/pgtable.h | 13 ++++
mm/Kconfig | 10 +++
mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
3 files changed, 158 insertions(+), 9 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 222a33b9600d..4b488cc66ddc 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void)
}
#endif

+#ifndef arch_wants_pte_order
+/*
+ * Returns preferred folio order for pte-mapped memory. Must be in range [0,
+ * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
+ * to be at least order-2. Negative value implies that the HW has no preference
+ * and mm will choose it's own default order.
+ */
+static inline int arch_wants_pte_order(void)
+{
+ return -1;
+}
+#endif
+
#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long address,
diff --git a/mm/Kconfig b/mm/Kconfig
index 721dc88423c7..a1e28b8ddc24 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA

source "mm/damon/Kconfig"

+config LARGE_ANON_FOLIO
+ bool "Allocate large folios for anonymous memory"
+ depends on TRANSPARENT_HUGEPAGE
+ default n
+ help
+ Use large (bigger than order-0) folios to back anonymous memory where
+ possible, even for pte-mapped memory. This reduces the number of page
+ faults, as well as other per-page overheads to improve performance for
+ many workloads.
+
endmenu
diff --git a/mm/memory.c b/mm/memory.c
index d003076b218d..bbc7d4ce84f7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
return ret;
}

+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+ int i;
+
+ if (nr_pages == 1)
+ return vmf_pte_changed(vmf);
+
+ for (i = 0; i < nr_pages; i++) {
+ if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+ return true;
+ }
+
+ return false;
+}
+
+#ifdef CONFIG_LARGE_ANON_FOLIO
+#define ANON_FOLIO_MAX_ORDER_UNHINTED \
+ (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
+
+static int anon_folio_order(struct vm_area_struct *vma)
+{
+ int order;
+
+ /*
+ * If the vma is eligible for thp, allocate a large folio of the size
+ * preferred by the arch. Or if the arch requested a very small size or
+ * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still
+ * meets the arch's requirements but means we still take advantage of SW
+ * optimizations (e.g. fewer page faults).
+ *
+ * If the vma isn't eligible for thp, take the arch-preferred size and
+ * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads
+ * that have not explicitly opted-in take benefit while capping the
+ * potential for internal fragmentation.
+ */
+
+ order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
+
+ if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
+ order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
+
+ return order;
+}
+
+static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+{
+ int i;
+ gfp_t gfp;
+ pte_t *pte;
+ unsigned long addr;
+ struct folio *folio;
+ struct vm_area_struct *vma = vmf->vma;
+ int prefer = anon_folio_order(vma);
+ int orders[] = {
+ prefer,
+ prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
+ 0,
+ };
+
+ /*
+ * If uffd is active for the vma we need per-page fault fidelity to
+ * maintain the uffd semantics.
+ */
+ if (userfaultfd_armed(vma))
+ goto fallback;
+
+ /*
+ * If hugepages are explicitly disabled for the vma (either
+ * MADV_NOHUGEPAGE or prctl) fallback to order-0. Failure to do this
+ * breaks correctness for user space. We ignore the sysfs global knob.
+ */
+ if (!hugepage_vma_check(vma, vma->vm_flags, false, true, false))
+ goto fallback;
+
+ for (i = 0; orders[i]; i++) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+ if (addr >= vma->vm_start &&
+ addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
+ break;
+ }
+
+ if (!orders[i])
+ goto fallback;
+
+ pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+ if (!pte)
+ return ERR_PTR(-EAGAIN);
+
+ for (; orders[i]; i++) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+ vmf->pte = pte + pte_index(addr);
+ if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
+ break;
+ }
+
+ vmf->pte = NULL;
+ pte_unmap(pte);
+
+ gfp = vma_thp_gfp_mask(vma);
+
+ for (; orders[i]; i++) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+ folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
+ if (folio) {
+ clear_huge_page(&folio->page, addr, 1 << orders[i]);
+ return folio;
+ }
+ }
+
+fallback:
+ return vma_alloc_zeroed_movable_folio(vma, vmf->address);
+}
+#else
+#define alloc_anon_folio(vmf) \
+ vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
+#endif
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -4080,6 +4197,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
+ int i;
+ int nr_pages = 1;
+ unsigned long addr = vmf->address;
bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
struct vm_area_struct *vma = vmf->vma;
struct folio *folio;
@@ -4124,10 +4244,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
/* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
goto oom;
- folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+ folio = alloc_anon_folio(vmf);
+ if (IS_ERR(folio))
+ return 0;
if (!folio)
goto oom;

+ nr_pages = folio_nr_pages(folio);
+ addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+
if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
goto oom_free_page;
folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4144,12 +4269,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));

- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
if (!vmf->pte)
goto release;
- if (vmf_pte_changed(vmf)) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
+ if (vmf_pte_range_changed(vmf, nr_pages)) {
+ for (i = 0; i < nr_pages; i++)
+ update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
goto release;
}

@@ -4164,16 +4289,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
return handle_userfault(vmf, VM_UFFD_MISSING);
}

- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- folio_add_new_anon_rmap(folio, vma, vmf->address);
+ folio_ref_add(folio, nr_pages - 1);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+ folio_add_new_anon_rmap(folio, vma, addr);
folio_add_lru_vma(folio, vma);
setpte:
if (uffd_wp)
entry = pte_mkuffd_wp(entry);
- set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+ set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);

/* No need to invalidate - it was non-present before */
- update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+ update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
unlock:
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
--
2.25.1


2023-08-10 15:51:48

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v5 0/5] variable-order, large folios for anonymous memory

On 10/08/2023 15:29, Ryan Roberts wrote:
> Hi All,
>
> This is v5 of a series to implement variable order, large folios for anonymous
> memory. (currently called "LARGE_ANON_FOLIO", previously called "FLEXIBLE_THP").
> The objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:
>
> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
> pages, there are efficiency savings to be had; fewer page faults, batched PTE
> and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
> overhead. This should benefit all architectures.
> 2) Since we are now mapping physically contiguous chunks of memory, we can take
> advantage of HW TLB compression techniques. A reduction in TLB pressure
> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>
> This patch set deals with the SW side of things (1). (2) is being tackled in a
> separate series. The new behaviour is hidden behind a new Kconfig switch,
> LARGE_ANON_FOLIO, which is disabled by default. Although the eventual aim is to
> enable it by default.
>
> My hope is that we are pretty much there with the changes at this point;
> hopefully this is sufficient to get an initial version merged so that we can
> scale up characterization efforts. Although they should not be merged until the
> prerequisites are complete. These are in progress and tracked at [5].
>
> This series is based on mm-unstable (ad3232df3e41).
>
> I'm going to be out on holiday from the end of today, returning on 29th
> August. So responses will likely be patchy, as I'm terrified of posting
> to list from my phone!
>
>
> Testing
> -------
>
> This version adds patches to mm selftests so that the cow tests explicitly test
> large anon folios, in the same way that thp is tested. When enabled you should
> see something similar at the start of the test suite:
>
> # [INFO] detected large anon folio size: 32 KiB
>
> Then the following results are expected. The fails and skips are due to existing
> issues in mm-unstable:
>
> # Totals: pass:207 fail:16 xfail:0 xpass:0 skip:85 error:0

Oops, the above are the results when running with SWAP disabled. This is what
you would normally see when SWAP is enabled:

# Totals: pass:291 fail:16 xfail:0 xpass:0 skip:1 error:0

>
> Existing mm selftests reveal 1 regression in khugepaged tests when
> LARGE_ANON_FOLIO is enabled:
>
> Run test: collapse_max_ptes_none (khugepaged:anon)
> Maybe collapse with max_ptes_none exceeded.... Fail
> Unexpected huge page
>
> I believe this is because khugepaged currently skips non-order-0 pages when
> looking for collapse opportunities and should get fixed with the help of
> DavidH's work to create a mechanism to precisely determine shared vs exclusive
> pages.
>
>
> Changes since v4 [4]
> --------------------
>
> - Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
> now uses the default order-3 size. I have moved this patch over to
> the contpte series.
> - Added "mm: Allow deferred splitting of arbitrary large anon folios" back
> into series. I originally removed this at v2 to add to a separate series,
> but that series has transformed significantly and it no longer fits, so
> bringing it back here.
> - Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
> set_ptes() is in mm-unstable now.
> - Updated policy for when to allocate LAF; only fallback to order-0 if
> MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
> sysfs's never/madvise/always knob.
> - Fallback to order-0 whenever uffd is armed for the vma, not just when
> uffd-wp is set on the pte.
> - alloc_anon_folio() now returns `strucxt folio *`, where errors are encoded
> with ERR_PTR().
>
> The last 3 changes were proposed by Yu Zhao - thanks!
>
>
> Changes since v3 [3]
> --------------------
>
> - Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
> - Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
> sysctl is preferable but we will wait until real workload needs it.
> - Fixed uninitialized `addr` on read fault path in do_anonymous_page().
> - Added mm selftests for large anon folios in cow test suite.
>
>
> Changes since v2 [2]
> --------------------
>
> - Dropped commit "Allow deferred splitting of arbitrary large anon folios"
> - Huang, Ying suggested the "batch zap" work (which I dropped from this
> series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
> moved the deferred split patch to a separate series along with the batch
> zap changes. I plan to submit this series early next week.
> - Changed folio order fallback policy
> - We no longer iterate from preferred to 0 looking for acceptable policy
> - Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
> - Removed vma parameter from arch_wants_pte_order()
> - Added command line parameter `flexthp_unhinted_max`
> - clamps preferred order when vma hasn't explicitly opted-in to THP
> - Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
> for process or system).
> - Simplified implementation and integration with do_anonymous_page()
> - Removed dependency on set_ptes()
>
>
> Changes since v1 [1]
> --------------------
>
> - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
> - replaced with arch-independent alloc_anon_folio()
> - follows THP allocation approach
> - no longer retry with intermediate orders if allocation fails
> - fallback directly to order-0
> - remove folio_add_new_anon_rmap_range() patch
> - instead add its new functionality to folio_add_new_anon_rmap()
> - remove batch-zap pte mappings optimization patch
> - remove enabler folio_remove_rmap_range() patch too
> - These offer real perf improvement so will submit separately
> - simplify Kconfig
> - single FLEXIBLE_THP option, which is independent of arch
> - depends on TRANSPARENT_HUGEPAGE
> - when enabled default to max anon folio size of 64K unless arch
> explicitly overrides
> - simplify changes to do_anonymous_page():
> - no more retry loop
>
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/
> [3] https://lore.kernel.org/linux-mm/[email protected]/
> [4] https://lore.kernel.org/linux-mm/[email protected]/
> [5] https://lore.kernel.org/linux-mm/[email protected]/
>
>
> Thanks,
> Ryan
>
> Ryan Roberts (5):
> mm: Allow deferred splitting of arbitrary large anon folios
> mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
> mm: LARGE_ANON_FOLIO for improved performance
> selftests/mm/cow: Generalize do_run_with_thp() helper
> selftests/mm/cow: Add large anon folio tests
>
> include/linux/pgtable.h | 13 ++
> mm/Kconfig | 10 ++
> mm/memory.c | 144 +++++++++++++++++--
> mm/rmap.c | 31 +++--
> tools/testing/selftests/mm/cow.c | 229 ++++++++++++++++++++++---------
> 5 files changed, 347 insertions(+), 80 deletions(-)
>
> --
> 2.25.1
>


2023-08-10 18:49:37

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <[email protected]> wrote:
>
> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> allocated in large folios of a determined order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
>
> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> which defaults to disabled for now; The long term aim is for this to
> defaut to enabled, but there are some risks around internal
> fragmentation that need to be better understood first.
>
> Large anonymous folio (LAF) allocation is integrated with the existing
> (PMD-order) THP and single (S) page allocation according to this policy,
> where fallback (>) is performed for various reasons, such as the
> proposed folio order not fitting within the bounds of the VMA, etc:
>
> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
> ----------------|-----------|-------------|---------------|-------------
> no hint | S | LAF>S | LAF>S | THP>LAF>S
> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
> MADV_NOHUGEPAGE | S | S | S | S
>
> This approach ensures that we don't violate existing hints to only
> allocate single pages - this is required for QEMU's VM live migration
> implementation to work correctly - while allowing us to use LAF
> independently of THP (when sysfs=never). This makes wide scale
> performance characterization simpler, while avoiding exposing any new
> ABI to user space.
>
> When using LAF for allocation, the folio order is determined as follows:
> The return value of arch_wants_pte_order() is used. For vmas that have
> not explicitly opted-in to use transparent hugepages (e.g. where
> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
> is bigger). This allows for a performance boost without requiring any
> explicit opt-in from the workload while limitting internal
> fragmentation.
>
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order; first
> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>
> arch_wants_pte_order() can be overridden by the architecture if desired.
> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> set of ptes map physically contigious, naturally aligned memory, so this
> mechanism allows the architecture to optimize as required.
>
> Here we add the default implementation of arch_wants_pte_order(), used
> when the architecture does not define it, which returns -1, implying
> that the HW has no preference. In this case, mm will choose it's own
> default order.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/pgtable.h | 13 ++++
> mm/Kconfig | 10 +++
> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
> 3 files changed, 158 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 222a33b9600d..4b488cc66ddc 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void)
> }
> #endif
>
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> + * to be at least order-2. Negative value implies that the HW has no preference
> + * and mm will choose it's own default order.
> + */
> +static inline int arch_wants_pte_order(void)
> +{
> + return -1;
> +}
> +#endif
> +
> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> unsigned long address,
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 721dc88423c7..a1e28b8ddc24 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA
>
> source "mm/damon/Kconfig"
>
> +config LARGE_ANON_FOLIO
> + bool "Allocate large folios for anonymous memory"
> + depends on TRANSPARENT_HUGEPAGE
> + default n
> + help
> + Use large (bigger than order-0) folios to back anonymous memory where
> + possible, even for pte-mapped memory. This reduces the number of page
> + faults, as well as other per-page overheads to improve performance for
> + many workloads.
> +
> endmenu
> diff --git a/mm/memory.c b/mm/memory.c
> index d003076b218d..bbc7d4ce84f7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> return ret;
> }
>
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> + int i;
> +
> + if (nr_pages == 1)
> + return vmf_pte_changed(vmf);
> +
> + for (i = 0; i < nr_pages; i++) {
> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> + return true;
> + }
> +
> + return false;
> +}
> +
> +#ifdef CONFIG_LARGE_ANON_FOLIO
> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> +
> +static int anon_folio_order(struct vm_area_struct *vma)
> +{
> + int order;
> +
> + /*
> + * If the vma is eligible for thp, allocate a large folio of the size
> + * preferred by the arch. Or if the arch requested a very small size or
> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still
> + * meets the arch's requirements but means we still take advantage of SW
> + * optimizations (e.g. fewer page faults).
> + *
> + * If the vma isn't eligible for thp, take the arch-preferred size and
> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads
> + * that have not explicitly opted-in take benefit while capping the
> + * potential for internal fragmentation.
> + */
> +
> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> +
> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> +
> + return order;
> +}

I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED.
1. It's not used, since no archs at the moment implement
arch_wants_pte_order() that returns >64KB.
2. As far as I know, there is no plan for any arch to do so.
3. Again, it seems to me the rationale behind
ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all.

Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please?

Also you made arch_wants_pte_order() return -1, and I acknowledged [1]:
Thanks: -1 actually is better than 0 (what I suggested) for the
obvious reason.

I thought we were on the same page, i.e., the "obvious reason" is that
h/w might prefer 0. But here you are not respecting 0. But then why
-1?

[1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/

2023-08-10 19:25:38

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

On 10/08/2023 18:01, Yu Zhao wrote:
> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <[email protected]> wrote:
>>
>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>> allocated in large folios of a determined order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>> which defaults to disabled for now; The long term aim is for this to
>> defaut to enabled, but there are some risks around internal
>> fragmentation that need to be better understood first.
>>
>> Large anonymous folio (LAF) allocation is integrated with the existing
>> (PMD-order) THP and single (S) page allocation according to this policy,
>> where fallback (>) is performed for various reasons, such as the
>> proposed folio order not fitting within the bounds of the VMA, etc:
>>
>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>> ----------------|-----------|-------------|---------------|-------------
>> no hint | S | LAF>S | LAF>S | THP>LAF>S
>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
>> MADV_NOHUGEPAGE | S | S | S | S
>>
>> This approach ensures that we don't violate existing hints to only
>> allocate single pages - this is required for QEMU's VM live migration
>> implementation to work correctly - while allowing us to use LAF
>> independently of THP (when sysfs=never). This makes wide scale
>> performance characterization simpler, while avoiding exposing any new
>> ABI to user space.
>>
>> When using LAF for allocation, the folio order is determined as follows:
>> The return value of arch_wants_pte_order() is used. For vmas that have
>> not explicitly opted-in to use transparent hugepages (e.g. where
>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
>> is bigger). This allows for a performance boost without requiring any
>> explicit opt-in from the workload while limitting internal
>> fragmentation.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order; first
>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>
>> arch_wants_pte_order() can be overridden by the architecture if desired.
>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>> set of ptes map physically contigious, naturally aligned memory, so this
>> mechanism allows the architecture to optimize as required.
>>
>> Here we add the default implementation of arch_wants_pte_order(), used
>> when the architecture does not define it, which returns -1, implying
>> that the HW has no preference. In this case, mm will choose it's own
>> default order.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> include/linux/pgtable.h | 13 ++++
>> mm/Kconfig | 10 +++
>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
>> 3 files changed, 158 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 222a33b9600d..4b488cc66ddc 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void)
>> }
>> #endif
>>
>> +#ifndef arch_wants_pte_order
>> +/*
>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>> + * to be at least order-2. Negative value implies that the HW has no preference
>> + * and mm will choose it's own default order.
>> + */
>> +static inline int arch_wants_pte_order(void)
>> +{
>> + return -1;
>> +}
>> +#endif
>> +
>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>> unsigned long address,
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 721dc88423c7..a1e28b8ddc24 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA
>>
>> source "mm/damon/Kconfig"
>>
>> +config LARGE_ANON_FOLIO
>> + bool "Allocate large folios for anonymous memory"
>> + depends on TRANSPARENT_HUGEPAGE
>> + default n
>> + help
>> + Use large (bigger than order-0) folios to back anonymous memory where
>> + possible, even for pte-mapped memory. This reduces the number of page
>> + faults, as well as other per-page overheads to improve performance for
>> + many workloads.
>> +
>> endmenu
>> diff --git a/mm/memory.c b/mm/memory.c
>> index d003076b218d..bbc7d4ce84f7 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> return ret;
>> }
>>
>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>> +{
>> + int i;
>> +
>> + if (nr_pages == 1)
>> + return vmf_pte_changed(vmf);
>> +
>> + for (i = 0; i < nr_pages; i++) {
>> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>> + return true;
>> + }
>> +
>> + return false;
>> +}
>> +
>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>> +
>> +static int anon_folio_order(struct vm_area_struct *vma)
>> +{
>> + int order;
>> +
>> + /*
>> + * If the vma is eligible for thp, allocate a large folio of the size
>> + * preferred by the arch. Or if the arch requested a very small size or
>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still
>> + * meets the arch's requirements but means we still take advantage of SW
>> + * optimizations (e.g. fewer page faults).
>> + *
>> + * If the vma isn't eligible for thp, take the arch-preferred size and
>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads
>> + * that have not explicitly opted-in take benefit while capping the
>> + * potential for internal fragmentation.
>> + */
>> +
>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>> +
>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>> +
>> + return order;
>> +}
>
> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED.
> 1. It's not used, since no archs at the moment implement
> arch_wants_pte_order() that returns >64KB.
> 2. As far as I know, there is no plan for any arch to do so.

My rationale is that arm64 is planning to use this for contpte mapping 2MB
blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB
blocks without the proper THP hinting is a bad plan.

As I see it, arches could add their own arch_wants_pte_order() at any time, and
just because the HW has a preference, doesn't mean the SW shouldn't get a say.
Its a negotiation between HW and SW for the LAF order, embodied in this policy.

> 3. Again, it seems to me the rationale behind
> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all.
>
> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please?
>
> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]:
> Thanks: -1 actually is better than 0 (what I suggested) for the
> obvious reason.
>
> I thought we were on the same page, i.e., the "obvious reason" is that
> h/w might prefer 0. But here you are not respecting 0. But then why
> -1?

I agree that the "obvious reason" is that HW might prefer order-0. But the
performance wins don't come solely from the HW. Batching up page faults is a big
win for SW even if the HW doesn't benefit. So I think it is important that a HW
preference of order-0 is possible to express through this API. But that doesn't
mean that we don't listen to SW's preferences either.

I would really rather leave it in; As I've mentioned in the past, we have a
partner who is actively keen to take advantage of 2MB blocks with 64K kernel and
this is the mechanism that means we don't dole out those 2MB blocks unless
explicitly opted-in.

I'm going to be out on holiday for a couple of weeks, so we might have to wait
until I'm back to conclude on this, if you still take issue with the justification.

Thanks,
Ryan


>
> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/


2023-08-10 20:14:59

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

On 10 Aug 2023, at 15:12, Ryan Roberts wrote:

> On 10/08/2023 18:01, Yu Zhao wrote:
>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <[email protected]> wrote:
>>>
>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>> allocated in large folios of a determined order. All pages of the large
>>> folio are pte-mapped during the same page fault, significantly reducing
>>> the number of page faults. The number of per-page operations (e.g. ref
>>> counting, rmap management lru list management) are also significantly
>>> reduced since those ops now become per-folio.
>>>
>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>> which defaults to disabled for now; The long term aim is for this to
>>> defaut to enabled, but there are some risks around internal
>>> fragmentation that need to be better understood first.
>>>
>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>> where fallback (>) is performed for various reasons, such as the
>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>
>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>>> ----------------|-----------|-------------|---------------|-------------
>>> no hint | S | LAF>S | LAF>S | THP>LAF>S
>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
>>> MADV_NOHUGEPAGE | S | S | S | S
>>>
>>> This approach ensures that we don't violate existing hints to only
>>> allocate single pages - this is required for QEMU's VM live migration
>>> implementation to work correctly - while allowing us to use LAF
>>> independently of THP (when sysfs=never). This makes wide scale
>>> performance characterization simpler, while avoiding exposing any new
>>> ABI to user space.
>>>
>>> When using LAF for allocation, the folio order is determined as follows:
>>> The return value of arch_wants_pte_order() is used. For vmas that have
>>> not explicitly opted-in to use transparent hugepages (e.g. where
>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
>>> is bigger). This allows for a performance boost without requiring any
>>> explicit opt-in from the workload while limitting internal
>>> fragmentation.
>>>
>>> If the preferred order can't be used (e.g. because the folio would
>>> breach the bounds of the vma, or because ptes in the region are already
>>> mapped) then we fall back to a suitable lower order; first
>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>
>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>> set of ptes map physically contigious, naturally aligned memory, so this
>>> mechanism allows the architecture to optimize as required.
>>>
>>> Here we add the default implementation of arch_wants_pte_order(), used
>>> when the architecture does not define it, which returns -1, implying
>>> that the HW has no preference. In this case, mm will choose it's own
>>> default order.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>> include/linux/pgtable.h | 13 ++++
>>> mm/Kconfig | 10 +++
>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
>>> 3 files changed, 158 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 222a33b9600d..4b488cc66ddc 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void)
>>> }
>>> #endif
>>>
>>> +#ifndef arch_wants_pte_order
>>> +/*
>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>> + * and mm will choose it's own default order.
>>> + */
>>> +static inline int arch_wants_pte_order(void)
>>> +{
>>> + return -1;
>>> +}
>>> +#endif
>>> +
>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>> unsigned long address,
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 721dc88423c7..a1e28b8ddc24 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA
>>>
>>> source "mm/damon/Kconfig"
>>>
>>> +config LARGE_ANON_FOLIO
>>> + bool "Allocate large folios for anonymous memory"
>>> + depends on TRANSPARENT_HUGEPAGE
>>> + default n
>>> + help
>>> + Use large (bigger than order-0) folios to back anonymous memory where
>>> + possible, even for pte-mapped memory. This reduces the number of page
>>> + faults, as well as other per-page overheads to improve performance for
>>> + many workloads.
>>> +
>>> endmenu
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index d003076b218d..bbc7d4ce84f7 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> return ret;
>>> }
>>>
>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>> +{
>>> + int i;
>>> +
>>> + if (nr_pages == 1)
>>> + return vmf_pte_changed(vmf);
>>> +
>>> + for (i = 0; i < nr_pages; i++) {
>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>> + return true;
>>> + }
>>> +
>>> + return false;
>>> +}
>>> +
>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>> +
>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>> +{
>>> + int order;
>>> +
>>> + /*
>>> + * If the vma is eligible for thp, allocate a large folio of the size
>>> + * preferred by the arch. Or if the arch requested a very small size or
>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still
>>> + * meets the arch's requirements but means we still take advantage of SW
>>> + * optimizations (e.g. fewer page faults).
>>> + *
>>> + * If the vma isn't eligible for thp, take the arch-preferred size and
>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads
>>> + * that have not explicitly opted-in take benefit while capping the
>>> + * potential for internal fragmentation.
>>> + */
>>> +
>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>> +
>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>> +
>>> + return order;
>>> +}
>>
>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED.
>> 1. It's not used, since no archs at the moment implement
>> arch_wants_pte_order() that returns >64KB.
>> 2. As far as I know, there is no plan for any arch to do so.
>
> My rationale is that arm64 is planning to use this for contpte mapping 2MB
> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB
> blocks without the proper THP hinting is a bad plan.
>
> As I see it, arches could add their own arch_wants_pte_order() at any time, and
> just because the HW has a preference, doesn't mean the SW shouldn't get a say.
> Its a negotiation between HW and SW for the LAF order, embodied in this policy.
>
>> 3. Again, it seems to me the rationale behind
>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all.
>>
>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please?
>>
>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]:
>> Thanks: -1 actually is better than 0 (what I suggested) for the
>> obvious reason.
>>
>> I thought we were on the same page, i.e., the "obvious reason" is that
>> h/w might prefer 0. But here you are not respecting 0. But then why
>> -1?
>
> I agree that the "obvious reason" is that HW might prefer order-0. But the
> performance wins don't come solely from the HW. Batching up page faults is a big
> win for SW even if the HW doesn't benefit. So I think it is important that a HW
> preference of order-0 is possible to express through this API. But that doesn't
> mean that we don't listen to SW's preferences either.
>
> I would really rather leave it in; As I've mentioned in the past, we have a
> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and
> this is the mechanism that means we don't dole out those 2MB blocks unless
> explicitly opted-in.
>
> I'm going to be out on holiday for a couple of weeks, so we might have to wait
> until I'm back to conclude on this, if you still take issue with the justification.

From my understanding (correct me if I am wrong), Yu seems to want order-0 to be
the default order even if LAF is enabled. But that does not make sense to me, since
if LAF is configured to be enabled (it is disabled by default now), user (and distros)
must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation
time or by using prctl. Enabling LAF and using order-0 as the default order makes
most of LAF code not used.

Also arch_wants_pte_order() might need a better name like
arch_wants_large_folio_order(). Since current name sounds like the specified order
is wanted by HW in a general setting, but it is not. It is an order HW wants
when LAF is enabled. That might cause some confusion.

>>
>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-08-11 01:11:08

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance



On 8/11/2023 3:12 AM, Ryan Roberts wrote:
> On 10/08/2023 18:01, Yu Zhao wrote:
>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <[email protected]> wrote:
>>>
>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>> allocated in large folios of a determined order. All pages of the large
>>> folio are pte-mapped during the same page fault, significantly reducing
>>> the number of page faults. The number of per-page operations (e.g. ref
>>> counting, rmap management lru list management) are also significantly
>>> reduced since those ops now become per-folio.
>>>
>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>> which defaults to disabled for now; The long term aim is for this to
>>> defaut to enabled, but there are some risks around internal
>>> fragmentation that need to be better understood first.
>>>
>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>> where fallback (>) is performed for various reasons, such as the
>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>
>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>>> ----------------|-----------|-------------|---------------|-------------
>>> no hint | S | LAF>S | LAF>S | THP>LAF>S
>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
>>> MADV_NOHUGEPAGE | S | S | S | S
>>>
>>> This approach ensures that we don't violate existing hints to only
>>> allocate single pages - this is required for QEMU's VM live migration
>>> implementation to work correctly - while allowing us to use LAF
>>> independently of THP (when sysfs=never). This makes wide scale
>>> performance characterization simpler, while avoiding exposing any new
>>> ABI to user space.
>>>
>>> When using LAF for allocation, the folio order is determined as follows:
>>> The return value of arch_wants_pte_order() is used. For vmas that have
>>> not explicitly opted-in to use transparent hugepages (e.g. where
>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
>>> is bigger). This allows for a performance boost without requiring any
>>> explicit opt-in from the workload while limitting internal
>>> fragmentation.
>>>
>>> If the preferred order can't be used (e.g. because the folio would
>>> breach the bounds of the vma, or because ptes in the region are already
>>> mapped) then we fall back to a suitable lower order; first
>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>
>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>> set of ptes map physically contigious, naturally aligned memory, so this
>>> mechanism allows the architecture to optimize as required.
>>>
>>> Here we add the default implementation of arch_wants_pte_order(), used
>>> when the architecture does not define it, which returns -1, implying
>>> that the HW has no preference. In this case, mm will choose it's own
>>> default order.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>> include/linux/pgtable.h | 13 ++++
>>> mm/Kconfig | 10 +++
>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
>>> 3 files changed, 158 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 222a33b9600d..4b488cc66ddc 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void)
>>> }
>>> #endif
>>>
>>> +#ifndef arch_wants_pte_order
>>> +/*
>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>> + * and mm will choose it's own default order.
>>> + */
>>> +static inline int arch_wants_pte_order(void)
>>> +{
>>> + return -1;
>>> +}
>>> +#endif
>>> +
>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>> unsigned long address,
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 721dc88423c7..a1e28b8ddc24 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA
>>>
>>> source "mm/damon/Kconfig"
>>>
>>> +config LARGE_ANON_FOLIO
>>> + bool "Allocate large folios for anonymous memory"
>>> + depends on TRANSPARENT_HUGEPAGE
>>> + default n
>>> + help
>>> + Use large (bigger than order-0) folios to back anonymous memory where
>>> + possible, even for pte-mapped memory. This reduces the number of page
>>> + faults, as well as other per-page overheads to improve performance for
>>> + many workloads.
>>> +
>>> endmenu
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index d003076b218d..bbc7d4ce84f7 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> return ret;
>>> }
>>>
>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>> +{
>>> + int i;
>>> +
>>> + if (nr_pages == 1)
>>> + return vmf_pte_changed(vmf);
>>> +
>>> + for (i = 0; i < nr_pages; i++) {
>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>> + return true;
>>> + }
>>> +
>>> + return false;
>>> +}
>>> +
>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>> +
>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>> +{
>>> + int order;
>>> +
>>> + /*
>>> + * If the vma is eligible for thp, allocate a large folio of the size
>>> + * preferred by the arch. Or if the arch requested a very small size or
>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still
>>> + * meets the arch's requirements but means we still take advantage of SW
>>> + * optimizations (e.g. fewer page faults).
>>> + *
>>> + * If the vma isn't eligible for thp, take the arch-preferred size and
>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads
>>> + * that have not explicitly opted-in take benefit while capping the
>>> + * potential for internal fragmentation.
>>> + */
>>> +
>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>> +
>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>> +
>>> + return order;
>>> +}
>>
>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED.
>> 1. It's not used, since no archs at the moment implement
>> arch_wants_pte_order() that returns >64KB.
>> 2. As far as I know, there is no plan for any arch to do so.
>
> My rationale is that arm64 is planning to use this for contpte mapping 2MB
> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB
> blocks without the proper THP hinting is a bad plan.
>
> As I see it, arches could add their own arch_wants_pte_order() at any time, and
> just because the HW has a preference, doesn't mean the SW shouldn't get a say.
> Its a negotiation between HW and SW for the LAF order, embodied in this policy.
>
>> 3. Again, it seems to me the rationale behind
>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all.
>>
>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please?
>>
>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]:
>> Thanks: -1 actually is better than 0 (what I suggested) for the
>> obvious reason.
>>
>> I thought we were on the same page, i.e., the "obvious reason" is that
>> h/w might prefer 0. But here you are not respecting 0. But then why
>> -1?
>
> I agree that the "obvious reason" is that HW might prefer order-0. But the
> performance wins don't come solely from the HW. Batching up page faults is a big
> win for SW even if the HW doesn't benefit. So I think it is important that a HW
> preference of order-0 is possible to express through this API. But that doesn't
> mean that we don't listen to SW's preferences either.
>
> I would really rather leave it in; As I've mentioned in the past, we have a
> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and
> this is the mechanism that means we don't dole out those 2MB blocks unless
> explicitly opted-in.
Even so, I don't think we want to put the ANON_FOLIO_MAX_ORDER_UNHINTED hardcoded
in common mm code as it's useless to other ARCHs.

Another drawback is it brings trouble to do performance testing. People needs
either change code and recompile the kernel or add another knob to configure it.

Considering we are still on the phase to do more testing to understand the impact
of the LAF, I agree with Yu on this. Thanks.


Regards
Yin, Fengwei

>
> I'm going to be out on holiday for a couple of weeks, so we might have to wait
> until I'm back to conclude on this, if you still take issue with the justification.
>
> Thanks,
> Ryan
>
>
>>
>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/
>

2023-08-11 01:52:08

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

On 10 Aug 2023, at 20:36, Yin, Fengwei wrote:

> On 8/11/2023 3:46 AM, Zi Yan wrote:
>> On 10 Aug 2023, at 15:12, Ryan Roberts wrote:
>>
>>> On 10/08/2023 18:01, Yu Zhao wrote:
>>>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <[email protected]> wrote:
>>>>>
>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>> allocated in large folios of a determined order. All pages of the large
>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>> counting, rmap management lru list management) are also significantly
>>>>> reduced since those ops now become per-folio.
>>>>>
>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>> defaut to enabled, but there are some risks around internal
>>>>> fragmentation that need to be better understood first.
>>>>>
>>>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>>>> where fallback (>) is performed for various reasons, such as the
>>>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>>>
>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S
>>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
>>>>> MADV_NOHUGEPAGE | S | S | S | S
>>>>>
>>>>> This approach ensures that we don't violate existing hints to only
>>>>> allocate single pages - this is required for QEMU's VM live migration
>>>>> implementation to work correctly - while allowing us to use LAF
>>>>> independently of THP (when sysfs=never). This makes wide scale
>>>>> performance characterization simpler, while avoiding exposing any new
>>>>> ABI to user space.
>>>>>
>>>>> When using LAF for allocation, the folio order is determined as follows:
>>>>> The return value of arch_wants_pte_order() is used. For vmas that have
>>>>> not explicitly opted-in to use transparent hugepages (e.g. where
>>>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
>>>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
>>>>> is bigger). This allows for a performance boost without requiring any
>>>>> explicit opt-in from the workload while limitting internal
>>>>> fragmentation.
>>>>>
>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>> mapped) then we fall back to a suitable lower order; first
>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>
>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>> mechanism allows the architecture to optimize as required.
>>>>>
>>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>>> when the architecture does not define it, which returns -1, implying
>>>>> that the HW has no preference. In this case, mm will choose it's own
>>>>> default order.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>> ---
>>>>> include/linux/pgtable.h | 13 ++++
>>>>> mm/Kconfig | 10 +++
>>>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
>>>>> 3 files changed, 158 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index 222a33b9600d..4b488cc66ddc 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>>> }
>>>>> #endif
>>>>>
>>>>> +#ifndef arch_wants_pte_order
>>>>> +/*
>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>>>> + * and mm will choose it's own default order.
>>>>> + */
>>>>> +static inline int arch_wants_pte_order(void)
>>>>> +{
>>>>> + return -1;
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>> unsigned long address,
>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>> index 721dc88423c7..a1e28b8ddc24 100644
>>>>> --- a/mm/Kconfig
>>>>> +++ b/mm/Kconfig
>>>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA
>>>>>
>>>>> source "mm/damon/Kconfig"
>>>>>
>>>>> +config LARGE_ANON_FOLIO
>>>>> + bool "Allocate large folios for anonymous memory"
>>>>> + depends on TRANSPARENT_HUGEPAGE
>>>>> + default n
>>>>> + help
>>>>> + Use large (bigger than order-0) folios to back anonymous memory where
>>>>> + possible, even for pte-mapped memory. This reduces the number of page
>>>>> + faults, as well as other per-page overheads to improve performance for
>>>>> + many workloads.
>>>>> +
>>>>> endmenu
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index d003076b218d..bbc7d4ce84f7 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>> return ret;
>>>>> }
>>>>>
>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>>> +{
>>>>> + int i;
>>>>> +
>>>>> + if (nr_pages == 1)
>>>>> + return vmf_pte_changed(vmf);
>>>>> +
>>>>> + for (i = 0; i < nr_pages; i++) {
>>>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>>> + return true;
>>>>> + }
>>>>> +
>>>>> + return false;
>>>>> +}
>>>>> +
>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>> +
>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>> +{
>>>>> + int order;
>>>>> +
>>>>> + /*
>>>>> + * If the vma is eligible for thp, allocate a large folio of the size
>>>>> + * preferred by the arch. Or if the arch requested a very small size or
>>>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still
>>>>> + * meets the arch's requirements but means we still take advantage of SW
>>>>> + * optimizations (e.g. fewer page faults).
>>>>> + *
>>>>> + * If the vma isn't eligible for thp, take the arch-preferred size and
>>>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads
>>>>> + * that have not explicitly opted-in take benefit while capping the
>>>>> + * potential for internal fragmentation.
>>>>> + */
>>>>> +
>>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>> +
>>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>> +
>>>>> + return order;
>>>>> +}
>>>>
>>>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>> 1. It's not used, since no archs at the moment implement
>>>> arch_wants_pte_order() that returns >64KB.
>>>> 2. As far as I know, there is no plan for any arch to do so.
>>>
>>> My rationale is that arm64 is planning to use this for contpte mapping 2MB
>>> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB
>>> blocks without the proper THP hinting is a bad plan.
>>>
>>> As I see it, arches could add their own arch_wants_pte_order() at any time, and
>>> just because the HW has a preference, doesn't mean the SW shouldn't get a say.
>>> Its a negotiation between HW and SW for the LAF order, embodied in this policy.
>>>
>>>> 3. Again, it seems to me the rationale behind
>>>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all.
>>>>
>>>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please?
>>>>
>>>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]:
>>>> Thanks: -1 actually is better than 0 (what I suggested) for the
>>>> obvious reason.
>>>>
>>>> I thought we were on the same page, i.e., the "obvious reason" is that
>>>> h/w might prefer 0. But here you are not respecting 0. But then why
>>>> -1?
>>>
>>> I agree that the "obvious reason" is that HW might prefer order-0. But the
>>> performance wins don't come solely from the HW. Batching up page faults is a big
>>> win for SW even if the HW doesn't benefit. So I think it is important that a HW
>>> preference of order-0 is possible to express through this API. But that doesn't
>>> mean that we don't listen to SW's preferences either.
>>>
>>> I would really rather leave it in; As I've mentioned in the past, we have a
>>> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and
>>> this is the mechanism that means we don't dole out those 2MB blocks unless
>>> explicitly opted-in.
>>>
>>> I'm going to be out on holiday for a couple of weeks, so we might have to wait
>>> until I'm back to conclude on this, if you still take issue with the justification.
>>
>> From my understanding (correct me if I am wrong), Yu seems to want order-0 to be
>> the default order even if LAF is enabled. But that does not make sense to me, since
>> if LAF is configured to be enabled (it is disabled by default now), user (and distros)
>> must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation
>> time or by using prctl. Enabling LAF and using order-0 as the default order makes
>> most of LAF code not used.
> For the device with limited memory size and it still wants LAF enabled for some specific
> memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise
> to enable LAF for specific memory ranges.

Do you have a use case? Or it is just a possible scenario?

IIUC, Ryan has a concrete use case for his choice. For ARM64 with 16KB/64KB
base pages, 2MB folios (LAF in this config) would be desirable since THP is
32MB/512MB and much harder to get.

>
> So my understanding is it's possible case. But it's another configuration thing and not
> necessary to be finalized now.

Basically, we are deciding whether LAF should use order-0 by default once it is
compiled in to kernel. From your other email on ANON_FOLIO_MAX_ORDER_UNHINTED,
your argument is that code change is needed to test the impact of LAF with
different orders. That seems to imply we actually need an extra knob (maybe sysctl)
to control the max LAF order. And with that extra knob, we can solve this default
order problem, since we can set it to 0 for devices want to opt in LAF and set
it N (like 64KB) for other devices want to opt out LAF.

So maybe we need the extra knob for both testing purpose and serving different
device configuration purpose.

>>
>> Also arch_wants_pte_order() might need a better name like
>> arch_wants_large_folio_order(). Since current name sounds like the specified order
>> is wanted by HW in a general setting, but it is not. It is an order HW wants
>> when LAF is enabled. That might cause some confusion.
>>
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/
>>
>>
>> --
>> Best Regards,
>> Yan, Zi


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-08-11 02:13:16

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance



On 8/11/2023 3:46 AM, Zi Yan wrote:
> On 10 Aug 2023, at 15:12, Ryan Roberts wrote:
>
>> On 10/08/2023 18:01, Yu Zhao wrote:
>>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>> allocated in large folios of a determined order. All pages of the large
>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>> counting, rmap management lru list management) are also significantly
>>>> reduced since those ops now become per-folio.
>>>>
>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>> which defaults to disabled for now; The long term aim is for this to
>>>> defaut to enabled, but there are some risks around internal
>>>> fragmentation that need to be better understood first.
>>>>
>>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>>> where fallback (>) is performed for various reasons, such as the
>>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>>
>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>>>> ----------------|-----------|-------------|---------------|-------------
>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S
>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
>>>> MADV_NOHUGEPAGE | S | S | S | S
>>>>
>>>> This approach ensures that we don't violate existing hints to only
>>>> allocate single pages - this is required for QEMU's VM live migration
>>>> implementation to work correctly - while allowing us to use LAF
>>>> independently of THP (when sysfs=never). This makes wide scale
>>>> performance characterization simpler, while avoiding exposing any new
>>>> ABI to user space.
>>>>
>>>> When using LAF for allocation, the folio order is determined as follows:
>>>> The return value of arch_wants_pte_order() is used. For vmas that have
>>>> not explicitly opted-in to use transparent hugepages (e.g. where
>>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
>>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
>>>> is bigger). This allows for a performance boost without requiring any
>>>> explicit opt-in from the workload while limitting internal
>>>> fragmentation.
>>>>
>>>> If the preferred order can't be used (e.g. because the folio would
>>>> breach the bounds of the vma, or because ptes in the region are already
>>>> mapped) then we fall back to a suitable lower order; first
>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>
>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>> mechanism allows the architecture to optimize as required.
>>>>
>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>> when the architecture does not define it, which returns -1, implying
>>>> that the HW has no preference. In this case, mm will choose it's own
>>>> default order.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> ---
>>>> include/linux/pgtable.h | 13 ++++
>>>> mm/Kconfig | 10 +++
>>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
>>>> 3 files changed, 158 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 222a33b9600d..4b488cc66ddc 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>> }
>>>> #endif
>>>>
>>>> +#ifndef arch_wants_pte_order
>>>> +/*
>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>>> + * and mm will choose it's own default order.
>>>> + */
>>>> +static inline int arch_wants_pte_order(void)
>>>> +{
>>>> + return -1;
>>>> +}
>>>> +#endif
>>>> +
>>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>> unsigned long address,
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index 721dc88423c7..a1e28b8ddc24 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA
>>>>
>>>> source "mm/damon/Kconfig"
>>>>
>>>> +config LARGE_ANON_FOLIO
>>>> + bool "Allocate large folios for anonymous memory"
>>>> + depends on TRANSPARENT_HUGEPAGE
>>>> + default n
>>>> + help
>>>> + Use large (bigger than order-0) folios to back anonymous memory where
>>>> + possible, even for pte-mapped memory. This reduces the number of page
>>>> + faults, as well as other per-page overheads to improve performance for
>>>> + many workloads.
>>>> +
>>>> endmenu
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index d003076b218d..bbc7d4ce84f7 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>> return ret;
>>>> }
>>>>
>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>> +{
>>>> + int i;
>>>> +
>>>> + if (nr_pages == 1)
>>>> + return vmf_pte_changed(vmf);
>>>> +
>>>> + for (i = 0; i < nr_pages; i++) {
>>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>> + return true;
>>>> + }
>>>> +
>>>> + return false;
>>>> +}
>>>> +
>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>> +
>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>> +{
>>>> + int order;
>>>> +
>>>> + /*
>>>> + * If the vma is eligible for thp, allocate a large folio of the size
>>>> + * preferred by the arch. Or if the arch requested a very small size or
>>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still
>>>> + * meets the arch's requirements but means we still take advantage of SW
>>>> + * optimizations (e.g. fewer page faults).
>>>> + *
>>>> + * If the vma isn't eligible for thp, take the arch-preferred size and
>>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads
>>>> + * that have not explicitly opted-in take benefit while capping the
>>>> + * potential for internal fragmentation.
>>>> + */
>>>> +
>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>> +
>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>> +
>>>> + return order;
>>>> +}
>>>
>>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED.
>>> 1. It's not used, since no archs at the moment implement
>>> arch_wants_pte_order() that returns >64KB.
>>> 2. As far as I know, there is no plan for any arch to do so.
>>
>> My rationale is that arm64 is planning to use this for contpte mapping 2MB
>> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB
>> blocks without the proper THP hinting is a bad plan.
>>
>> As I see it, arches could add their own arch_wants_pte_order() at any time, and
>> just because the HW has a preference, doesn't mean the SW shouldn't get a say.
>> Its a negotiation between HW and SW for the LAF order, embodied in this policy.
>>
>>> 3. Again, it seems to me the rationale behind
>>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all.
>>>
>>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please?
>>>
>>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]:
>>> Thanks: -1 actually is better than 0 (what I suggested) for the
>>> obvious reason.
>>>
>>> I thought we were on the same page, i.e., the "obvious reason" is that
>>> h/w might prefer 0. But here you are not respecting 0. But then why
>>> -1?
>>
>> I agree that the "obvious reason" is that HW might prefer order-0. But the
>> performance wins don't come solely from the HW. Batching up page faults is a big
>> win for SW even if the HW doesn't benefit. So I think it is important that a HW
>> preference of order-0 is possible to express through this API. But that doesn't
>> mean that we don't listen to SW's preferences either.
>>
>> I would really rather leave it in; As I've mentioned in the past, we have a
>> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and
>> this is the mechanism that means we don't dole out those 2MB blocks unless
>> explicitly opted-in.
>>
>> I'm going to be out on holiday for a couple of weeks, so we might have to wait
>> until I'm back to conclude on this, if you still take issue with the justification.
>
> From my understanding (correct me if I am wrong), Yu seems to want order-0 to be
> the default order even if LAF is enabled. But that does not make sense to me, since
> if LAF is configured to be enabled (it is disabled by default now), user (and distros)
> must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation
> time or by using prctl. Enabling LAF and using order-0 as the default order makes
> most of LAF code not used.
For the device with limited memory size and it still wants LAF enabled for some specific
memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise
to enable LAF for specific memory ranges.

So my understanding is it's possible case. But it's another configuration thing and not
necessary to be finalized now.


Regards
Yin, Fengwei

>
> Also arch_wants_pte_order() might need a better name like
> arch_wants_large_folio_order(). Since current name sounds like the specified order
> is wanted by HW in a general setting, but it is not. It is an order HW wants
> when LAF is enabled. That might cause some confusion.
>
>>>
>>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/
>
>
> --
> Best Regards,
> Yan, Zi

2023-08-11 06:18:40

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance



On 8/11/2023 9:04 AM, Zi Yan wrote:
> On 10 Aug 2023, at 20:36, Yin, Fengwei wrote:
>
>> On 8/11/2023 3:46 AM, Zi Yan wrote:
>>> On 10 Aug 2023, at 15:12, Ryan Roberts wrote:
>>>
>>>> On 10/08/2023 18:01, Yu Zhao wrote:
>>>>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <[email protected]> wrote:
>>>>>>
>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>> counting, rmap management lru list management) are also significantly
>>>>>> reduced since those ops now become per-folio.
>>>>>>
>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>> defaut to enabled, but there are some risks around internal
>>>>>> fragmentation that need to be better understood first.
>>>>>>
>>>>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>>>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>>>>> where fallback (>) is performed for various reasons, such as the
>>>>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>>>>
>>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S
>>>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
>>>>>> MADV_NOHUGEPAGE | S | S | S | S
>>>>>>
>>>>>> This approach ensures that we don't violate existing hints to only
>>>>>> allocate single pages - this is required for QEMU's VM live migration
>>>>>> implementation to work correctly - while allowing us to use LAF
>>>>>> independently of THP (when sysfs=never). This makes wide scale
>>>>>> performance characterization simpler, while avoiding exposing any new
>>>>>> ABI to user space.
>>>>>>
>>>>>> When using LAF for allocation, the folio order is determined as follows:
>>>>>> The return value of arch_wants_pte_order() is used. For vmas that have
>>>>>> not explicitly opted-in to use transparent hugepages (e.g. where
>>>>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
>>>>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
>>>>>> is bigger). This allows for a performance boost without requiring any
>>>>>> explicit opt-in from the workload while limitting internal
>>>>>> fragmentation.
>>>>>>
>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>
>>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>>> mechanism allows the architecture to optimize as required.
>>>>>>
>>>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>>>> when the architecture does not define it, which returns -1, implying
>>>>>> that the HW has no preference. In this case, mm will choose it's own
>>>>>> default order.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>> ---
>>>>>> include/linux/pgtable.h | 13 ++++
>>>>>> mm/Kconfig | 10 +++
>>>>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
>>>>>> 3 files changed, 158 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index 222a33b9600d..4b488cc66ddc 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>> }
>>>>>> #endif
>>>>>>
>>>>>> +#ifndef arch_wants_pte_order
>>>>>> +/*
>>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>>>>> + * and mm will choose it's own default order.
>>>>>> + */
>>>>>> +static inline int arch_wants_pte_order(void)
>>>>>> +{
>>>>>> + return -1;
>>>>>> +}
>>>>>> +#endif
>>>>>> +
>>>>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>>> unsigned long address,
>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>>> index 721dc88423c7..a1e28b8ddc24 100644
>>>>>> --- a/mm/Kconfig
>>>>>> +++ b/mm/Kconfig
>>>>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA
>>>>>>
>>>>>> source "mm/damon/Kconfig"
>>>>>>
>>>>>> +config LARGE_ANON_FOLIO
>>>>>> + bool "Allocate large folios for anonymous memory"
>>>>>> + depends on TRANSPARENT_HUGEPAGE
>>>>>> + default n
>>>>>> + help
>>>>>> + Use large (bigger than order-0) folios to back anonymous memory where
>>>>>> + possible, even for pte-mapped memory. This reduces the number of page
>>>>>> + faults, as well as other per-page overheads to improve performance for
>>>>>> + many workloads.
>>>>>> +
>>>>>> endmenu
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index d003076b218d..bbc7d4ce84f7 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>> return ret;
>>>>>> }
>>>>>>
>>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>>>> +{
>>>>>> + int i;
>>>>>> +
>>>>>> + if (nr_pages == 1)
>>>>>> + return vmf_pte_changed(vmf);
>>>>>> +
>>>>>> + for (i = 0; i < nr_pages; i++) {
>>>>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>>>> + return true;
>>>>>> + }
>>>>>> +
>>>>>> + return false;
>>>>>> +}
>>>>>> +
>>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>> +
>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>> +{
>>>>>> + int order;
>>>>>> +
>>>>>> + /*
>>>>>> + * If the vma is eligible for thp, allocate a large folio of the size
>>>>>> + * preferred by the arch. Or if the arch requested a very small size or
>>>>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still
>>>>>> + * meets the arch's requirements but means we still take advantage of SW
>>>>>> + * optimizations (e.g. fewer page faults).
>>>>>> + *
>>>>>> + * If the vma isn't eligible for thp, take the arch-preferred size and
>>>>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads
>>>>>> + * that have not explicitly opted-in take benefit while capping the
>>>>>> + * potential for internal fragmentation.
>>>>>> + */
>>>>>> +
>>>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>> +
>>>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>> +
>>>>>> + return order;
>>>>>> +}
>>>>>
>>>>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>> 1. It's not used, since no archs at the moment implement
>>>>> arch_wants_pte_order() that returns >64KB.
>>>>> 2. As far as I know, there is no plan for any arch to do so.
>>>>
>>>> My rationale is that arm64 is planning to use this for contpte mapping 2MB
>>>> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB
>>>> blocks without the proper THP hinting is a bad plan.
>>>>
>>>> As I see it, arches could add their own arch_wants_pte_order() at any time, and
>>>> just because the HW has a preference, doesn't mean the SW shouldn't get a say.
>>>> Its a negotiation between HW and SW for the LAF order, embodied in this policy.
>>>>
>>>>> 3. Again, it seems to me the rationale behind
>>>>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all.
>>>>>
>>>>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please?
>>>>>
>>>>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]:
>>>>> Thanks: -1 actually is better than 0 (what I suggested) for the
>>>>> obvious reason.
>>>>>
>>>>> I thought we were on the same page, i.e., the "obvious reason" is that
>>>>> h/w might prefer 0. But here you are not respecting 0. But then why
>>>>> -1?
>>>>
>>>> I agree that the "obvious reason" is that HW might prefer order-0. But the
>>>> performance wins don't come solely from the HW. Batching up page faults is a big
>>>> win for SW even if the HW doesn't benefit. So I think it is important that a HW
>>>> preference of order-0 is possible to express through this API. But that doesn't
>>>> mean that we don't listen to SW's preferences either.
>>>>
>>>> I would really rather leave it in; As I've mentioned in the past, we have a
>>>> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and
>>>> this is the mechanism that means we don't dole out those 2MB blocks unless
>>>> explicitly opted-in.
>>>>
>>>> I'm going to be out on holiday for a couple of weeks, so we might have to wait
>>>> until I'm back to conclude on this, if you still take issue with the justification.
>>>
>>> From my understanding (correct me if I am wrong), Yu seems to want order-0 to be
>>> the default order even if LAF is enabled. But that does not make sense to me, since
>>> if LAF is configured to be enabled (it is disabled by default now), user (and distros)
>>> must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation
>>> time or by using prctl. Enabling LAF and using order-0 as the default order makes
>>> most of LAF code not used.
>> For the device with limited memory size and it still wants LAF enabled for some specific
>> memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise
>> to enable LAF for specific memory ranges.
>
> Do you have a use case? Or it is just a possible scenario?
It's a possible scenario. Per my experience, it's valid use case for embedded
system or low end android phone.

>
> IIUC, Ryan has a concrete use case for his choice. For ARM64 with 16KB/64KB
> base pages, 2MB folios (LAF in this config) would be desirable since THP is
> 32MB/512MB and much harder to get.
>
>>
>> So my understanding is it's possible case. But it's another configuration thing and not
>> necessary to be finalized now.
>
> Basically, we are deciding whether LAF should use order-0 by default once it is
> compiled in to kernel. From your other email on ANON_FOLIO_MAX_ORDER_UNHINTED,
> your argument is that code change is needed to test the impact of LAF with
> different orders. That seems to imply we actually need an extra knob (maybe sysctl)
> to control the max LAF order. And with that extra knob, we can solve this default
> order problem, since we can set it to 0 for devices want to opt in LAF and set
> it N (like 64KB) for other devices want to opt out LAF.
From performance tuning perspective, it's necessary to have knobs to configure and
check the attribute of LAF. But we must be careful to add the knobs as they need
be maintained for ever.


Regards
Yin, Fengwei

>
> So maybe we need the extra knob for both testing purpose and serving different
> device configuration purpose.
>
>>>
>>> Also arch_wants_pte_order() might need a better name like
>>> arch_wants_large_folio_order(). Since current name sounds like the specified order
>>> is wanted by HW in a general setting, but it is not. It is an order HW wants
>>> when LAF is enabled. That might cause some confusion.
>>>
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/
>>>
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>
>
> --
> Best Regards,
> Yan, Zi

2023-08-11 15:02:29

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

On 11 Aug 2023, at 1:34, Yin, Fengwei wrote:

> On 8/11/2023 9:04 AM, Zi Yan wrote:
>> On 10 Aug 2023, at 20:36, Yin, Fengwei wrote:
>>
>>> On 8/11/2023 3:46 AM, Zi Yan wrote:
>>>> On 10 Aug 2023, at 15:12, Ryan Roberts wrote:
>>>>
>>>>> On 10/08/2023 18:01, Yu Zhao wrote:
>>>>>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <[email protected]> wrote:
>>>>>>>
>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>>> counting, rmap management lru list management) are also significantly
>>>>>>> reduced since those ops now become per-folio.
>>>>>>>
>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>>> defaut to enabled, but there are some risks around internal
>>>>>>> fragmentation that need to be better understood first.
>>>>>>>
>>>>>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>>>>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>>>>>> where fallback (>) is performed for various reasons, such as the
>>>>>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>>>>>
>>>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>>>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>>>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S
>>>>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
>>>>>>> MADV_NOHUGEPAGE | S | S | S | S
>>>>>>>
>>>>>>> This approach ensures that we don't violate existing hints to only
>>>>>>> allocate single pages - this is required for QEMU's VM live migration
>>>>>>> implementation to work correctly - while allowing us to use LAF
>>>>>>> independently of THP (when sysfs=never). This makes wide scale
>>>>>>> performance characterization simpler, while avoiding exposing any new
>>>>>>> ABI to user space.
>>>>>>>
>>>>>>> When using LAF for allocation, the folio order is determined as follows:
>>>>>>> The return value of arch_wants_pte_order() is used. For vmas that have
>>>>>>> not explicitly opted-in to use transparent hugepages (e.g. where
>>>>>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
>>>>>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
>>>>>>> is bigger). This allows for a performance boost without requiring any
>>>>>>> explicit opt-in from the workload while limitting internal
>>>>>>> fragmentation.
>>>>>>>
>>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>>
>>>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>>>> mechanism allows the architecture to optimize as required.
>>>>>>>
>>>>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>>>>> when the architecture does not define it, which returns -1, implying
>>>>>>> that the HW has no preference. In this case, mm will choose it's own
>>>>>>> default order.
>>>>>>>
>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>> ---
>>>>>>> include/linux/pgtable.h | 13 ++++
>>>>>>> mm/Kconfig | 10 +++
>>>>>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
>>>>>>> 3 files changed, 158 insertions(+), 9 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>> index 222a33b9600d..4b488cc66ddc 100644
>>>>>>> --- a/include/linux/pgtable.h
>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>>> }
>>>>>>> #endif
>>>>>>>
>>>>>>> +#ifndef arch_wants_pte_order
>>>>>>> +/*
>>>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>>>>>> + * and mm will choose it's own default order.
>>>>>>> + */
>>>>>>> +static inline int arch_wants_pte_order(void)
>>>>>>> +{
>>>>>>> + return -1;
>>>>>>> +}
>>>>>>> +#endif
>>>>>>> +
>>>>>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>>>> unsigned long address,
>>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>>>> index 721dc88423c7..a1e28b8ddc24 100644
>>>>>>> --- a/mm/Kconfig
>>>>>>> +++ b/mm/Kconfig
>>>>>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA
>>>>>>>
>>>>>>> source "mm/damon/Kconfig"
>>>>>>>
>>>>>>> +config LARGE_ANON_FOLIO
>>>>>>> + bool "Allocate large folios for anonymous memory"
>>>>>>> + depends on TRANSPARENT_HUGEPAGE
>>>>>>> + default n
>>>>>>> + help
>>>>>>> + Use large (bigger than order-0) folios to back anonymous memory where
>>>>>>> + possible, even for pte-mapped memory. This reduces the number of page
>>>>>>> + faults, as well as other per-page overheads to improve performance for
>>>>>>> + many workloads.
>>>>>>> +
>>>>>>> endmenu
>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>> index d003076b218d..bbc7d4ce84f7 100644
>>>>>>> --- a/mm/memory.c
>>>>>>> +++ b/mm/memory.c
>>>>>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>> return ret;
>>>>>>> }
>>>>>>>
>>>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>>>>> +{
>>>>>>> + int i;
>>>>>>> +
>>>>>>> + if (nr_pages == 1)
>>>>>>> + return vmf_pte_changed(vmf);
>>>>>>> +
>>>>>>> + for (i = 0; i < nr_pages; i++) {
>>>>>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>>>>> + return true;
>>>>>>> + }
>>>>>>> +
>>>>>>> + return false;
>>>>>>> +}
>>>>>>> +
>>>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>>> +
>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>>> +{
>>>>>>> + int order;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * If the vma is eligible for thp, allocate a large folio of the size
>>>>>>> + * preferred by the arch. Or if the arch requested a very small size or
>>>>>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still
>>>>>>> + * meets the arch's requirements but means we still take advantage of SW
>>>>>>> + * optimizations (e.g. fewer page faults).
>>>>>>> + *
>>>>>>> + * If the vma isn't eligible for thp, take the arch-preferred size and
>>>>>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads
>>>>>>> + * that have not explicitly opted-in take benefit while capping the
>>>>>>> + * potential for internal fragmentation.
>>>>>>> + */
>>>>>>> +
>>>>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>>> +
>>>>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>>> +
>>>>>>> + return order;
>>>>>>> +}
>>>>>>
>>>>>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>> 1. It's not used, since no archs at the moment implement
>>>>>> arch_wants_pte_order() that returns >64KB.
>>>>>> 2. As far as I know, there is no plan for any arch to do so.
>>>>>
>>>>> My rationale is that arm64 is planning to use this for contpte mapping 2MB
>>>>> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB
>>>>> blocks without the proper THP hinting is a bad plan.
>>>>>
>>>>> As I see it, arches could add their own arch_wants_pte_order() at any time, and
>>>>> just because the HW has a preference, doesn't mean the SW shouldn't get a say.
>>>>> Its a negotiation between HW and SW for the LAF order, embodied in this policy.
>>>>>
>>>>>> 3. Again, it seems to me the rationale behind
>>>>>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all.
>>>>>>
>>>>>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please?
>>>>>>
>>>>>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]:
>>>>>> Thanks: -1 actually is better than 0 (what I suggested) for the
>>>>>> obvious reason.
>>>>>>
>>>>>> I thought we were on the same page, i.e., the "obvious reason" is that
>>>>>> h/w might prefer 0. But here you are not respecting 0. But then why
>>>>>> -1?
>>>>>
>>>>> I agree that the "obvious reason" is that HW might prefer order-0. But the
>>>>> performance wins don't come solely from the HW. Batching up page faults is a big
>>>>> win for SW even if the HW doesn't benefit. So I think it is important that a HW
>>>>> preference of order-0 is possible to express through this API. But that doesn't
>>>>> mean that we don't listen to SW's preferences either.
>>>>>
>>>>> I would really rather leave it in; As I've mentioned in the past, we have a
>>>>> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and
>>>>> this is the mechanism that means we don't dole out those 2MB blocks unless
>>>>> explicitly opted-in.
>>>>>
>>>>> I'm going to be out on holiday for a couple of weeks, so we might have to wait
>>>>> until I'm back to conclude on this, if you still take issue with the justification.
>>>>
>>>> From my understanding (correct me if I am wrong), Yu seems to want order-0 to be
>>>> the default order even if LAF is enabled. But that does not make sense to me, since
>>>> if LAF is configured to be enabled (it is disabled by default now), user (and distros)
>>>> must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation
>>>> time or by using prctl. Enabling LAF and using order-0 as the default order makes
>>>> most of LAF code not used.
>>> For the device with limited memory size and it still wants LAF enabled for some specific
>>> memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise
>>> to enable LAF for specific memory ranges.
>>
>> Do you have a use case? Or it is just a possible scenario?
> It's a possible scenario. Per my experience, it's valid use case for embedded
> system or low end android phone.
>
>>
>> IIUC, Ryan has a concrete use case for his choice. For ARM64 with 16KB/64KB
>> base pages, 2MB folios (LAF in this config) would be desirable since THP is
>> 32MB/512MB and much harder to get.
>>
>>>
>>> So my understanding is it's possible case. But it's another configuration thing and not
>>> necessary to be finalized now.
>>
>> Basically, we are deciding whether LAF should use order-0 by default once it is
>> compiled in to kernel. From your other email on ANON_FOLIO_MAX_ORDER_UNHINTED,
>> your argument is that code change is needed to test the impact of LAF with
>> different orders. That seems to imply we actually need an extra knob (maybe sysctl)
>> to control the max LAF order. And with that extra knob, we can solve this default
>> order problem, since we can set it to 0 for devices want to opt in LAF and set
>> it N (like 64KB) for other devices want to opt out LAF.
> From performance tuning perspective, it's necessary to have knobs to configure and
> check the attribute of LAF. But we must be careful to add the knobs as they need
> be maintained for ever.

If we do not want to maintain such a knob (since it may take some time to finalize)
and tweaking LAF order is important for us to explore different LAF configurations
(Ryan thinks 64KB will perform well on ARM64, whereas Yu mentioned 16KB/32KB is
better in his use cases), we probably just put the LAF order knob in debugfs
like Ryan suggested before to move forward.


>>
>> So maybe we need the extra knob for both testing purpose and serving different
>> device configuration purpose.
>>
>>>>
>>>> Also arch_wants_pte_order() might need a better name like
>>>> arch_wants_large_folio_order(). Since current name sounds like the specified order
>>>> is wanted by HW in a general setting, but it is not. It is an order HW wants
>>>> when LAF is enabled. That might cause some confusion.
>>>>
>>>>>>
>>>>>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Yan, Zi
>>
>>
>> --
>> Best Regards,
>> Yan, Zi


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-08-12 02:55:10

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance



On 8/11/2023 10:33 PM, Zi Yan wrote:
> On 11 Aug 2023, at 1:34, Yin, Fengwei wrote:
>
>> On 8/11/2023 9:04 AM, Zi Yan wrote:
>>> On 10 Aug 2023, at 20:36, Yin, Fengwei wrote:
>>>
>>>> On 8/11/2023 3:46 AM, Zi Yan wrote:
>>>>> On 10 Aug 2023, at 15:12, Ryan Roberts wrote:
>>>>>
>>>>>> On 10/08/2023 18:01, Yu Zhao wrote:
>>>>>>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>>>> counting, rmap management lru list management) are also significantly
>>>>>>>> reduced since those ops now become per-folio.
>>>>>>>>
>>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>>>> defaut to enabled, but there are some risks around internal
>>>>>>>> fragmentation that need to be better understood first.
>>>>>>>>
>>>>>>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>>>>>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>>>>>>> where fallback (>) is performed for various reasons, such as the
>>>>>>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>>>>>>
>>>>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>>>>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>>>>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S
>>>>>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
>>>>>>>> MADV_NOHUGEPAGE | S | S | S | S
>>>>>>>>
>>>>>>>> This approach ensures that we don't violate existing hints to only
>>>>>>>> allocate single pages - this is required for QEMU's VM live migration
>>>>>>>> implementation to work correctly - while allowing us to use LAF
>>>>>>>> independently of THP (when sysfs=never). This makes wide scale
>>>>>>>> performance characterization simpler, while avoiding exposing any new
>>>>>>>> ABI to user space.
>>>>>>>>
>>>>>>>> When using LAF for allocation, the folio order is determined as follows:
>>>>>>>> The return value of arch_wants_pte_order() is used. For vmas that have
>>>>>>>> not explicitly opted-in to use transparent hugepages (e.g. where
>>>>>>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
>>>>>>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
>>>>>>>> is bigger). This allows for a performance boost without requiring any
>>>>>>>> explicit opt-in from the workload while limitting internal
>>>>>>>> fragmentation.
>>>>>>>>
>>>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>>>
>>>>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>>>>> mechanism allows the architecture to optimize as required.
>>>>>>>>
>>>>>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>>>>>> when the architecture does not define it, which returns -1, implying
>>>>>>>> that the HW has no preference. In this case, mm will choose it's own
>>>>>>>> default order.
>>>>>>>>
>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>> ---
>>>>>>>> include/linux/pgtable.h | 13 ++++
>>>>>>>> mm/Kconfig | 10 +++
>>>>>>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
>>>>>>>> 3 files changed, 158 insertions(+), 9 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>> index 222a33b9600d..4b488cc66ddc 100644
>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>>>> }
>>>>>>>> #endif
>>>>>>>>
>>>>>>>> +#ifndef arch_wants_pte_order
>>>>>>>> +/*
>>>>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>>>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>>>>>>> + * and mm will choose it's own default order.
>>>>>>>> + */
>>>>>>>> +static inline int arch_wants_pte_order(void)
>>>>>>>> +{
>>>>>>>> + return -1;
>>>>>>>> +}
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>>>>> unsigned long address,
>>>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>>>>> index 721dc88423c7..a1e28b8ddc24 100644
>>>>>>>> --- a/mm/Kconfig
>>>>>>>> +++ b/mm/Kconfig
>>>>>>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA
>>>>>>>>
>>>>>>>> source "mm/damon/Kconfig"
>>>>>>>>
>>>>>>>> +config LARGE_ANON_FOLIO
>>>>>>>> + bool "Allocate large folios for anonymous memory"
>>>>>>>> + depends on TRANSPARENT_HUGEPAGE
>>>>>>>> + default n
>>>>>>>> + help
>>>>>>>> + Use large (bigger than order-0) folios to back anonymous memory where
>>>>>>>> + possible, even for pte-mapped memory. This reduces the number of page
>>>>>>>> + faults, as well as other per-page overheads to improve performance for
>>>>>>>> + many workloads.
>>>>>>>> +
>>>>>>>> endmenu
>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>> index d003076b218d..bbc7d4ce84f7 100644
>>>>>>>> --- a/mm/memory.c
>>>>>>>> +++ b/mm/memory.c
>>>>>>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>> return ret;
>>>>>>>> }
>>>>>>>>
>>>>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>>>>>> +{
>>>>>>>> + int i;
>>>>>>>> +
>>>>>>>> + if (nr_pages == 1)
>>>>>>>> + return vmf_pte_changed(vmf);
>>>>>>>> +
>>>>>>>> + for (i = 0; i < nr_pages; i++) {
>>>>>>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>>>>>> + return true;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + return false;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>>>> +
>>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>>>> +{
>>>>>>>> + int order;
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * If the vma is eligible for thp, allocate a large folio of the size
>>>>>>>> + * preferred by the arch. Or if the arch requested a very small size or
>>>>>>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still
>>>>>>>> + * meets the arch's requirements but means we still take advantage of SW
>>>>>>>> + * optimizations (e.g. fewer page faults).
>>>>>>>> + *
>>>>>>>> + * If the vma isn't eligible for thp, take the arch-preferred size and
>>>>>>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads
>>>>>>>> + * that have not explicitly opted-in take benefit while capping the
>>>>>>>> + * potential for internal fragmentation.
>>>>>>>> + */
>>>>>>>> +
>>>>>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>>>> +
>>>>>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>>>> +
>>>>>>>> + return order;
>>>>>>>> +}
>>>>>>>
>>>>>>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>>> 1. It's not used, since no archs at the moment implement
>>>>>>> arch_wants_pte_order() that returns >64KB.
>>>>>>> 2. As far as I know, there is no plan for any arch to do so.
>>>>>>
>>>>>> My rationale is that arm64 is planning to use this for contpte mapping 2MB
>>>>>> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB
>>>>>> blocks without the proper THP hinting is a bad plan.
>>>>>>
>>>>>> As I see it, arches could add their own arch_wants_pte_order() at any time, and
>>>>>> just because the HW has a preference, doesn't mean the SW shouldn't get a say.
>>>>>> Its a negotiation between HW and SW for the LAF order, embodied in this policy.
>>>>>>
>>>>>>> 3. Again, it seems to me the rationale behind
>>>>>>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all.
>>>>>>>
>>>>>>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please?
>>>>>>>
>>>>>>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]:
>>>>>>> Thanks: -1 actually is better than 0 (what I suggested) for the
>>>>>>> obvious reason.
>>>>>>>
>>>>>>> I thought we were on the same page, i.e., the "obvious reason" is that
>>>>>>> h/w might prefer 0. But here you are not respecting 0. But then why
>>>>>>> -1?
>>>>>>
>>>>>> I agree that the "obvious reason" is that HW might prefer order-0. But the
>>>>>> performance wins don't come solely from the HW. Batching up page faults is a big
>>>>>> win for SW even if the HW doesn't benefit. So I think it is important that a HW
>>>>>> preference of order-0 is possible to express through this API. But that doesn't
>>>>>> mean that we don't listen to SW's preferences either.
>>>>>>
>>>>>> I would really rather leave it in; As I've mentioned in the past, we have a
>>>>>> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and
>>>>>> this is the mechanism that means we don't dole out those 2MB blocks unless
>>>>>> explicitly opted-in.
>>>>>>
>>>>>> I'm going to be out on holiday for a couple of weeks, so we might have to wait
>>>>>> until I'm back to conclude on this, if you still take issue with the justification.
>>>>>
>>>>> From my understanding (correct me if I am wrong), Yu seems to want order-0 to be
>>>>> the default order even if LAF is enabled. But that does not make sense to me, since
>>>>> if LAF is configured to be enabled (it is disabled by default now), user (and distros)
>>>>> must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation
>>>>> time or by using prctl. Enabling LAF and using order-0 as the default order makes
>>>>> most of LAF code not used.
>>>> For the device with limited memory size and it still wants LAF enabled for some specific
>>>> memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise
>>>> to enable LAF for specific memory ranges.
>>>
>>> Do you have a use case? Or it is just a possible scenario?
>> It's a possible scenario. Per my experience, it's valid use case for embedded
>> system or low end android phone.
>>
>>>
>>> IIUC, Ryan has a concrete use case for his choice. For ARM64 with 16KB/64KB
>>> base pages, 2MB folios (LAF in this config) would be desirable since THP is
>>> 32MB/512MB and much harder to get.
>>>
>>>>
>>>> So my understanding is it's possible case. But it's another configuration thing and not
>>>> necessary to be finalized now.
>>>
>>> Basically, we are deciding whether LAF should use order-0 by default once it is
>>> compiled in to kernel. From your other email on ANON_FOLIO_MAX_ORDER_UNHINTED,
>>> your argument is that code change is needed to test the impact of LAF with
>>> different orders. That seems to imply we actually need an extra knob (maybe sysctl)
>>> to control the max LAF order. And with that extra knob, we can solve this default
>>> order problem, since we can set it to 0 for devices want to opt in LAF and set
>>> it N (like 64KB) for other devices want to opt out LAF.
>> From performance tuning perspective, it's necessary to have knobs to configure and
>> check the attribute of LAF. But we must be careful to add the knobs as they need
>> be maintained for ever.
>
> If we do not want to maintain such a knob (since it may take some time to finalize)
> and tweaking LAF order is important for us to explore different LAF configurations
> (Ryan thinks 64KB will perform well on ARM64, whereas Yu mentioned 16KB/32KB is
> better in his use cases), we probably just put the LAF order knob in debugfs
> like Ryan suggested before to move forward.
Works for me.

>
>
>>>
>>> So maybe we need the extra knob for both testing purpose and serving different
>>> device configuration purpose.
>>>
>>>>>
>>>>> Also arch_wants_pte_order() might need a better name like
>>>>> arch_wants_large_folio_order(). Since current name sounds like the specified order
>>>>> is wanted by HW in a general setting, but it is not. It is an order HW wants
>>>>> when LAF is enabled. That might cause some confusion.
>>>>>
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Yan, Zi
>>>
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>
>
> --
> Best Regards,
> Yan, Zi

2023-08-19 12:33:01

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

Hi, Ryan,

Ryan Roberts <[email protected]> writes:

> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> allocated in large folios of a determined order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
>
> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> which defaults to disabled for now; The long term aim is for this to
> defaut to enabled, but there are some risks around internal
> fragmentation that need to be better understood first.
>
> Large anonymous folio (LAF) allocation is integrated with the existing
> (PMD-order) THP and single (S) page allocation according to this policy,
> where fallback (>) is performed for various reasons, such as the
> proposed folio order not fitting within the bounds of the VMA, etc:
>
> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
> ----------------|-----------|-------------|---------------|-------------
> no hint | S | LAF>S | LAF>S | THP>LAF>S
> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
> MADV_NOHUGEPAGE | S | S | S | S

IMHO, we should use the following semantics as you have suggested
before.

| prctl=dis | prctl=ena | prctl=ena | prctl=ena
| sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
----------------|-----------|-------------|---------------|-------------
no hint | S | S | LAF>S | THP>LAF>S
MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S
MADV_NOHUGEPAGE | S | S | S | S

Or even,

| prctl=dis | prctl=ena | prctl=ena | prctl=ena
| sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
----------------|-----------|-------------|---------------|-------------
no hint | S | S | S | THP>LAF>S
MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S
MADV_NOHUGEPAGE | S | S | S | S

From the implementation point of view, PTE mapped PMD-sized THP has
almost no difference with LAF (just some small sized THP). It will be
confusing to distinguish them from the interface point of view.

So, IMHO, the real difference is the policy. For example, prefer
PMD-sized THP, prefer small sized THP, or fully auto. The sysfs
interface is used to specify system global policy. In the long term, it
can be something like below,

never: S # disable all THP
madvise: # never by default, control via madvise()
always: THP>LAF>S # prefer PMD-sized THP in fact
small: LAF>S # prefer small sized THP
auto: # use in-kernel heuristics for THP size

But it may be not ready to add new policies now. So, before the new
policies are ready, we can add a debugfs interface to override the
original policy in /sys/kernel/mm/transparent_hugepage/enabled. After
we have tuned enough workloads, collected enough data, we can add new
policies to the sysfs interface.

--
Best Regards,
Huang, Ying

> This approach ensures that we don't violate existing hints to only
> allocate single pages - this is required for QEMU's VM live migration
> implementation to work correctly - while allowing us to use LAF
> independently of THP (when sysfs=never). This makes wide scale
> performance characterization simpler, while avoiding exposing any new
> ABI to user space.
>
> When using LAF for allocation, the folio order is determined as follows:
> The return value of arch_wants_pte_order() is used. For vmas that have
> not explicitly opted-in to use transparent hugepages (e.g. where
> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
> is bigger). This allows for a performance boost without requiring any
> explicit opt-in from the workload while limitting internal
> fragmentation.
>
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order; first
> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>
> arch_wants_pte_order() can be overridden by the architecture if desired.
> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> set of ptes map physically contigious, naturally aligned memory, so this
> mechanism allows the architecture to optimize as required.
>
> Here we add the default implementation of arch_wants_pte_order(), used
> when the architecture does not define it, which returns -1, implying
> that the HW has no preference. In this case, mm will choose it's own
> default order.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/pgtable.h | 13 ++++
> mm/Kconfig | 10 +++
> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
> 3 files changed, 158 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 222a33b9600d..4b488cc66ddc 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void)
> }
> #endif
>
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> + * to be at least order-2. Negative value implies that the HW has no preference
> + * and mm will choose it's own default order.
> + */
> +static inline int arch_wants_pte_order(void)
> +{
> + return -1;
> +}
> +#endif
> +
> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> unsigned long address,
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 721dc88423c7..a1e28b8ddc24 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA
>
> source "mm/damon/Kconfig"
>
> +config LARGE_ANON_FOLIO
> + bool "Allocate large folios for anonymous memory"
> + depends on TRANSPARENT_HUGEPAGE
> + default n
> + help
> + Use large (bigger than order-0) folios to back anonymous memory where
> + possible, even for pte-mapped memory. This reduces the number of page
> + faults, as well as other per-page overheads to improve performance for
> + many workloads.
> +
> endmenu
> diff --git a/mm/memory.c b/mm/memory.c
> index d003076b218d..bbc7d4ce84f7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> return ret;
> }
>
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> + int i;
> +
> + if (nr_pages == 1)
> + return vmf_pte_changed(vmf);
> +
> + for (i = 0; i < nr_pages; i++) {
> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> + return true;
> + }
> +
> + return false;
> +}
> +
> +#ifdef CONFIG_LARGE_ANON_FOLIO
> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
> +
> +static int anon_folio_order(struct vm_area_struct *vma)
> +{
> + int order;
> +
> + /*
> + * If the vma is eligible for thp, allocate a large folio of the size
> + * preferred by the arch. Or if the arch requested a very small size or
> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still
> + * meets the arch's requirements but means we still take advantage of SW
> + * optimizations (e.g. fewer page faults).
> + *
> + * If the vma isn't eligible for thp, take the arch-preferred size and
> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads
> + * that have not explicitly opted-in take benefit while capping the
> + * potential for internal fragmentation.
> + */
> +
> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> +
> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
> +
> + return order;
> +}
> +
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +{
> + int i;
> + gfp_t gfp;
> + pte_t *pte;
> + unsigned long addr;
> + struct folio *folio;
> + struct vm_area_struct *vma = vmf->vma;
> + int prefer = anon_folio_order(vma);
> + int orders[] = {
> + prefer,
> + prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> + 0,
> + };
> +
> + /*
> + * If uffd is active for the vma we need per-page fault fidelity to
> + * maintain the uffd semantics.
> + */
> + if (userfaultfd_armed(vma))
> + goto fallback;
> +
> + /*
> + * If hugepages are explicitly disabled for the vma (either
> + * MADV_NOHUGEPAGE or prctl) fallback to order-0. Failure to do this
> + * breaks correctness for user space. We ignore the sysfs global knob.
> + */
> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, false))
> + goto fallback;
> +
> + for (i = 0; orders[i]; i++) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> + if (addr >= vma->vm_start &&
> + addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> + break;
> + }
> +
> + if (!orders[i])
> + goto fallback;
> +
> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> + if (!pte)
> + return ERR_PTR(-EAGAIN);
> +
> + for (; orders[i]; i++) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> + vmf->pte = pte + pte_index(addr);
> + if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
> + break;
> + }
> +
> + vmf->pte = NULL;
> + pte_unmap(pte);
> +
> + gfp = vma_thp_gfp_mask(vma);
> +
> + for (; orders[i]; i++) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> + folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
> + if (folio) {
> + clear_huge_page(&folio->page, addr, 1 << orders[i]);
> + return folio;
> + }
> + }
> +
> +fallback:
> + return vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +}
> +#else
> +#define alloc_anon_folio(vmf) \
> + vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
> +#endif
> +
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4080,6 +4197,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> {
> + int i;
> + int nr_pages = 1;
> + unsigned long addr = vmf->address;
> bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
> struct vm_area_struct *vma = vmf->vma;
> struct folio *folio;
> @@ -4124,10 +4244,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> /* Allocate our own private page. */
> if (unlikely(anon_vma_prepare(vma)))
> goto oom;
> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> + folio = alloc_anon_folio(vmf);
> + if (IS_ERR(folio))
> + return 0;
> if (!folio)
> goto oom;
>
> + nr_pages = folio_nr_pages(folio);
> + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +
> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> goto oom_free_page;
> folio_throttle_swaprate(folio, GFP_KERNEL);
> @@ -4144,12 +4269,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> if (vma->vm_flags & VM_WRITE)
> entry = pte_mkwrite(pte_mkdirty(entry));
>
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> - &vmf->ptl);
> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> if (!vmf->pte)
> goto release;
> - if (vmf_pte_changed(vmf)) {
> - update_mmu_tlb(vma, vmf->address, vmf->pte);
> + if (vmf_pte_range_changed(vmf, nr_pages)) {
> + for (i = 0; i < nr_pages; i++)
> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> goto release;
> }
>
> @@ -4164,16 +4289,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> return handle_userfault(vmf, VM_UFFD_MISSING);
> }
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> + folio_ref_add(folio, nr_pages - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + folio_add_new_anon_rmap(folio, vma, addr);
> folio_add_lru_vma(folio, vma);
> setpte:
> if (uffd_wp)
> entry = pte_mkuffd_wp(entry);
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> + update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);

2023-08-30 22:13:45

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

Sorry for the delay in responding (I've been out on holiday). Questions for Yu,
Zi and Yin below...


On 12/08/2023 01:23, Yin, Fengwei wrote:
>
>
> On 8/11/2023 10:33 PM, Zi Yan wrote:
>> On 11 Aug 2023, at 1:34, Yin, Fengwei wrote:
>>
>>> On 8/11/2023 9:04 AM, Zi Yan wrote:
>>>> On 10 Aug 2023, at 20:36, Yin, Fengwei wrote:
>>>>
>>>>> On 8/11/2023 3:46 AM, Zi Yan wrote:
>>>>>> On 10 Aug 2023, at 15:12, Ryan Roberts wrote:
>>>>>>
>>>>>>> On 10/08/2023 18:01, Yu Zhao wrote:
>>>>>>>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>>>>> counting, rmap management lru list management) are also significantly
>>>>>>>>> reduced since those ops now become per-folio.
>>>>>>>>>
>>>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>>>>> defaut to enabled, but there are some risks around internal
>>>>>>>>> fragmentation that need to be better understood first.
>>>>>>>>>
>>>>>>>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>>>>>>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>>>>>>>> where fallback (>) is performed for various reasons, such as the
>>>>>>>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>>>>>>>
>>>>>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>>>>>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>>>>>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>>>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S
>>>>>>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
>>>>>>>>> MADV_NOHUGEPAGE | S | S | S | S
>>>>>>>>>
>>>>>>>>> This approach ensures that we don't violate existing hints to only
>>>>>>>>> allocate single pages - this is required for QEMU's VM live migration
>>>>>>>>> implementation to work correctly - while allowing us to use LAF
>>>>>>>>> independently of THP (when sysfs=never). This makes wide scale
>>>>>>>>> performance characterization simpler, while avoiding exposing any new
>>>>>>>>> ABI to user space.
>>>>>>>>>
>>>>>>>>> When using LAF for allocation, the folio order is determined as follows:
>>>>>>>>> The return value of arch_wants_pte_order() is used. For vmas that have
>>>>>>>>> not explicitly opted-in to use transparent hugepages (e.g. where
>>>>>>>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
>>>>>>>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
>>>>>>>>> is bigger). This allows for a performance boost without requiring any
>>>>>>>>> explicit opt-in from the workload while limitting internal
>>>>>>>>> fragmentation.
>>>>>>>>>
>>>>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>>>>
>>>>>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>>>>>> mechanism allows the architecture to optimize as required.
>>>>>>>>>
>>>>>>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>>>>>>> when the architecture does not define it, which returns -1, implying
>>>>>>>>> that the HW has no preference. In this case, mm will choose it's own
>>>>>>>>> default order.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>> ---
>>>>>>>>> include/linux/pgtable.h | 13 ++++
>>>>>>>>> mm/Kconfig | 10 +++
>>>>>>>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
>>>>>>>>> 3 files changed, 158 insertions(+), 9 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>> index 222a33b9600d..4b488cc66ddc 100644
>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>>>>> }
>>>>>>>>> #endif
>>>>>>>>>
>>>>>>>>> +#ifndef arch_wants_pte_order
>>>>>>>>> +/*
>>>>>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>>>>>> + * to be at least order-2. Negative value implies that the HW has no preference
>>>>>>>>> + * and mm will choose it's own default order.
>>>>>>>>> + */
>>>>>>>>> +static inline int arch_wants_pte_order(void)
>>>>>>>>> +{
>>>>>>>>> + return -1;
>>>>>>>>> +}
>>>>>>>>> +#endif
>>>>>>>>> +
>>>>>>>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>>>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>>>>>> unsigned long address,
>>>>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>>>>>> index 721dc88423c7..a1e28b8ddc24 100644
>>>>>>>>> --- a/mm/Kconfig
>>>>>>>>> +++ b/mm/Kconfig
>>>>>>>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA
>>>>>>>>>
>>>>>>>>> source "mm/damon/Kconfig"
>>>>>>>>>
>>>>>>>>> +config LARGE_ANON_FOLIO
>>>>>>>>> + bool "Allocate large folios for anonymous memory"
>>>>>>>>> + depends on TRANSPARENT_HUGEPAGE
>>>>>>>>> + default n
>>>>>>>>> + help
>>>>>>>>> + Use large (bigger than order-0) folios to back anonymous memory where
>>>>>>>>> + possible, even for pte-mapped memory. This reduces the number of page
>>>>>>>>> + faults, as well as other per-page overheads to improve performance for
>>>>>>>>> + many workloads.
>>>>>>>>> +
>>>>>>>>> endmenu
>>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>>> index d003076b218d..bbc7d4ce84f7 100644
>>>>>>>>> --- a/mm/memory.c
>>>>>>>>> +++ b/mm/memory.c
>>>>>>>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>>> return ret;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>>>>>>> +{
>>>>>>>>> + int i;
>>>>>>>>> +
>>>>>>>>> + if (nr_pages == 1)
>>>>>>>>> + return vmf_pte_changed(vmf);
>>>>>>>>> +
>>>>>>>>> + for (i = 0; i < nr_pages; i++) {
>>>>>>>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>>>>>>> + return true;
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + return false;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO
>>>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
>>>>>>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>>>>> +
>>>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>>>>> +{
>>>>>>>>> + int order;
>>>>>>>>> +
>>>>>>>>> + /*
>>>>>>>>> + * If the vma is eligible for thp, allocate a large folio of the size
>>>>>>>>> + * preferred by the arch. Or if the arch requested a very small size or
>>>>>>>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still
>>>>>>>>> + * meets the arch's requirements but means we still take advantage of SW
>>>>>>>>> + * optimizations (e.g. fewer page faults).
>>>>>>>>> + *
>>>>>>>>> + * If the vma isn't eligible for thp, take the arch-preferred size and
>>>>>>>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads
>>>>>>>>> + * that have not explicitly opted-in take benefit while capping the
>>>>>>>>> + * potential for internal fragmentation.
>>>>>>>>> + */
>>>>>>>>> +
>>>>>>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>>>>> +
>>>>>>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>>>>> +
>>>>>>>>> + return order;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>>>> 1. It's not used, since no archs at the moment implement
>>>>>>>> arch_wants_pte_order() that returns >64KB.
>>>>>>>> 2. As far as I know, there is no plan for any arch to do so.
>>>>>>>
>>>>>>> My rationale is that arm64 is planning to use this for contpte mapping 2MB
>>>>>>> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB
>>>>>>> blocks without the proper THP hinting is a bad plan.
>>>>>>>
>>>>>>> As I see it, arches could add their own arch_wants_pte_order() at any time, and
>>>>>>> just because the HW has a preference, doesn't mean the SW shouldn't get a say.
>>>>>>> Its a negotiation between HW and SW for the LAF order, embodied in this policy.

Yu, I never saw a reply to this. Have I managed to convince you? I'm willing to
put the vma param back into arch_wants_pte_order() and handle the policy in the
arch, if you consider that a less bad solution.

>>>>>>>
>>>>>>>> 3. Again, it seems to me the rationale behind
>>>>>>>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all.
>>>>>>>>
>>>>>>>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please?
>>>>>>>>
>>>>>>>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]:
>>>>>>>> Thanks: -1 actually is better than 0 (what I suggested) for the
>>>>>>>> obvious reason.
>>>>>>>>
>>>>>>>> I thought we were on the same page, i.e., the "obvious reason" is that
>>>>>>>> h/w might prefer 0. But here you are not respecting 0. But then why
>>>>>>>> -1?
>>>>>>>
>>>>>>> I agree that the "obvious reason" is that HW might prefer order-0. But the
>>>>>>> performance wins don't come solely from the HW. Batching up page faults is a big
>>>>>>> win for SW even if the HW doesn't benefit. So I think it is important that a HW
>>>>>>> preference of order-0 is possible to express through this API. But that doesn't
>>>>>>> mean that we don't listen to SW's preferences either.
>>>>>>>
>>>>>>> I would really rather leave it in; As I've mentioned in the past, we have a
>>>>>>> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and
>>>>>>> this is the mechanism that means we don't dole out those 2MB blocks unless
>>>>>>> explicitly opted-in.

Yu, would appreciate any comments here.

>>>>>>>
>>>>>>> I'm going to be out on holiday for a couple of weeks, so we might have to wait
>>>>>>> until I'm back to conclude on this, if you still take issue with the justification.
>>>>>>
>>>>>> From my understanding (correct me if I am wrong), Yu seems to want order-0 to be
>>>>>> the default order even if LAF is enabled.

Zi, I think you are incorrect; Yu does not want order-0 to be the default. He's
just pointing out the that original "default return value that actually means
PAGE_ALLOC_COSTLY_ORDER" was 0 and that was not an ideal choice because 0
_could_ be a legitimate preference from the HW. So -1 is preferred for this
purpose. Yu - correct me if wrong!

>>>>>> But that does not make sense to me, since
>>>>>> if LAF is configured to be enabled (it is disabled by default now), user (and distros)
>>>>>> must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation
>>>>>> time or by using prctl. Enabling LAF and using order-0 as the default order makes
>>>>>> most of LAF code not used.
>>>>> For the device with limited memory size and it still wants LAF enabled for some specific
>>>>> memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise
>>>>> to enable LAF for specific memory ranges.
>>>>
>>>> Do you have a use case? Or it is just a possible scenario?
>>> It's a possible scenario. Per my experience, it's valid use case for embedded
>>> system or low end android phone.
>>>
>>>>
>>>> IIUC, Ryan has a concrete use case for his choice. For ARM64 with 16KB/64KB
>>>> base pages, 2MB folios (LAF in this config) would be desirable since THP is
>>>> 32MB/512MB and much harder to get.

Yes I have a real use case for my choice. But as I said above, I'm willing to
move that policy into the arch impl of arch_wants_pte_order() if its acceptable
to pass the vma in (this is how I was doing it in the original version, but
preference was to remove the parameter).

>>>>
>>>>>
>>>>> So my understanding is it's possible case. But it's another configuration thing and not
>>>>> necessary to be finalized now.
>>>>
>>>> Basically, we are deciding whether LAF should use order-0 by default once it is
>>>> compiled in to kernel. From your other email on ANON_FOLIO_MAX_ORDER_UNHINTED,
>>>> your argument is that code change is needed to test the impact of LAF with
>>>> different orders. That seems to imply we actually need an extra knob (maybe sysctl)
>>>> to control the max LAF order. And with that extra knob, we can solve this default
>>>> order problem, since we can set it to 0 for devices want to opt in LAF and set
>>>> it N (like 64KB) for other devices want to opt out LAF.
>>> From performance tuning perspective, it's necessary to have knobs to configure and
>>> check the attribute of LAF. But we must be careful to add the knobs as they need
>>> be maintained for ever.
>>
>> If we do not want to maintain such a knob (since it may take some time to finalize)
>> and tweaking LAF order is important for us to explore different LAF configurations
>> (Ryan thinks 64KB will perform well on ARM64, whereas Yu mentioned 16KB/32KB is
>> better in his use cases), we probably just put the LAF order knob in debugfs
>> like Ryan suggested before to move forward.
> Works for me.

I would really rather avoid adding any knob for now if we possibly can. We have
discussed this in the past and concluded we should avoid. It was also raised
that if we do add a knob, then debugfs is not sufficient because you can't
access it in some environments.

>
>>
>>
>>>>
>>>> So maybe we need the extra knob for both testing purpose and serving different
>>>> device configuration purpose.
>>>>
>>>>>>
>>>>>> Also arch_wants_pte_order() might need a better name like
>>>>>> arch_wants_large_folio_order(). Since current name sounds like the specified order
>>>>>> is wanted by HW in a general setting, but it is not. It is an order HW wants
>>>>>> when LAF is enabled. That might cause some confusion.

Personally I don't think it makes much difference. "large folio" does not make
it clear that its for pte-mapped memory only. How about
arch_prefers_pte_order(), if it really must be changed?

>>>>>>
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Yan, Zi
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Yan, Zi
>>
>>
>> --
>> Best Regards,
>> Yan, Zi


2023-08-30 22:16:58

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

On 15/08/2023 22:32, Huang, Ying wrote:
> Hi, Ryan,
>
> Ryan Roberts <[email protected]> writes:
>
>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>> allocated in large folios of a determined order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>> which defaults to disabled for now; The long term aim is for this to
>> defaut to enabled, but there are some risks around internal
>> fragmentation that need to be better understood first.
>>
>> Large anonymous folio (LAF) allocation is integrated with the existing
>> (PMD-order) THP and single (S) page allocation according to this policy,
>> where fallback (>) is performed for various reasons, such as the
>> proposed folio order not fitting within the bounds of the VMA, etc:
>>
>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
>> ----------------|-----------|-------------|---------------|-------------
>> no hint | S | LAF>S | LAF>S | THP>LAF>S
>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
>> MADV_NOHUGEPAGE | S | S | S | S
>
> IMHO, we should use the following semantics as you have suggested
> before.
>
> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
> ----------------|-----------|-------------|---------------|-------------
> no hint | S | S | LAF>S | THP>LAF>S
> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S
> MADV_NOHUGEPAGE | S | S | S | S
>
> Or even,
>
> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
> ----------------|-----------|-------------|---------------|-------------
> no hint | S | S | S | THP>LAF>S
> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S
> MADV_NOHUGEPAGE | S | S | S | S
>
> From the implementation point of view, PTE mapped PMD-sized THP has
> almost no difference with LAF (just some small sized THP). It will be
> confusing to distinguish them from the interface point of view.
>
> So, IMHO, the real difference is the policy. For example, prefer
> PMD-sized THP, prefer small sized THP, or fully auto. The sysfs
> interface is used to specify system global policy. In the long term, it
> can be something like below,
>
> never: S # disable all THP
> madvise: # never by default, control via madvise()
> always: THP>LAF>S # prefer PMD-sized THP in fact
> small: LAF>S # prefer small sized THP
> auto: # use in-kernel heuristics for THP size
>
> But it may be not ready to add new policies now. So, before the new
> policies are ready, we can add a debugfs interface to override the
> original policy in /sys/kernel/mm/transparent_hugepage/enabled. After
> we have tuned enough workloads, collected enough data, we can add new
> policies to the sysfs interface.

I think we can all imagine many policy options. But we don't really have much
evidence yet for what it best. The policy I'm currently using is intended to
give some flexibility for testing (use LAF without THP by setting sysfs=never,
use THP without LAF by compiling without LAF) without adding any new knobs at
all. Given that, surely we can defer these decisions until we have more data?

In the absence of data, your proposed solution sounds very sensible to me. But
for the purposes of scaling up perf testing, I don't think its essential given
the current policy will also produce the same options.

If we were going to add a debugfs knob, I think the higher priority would be a
knob to specify the folio order. (but again, I would rather avoid if possible).

Thanks,
Ryan



2023-08-31 13:45:50

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

On 31.08.23 10:02, Yin, Fengwei wrote:
>
>
> On 8/31/2023 3:57 PM, David Hildenbrand wrote:
>> On 31.08.23 03:40, Huang, Ying wrote:
>>> Ryan Roberts <[email protected]> writes:
>>>
>>>> On 15/08/2023 22:32, Huang, Ying wrote:
>>>>> Hi, Ryan,
>>>>>
>>>>> Ryan Roberts <[email protected]> writes:
>>>>>
>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>> counting, rmap management lru list management) are also significantly
>>>>>> reduced since those ops now become per-folio.
>>>>>>
>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>> defaut to enabled, but there are some risks around internal
>>>>>> fragmentation that need to be better understood first.
>>>>>>
>>>>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>>>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>>>>> where fallback (>) is performed for various reasons, such as the
>>>>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>>>>
>>>>>>                  | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>>>>                  | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>>> no hint         | S         | LAF>S       | LAF>S         | THP>LAF>S
>>>>>> MADV_HUGEPAGE   | S         | LAF>S       | THP>LAF>S     | THP>LAF>S
>>>>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>>>>>
>>>>> IMHO, we should use the following semantics as you have suggested
>>>>> before.
>>>>>
>>>>>                  | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>>>                  | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>> no hint         | S         | S           | LAF>S         | THP>LAF>S
>>>>> MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
>>>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>>>>>
>>>>> Or even,
>>>>>
>>>>>                  | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>>>                  | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>> no hint         | S         | S           | S             | THP>LAF>S
>>>>> MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
>>>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>>>>>
>>>>>  From the implementation point of view, PTE mapped PMD-sized THP has
>>>>> almost no difference with LAF (just some small sized THP).  It will be
>>>>> confusing to distinguish them from the interface point of view.
>>>>>
>>>>> So, IMHO, the real difference is the policy.  For example, prefer
>>>>> PMD-sized THP, prefer small sized THP, or fully auto.  The sysfs
>>>>> interface is used to specify system global policy.  In the long term, it
>>>>> can be something like below,
>>>>>
>>>>> never:      S               # disable all THP
>>>>> madvise:                    # never by default, control via madvise()
>>>>> always:     THP>LAF>S       # prefer PMD-sized THP in fact
>>>>> small:      LAF>S           # prefer small sized THP
>>>>> auto:                       # use in-kernel heuristics for THP size
>>>>>
>>>>> But it may be not ready to add new policies now.  So, before the new
>>>>> policies are ready, we can add a debugfs interface to override the
>>>>> original policy in /sys/kernel/mm/transparent_hugepage/enabled.  After
>>>>> we have tuned enough workloads, collected enough data, we can add new
>>>>> policies to the sysfs interface.
>>>>
>>>> I think we can all imagine many policy options. But we don't really have much
>>>> evidence yet for what it best. The policy I'm currently using is intended to
>>>> give some flexibility for testing (use LAF without THP by setting sysfs=never,
>>>> use THP without LAF by compiling without LAF) without adding any new knobs at
>>>> all. Given that, surely we can defer these decisions until we have more data?
>>>>
>>>> In the absence of data, your proposed solution sounds very sensible to me. But
>>>> for the purposes of scaling up perf testing, I don't think its essential given
>>>> the current policy will also produce the same options.
>>>>
>>>> If we were going to add a debugfs knob, I think the higher priority would be a
>>>> knob to specify the folio order. (but again, I would rather avoid if possible).
>>>
>>> I totally understand we need some way to control PMD-sized THP and LAF
>>> to tune the workload, and nobody likes debugfs knob.
>>>
>>> My concern about interface is that we have no way to disable LAF
>>> system-wise without rebuilding the kernel.  In the future, should we add
>>> a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be
>>> stricter than "never"?  "really_never"?
>>
>> Let's talk about that in a bi-weekly MM session. (I proposed it as a topic for next week).
>
> The time slot of the meeting is not friendly to our timezone. Like
> it's 1 or 2 AM. Yes. I know it's very hard to find a good time slot
> for US, EU and Asia. :(.

:/

Yeah, even for me in Germany it's usually already around 6-7pm.

>
> So maybe we still need to discuss it through mail?
I don't think we'll be done discussing that in one session. One of the
main goals is to get some input from the wider MM community.

--
Cheers,

David / dhildenb


2023-09-01 15:24:21

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance



On 8/31/2023 3:57 PM, David Hildenbrand wrote:
> On 31.08.23 03:40, Huang, Ying wrote:
>> Ryan Roberts <[email protected]> writes:
>>
>>> On 15/08/2023 22:32, Huang, Ying wrote:
>>>> Hi, Ryan,
>>>>
>>>> Ryan Roberts <[email protected]> writes:
>>>>
>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>> allocated in large folios of a determined order. All pages of the large
>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>> counting, rmap management lru list management) are also significantly
>>>>> reduced since those ops now become per-folio.
>>>>>
>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>> defaut to enabled, but there are some risks around internal
>>>>> fragmentation that need to be better understood first.
>>>>>
>>>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>>>> where fallback (>) is performed for various reasons, such as the
>>>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>>>
>>>>>                  | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>>>                  | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>>>> ----------------|-----------|-------------|---------------|-------------
>>>>> no hint         | S         | LAF>S       | LAF>S         | THP>LAF>S
>>>>> MADV_HUGEPAGE   | S         | LAF>S       | THP>LAF>S     | THP>LAF>S
>>>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>>>>
>>>> IMHO, we should use the following semantics as you have suggested
>>>> before.
>>>>
>>>>                  | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>>                  | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>>> ----------------|-----------|-------------|---------------|-------------
>>>> no hint         | S         | S           | LAF>S         | THP>LAF>S
>>>> MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
>>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>>>>
>>>> Or even,
>>>>
>>>>                  | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>>                  | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>>> ----------------|-----------|-------------|---------------|-------------
>>>> no hint         | S         | S           | S             | THP>LAF>S
>>>> MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
>>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>>>>
>>>>  From the implementation point of view, PTE mapped PMD-sized THP has
>>>> almost no difference with LAF (just some small sized THP).  It will be
>>>> confusing to distinguish them from the interface point of view.
>>>>
>>>> So, IMHO, the real difference is the policy.  For example, prefer
>>>> PMD-sized THP, prefer small sized THP, or fully auto.  The sysfs
>>>> interface is used to specify system global policy.  In the long term, it
>>>> can be something like below,
>>>>
>>>> never:      S               # disable all THP
>>>> madvise:                    # never by default, control via madvise()
>>>> always:     THP>LAF>S       # prefer PMD-sized THP in fact
>>>> small:      LAF>S           # prefer small sized THP
>>>> auto:                       # use in-kernel heuristics for THP size
>>>>
>>>> But it may be not ready to add new policies now.  So, before the new
>>>> policies are ready, we can add a debugfs interface to override the
>>>> original policy in /sys/kernel/mm/transparent_hugepage/enabled.  After
>>>> we have tuned enough workloads, collected enough data, we can add new
>>>> policies to the sysfs interface.
>>>
>>> I think we can all imagine many policy options. But we don't really have much
>>> evidence yet for what it best. The policy I'm currently using is intended to
>>> give some flexibility for testing (use LAF without THP by setting sysfs=never,
>>> use THP without LAF by compiling without LAF) without adding any new knobs at
>>> all. Given that, surely we can defer these decisions until we have more data?
>>>
>>> In the absence of data, your proposed solution sounds very sensible to me. But
>>> for the purposes of scaling up perf testing, I don't think its essential given
>>> the current policy will also produce the same options.
>>>
>>> If we were going to add a debugfs knob, I think the higher priority would be a
>>> knob to specify the folio order. (but again, I would rather avoid if possible).
>>
>> I totally understand we need some way to control PMD-sized THP and LAF
>> to tune the workload, and nobody likes debugfs knob.
>>
>> My concern about interface is that we have no way to disable LAF
>> system-wise without rebuilding the kernel.  In the future, should we add
>> a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be
>> stricter than "never"?  "really_never"?
>
> Let's talk about that in a bi-weekly MM session. (I proposed it as a topic for next week).

The time slot of the meeting is not friendly to our timezone. Like
it's 1 or 2 AM. Yes. I know it's very hard to find a good time slot
for US, EU and Asia. :(.

So maybe we still need to discuss it through mail?


Regards
Yin, Fengwei

>
> As raised in another mail, we can then discuss
> * how we want to call this feature (transparent large pages? there is
>   the concern that "THP" might confuse users. Maybe we can consider
>   "large" the more generic version and "huge" only PMD-size, TBD)
> * how to expose it in stats towards the user (e.g., /proc/meminfo)
> * which minimal toggles we want
>
> I think there *really* has to be a way to disable it for a running system, otherwise no distro will dare pulling it in, even after we figured out the other stuff.
>
> Note that for the pagecache, large folios can be disabled and distributions are actively making use of that.
>

2023-09-03 13:46:18

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

On Thu, Aug 31, 2023 at 12:57 AM David Hildenbrand <[email protected]> wrote:
>
> On 31.08.23 03:40, Huang, Ying wrote:
> > Ryan Roberts <[email protected]> writes:
> >
> >> On 15/08/2023 22:32, Huang, Ying wrote:
> >>> Hi, Ryan,
> >>>
> >>> Ryan Roberts <[email protected]> writes:
> >>>
> >>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> >>>> allocated in large folios of a determined order. All pages of the large
> >>>> folio are pte-mapped during the same page fault, significantly reducing
> >>>> the number of page faults. The number of per-page operations (e.g. ref
> >>>> counting, rmap management lru list management) are also significantly
> >>>> reduced since those ops now become per-folio.
> >>>>
> >>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> >>>> which defaults to disabled for now; The long term aim is for this to
> >>>> defaut to enabled, but there are some risks around internal
> >>>> fragmentation that need to be better understood first.
> >>>>
> >>>> Large anonymous folio (LAF) allocation is integrated with the existing
> >>>> (PMD-order) THP and single (S) page allocation according to this policy,
> >>>> where fallback (>) is performed for various reasons, such as the
> >>>> proposed folio order not fitting within the bounds of the VMA, etc:
> >>>>
> >>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
> >>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
> >>>> ----------------|-----------|-------------|---------------|-------------
> >>>> no hint | S | LAF>S | LAF>S | THP>LAF>S
> >>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
> >>>> MADV_NOHUGEPAGE | S | S | S | S
> >>>
> >>> IMHO, we should use the following semantics as you have suggested
> >>> before.
> >>>
> >>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
> >>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
> >>> ----------------|-----------|-------------|---------------|-------------
> >>> no hint | S | S | LAF>S | THP>LAF>S
> >>> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S
> >>> MADV_NOHUGEPAGE | S | S | S | S
> >>>
> >>> Or even,
> >>>
> >>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena
> >>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
> >>> ----------------|-----------|-------------|---------------|-------------
> >>> no hint | S | S | S | THP>LAF>S
> >>> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S
> >>> MADV_NOHUGEPAGE | S | S | S | S
> >>>
> >>> From the implementation point of view, PTE mapped PMD-sized THP has
> >>> almost no difference with LAF (just some small sized THP). It will be
> >>> confusing to distinguish them from the interface point of view.
> >>>
> >>> So, IMHO, the real difference is the policy. For example, prefer
> >>> PMD-sized THP, prefer small sized THP, or fully auto. The sysfs
> >>> interface is used to specify system global policy. In the long term, it
> >>> can be something like below,
> >>>
> >>> never: S # disable all THP
> >>> madvise: # never by default, control via madvise()
> >>> always: THP>LAF>S # prefer PMD-sized THP in fact
> >>> small: LAF>S # prefer small sized THP
> >>> auto: # use in-kernel heuristics for THP size
> >>>
> >>> But it may be not ready to add new policies now. So, before the new
> >>> policies are ready, we can add a debugfs interface to override the
> >>> original policy in /sys/kernel/mm/transparent_hugepage/enabled. After
> >>> we have tuned enough workloads, collected enough data, we can add new
> >>> policies to the sysfs interface.
> >>
> >> I think we can all imagine many policy options. But we don't really have much
> >> evidence yet for what it best. The policy I'm currently using is intended to
> >> give some flexibility for testing (use LAF without THP by setting sysfs=never,
> >> use THP without LAF by compiling without LAF) without adding any new knobs at
> >> all. Given that, surely we can defer these decisions until we have more data?
> >>
> >> In the absence of data, your proposed solution sounds very sensible to me. But
> >> for the purposes of scaling up perf testing, I don't think its essential given
> >> the current policy will also produce the same options.
> >>
> >> If we were going to add a debugfs knob, I think the higher priority would be a
> >> knob to specify the folio order. (but again, I would rather avoid if possible).
> >
> > I totally understand we need some way to control PMD-sized THP and LAF
> > to tune the workload, and nobody likes debugfs knob.
> >
> > My concern about interface is that we have no way to disable LAF
> > system-wise without rebuilding the kernel. In the future, should we add
> > a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be
> > stricter than "never"? "really_never"?
>
> Let's talk about that in a bi-weekly MM session. (I proposed it as a
> topic for next week).
>
> As raised in another mail, we can then discuss
> * how we want to call this feature (transparent large pages? there is
> the concern that "THP" might confuse users. Maybe we can consider
> "large" the more generic version and "huge" only PMD-size, TBD)

I tend to agree. "Huge" means PMD-mappable (transparent or HugeTLB),
"Large" means any order but less than PMD-mappable order, "Gigantic"
means PUD mappable. This should incur the least confusion IMHO.

> * how to expose it in stats towards the user (e.g., /proc/meminfo)

I recalled I suggested new statistics for each order, but was NAK'ed.

> * which minimal toggles we want
>
> I think there *really* has to be a way to disable it for a running
> system, otherwise no distro will dare pulling it in, even after we
> figured out the other stuff.

TBH I really don't like to tie large folio to THP toggles. THP
(PMD-mappable) is just a special case of LAF. The large folio should
be tried whenever it is possible ideally. But I do agree we may not be
able to achieve the ideal case at the time being, and also understand
the concern about regression in early adoption, so a knob that can
disable large folio may be needed for now. But it should be just a
simple binary knob (on/off), and should not be a part of kernel ABI
(temporary and debugging only) IMHO.

One more thing we may discuss is whether huge page madvise APIs should
take effect for large folio or not.

>
> Note that for the pagecache, large folios can be disabled and
> distributions are actively making use of that.
>
> --
> Cheers,
>
> David / dhildenb
>

2023-09-04 18:20:08

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

On 01/09/2023 18:18, Yang Shi wrote:
> On Fri, Sep 1, 2023 at 9:13 AM Matthew Wilcox <[email protected]> wrote:
>>
>> On Thu, Aug 31, 2023 at 10:15:09AM -0700, Yang Shi wrote:
>>> On Thu, Aug 31, 2023 at 12:57 AM David Hildenbrand <[email protected]> wrote:
>>>> Let's talk about that in a bi-weekly MM session. (I proposed it as a
>>>> topic for next week).
>>>>
>>>> As raised in another mail, we can then discuss
>>>> * how we want to call this feature (transparent large pages? there is
>>>> the concern that "THP" might confuse users. Maybe we can consider
>>>> "large" the more generic version and "huge" only PMD-size, TBD)
>>>
>>> I tend to agree. "Huge" means PMD-mappable (transparent or HugeTLB),
>>> "Large" means any order but less than PMD-mappable order, "Gigantic"
>>> means PUD mappable. This should incur the least confusion IMHO.
>>
>> "Large" means any order > 0. The limitation to <= PMD_ORDER is simply
>> because I don't want to go through the whole VM and fix all the places
>> that assume that pmd_page() returns a head page. The benefit to doing so
>> is quite small, and the work to achieve it is quite large. The amount of
>> work needed should decrease over time as we convert more code to folios,
>> so deferring it is the right decision today.
>
> Yeah, I agree. And we are on the same page.
>
>>
>> But nobody should have the impression that large folios are smaller
>> than PMD size, nor even less than or equal. Just like they shouldn't
>> think that large folios depend on CONFIG_TRANSPARENT_HUGEPAGE. They do
>> today, but that's purely an implementation detail that will be removed
>> eventually.
>
> Yes, THP should be just a special case of large folio from page table
> point of view (for example, PMD-mappable vs non-PMD-mappable).
>
>>
>>>> I think there *really* has to be a way to disable it for a running
>>>> system, otherwise no distro will dare pulling it in, even after we
>>>> figured out the other stuff.
>>>
>>> TBH I really don't like to tie large folio to THP toggles. THP
>>> (PMD-mappable) is just a special case of LAF. The large folio should
>>> be tried whenever it is possible ideally. But I do agree we may not be
>>> able to achieve the ideal case at the time being, and also understand
>>> the concern about regression in early adoption, so a knob that can
>>> disable large folio may be needed for now. But it should be just a
>>> simple binary knob (on/off), and should not be a part of kernel ABI
>>> (temporary and debugging only) IMHO.
>>
>> Best of luck trying to remove it after you've shipped it ... we've
>> never been able to remove any of the THP toggles, only make them more
>> complicated.
>
> Fingers crossed... and my point is we should try to avoid making
> things more complicated. It may be hard...
>
>>
>>> One more thing we may discuss is whether huge page madvise APIs should
>>> take effect for large folio or not.
>>
>> They already do for file large folios; we listen to MADV_HUGEPAGE and
>> attempt to allocate PMD_ORDER folios for faults.
>
> OK, file folio may be simpler than anonymous. For anonymous folio,
> there may be two potential cases depending on our choice:
>
> Tie large folio to THP knobs:
> MADV_HUGEPAGE - large folio if THP is on/no large folio if THP is off
> MADV_NOHUGEPAGE - no large folio
>
> Not tie large folio to THP knob:
> MADV_HUGEPAGE - always large folio
> MADV_NOHUGEPAGE - shall create large folio?
>

In my mind, the debate on how LAF and MADV_NOHUGEPAGE should interact is
concluded; David has explained a QEMU live migration use case, which would break
if a LAF was allocated for a VMA with MADV_NOHUGEPAGE (see [1]).

Given LAF and THP controls must be tied together at MADV_NOHUGEPAGE as a
minimum, then for me it makes most sense to expose LAF to user space as a
generalization of THP rather than a separate, independent feature. And if taking
such a route, Huang Ying's suggestion at [2] sounds like a good starting point.

Anyway, let's discuss in the mm meeting as David requested.


[1]
https://lore.kernel.org/linux-mm/[email protected]/
[2]
https://lore.kernel.org/linux-mm/[email protected]/

2023-09-05 16:19:39

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

On Thu, Aug 31, 2023 at 10:15:09AM -0700, Yang Shi wrote:
> On Thu, Aug 31, 2023 at 12:57 AM David Hildenbrand <[email protected]> wrote:
> > Let's talk about that in a bi-weekly MM session. (I proposed it as a
> > topic for next week).
> >
> > As raised in another mail, we can then discuss
> > * how we want to call this feature (transparent large pages? there is
> > the concern that "THP" might confuse users. Maybe we can consider
> > "large" the more generic version and "huge" only PMD-size, TBD)
>
> I tend to agree. "Huge" means PMD-mappable (transparent or HugeTLB),
> "Large" means any order but less than PMD-mappable order, "Gigantic"
> means PUD mappable. This should incur the least confusion IMHO.

"Large" means any order > 0. The limitation to <= PMD_ORDER is simply
because I don't want to go through the whole VM and fix all the places
that assume that pmd_page() returns a head page. The benefit to doing so
is quite small, and the work to achieve it is quite large. The amount of
work needed should decrease over time as we convert more code to folios,
so deferring it is the right decision today.

But nobody should have the impression that large folios are smaller
than PMD size, nor even less than or equal. Just like they shouldn't
think that large folios depend on CONFIG_TRANSPARENT_HUGEPAGE. They do
today, but that's purely an implementation detail that will be removed
eventually.

> > I think there *really* has to be a way to disable it for a running
> > system, otherwise no distro will dare pulling it in, even after we
> > figured out the other stuff.
>
> TBH I really don't like to tie large folio to THP toggles. THP
> (PMD-mappable) is just a special case of LAF. The large folio should
> be tried whenever it is possible ideally. But I do agree we may not be
> able to achieve the ideal case at the time being, and also understand
> the concern about regression in early adoption, so a knob that can
> disable large folio may be needed for now. But it should be just a
> simple binary knob (on/off), and should not be a part of kernel ABI
> (temporary and debugging only) IMHO.

Best of luck trying to remove it after you've shipped it ... we've
never been able to remove any of the THP toggles, only make them more
complicated.

> One more thing we may discuss is whether huge page madvise APIs should
> take effect for large folio or not.

They already do for file large folios; we listen to MADV_HUGEPAGE and
attempt to allocate PMD_ORDER folios for faults.

2023-09-05 16:21:57

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

On Thu, Aug 31, 2023 at 09:57:46AM +0200, David Hildenbrand wrote:
> As raised in another mail, we can then discuss
> * how we want to call this feature (transparent large pages? there is
> the concern that "THP" might confuse users. Maybe we can consider
> "large" the more generic version and "huge" only PMD-size, TBD)
> * how to expose it in stats towards the user (e.g., /proc/meminfo)
> * which minimal toggles we want
>
> I think there *really* has to be a way to disable it for a running system,
> otherwise no distro will dare pulling it in, even after we figured out the
> other stuff.
>
> Note that for the pagecache, large folios can be disabled and distributions
> are actively making use of that.

You can't. Well, you can for shmem/tmpfs, but you have to edit the
source code or disable CONFIG_TRANSPARENT_HUGEPAGE to disable it for XFS.