2023-09-29 11:53:53

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v6 0/9] variable-order, large folios for anonymous memory

Hi All,

This is v6 of a series to implement variable order, large folios for anonymous
memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
"FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
objective of this is to improve performance by allocating larger chunks of
memory during anonymous page faults:

1) Since SW (the kernel) is dealing with larger chunks of memory than base
pages, there are efficiency savings to be had; fewer page faults, batched PTE
and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
overhead. This should benefit all architectures.
2) Since we are now mapping physically contiguous chunks of memory, we can take
advantage of HW TLB compression techniques. A reduction in TLB pressure
speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
TLB entries; "the contiguous bit" (architectural) and HPA (uarch).

The major change in this revision is the addition of sysfs controls to allow
this "small-order THP" to be enabled/disabled/configured independently of
PMD-order THP. The approach I've taken differs a bit from previous discussions;
instead of creating a whole new interface ("large_folio"), I'm extending THP. I
personally think this makes things clearer and more extensible. See [6] for
detailed rationale.

Because we now have runtime enable/disable control, I've removed the compile
time Kconfig switch. It still defaults to runtime-disabled.

NOTE: These changes should not be merged until the prerequisites are complete.
These are in progress and tracked at [7].

This series is based on mm-hotfixes-unstable (f9911db48293).


Testing
=======

This version adds patches to mm selftests so that the cow tests explicitly test
small-order THP, in the same way that PMD-order THP is tested. The new tests all
pass, and no regressions are observed in the mm selftest suite.


Performance
===========

The below tables show performance and memory data for selected workloads, with
different small-order THPs enabled. All configs are compared to a 4k page kernel
with small-order THP disabled. 16k and 64k (with small-order THP disabled)
kernels are included to aid the comparison. All kernels built from the same
source; mm-hotfixes-unstable (f9911db48293) + this series.

4k-page-16k-folio: 16k (order-2) THP enabled
4k-page-32k-folio: 32k+16k (order-3, order-2) THP enabled
4k-page-64k-folio: 64k+32k+16k (order-4, order-3, order-2) THP enabled

Running on Ampere Altra with 1 NUMA node enabled, Ubuntu 22.04, XFS filesystem
20 repeats across 5 reboots (with 1 warmup run after each reboot)
Run in its own cgroup and read memory.peak after completion


Kernel Compilation with 8 jobs: (make defconfig && make -s -j8 Image)
(smaller is better):

| kernel | real-time | kern-time | user-time | memory |
|:------------------|------------:|------------:|------------:|-------:|
| baseline-4k-page | 0.0% | 0.0% | 0.0% | 0.0% |
| 16k-page | -9.0% | -49.7% | -4.0% | 6.2% |
| 64k-page | -11.9% | -66.5% | -5.0% | 28.3% |
| 4k-page-16k-folio | -2.8% | -23.0% | -0.3% | 0.0% |
| 4k-page-32k-folio | -4.0% | -32.0% | -0.6% | 0.1% |
| 4k-page-64k-folio | -4.6% | -37.9% | -0.5% | 0.1% |


Kernel Compilation with 80 jobs: (make defconfig && make -s -j80 Image)
(smaller is better):

| kernel | real-time | kern-time | user-time | memory |
|:------------------|------------:|------------:|------------:|:-------|
| baseline-4k-page | 0.0% | 0.0% | 0.0% | 0.0% |
| 16k-page | -9.2% | -52.1% | -3.6% | 4.6% |
| 64k-page | -11.4% | -66.4% | -3.0% | 12.6% |
| 4k-page-16k-folio | -3.2% | -22.8% | -0.3% | 2.7% |
| 4k-page-32k-folio | -4.8% | -37.1% | -0.5% | 2.9% |
| 4k-page-64k-folio | -5.0% | -42.1% | -0.3% | 3.4% |


Speedometer 2.0: Running on Chromium automated with Selenium
(bigger is better for runs_per_min, smaller is better for memory):

| kernel | runs_per_min | memory |
|:------------------|---------------:|-------:|
| baseline-4k-page | 0.0% | 0.0% |
| 16k-page | 5.9% | 10.6% |
| 4k-page-16k-folio | 1.0% | -0.6% |
| 4k-page-32k-folio | 1.3% | 3.5% |
| 4k-page-64k-folio | 1.3% | 6.4% |


Changes since v5 [5]
====================

- Added accounting for PTE-mapped THPs (patch 3)
- Added runtime control mechanism via sysfs as extension to THP (patch 4)
- Minor refactoring of alloc_anon_folio() to integrate with runtime controls
- Stripped out hardcoded policy for allocation order; its now all user space
controlled (although user space can request "recommend" which will configure
the HW-preferred order)


Changes since v4 [4]
====================

- Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
now uses the default order-3 size. I have moved this patch over to
the contpte series.
- Added "mm: Allow deferred splitting of arbitrary large anon folios" back
into series. I originally removed this at v2 to add to a separate series,
but that series has transformed significantly and it no longer fits, so
bringing it back here.
- Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
set_ptes() is in mm-unstable now.
- Updated policy for when to allocate LAF; only fallback to order-0 if
MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
sysfs's never/madvise/always knob.
- Fallback to order-0 whenever uffd is armed for the vma, not just when
uffd-wp is set on the pte.
- alloc_anon_folio() now returns `struct folio *`, where errors are encoded
with ERR_PTR().

The last 3 changes were proposed by Yu Zhao - thanks!


Changes since v3 [3]
====================

- Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
- Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
sysctl is preferable but we will wait until real workload needs it.
- Fixed uninitialized `addr` on read fault path in do_anonymous_page().
- Added mm selftests for large anon folios in cow test suite.


Changes since v2 [2]
====================

- Dropped commit "Allow deferred splitting of arbitrary large anon folios"
- Huang, Ying suggested the "batch zap" work (which I dropped from this
series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
moved the deferred split patch to a separate series along with the batch
zap changes. I plan to submit this series early next week.
- Changed folio order fallback policy
- We no longer iterate from preferred to 0 looking for acceptable policy
- Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
- Removed vma parameter from arch_wants_pte_order()
- Added command line parameter `flexthp_unhinted_max`
- clamps preferred order when vma hasn't explicitly opted-in to THP
- Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
for process or system).
- Simplified implementation and integration with do_anonymous_page()
- Removed dependency on set_ptes()


Changes since v1 [1]
====================

- removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
- replaced with arch-independent alloc_anon_folio()
- follows THP allocation approach
- no longer retry with intermediate orders if allocation fails
- fallback directly to order-0
- remove folio_add_new_anon_rmap_range() patch
- instead add its new functionality to folio_add_new_anon_rmap()
- remove batch-zap pte mappings optimization patch
- remove enabler folio_remove_rmap_range() patch too
- These offer real perf improvement so will submit separately
- simplify Kconfig
- single FLEXIBLE_THP option, which is independent of arch
- depends on TRANSPARENT_HUGEPAGE
- when enabled default to max anon folio size of 64K unless arch
explicitly overrides
- simplify changes to do_anonymous_page():
- no more retry loop


[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/
[4] https://lore.kernel.org/linux-mm/[email protected]/
[5] https://lore.kernel.org/linux-mm/[email protected]/
[6] https://lore.kernel.org/linux-mm/[email protected]/
[7] https://lore.kernel.org/linux-mm/[email protected]/


Thanks,
Ryan


Ryan Roberts (9):
mm: Allow deferred splitting of arbitrary anon large folios
mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
mm: thp: Account pte-mapped anonymous THP usage
mm: thp: Introduce anon_orders and anon_always_mask sysfs files
mm: thp: Extend THP to allocate anonymous large folios
mm: thp: Add "recommend" option for anon_orders
arm64/mm: Override arch_wants_pte_order()
selftests/mm/cow: Generalize do_run_with_thp() helper
selftests/mm/cow: Add tests for small-order anon THP

Documentation/ABI/testing/procfs-smaps_rollup | 1 +
.../admin-guide/cgroup-v1/memory.rst | 5 +-
Documentation/admin-guide/cgroup-v2.rst | 6 +-
Documentation/admin-guide/mm/transhuge.rst | 96 ++++++-
Documentation/filesystems/proc.rst | 20 +-
arch/arm64/include/asm/pgtable.h | 10 +
drivers/base/node.c | 2 +
fs/proc/meminfo.c | 2 +
fs/proc/task_mmu.c | 7 +-
include/linux/huge_mm.h | 95 +++++--
include/linux/mmzone.h | 1 +
include/linux/pgtable.h | 13 +
mm/huge_memory.c | 172 ++++++++++--
mm/khugepaged.c | 18 +-
mm/memcontrol.c | 8 +
mm/memory.c | 114 +++++++-
mm/page_vma_mapped.c | 3 +-
mm/rmap.c | 42 ++-
mm/show_mem.c | 2 +
mm/vmstat.c | 1 +
tools/testing/selftests/mm/cow.c | 244 +++++++++++++-----
21 files changed, 696 insertions(+), 166 deletions(-)

--
2.25.1


2023-09-29 11:55:53

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()

Define an arch-specific override of arch_wants_pte_order() so that when
anon_orders=recommend is set, large folios will be allocated for
anonymous memory with an order that is compatible with arm64's HPA uarch
feature.

Reviewed-by: Yu Zhao <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7f7d9b1df4e5..e3d2449dec5c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t old_pte, pte_t new_pte);
+
+#define arch_wants_pte_order arch_wants_pte_order
+static inline int arch_wants_pte_order(void)
+{
+ /*
+ * Many arm64 CPUs support hardware page aggregation (HPA), which can
+ * coalesce 4 contiguous pages into a single TLB entry.
+ */
+ return 2;
+}
#endif /* !__ASSEMBLY__ */

#endif /* __ASM_PGTABLE_H */
--
2.25.1

2023-09-29 11:55:57

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v6 9/9] selftests/mm/cow: Add tests for small-order anon THP

Add tests similar to the existing THP tests, but which operate on memory
backed by smaller-order, PTE-mapped THP. This reuses all the existing
infrastructure. If the test suite detects that small-order THP is not
supported by the kernel, the new tests are skipped.

Signed-off-by: Ryan Roberts <[email protected]>
---
tools/testing/selftests/mm/cow.c | 93 ++++++++++++++++++++++++++++++++
1 file changed, 93 insertions(+)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index d887ce454e34..6c5e37d8bb69 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -33,10 +33,13 @@
static size_t pagesize;
static int pagemap_fd;
static size_t pmdsize;
+static size_t ptesize;
static int nr_hugetlbsizes;
static size_t hugetlbsizes[10];
static int gup_fd;
static bool has_huge_zeropage;
+static unsigned int orig_anon_orders;
+static bool orig_anon_orders_valid;

static void detect_huge_zeropage(void)
{
@@ -1118,6 +1121,14 @@ static void run_anon_test_case(struct test_case const *test_case)
run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize);
run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize);
}
+ if (ptesize) {
+ run_with_pte_mapped_thp(test_case->fn, test_case->desc, ptesize);
+ run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, ptesize);
+ run_with_single_pte_of_thp(test_case->fn, test_case->desc, ptesize);
+ run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, ptesize);
+ run_with_partial_mremap_thp(test_case->fn, test_case->desc, ptesize);
+ run_with_partial_shared_thp(test_case->fn, test_case->desc, ptesize);
+ }
for (i = 0; i < nr_hugetlbsizes; i++)
run_with_hugetlb(test_case->fn, test_case->desc,
hugetlbsizes[i]);
@@ -1139,6 +1150,8 @@ static int tests_per_anon_test_case(void)

if (pmdsize)
tests += 8;
+ if (ptesize)
+ tests += 6;
return tests;
}

@@ -1693,6 +1706,80 @@ static int tests_per_non_anon_test_case(void)
return tests;
}

+#define ANON_ORDERS_FILE "/sys/kernel/mm/transparent_hugepage/anon_orders"
+
+static int read_anon_orders(unsigned int *orders)
+{
+ ssize_t buflen = 80;
+ char buf[buflen];
+ int fd;
+
+ fd = open(ANON_ORDERS_FILE, O_RDONLY);
+ if (fd == -1)
+ return -1;
+
+ buflen = read(fd, buf, buflen);
+ close(fd);
+
+ if (buflen < 1)
+ return -1;
+
+ *orders = strtoul(buf, NULL, 16);
+
+ return 0;
+}
+
+static int write_anon_orders(unsigned int orders)
+{
+ ssize_t buflen = 80;
+ char buf[buflen];
+ int fd;
+
+ fd = open(ANON_ORDERS_FILE, O_WRONLY);
+ if (fd == -1)
+ return -1;
+
+ buflen = snprintf(buf, buflen, "0x%08x\n", orders);
+ buflen = write(fd, buf, buflen);
+ close(fd);
+
+ if (buflen < 1)
+ return -1;
+
+ return 0;
+}
+
+static size_t save_thp_anon_orders(void)
+{
+ /*
+ * If the kernel supports multiple orders for anon THP (indicated by the
+ * presence of anon_orders file), configure it for the PMD-order and the
+ * PMD-order - 1, which we will report back and use as the PTE-order THP
+ * size. Save the original value so that it can be restored on exit. If
+ * the kernel does not support multiple orders, report back 0 for the
+ * PTE-size so those tests are skipped.
+ */
+
+ int pteorder = sz2ord(pmdsize) - 1;
+ unsigned int orders = (1UL << sz2ord(pmdsize)) | (1UL << pteorder);
+
+ if (read_anon_orders(&orig_anon_orders))
+ return 0;
+
+ orig_anon_orders_valid = true;
+
+ if (write_anon_orders(orders))
+ return 0;
+
+ return pagesize << pteorder;
+}
+
+static void restore_thp_anon_orders(void)
+{
+ if (orig_anon_orders_valid)
+ write_anon_orders(orig_anon_orders);
+}
+
int main(int argc, char **argv)
{
int err;
@@ -1702,6 +1789,10 @@ int main(int argc, char **argv)
if (pmdsize)
ksft_print_msg("[INFO] detected PMD-mapped THP size: %zu KiB\n",
pmdsize / 1024);
+ ptesize = save_thp_anon_orders();
+ if (ptesize)
+ ksft_print_msg("[INFO] configured PTE-mapped THP size: %zu KiB\n",
+ ptesize / 1024);
nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
ARRAY_SIZE(hugetlbsizes));
detect_huge_zeropage();
@@ -1720,6 +1811,8 @@ int main(int argc, char **argv)
run_anon_thp_test_cases();
run_non_anon_test_cases();

+ restore_thp_anon_orders();
+
err = ksft_get_fail_cnt();
if (err)
ksft_exit_fail_msg("%d out of %d tests failed\n",
--
2.25.1

2023-09-29 12:11:10

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v6 8/9] selftests/mm/cow: Generalize do_run_with_thp() helper

do_run_with_thp() prepares (PMD-sized) THP memory into different states
before running tests. With the introduction of THP orders that are
smaller than PMD_ORDER, we would like to reuse this logic to also test
those smaller orders. So let's add a size parameter which tells the
function what size THP it should operate on.

No functional change intended here, but a separate commit will add new
tests for smaller order THP, where available.

Signed-off-by: Ryan Roberts <[email protected]>
---
tools/testing/selftests/mm/cow.c | 151 +++++++++++++++++--------------
1 file changed, 84 insertions(+), 67 deletions(-)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index 7324ce5363c0..d887ce454e34 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -32,7 +32,7 @@

static size_t pagesize;
static int pagemap_fd;
-static size_t thpsize;
+static size_t pmdsize;
static int nr_hugetlbsizes;
static size_t hugetlbsizes[10];
static int gup_fd;
@@ -734,14 +734,14 @@ enum thp_run {
THP_RUN_PARTIAL_SHARED,
};

-static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
+static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t size)
{
char *mem, *mmap_mem, *tmp, *mremap_mem = MAP_FAILED;
- size_t size, mmap_size, mremap_size;
+ size_t mmap_size, mremap_size;
int ret;

- /* For alignment purposes, we need twice the thp size. */
- mmap_size = 2 * thpsize;
+ /* For alignment purposes, we need twice the requested size. */
+ mmap_size = 2 * size;
mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mmap_mem == MAP_FAILED) {
@@ -749,36 +749,40 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
return;
}

- /* We need a THP-aligned memory area. */
- mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
+ /* We need to naturally align the memory area. */
+ mem = (char *)(((uintptr_t)mmap_mem + size) & ~(size - 1));

- ret = madvise(mem, thpsize, MADV_HUGEPAGE);
+ ret = madvise(mem, size, MADV_HUGEPAGE);
if (ret) {
ksft_test_result_fail("MADV_HUGEPAGE failed\n");
goto munmap;
}

/*
- * Try to populate a THP. Touch the first sub-page and test if we get
- * another sub-page populated automatically.
+ * Try to populate a THP. Touch the first sub-page and test if
+ * we get the last sub-page populated automatically.
*/
mem[0] = 0;
- if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
+ if (!pagemap_is_populated(pagemap_fd, mem + size - pagesize)) {
ksft_test_result_skip("Did not get a THP populated\n");
goto munmap;
}
- memset(mem, 0, thpsize);
+ memset(mem, 0, size);

- size = thpsize;
switch (thp_run) {
case THP_RUN_PMD:
case THP_RUN_PMD_SWAPOUT:
+ if (size != pmdsize) {
+ ksft_test_result_fail("test bug: can't PMD-map size\n");
+ goto munmap;
+ }
break;
case THP_RUN_PTE:
case THP_RUN_PTE_SWAPOUT:
/*
* Trigger PTE-mapping the THP by temporarily mapping a single
- * subpage R/O.
+ * subpage R/O. This is a noop if the THP is not pmdsize (and
+ * therefore already PTE-mapped).
*/
ret = mprotect(mem + pagesize, pagesize, PROT_READ);
if (ret) {
@@ -797,7 +801,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
* Discard all but a single subpage of that PTE-mapped THP. What
* remains is a single PTE mapping a single subpage.
*/
- ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTNEED);
+ ret = madvise(mem + pagesize, size - pagesize, MADV_DONTNEED);
if (ret) {
ksft_test_result_fail("MADV_DONTNEED failed\n");
goto munmap;
@@ -809,7 +813,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
* Remap half of the THP. We need some new memory location
* for that.
*/
- mremap_size = thpsize / 2;
+ mremap_size = size / 2;
mremap_mem = mmap(NULL, mremap_size, PROT_NONE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mem == MAP_FAILED) {
@@ -830,7 +834,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
* child. This will result in some parts of the THP never
* have been shared.
*/
- ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTFORK);
+ ret = madvise(mem + pagesize, size - pagesize, MADV_DONTFORK);
if (ret) {
ksft_test_result_fail("MADV_DONTFORK failed\n");
goto munmap;
@@ -844,7 +848,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
}
wait(&ret);
/* Allow for sharing all pages again. */
- ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DOFORK);
+ ret = madvise(mem + pagesize, size - pagesize, MADV_DOFORK);
if (ret) {
ksft_test_result_fail("MADV_DOFORK failed\n");
goto munmap;
@@ -875,52 +879,65 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
munmap(mremap_mem, mremap_size);
}

-static void run_with_thp(test_fn fn, const char *desc)
+static int sz2ord(size_t size)
+{
+ return __builtin_ctzll(size / pagesize);
+}
+
+static void run_with_thp(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PMD);
+ ksft_print_msg("[RUN] %s ... with order-%d THP\n",
+ desc, sz2ord(size));
+ do_run_with_thp(fn, THP_RUN_PMD, size);
}

-static void run_with_thp_swap(test_fn fn, const char *desc)
+static void run_with_thp_swap(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with swapped-out THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT);
+ ksft_print_msg("[RUN] %s ... with swapped-out order-%d THP\n",
+ desc, sz2ord(size));
+ do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT, size);
}

-static void run_with_pte_mapped_thp(test_fn fn, const char *desc)
+static void run_with_pte_mapped_thp(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with PTE-mapped THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PTE);
+ ksft_print_msg("[RUN] %s ... with PTE-mapped order-%d THP\n",
+ desc, sz2ord(size));
+ do_run_with_thp(fn, THP_RUN_PTE, size);
}

-static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc)
+static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT);
+ ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped order-%d THP\n",
+ desc, sz2ord(size));
+ do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT, size);
}

-static void run_with_single_pte_of_thp(test_fn fn, const char *desc)
+static void run_with_single_pte_of_thp(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with single PTE of THP\n", desc);
- do_run_with_thp(fn, THP_RUN_SINGLE_PTE);
+ ksft_print_msg("[RUN] %s ... with single PTE of order-%d THP\n",
+ desc, sz2ord(size));
+ do_run_with_thp(fn, THP_RUN_SINGLE_PTE, size);
}

-static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc)
+static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with single PTE of swapped-out THP\n", desc);
- do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT);
+ ksft_print_msg("[RUN] %s ... with single PTE of swapped-out order-%d THP\n",
+ desc, sz2ord(size));
+ do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT, size);
}

-static void run_with_partial_mremap_thp(test_fn fn, const char *desc)
+static void run_with_partial_mremap_thp(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with partially mremap()'ed THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP);
+ ksft_print_msg("[RUN] %s ... with partially mremap()'ed order-%d THP\n",
+ desc, sz2ord(size));
+ do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP, size);
}

-static void run_with_partial_shared_thp(test_fn fn, const char *desc)
+static void run_with_partial_shared_thp(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with partially shared THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED);
+ ksft_print_msg("[RUN] %s ... with partially shared order-%d THP\n",
+ desc, sz2ord(size));
+ do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED, size);
}

static void run_with_hugetlb(test_fn fn, const char *desc, size_t hugetlbsize)
@@ -1091,15 +1108,15 @@ static void run_anon_test_case(struct test_case const *test_case)

run_with_base_page(test_case->fn, test_case->desc);
run_with_base_page_swap(test_case->fn, test_case->desc);
- if (thpsize) {
- run_with_thp(test_case->fn, test_case->desc);
- run_with_thp_swap(test_case->fn, test_case->desc);
- run_with_pte_mapped_thp(test_case->fn, test_case->desc);
- run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc);
- run_with_single_pte_of_thp(test_case->fn, test_case->desc);
- run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc);
- run_with_partial_mremap_thp(test_case->fn, test_case->desc);
- run_with_partial_shared_thp(test_case->fn, test_case->desc);
+ if (pmdsize) {
+ run_with_thp(test_case->fn, test_case->desc, pmdsize);
+ run_with_thp_swap(test_case->fn, test_case->desc, pmdsize);
+ run_with_pte_mapped_thp(test_case->fn, test_case->desc, pmdsize);
+ run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, pmdsize);
+ run_with_single_pte_of_thp(test_case->fn, test_case->desc, pmdsize);
+ run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, pmdsize);
+ run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize);
+ run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize);
}
for (i = 0; i < nr_hugetlbsizes; i++)
run_with_hugetlb(test_case->fn, test_case->desc,
@@ -1120,7 +1137,7 @@ static int tests_per_anon_test_case(void)
{
int tests = 2 + nr_hugetlbsizes;

- if (thpsize)
+ if (pmdsize)
tests += 8;
return tests;
}
@@ -1329,7 +1346,7 @@ static void run_anon_thp_test_cases(void)
{
int i;

- if (!thpsize)
+ if (!pmdsize)
return;

ksft_print_msg("[INFO] Anonymous THP tests\n");
@@ -1338,13 +1355,13 @@ static void run_anon_thp_test_cases(void)
struct test_case const *test_case = &anon_thp_test_cases[i];

ksft_print_msg("[RUN] %s\n", test_case->desc);
- do_run_with_thp(test_case->fn, THP_RUN_PMD);
+ do_run_with_thp(test_case->fn, THP_RUN_PMD, pmdsize);
}
}

static int tests_per_anon_thp_test_case(void)
{
- return thpsize ? 1 : 0;
+ return pmdsize ? 1 : 0;
}

typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size);
@@ -1419,7 +1436,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
}

/* For alignment purposes, we need twice the thp size. */
- mmap_size = 2 * thpsize;
+ mmap_size = 2 * pmdsize;
mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mmap_mem == MAP_FAILED) {
@@ -1434,11 +1451,11 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
}

/* We need a THP-aligned memory area. */
- mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
- smem = (char *)(((uintptr_t)mmap_smem + thpsize) & ~(thpsize - 1));
+ mem = (char *)(((uintptr_t)mmap_mem + pmdsize) & ~(pmdsize - 1));
+ smem = (char *)(((uintptr_t)mmap_smem + pmdsize) & ~(pmdsize - 1));

- ret = madvise(mem, thpsize, MADV_HUGEPAGE);
- ret |= madvise(smem, thpsize, MADV_HUGEPAGE);
+ ret = madvise(mem, pmdsize, MADV_HUGEPAGE);
+ ret |= madvise(smem, pmdsize, MADV_HUGEPAGE);
if (ret) {
ksft_test_result_fail("MADV_HUGEPAGE failed\n");
goto munmap;
@@ -1457,7 +1474,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
goto munmap;
}

- fn(mem, smem, thpsize);
+ fn(mem, smem, pmdsize);
munmap:
munmap(mmap_mem, mmap_size);
if (mmap_smem != MAP_FAILED)
@@ -1650,7 +1667,7 @@ static void run_non_anon_test_case(struct non_anon_test_case const *test_case)
run_with_zeropage(test_case->fn, test_case->desc);
run_with_memfd(test_case->fn, test_case->desc);
run_with_tmpfile(test_case->fn, test_case->desc);
- if (thpsize)
+ if (pmdsize)
run_with_huge_zeropage(test_case->fn, test_case->desc);
for (i = 0; i < nr_hugetlbsizes; i++)
run_with_memfd_hugetlb(test_case->fn, test_case->desc,
@@ -1671,7 +1688,7 @@ static int tests_per_non_anon_test_case(void)
{
int tests = 3 + nr_hugetlbsizes;

- if (thpsize)
+ if (pmdsize)
tests += 1;
return tests;
}
@@ -1681,10 +1698,10 @@ int main(int argc, char **argv)
int err;

pagesize = getpagesize();
- thpsize = read_pmd_pagesize();
- if (thpsize)
- ksft_print_msg("[INFO] detected THP size: %zu KiB\n",
- thpsize / 1024);
+ pmdsize = read_pmd_pagesize();
+ if (pmdsize)
+ ksft_print_msg("[INFO] detected PMD-mapped THP size: %zu KiB\n",
+ pmdsize / 1024);
nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
ARRAY_SIZE(hugetlbsizes));
detect_huge_zeropage();
--
2.25.1

2023-10-02 15:36:41

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()

On Fri, Sep 29, 2023 at 12:44:18PM +0100, Ryan Roberts wrote:
> Define an arch-specific override of arch_wants_pte_order() so that when
> anon_orders=recommend is set, large folios will be allocated for
> anonymous memory with an order that is compatible with arm64's HPA uarch
> feature.
>
> Reviewed-by: Yu Zhao <[email protected]>
> Signed-off-by: Ryan Roberts <[email protected]>

Acked-by: Catalin Marinas <[email protected]>

> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 7f7d9b1df4e5..e3d2449dec5c 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
> extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep,
> pte_t old_pte, pte_t new_pte);
> +
> +#define arch_wants_pte_order arch_wants_pte_order
> +static inline int arch_wants_pte_order(void)
> +{
> + /*
> + * Many arm64 CPUs support hardware page aggregation (HPA), which can
> + * coalesce 4 contiguous pages into a single TLB entry.
> + */
> + return 2;
> +}

I haven't followed the discussions on previous revisions of this series
but I wonder why not return a bitmap from arch_wants_pte_order(). For
arm64 we may want an order 6 at some point (contiguous ptes) with a
fallback to order 2 as the next best.

--
Catalin

2023-10-03 07:32:42

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()

On 02/10/2023 16:21, Catalin Marinas wrote:
> On Fri, Sep 29, 2023 at 12:44:18PM +0100, Ryan Roberts wrote:
>> Define an arch-specific override of arch_wants_pte_order() so that when
>> anon_orders=recommend is set, large folios will be allocated for
>> anonymous memory with an order that is compatible with arm64's HPA uarch
>> feature.
>>
>> Reviewed-by: Yu Zhao <[email protected]>
>> Signed-off-by: Ryan Roberts <[email protected]>
>
> Acked-by: Catalin Marinas <[email protected]>
>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 7f7d9b1df4e5..e3d2449dec5c 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>> extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>> unsigned long addr, pte_t *ptep,
>> pte_t old_pte, pte_t new_pte);
>> +
>> +#define arch_wants_pte_order arch_wants_pte_order
>> +static inline int arch_wants_pte_order(void)
>> +{
>> + /*
>> + * Many arm64 CPUs support hardware page aggregation (HPA), which can
>> + * coalesce 4 contiguous pages into a single TLB entry.
>> + */
>> + return 2;
>> +}
>
> I haven't followed the discussions on previous revisions of this series
> but I wonder why not return a bitmap from arch_wants_pte_order(). For
> arm64 we may want an order 6 at some point (contiguous ptes) with a
> fallback to order 2 as the next best.
>

This sounds like good idea to me - I'll implement it, assuming there is a next
rev. (Or in the unlikely event that this is the only pending change, I'd rather
defer it to when we actually need it with the contpte series).

This is just a hangover from the "MVP" approach that I was persuing in v5, where
we didn't want to configure too many orders for fear of fragmentation. But in v6
I've introduced UABI to configure the set of orders, and this function feeds
into the special "recommend" set. So I think it is appropriate that this API
allows expression of multiple orders as you suggest.

Side note: I don't think order-6 is ever a contpte size? Its order-4 for 4K,
order-7 for 16k and order-5 for 64k.

2023-10-03 12:06:10

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()

On Tue, Oct 03, 2023 at 08:32:29AM +0100, Ryan Roberts wrote:
> On 02/10/2023 16:21, Catalin Marinas wrote:
> > On Fri, Sep 29, 2023 at 12:44:18PM +0100, Ryan Roberts wrote:
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index 7f7d9b1df4e5..e3d2449dec5c 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
> >> extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
> >> unsigned long addr, pte_t *ptep,
> >> pte_t old_pte, pte_t new_pte);
> >> +
> >> +#define arch_wants_pte_order arch_wants_pte_order
> >> +static inline int arch_wants_pte_order(void)
> >> +{
> >> + /*
> >> + * Many arm64 CPUs support hardware page aggregation (HPA), which can
> >> + * coalesce 4 contiguous pages into a single TLB entry.
> >> + */
> >> + return 2;
> >> +}
> >
> > I haven't followed the discussions on previous revisions of this series
> > but I wonder why not return a bitmap from arch_wants_pte_order(). For
> > arm64 we may want an order 6 at some point (contiguous ptes) with a
> > fallback to order 2 as the next best.
>
> This sounds like good idea to me - I'll implement it, assuming there is a next
> rev. (Or in the unlikely event that this is the only pending change, I'd rather
> defer it to when we actually need it with the contpte series).

Fine by me, at the moment there wouldn't be any user, so a patch on top
later would do.

> Side note: I don't think order-6 is ever a contpte size? Its order-4 for 4K,
> order-7 for 16k and order-5 for 64k.

Yes, it's order-4 for 4K pages (I was thinking too much of the "64" in 64KB).

--
Catalin

2023-10-06 20:07:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 29.09.23 13:44, Ryan Roberts wrote:
> Hi All,

Let me highlight some core decisions on the things discussed in the
previous alignment meetings, and comment on them.

>
> This is v6 of a series to implement variable order, large folios for anonymous
> memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
> "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
> objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:

Change number 1: Let's call these things THP.

Fine with me; I previously rooted for that but was told that end users
could be confused. I think the important bit is that we don't mess up
the stats, and when we talk about THP we default to "PMD-sized THP",
unless we explicitly include the other ones.


I dislike exposing "orders" to the users, I'm happy to be convinced why
I am wrong and it is a good idea.

So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP"
-- as said, I think FreeBSD tends to call it "Medium-sized superpages".
But what's small/medium/large is debatable. "Small" implies at least
that it's smaller than what we used to know, which is a fact.

Can we also now use the terminology consistently? (e.g.,
"variable-order, large folios for anonymous memory" -> "Small-sized
anonymous THP", you can just point at the previous patch set name in the
cover letter)

>
> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
> pages, there are efficiency savings to be had; fewer page faults, batched PTE
> and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
> overhead. This should benefit all architectures.
> 2) Since we are now mapping physically contiguous chunks of memory, we can take
> advantage of HW TLB compression techniques. A reduction in TLB pressure
> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>
> The major change in this revision is the addition of sysfs controls to allow
> this "small-order THP" to be enabled/disabled/configured independently of
> PMD-order THP. The approach I've taken differs a bit from previous discussions;
> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
> personally think this makes things clearer and more extensible. See [6] for
> detailed rationale.

Change 2: sysfs interface.

If we call it THP, it shall go under
"/sys/kernel/mm/transparent_hugepage/", I agree.

What we expose there and how, is TBD. Again, not a friend of "orders"
and bitmaps at all. We can do better if we want to go down that path.

Maybe we should take a look at hugetlb, and how they added support for
multiple sizes. What *might* make sense could be (depending on which
values we actually support!)


/sys/kernel/mm/transparent_hugepage/hugepages-64kB/
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/

Each one would contain an "enabled" and "defrag" file. We want something
minimal first? Start with the "enabled" option.


enabled: always [global] madvise never

Initially, we would set it for PMD-sized THP to "global" and for
everything else to "never".



That sounds reasonable at least to me, and we would be using what we
learned from THP (as John suggested). That still gives reasonable
flexibility without going too wild, and a better IMHO interface.

I understand Yu's point about ABI discussions and "0 knobs". I'm happy
as long as we can have something that won't hurt us later and still be
able to use this in distributions within a reasonable timeframe.
Enabling/disabling individual sizes does not sound too restrictive to
me. And we could always add an "auto" setting later and default to that
with a new kconfig knob.

If someone wants to configure it, why not. Let's just prepare a way to
to handle this "better" automatically in the future (if ever ...).


Change 3: Stats

> /proc/meminfo:
> Introduce new "AnonHugePteMap" field, which reports the amount of
> memory (in KiB) mapped from large folios globally (similar to
> AnonHugePages field).

AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD",
I think we all agree on that. It should have been named "AnonPmdMapped"
or "AnonHugePmdMapped", too bad, we can't change that.

"AnonHugePteMap" better be "AnonHugePteMapped".

But, I wonder if we want to expose this "PteMapped" to user space *at
all*. Why should they care if it's PTE mapped? For PMD-sized THP it
makes a bit of sense, because !PMD implied !performance, and one might
have been able to troubleshoot that somehow. For PTE-mapped, it doesn't
make much sense really, they are always PTE-mapped.

That also raises the question how you would account a PTE-mapped THP.
The hole thing? Only the parts that are mapped? Let's better not go down
that path.

That leaves the question why we would want to include them here at all
in a special PTE-mapped way?


Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.

HugePages_Total: 1
HugePages_Free: 1
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 1050624 kB

-> Only the last one gives the sum, the other stats don't even mention
the other ones. [how do we get their stats, if at all?]

So maybe, we only want a summary of how many anon huge pages of any size
are allocated (independent of the PTE vs. PMD mapping), and some other
source to eventually inspect how the different sizes behave.

But note that for non-PMD-sized file THP we don't even have special
counters! ... so maybe we should also defer any such stats and come up
with something uniform for all types of non-PMD-sized THP.


Sane discussion applies to all other stats.


>
> Because we now have runtime enable/disable control, I've removed the compile
> time Kconfig switch. It still defaults to runtime-disabled.
>
> NOTE: These changes should not be merged until the prerequisites are complete.
> These are in progress and tracked at [7].

We should probably list them here, and classify which one we see as
strict a requirement, which ones might be an optimization.


Now, these are just my thoughts, and I'm happy about other thoughts.

--
Cheers,

David / dhildenb

2023-10-09 11:30:28

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 06/10/2023 21:06, David Hildenbrand wrote:
> On 29.09.23 13:44, Ryan Roberts wrote:
>> Hi All,
>
> Let me highlight some core decisions on the things discussed in the previous
> alignment meetings, and comment on them.
>
>>
>> This is v6 of a series to implement variable order, large folios for anonymous
>> memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
>> "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
>> objective of this is to improve performance by allocating larger chunks of
>> memory during anonymous page faults:
>
> Change number 1: Let's call these things THP.
>
> Fine with me; I previously rooted for that but was told that end users could be
> confused. I think the important bit is that we don't mess up the stats, and when
> we talk about THP we default to "PMD-sized THP", unless we explicitly include
> the other ones.
>
>
> I dislike exposing "orders" to the users, I'm happy to be convinced why I am
> wrong and it is a good idea.
>
> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as
> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's
> small/medium/large is debatable. "Small" implies at least that it's smaller than
> what we used to know, which is a fact.
>
> Can we also now use the terminology consistently? (e.g., "variable-order, large
> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point
> at the previous patch set name in the cover letter)

Yes absolutely. FWIW, I was deliberately not changing the title of the patchset
so people could easily see it was an evolution of something posted before. But
if it's the norm to change the title as the patchset evolves, I'm very happy to
do that. And there are other places too, in commit logs that I can tidy up. I
will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP"
(that last one slightly different from what David suggested above - it means
"small-sized THP" can still be grepped) unless others object.

>
>>
>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>     overhead. This should benefit all architectures.
>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>
>> The major change in this revision is the addition of sysfs controls to allow
>> this "small-order THP" to be enabled/disabled/configured independently of
>> PMD-order THP. The approach I've taken differs a bit from previous discussions;
>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
>> personally think this makes things clearer and more extensible. See [6] for
>> detailed rationale.
>
> Change 2: sysfs interface.
>
> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> agree.
>
> What we expose there and how, is TBD. Again, not a friend of "orders" and
> bitmaps at all. We can do better if we want to go down that path.
>
> Maybe we should take a look at hugetlb, and how they added support for multiple
> sizes. What *might* make sense could be (depending on which values we actually
> support!)
>
>
> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>
> Each one would contain an "enabled" and "defrag" file. We want something minimal
> first? Start with the "enabled" option.
>
>
> enabled: always [global] madvise never
>
> Initially, we would set it for PMD-sized THP to "global" and for everything else
> to "never".

My only reservation about this approach is the potential for a future need for a
"one setting applied across all sizes" class of control (e.g. "auto"). I think
we agreed in the previous meetings that chasing a solution for "auto" was a good
aspiration to have, so it would be good to have a place we we can insert that in
future. The main reason why I chose to expose the "anon_orders" control is
because it is possible to both enable/disable the various sizes as well as
specificy (e.g.) "auto", without creating redundancy. But I agree that ideally
we wouldn't expose orders to the user; I was attempting a compromise to simplify
the "auto" case.

A potential (though feels quite complex) solution to make auto work with your
proposal: Add "auto" as an option to the existing global enabled file, and to
all of your proposed new enabled files. But its only possible to *set* auto
through the global file. And when it is set, all of the size-specific enabled
files read-back "auto" too. Any any writes to the size-specific enabled files
are ignored (or remembered but not enacted) until the global enabled file is
changed away from auto.

But I'm not sure if adding a new option to the global enabled file might break
compat?

>
>
>
> That sounds reasonable at least to me, and we would be using what we learned
> from THP (as John suggested).  That still gives reasonable flexibility without
> going too wild, and a better IMHO interface.
>
> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long
> as we can have something that won't hurt us later and still be able to use this
> in distributions within a reasonable timeframe. Enabling/disabling individual
> sizes does not sound too restrictive to me. And we could always add an "auto"
> setting later and default to that with a new kconfig knob.
>
> If someone wants to configure it, why not. Let's just prepare a way to to handle
> this "better" automatically in the future (if ever ...).
>
>
> Change 3: Stats
>
>> /proc/meminfo:
>>   Introduce new "AnonHugePteMap" field, which reports the amount of
>>   memory (in KiB) mapped from large folios globally (similar to
>>   AnonHugePages field).
>
> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think
> we all agree on that. It should have been named "AnonPmdMapped" or
> "AnonHugePmdMapped", too bad, we can't change that.

Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and
PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and
"AnonHugePteMapped", but I think that would likely break things. Its further
complicated because vmstats prints it in PMD-size units, so can't represent
PTE-mapped memory in that counter.

>
> "AnonHugePteMap" better be "AnonHugePteMapped".

I agree, but I went with the shorter one because any longer and it would unalign
the value e.g:

AnonHugePages: 0 kB
AnonHugePteMapped: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB

So would need to decide which is preferable, or come up with a shorter name.

>
> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why
> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense,
> because !PMD implied !performance, and one might have been able to troubleshoot
> that somehow. For PTE-mapped, it doesn't make much sense really, they are always
> PTE-mapped.

I disagree; I've been using it a lot to debug performance issues. It tells you
how much of your anon memory is allocated with large folios. And making that
percentage bigger improves performance; fewer page faults, and with a separate
contpte series on arm64, better use of the TLB. Reasons might include; poorly
aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc.

I would actually argue for adding similar counters for file-backed memory too
for the same reasons. (I actually posted an independent patch a while back that
did this for file- and anon- memory, bucketted by size. But I think the idea of
the bucketting was NAKed.

>
> That also raises the question how you would account a PTE-mapped THP. The hole
> thing? Only the parts that are mapped? Let's better not go down that path.

The approach I've taken in this series is the simple one - account every page
that belongs to a large folio from when it is first mapped to last unmapped.
Yes, in this case, you might not actually be mapping the full thing
contigiously. But it gives a good indication.

I also considered accounting the whole folio only when all of its pages become
mapped (although not worrying about them all being contiguous). That's still
simple to implement for all counters except smaps. So went with the simplest
approach with the view that its "good enough".

>
> That leaves the question why we would want to include them here at all in a
> special PTE-mapped way?
>
>
> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.
>
> HugePages_Total:       1
> HugePages_Free:        1
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> Hugetlb:         1050624 kB
>
> -> Only the last one gives the sum, the other stats don't even mention the other
> ones. [how do we get their stats, if at all?]

There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and
/sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages,
surplus_hugepages. But this interface also constitutes the allocator, not just
stats, I think.

>
> So maybe, we only want a summary of how many anon huge pages of any size are
> allocated (independent of the PTE vs. PMD mapping),

Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If
the former, then I don't really see the difference. We have to continue to
expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the
total, or add a total counter and derive PTE-only. I suspect I've misunderstood
your point.

> and some other source to
> eventually inspect how the different sizes behave.
>
> But note that for non-PMD-sized file THP we don't even have special counters!
> ... so maybe we should also defer any such stats and come up with something
> uniform for all types of non-PMD-sized THP.

Indeed, I can see benefit in adding these for file THP - in fact I have a patch
that does exactly that to help my development work. I had envisaged that we
could add something like FileHugePteMapped, ShmemHugePteMapped that would follow
the same semantics as AnonHugePteMapped.

>
>
> Sane discussion applies to all other stats.
>
>
>>
>> Because we now have runtime enable/disable control, I've removed the compile
>> time Kconfig switch. It still defaults to runtime-disabled.
>>
>> NOTE: These changes should not be merged until the prerequisites are complete.
>> These are in progress and tracked at [7].
>
> We should probably list them here, and classify which one we see as strict a
> requirement, which ones might be an optimization.


I'll need some help with clasifying them, so showing my working. Final list that
I would propose as strict requirements at bottom.

This is my list with status, as per response to Yu in other thread:

- David is working on "shared vs exclusive mappings"
- Zi Yan has posted an RFC for compaction
- Yin Fengwei's mlock series is now in mm-stable
- Yin Fengwei's madvise series is in 6.6
- I've reworked and posted a series for deferred_split_folio; although I've
deprioritied it because Yu said it wasn't really a pre-requisite.
- numa balancing depends on David's "shared vs exclusive mappings" work
- I've started looking at "large folios in swap cache" in the background,
because I'm seeing some slow down with large folios, but we also agreed that
wasn't a prerequisite

Although, since sending that, I've determined that when running kernel
compilation across high number of cores on arm64, the cost of splitting the
folios gets large due to needing to broadcast the extra TLBIs. So I think the
last point on that list may be a prerequisite after all. (I've been able to fix
this by adding support for allocating large folios in the swap file, and
avoiding the split - planning to send RFC this week).

There is also this set of things that you mentioned against "shared vs exclusive
mappings", which I'm not sure if you are planning to cover as part of your work
or if they are follow on things that will need to be done:

(1) Detecting shared folios, to not mess with them while they are shared.
MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
replace cases where folio_estimated_sharers() == 1 would currently be the
best we can do (and in some cases, page_mapcount() == 1).

And I recently discovered that khugepaged doesn't collapse file-backed pages to
a PMD-size THP if they belong to a large folio, so I'm guessing it may also
suffer the same behaviour for anon memory. I'm not sure if that's what your
"khugepaged ..." comment refers to?

So taking all that and trying to put together a complete outstanding list for
strict requirements:

- Shared vs Exclusive Mappings (DavidH)
- user-triggered page migration
- NUMA hinting/balancing
- Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
- Compaction of Large Folios (Zi Yan)
- Swap out small-size THP without Split (Ryan Roberts)


>
>
> Now, these are just my thoughts, and I'm happy about other thoughts.

As always, thanks for taking the time - I really appreciate it.

Thanks,
Ryan


2023-10-09 16:23:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

[...]

>>
>> I dislike exposing "orders" to the users, I'm happy to be convinced why I am
>> wrong and it is a good idea.
>>
>> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as
>> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's
>> small/medium/large is debatable. "Small" implies at least that it's smaller than
>> what we used to know, which is a fact.
>>
>> Can we also now use the terminology consistently? (e.g., "variable-order, large
>> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point
>> at the previous patch set name in the cover letter)
>
> Yes absolutely. FWIW, I was deliberately not changing the title of the patchset
> so people could easily see it was an evolution of something posted before. But
> if it's the norm to change the title as the patchset evolves, I'm very happy to
> do that. And there are other places too, in commit logs that I can tidy up. I
> will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP"
> (that last one slightly different from what David suggested above - it means
> "small-sized THP" can still be grepped) unless others object.

Absolutely fine with me. Hoping other people will object when I talk
nonsense or my suggestions don't make any sense.

Or even better, propose something better :)

>
>>
>>>
>>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>>     overhead. This should benefit all architectures.
>>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>
>>> The major change in this revision is the addition of sysfs controls to allow
>>> this "small-order THP" to be enabled/disabled/configured independently of
>>> PMD-order THP. The approach I've taken differs a bit from previous discussions;
>>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
>>> personally think this makes things clearer and more extensible. See [6] for
>>> detailed rationale.
>>
>> Change 2: sysfs interface.
>>
>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>> agree.
>>
>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>> bitmaps at all. We can do better if we want to go down that path.
>>
>> Maybe we should take a look at hugetlb, and how they added support for multiple
>> sizes. What *might* make sense could be (depending on which values we actually
>> support!)
>>
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>
>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>> first? Start with the "enabled" option.
>>
>>
>> enabled: always [global] madvise never
>>
>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>> to "never".
>
> My only reservation about this approach is the potential for a future need for a
> "one setting applied across all sizes" class of control (e.g. "auto"). I think
> we agreed in the previous meetings that chasing a solution for "auto" was a good
> aspiration to have, so it would be good to have a place we we can insert that in
> future. The main reason why I chose to expose the "anon_orders" control is
> because it is possible to both enable/disable the various sizes as well as
> specificy (e.g.) "auto", without creating redundancy. But I agree that ideally
> we wouldn't expose orders to the user; I was attempting a compromise to simplify
> the "auto" case.
>
> A potential (though feels quite complex) solution to make auto work with your
> proposal: Add "auto" as an option to the existing global enabled file, and to
> all of your proposed new enabled files. But its only possible to *set* auto
> through the global file. And when it is set, all of the size-specific enabled
> files read-back "auto" too. Any any writes to the size-specific enabled files
> are ignored (or remembered but not enacted) until the global enabled file is
> changed away from auto.

Yes, I think there are various ways forward regarding that. Or to enable
"auto" mode only once all are "auto", and as soon as one is not "auto",
just disable it. A simple

echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled

Would do to enable it. Or, have them all be "global" and have a global
"auto" mode as you raised.

echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled

>
> But I'm not sure if adding a new option to the global enabled file might break
> compat?

I think we used to extend the "defrag" option, see

commit 21440d7eb9044001b7fdb71d0163689f60a0f2a1
Author: David Rientjes <[email protected]>
Date: Wed Feb 22 15:45:49 2017 -0800

mm, thp: add new defer+madvise defrag option


So I suspect we could extend that one in a similar way.

But again, this is just the thing that came to mind when thinking about
how to:
a) avoid orders
b) make it configurable and future-proof
c) make it look a bit consistent with other interfaces (hugetlb and
existing thp)
d) still prepare for an auto mode that we want in the future

I'm happy to hear other ideas.

>
>>
>>
>>
>> That sounds reasonable at least to me, and we would be using what we learned
>> from THP (as John suggested).  That still gives reasonable flexibility without
>> going too wild, and a better IMHO interface.
>>
>> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long
>> as we can have something that won't hurt us later and still be able to use this
>> in distributions within a reasonable timeframe. Enabling/disabling individual
>> sizes does not sound too restrictive to me. And we could always add an "auto"
>> setting later and default to that with a new kconfig knob.
>>
>> If someone wants to configure it, why not. Let's just prepare a way to to handle
>> this "better" automatically in the future (if ever ...).
>>
>>
>> Change 3: Stats
>>
>>> /proc/meminfo:
>>>    Introduce new "AnonHugePteMap" field, which reports the amount of
>>>    memory (in KiB) mapped from large folios globally (similar to
>>>    AnonHugePages field).
>>
>> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think
>> we all agree on that. It should have been named "AnonPmdMapped" or
>> "AnonHugePmdMapped", too bad, we can't change that.
>
> Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and
> PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and
> "AnonHugePteMapped", but I think that would likely break things. Its further
> complicated because vmstats prints it in PMD-size units, so can't represent
> PTE-mapped memory in that counter.

:/

>
>>
>> "AnonHugePteMap" better be "AnonHugePteMapped".
>
> I agree, but I went with the shorter one because any longer and it would unalign
> the value e.g:
>
> AnonHugePages: 0 kB
> AnonHugePteMapped: 0 kB
> ShmemPmdMapped: 0 kB
> Shared_Hugetlb: 0 kB
>

Can't that be handled? We surely have long stuff in there:

HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 1081344 kB
ShmemPmdMapped: 0 kB

HardwareCorrupted has same length as AnonHugePteMapped

But I'm not convinced about "AnonHugePteMapped" yet :)

> So would need to decide which is preferable, or come up with a shorter name.
>
>>
>> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why
>> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense,
>> because !PMD implied !performance, and one might have been able to troubleshoot
>> that somehow. For PTE-mapped, it doesn't make much sense really, they are always
>> PTE-mapped.
>
> I disagree; I've been using it a lot to debug performance issues. It tells you
> how much of your anon memory is allocated with large folios. And making that
> percentage bigger improves performance; fewer page faults, and with a separate
> contpte series on arm64, better use of the TLB. Reasons might include; poorly
> aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc.

Just because a small-sized THP is PTE-mapped doesn't tell you anything,
really. What you want to know is if it is "completely" and
"consecutively" mapped such that the HW can actually benefit from it --
if HW even supports it. So "PTE-mapped THP" is just part of the story.
And that's where it gets tricky I think.

I agree that it's good for debugging, but then maybe it should a) live
somewhere else (debugfs, bucketing below) and b) be consistent with
other THPs, meaning we also want similar stats somewhere.

One idea would be to expose such stats in a R/O fashion like
"nr_allocated" or "nr_hugepages" in
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and friends. Of
course, maybe tagging them with "anon" prefix.

>
> I would actually argue for adding similar counters for file-backed memory too
> for the same reasons. (I actually posted an independent patch a while back that
> did this for file- and anon- memory, bucketted by size. But I think the idea of
> the bucketting was NAKed.
For debugging, I *think* it might be valuable to see how many THP of
each size are allocated. Tracking exactly "how is it mapped" is not easy
to achieve as we learned. PMD-mapped was easy, but also requires us to
keep doing that tracking for all eternity ...

Do you have a pointer to the patch set? Did it try to squeeze it into
/proc/meminfo?

>
>>
>> That also raises the question how you would account a PTE-mapped THP. The hole
>> thing? Only the parts that are mapped? Let's better not go down that path.
>
> The approach I've taken in this series is the simple one - account every page
> that belongs to a large folio from when it is first mapped to last unmapped.
> Yes, in this case, you might not actually be mapping the full thing
> contigiously. But it gives a good indication.
>
> I also considered accounting the whole folio only when all of its pages become
> mapped (although not worrying about them all being contiguous). That's still
> simple to implement for all counters except smaps. So went with the simplest
> approach with the view that its "good enough".

If you take a look at "ShmemHugePages" and "FileHugePages", there we
actually track them when they get allocated+freed, which is much easier
than tracking when/how they are (un)mapped. But it's only done for
PMD-sized THP for now.

>
>>
>> That leaves the question why we would want to include them here at all in a
>> special PTE-mapped way?
>>
>>
>> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.
>>
>> HugePages_Total:       1
>> HugePages_Free:        1
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
>> Hugepagesize:       2048 kB
>> Hugetlb:         1050624 kB
>>
>> -> Only the last one gives the sum, the other stats don't even mention the other
>> ones. [how do we get their stats, if at all?]
>
> There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and
> /sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages,
> surplus_hugepages. But this interface also constitutes the allocator, not just
> stats, I think.

Ah, I missed that we expose free vs. reserved vs. surpluse ... there as
well; I thought we would only have "nr_hugepages".

>
>>
>> So maybe, we only want a summary of how many anon huge pages of any size are
>> allocated (independent of the PTE vs. PMD mapping),
>
> Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If
> the former, then I don't really see the difference. We have to continue to
> expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the
> total, or add a total counter and derive PTE-only. I suspect I've misunderstood
> your point.

I don't think we should go down the "PteMapped" path. Probably we want
"bucketing" stats as you said, and maybe a global one that just combines
everything (any THP). But naming will be difficult.

>
>> and some other source to
>> eventually inspect how the different sizes behave.
>>
>> But note that for non-PMD-sized file THP we don't even have special counters!
>> ... so maybe we should also defer any such stats and come up with something
>> uniform for all types of non-PMD-sized THP.
>
> Indeed, I can see benefit in adding these for file THP - in fact I have a patch
> that does exactly that to help my development work. I had envisaged that we
> could add something like FileHugePteMapped, ShmemHugePteMapped that would follow
> the same semantics as AnonHugePteMapped.

Again, maybe we can find something that does not involve the "PteMapped"
terminology and just gives us a big total of "allocated" THP. For
detailed stats for debugging, maybe we can just use a different
interface then.

>
>>
>>
>> Sane discussion applies to all other stats.
>>
>>
>>>
>>> Because we now have runtime enable/disable control, I've removed the compile
>>> time Kconfig switch. It still defaults to runtime-disabled.
>>>
>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>> These are in progress and tracked at [7].
>>
>> We should probably list them here, and classify which one we see as strict a
>> requirement, which ones might be an optimization.
>
>
> I'll need some help with clasifying them, so showing my working. Final list that
> I would propose as strict requirements at bottom.
>
> This is my list with status, as per response to Yu in other thread:
>
> - David is working on "shared vs exclusive mappings"

Probably "COW reuse support" is a separate item, although my approach
would cover that.

The question is, if the estimate we're using in most code for now would
at least be sufficient to merge it. The estimate is easily wrong, but we
do have that issue with PTE-mapped THP already.

But that argument probably applies to most things here: the difference
is that PTE-mapped THP are not the default, that's why nobody really cared.

[I'm playing with an almost-lockless scheme right now and hope I have
something running soonish -- as you know, I got distracted]

> - Zi Yan has posted an RFC for compaction
> - Yin Fengwei's mlock series is now in mm-stable
> - Yin Fengwei's madvise series is in 6.6
> - I've reworked and posted a series for deferred_split_folio; although I've
> deprioritied it because Yu said it wasn't really a pre-requisite.
> - numa balancing depends on David's "shared vs exclusive mappings" work
> - I've started looking at "large folios in swap cache" in the background,
> because I'm seeing some slow down with large folios, but we also agreed that
> wasn't a prerequisite
>

Probably it would be good to talk about the items and how we would
classify them in a meeting.


> Although, since sending that, I've determined that when running kernel
> compilation across high number of cores on arm64, the cost of splitting the
> folios gets large due to needing to broadcast the extra TLBIs. So I think the
> last point on that list may be a prerequisite after all. (I've been able to fix
> this by adding support for allocating large folios in the swap file, and
> avoiding the split - planning to send RFC this week).
>
> There is also this set of things that you mentioned against "shared vs exclusive
> mappings", which I'm not sure if you are planning to cover as part of your work
> or if they are follow on things that will need to be done:
>
> (1) Detecting shared folios, to not mess with them while they are shared.
> MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
> replace cases where folio_estimated_sharers() == 1 would currently be the
> best we can do (and in some cases, page_mapcount() == 1).
>
> And I recently discovered that khugepaged doesn't collapse file-backed pages to
> a PMD-size THP if they belong to a large folio, so I'm guessing it may also
> suffer the same behaviour for anon memory. I'm not sure if that's what your
> "khugepaged ..." comment refers to?

Yes. But I did not look into all the details yet.

"kuhepaged" collapse support to small-sized THP is probably also a very
imporant item, although it might be less relevant than for PMD -- and I
consider it future work. See below.

>
> So taking all that and trying to put together a complete outstanding list for
> strict requirements:
>
> - Shared vs Exclusive Mappings (DavidH)
> - user-triggered page migration
> - NUMA hinting/balancing
> - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
> - Compaction of Large Folios (Zi Yan)
> - Swap out small-size THP without Split (Ryan Roberts)

^ that's going to be tough, I can promise. And the only way to live
without that would be khugepaged support. (because that's how it's all
working for PMD-sized THP after all!)

Once a PMD-sized THP was swapped out and evicted, it will always come
back in order-0 folios. khugeged will re-collapse into PMD-sized chunks.
If we could do that for PTE-sized THP as well ...

>
>
>>
>>
>> Now, these are just my thoughts, and I'm happy about other thoughts.
>
> As always, thanks for taking the time - I really appreciate it.

Sure. Hoping others can comment.

My gut feeling is that it's best to focus on getting the sysfs interface
right+future proof and handling the stats independently. While being a
good debug mechanism, I wouldn't consider these stats a requirement: we
don't have them for file/shmem small-sized thp so far as well.

So maybe really better to handle the stats in meminfo and friends
separately.

--
Cheers,

David / dhildenb

2023-10-10 10:47:51

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 09/10/2023 17:22, David Hildenbrand wrote:
> [...]
>
>>>
>>> I dislike exposing "orders" to the users, I'm happy to be convinced why I am
>>> wrong and it is a good idea.
>>>
>>> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as
>>> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's
>>> small/medium/large is debatable. "Small" implies at least that it's smaller than
>>> what we used to know, which is a fact.
>>>
>>> Can we also now use the terminology consistently? (e.g., "variable-order, large
>>> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point
>>> at the previous patch set name in the cover letter)
>>
>> Yes absolutely. FWIW, I was deliberately not changing the title of the patchset
>> so people could easily see it was an evolution of something posted before. But
>> if it's the norm to change the title as the patchset evolves, I'm very happy to
>> do that. And there are other places too, in commit logs that I can tidy up. I
>> will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP"
>> (that last one slightly different from what David suggested above - it means
>> "small-sized THP" can still be grepped) unless others object.
>
> Absolutely fine with me. Hoping other people will object when I talk nonsense or
> my suggestions don't make any sense.
>
> Or even better, propose something better :)
>
>>
>>>
>>>>
>>>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>>>      pages, there are efficiency savings to be had; fewer page faults,
>>>> batched PTE
>>>>      and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>>>      overhead. This should benefit all architectures.
>>>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>>>      advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>>      speeds up kernel and user space. arm64 systems have 2 mechanisms to
>>>> coalesce
>>>>      TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>>
>>>> The major change in this revision is the addition of sysfs controls to allow
>>>> this "small-order THP" to be enabled/disabled/configured independently of
>>>> PMD-order THP. The approach I've taken differs a bit from previous discussions;
>>>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
>>>> personally think this makes things clearer and more extensible. See [6] for
>>>> detailed rationale.
>>>
>>> Change 2: sysfs interface.
>>>
>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>> agree.
>>>
>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>> bitmaps at all. We can do better if we want to go down that path.
>>>
>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>> sizes. What *might* make sense could be (depending on which values we actually
>>> support!)
>>>
>>>
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>
>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>> first? Start with the "enabled" option.
>>>
>>>
>>> enabled: always [global] madvise never
>>>
>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>> to "never".
>>
>> My only reservation about this approach is the potential for a future need for a
>> "one setting applied across all sizes" class of control (e.g. "auto"). I think
>> we agreed in the previous meetings that chasing a solution for "auto" was a good
>> aspiration to have, so it would be good to have a place we we can insert that in
>> future. The main reason why I chose to expose the "anon_orders" control is
>> because it is possible to both enable/disable the various sizes as well as
>> specificy (e.g.) "auto", without creating redundancy. But I agree that ideally
>> we wouldn't expose orders to the user; I was attempting a compromise to simplify
>> the "auto" case.
>>
>> A potential (though feels quite complex) solution to make auto work with your
>> proposal: Add "auto" as an option to the existing global enabled file, and to
>> all of your proposed new enabled files. But its only possible to *set* auto
>> through the global file. And when it is set, all of the size-specific enabled
>> files read-back "auto" too. Any any writes to the size-specific enabled files
>> are ignored (or remembered but not enacted) until the global enabled file is
>> changed away from auto.
>
> Yes, I think there are various ways forward regarding that. Or to enable "auto"
> mode only once all are "auto", and as soon as one is not "auto", just disable
> it. A simple
>
> echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled

I'm not really a fan, because this implies that you have a period where "auto"
is reported for a size, but its not really in "auto" mode yet.

>
> Would do to enable it. Or, have them all be "global" and have a global "auto"
> mode as you raised.
>
> echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
> echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled
>

Again, this isn't atomic either. I tend to prefer my proposal because it
switches atomically - there are no weird intermediate states. Anyway, I guess
the important point is we have demonstrated that your proposed interface could
be extended to support "auto" in future, should we need it.

>>
>> But I'm not sure if adding a new option to the global enabled file might break
>> compat?
>
> I think we used to extend the "defrag" option, see
>
> commit 21440d7eb9044001b7fdb71d0163689f60a0f2a1
> Author: David Rientjes <[email protected]>
> Date:   Wed Feb 22 15:45:49 2017 -0800
>
>     mm, thp: add new defer+madvise defrag option
>
>
> So I suspect we could extend that one in a similar way.
>
> But again, this is just the thing that came to mind when thinking about how to:
> a) avoid orders
> b) make it configurable and future-proof
> c) make it look a bit consistent with other interfaces (hugetlb and
>    existing thp)
> d) still prepare for an auto mode that we want in the future
>
> I'm happy to hear other ideas.
>
>>
>>>
>>>
>>>
>>> That sounds reasonable at least to me, and we would be using what we learned
>>> from THP (as John suggested).  That still gives reasonable flexibility without
>>> going too wild, and a better IMHO interface.
>>>
>>> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long
>>> as we can have something that won't hurt us later and still be able to use this
>>> in distributions within a reasonable timeframe. Enabling/disabling individual
>>> sizes does not sound too restrictive to me. And we could always add an "auto"
>>> setting later and default to that with a new kconfig knob.
>>>
>>> If someone wants to configure it, why not. Let's just prepare a way to to handle
>>> this "better" automatically in the future (if ever ...).
>>>
>>>
>>> Change 3: Stats
>>>
>>>> /proc/meminfo:
>>>>     Introduce new "AnonHugePteMap" field, which reports the amount of
>>>>     memory (in KiB) mapped from large folios globally (similar to
>>>>     AnonHugePages field).
>>>
>>> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think
>>> we all agree on that. It should have been named "AnonPmdMapped" or
>>> "AnonHugePmdMapped", too bad, we can't change that.
>>
>> Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and
>> PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and
>> "AnonHugePteMapped", but I think that would likely break things. Its further
>> complicated because vmstats prints it in PMD-size units, so can't represent
>> PTE-mapped memory in that counter.
>
> :/
>
>>
>>>
>>> "AnonHugePteMap" better be "AnonHugePteMapped".
>>
>> I agree, but I went with the shorter one because any longer and it would unalign
>> the value e.g:
>>
>>      AnonHugePages:         0 kB
>>      AnonHugePteMapped:        0 kB
>>      ShmemPmdMapped:        0 kB
>>      Shared_Hugetlb:        0 kB
>>
>
> Can't that be handled? We surely have long stuff in there:
>
> HardwareCorrupted:     0 kB
> AnonHugePages:         0 kB
> ShmemHugePages:  1081344 kB
> ShmemPmdMapped:        0 kB
>
> HardwareCorrupted has same length as AnonHugePteMapped

HardwareCorrupted is special cased and has a field length of 5, so the largest
value you can represent before it gets pushed out is ~97MB. I imagine that's
plenty for HardwareCorrupted, but unlikely enough for AnonHugePteMapped. The
standard field size is 8, which provides for ~95GB before becoming unaligned.

>
> But I'm not convinced about "AnonHugePteMapped" yet :)

OK

>
>> So would need to decide which is preferable, or come up with a shorter name.
>>
>>>
>>> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why
>>> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense,
>>> because !PMD implied !performance, and one might have been able to troubleshoot
>>> that somehow. For PTE-mapped, it doesn't make much sense really, they are always
>>> PTE-mapped.
>>
>> I disagree; I've been using it a lot to debug performance issues. It tells you
>> how much of your anon memory is allocated with large folios. And making that
>> percentage bigger improves performance; fewer page faults, and with a separate
>> contpte series on arm64, better use of the TLB. Reasons might include; poorly
>> aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc.
>
> Just because a small-sized THP is PTE-mapped doesn't tell you anything, really.
> What you want to know is if it is "completely" and "consecutively" mapped such
> that the HW can actually benefit from it -- if HW even supports it. So
> "PTE-mapped THP" is just part of the story. And that's where it gets tricky I
> think.
>
> I agree that it's good for debugging, but then maybe it should a) live somewhere
> else (debugfs, bucketing below) and b) be consistent with other THPs, meaning we
> also want similar stats somewhere.
>
> One idea would be to expose such stats in a R/O fashion like "nr_allocated" or
> "nr_hugepages" in /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and
> friends. Of course, maybe tagging them with "anon" prefix.

I see your point, but I don't completely agree with it all. That said, given
your conclusion at the bottom, perhaps we should park the discussion about the
accounting for a separate series in future? Then we can focus on the ABI?

>
>>
>> I would actually argue for adding similar counters for file-backed memory too
>> for the same reasons. (I actually posted an independent patch a while back that
>> did this for file- and anon- memory, bucketted by size. But I think the idea of
>> the bucketting was NAKed.
> For debugging, I *think* it might be valuable to see how many THP of each size
> are allocated. Tracking exactly "how is it mapped" is not easy to achieve as we
> learned. PMD-mapped was easy, but also requires us to keep doing that tracking
> for all eternity ...
>
> Do you have a pointer to the patch set? Did it try to squeeze it into
> /proc/meminfo?

I was actually only working on smaps/smaps_rollup, which has been my main tool
for debugging. patches at [1].

[1] https://lore.kernel.org/linux-mm/[email protected]/

>
>>
>>>
>>> That also raises the question how you would account a PTE-mapped THP. The hole
>>> thing? Only the parts that are mapped? Let's better not go down that path.
>>
>> The approach I've taken in this series is the simple one - account every page
>> that belongs to a large folio from when it is first mapped to last unmapped.
>> Yes, in this case, you might not actually be mapping the full thing
>> contigiously. But it gives a good indication.
>>
>> I also considered accounting the whole folio only when all of its pages become
>> mapped (although not worrying about them all being contiguous). That's still
>> simple to implement for all counters except smaps. So went with the simplest
>> approach with the view that its "good enough".
>
> If you take a look at "ShmemHugePages" and "FileHugePages", there we actually
> track them when they get allocated+freed, which is much easier than tracking
> when/how they are (un)mapped. But it's only done for PMD-sized THP for now.
>
>>
>>>
>>> That leaves the question why we would want to include them here at all in a
>>> special PTE-mapped way?
>>>
>>>
>>> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.
>>>
>>> HugePages_Total:       1
>>> HugePages_Free:        1
>>> HugePages_Rsvd:        0
>>> HugePages_Surp:        0
>>> Hugepagesize:       2048 kB
>>> Hugetlb:         1050624 kB
>>>
>>> -> Only the last one gives the sum, the other stats don't even mention the other
>>> ones. [how do we get their stats, if at all?]
>>
>> There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and
>> /sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages,
>> surplus_hugepages. But this interface also constitutes the allocator, not just
>> stats, I think.
>
> Ah, I missed that we expose free vs. reserved vs. surpluse ... there as well; I
> thought we would only have "nr_hugepages".
>
>>
>>>
>>> So maybe, we only want a summary of how many anon huge pages of any size are
>>> allocated (independent of the PTE vs. PMD mapping),
>>
>> Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If
>> the former, then I don't really see the difference. We have to continue to
>> expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the
>> total, or add a total counter and derive PTE-only. I suspect I've misunderstood
>> your point.
>
> I don't think we should go down the "PteMapped" path. Probably we want
> "bucketing" stats as you said, and maybe a global one that just combines
> everything (any THP). But naming will be difficult.
>
>>
>>> and some other source to
>>> eventually inspect how the different sizes behave.
>>>
>>> But note that for non-PMD-sized file THP we don't even have special counters!
>>> ... so maybe we should also defer any such stats and come up with something
>>> uniform for all types of non-PMD-sized THP.
>>
>> Indeed, I can see benefit in adding these for file THP - in fact I have a patch
>> that does exactly that to help my development work. I had envisaged that we
>> could add something like FileHugePteMapped, ShmemHugePteMapped that would follow
>> the same semantics as AnonHugePteMapped.
>
> Again, maybe we can find something that does not involve the "PteMapped"
> terminology and just gives us a big total of "allocated" THP. For detailed stats
> for debugging, maybe we can just use a different interface then.
>
>>
>>>
>>>
>>> Sane discussion applies to all other stats.
>>>
>>>
>>>>
>>>> Because we now have runtime enable/disable control, I've removed the compile
>>>> time Kconfig switch. It still defaults to runtime-disabled.
>>>>
>>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>>> These are in progress and tracked at [7].
>>>
>>> We should probably list them here, and classify which one we see as strict a
>>> requirement, which ones might be an optimization.
>>
>>
>> I'll need some help with clasifying them, so showing my working. Final list that
>> I would propose as strict requirements at bottom.
>>
>> This is my list with status, as per response to Yu in other thread:
>>
>>    - David is working on "shared vs exclusive mappings"
>
> Probably "COW reuse support" is a separate item, although my approach would
> cover that.

Yeah that's the in the original thread as (2), but I thought we were all agreed
that is not a prerequisite so didn't bring it over here.

>
> The question is, if the estimate we're using in most code for now would at least
> be sufficient to merge it. The estimate is easily wrong, but we do have that
> issue with PTE-mapped THP already.

Well as I understand it, at least the NUMA balancing code and khugepaged are
ignoring all folios > order-0. That's probably ok for the occasional PTE-mapped
THP, but I assume it becomes untenable if large folios are the norm. Perhaps we
can modify those paths to work with the current estimators in order to remove
your work from the critical path - is that what you are getting at?

>
> But that argument probably applies to most things here: the difference is that
> PTE-mapped THP are not the default, that's why nobody really cared.
>
> [I'm playing with an almost-lockless scheme right now and hope I have something
> running soonish -- as you know, I got distracted]
>
>>    - Zi Yan has posted an RFC for compaction
>>    - Yin Fengwei's mlock series is now in mm-stable
>>    - Yin Fengwei's madvise series is in 6.6
>>    - I've reworked and posted a series for deferred_split_folio; although I've
>>      deprioritied it because Yu said it wasn't really a pre-requisite.
>>    - numa balancing depends on David's "shared vs exclusive mappings" work
>>    - I've started looking at "large folios in swap cache" in the background,
>>      because I'm seeing some slow down with large folios, but we also agreed that
>>      wasn't a prerequisite
>>
>
> Probably it would be good to talk about the items and how we would classify them
> in a meeting.

Perhaps we can get a slot in Matthew's meeting tomorrow?

>
>
>> Although, since sending that, I've determined that when running kernel
>> compilation across high number of cores on arm64, the cost of splitting the
>> folios gets large due to needing to broadcast the extra TLBIs. So I think the
>> last point on that list may be a prerequisite after all. (I've been able to fix
>> this by adding support for allocating large folios in the swap file, and
>> avoiding the split - planning to send RFC this week).
>>
>> There is also this set of things that you mentioned against "shared vs exclusive
>> mappings", which I'm not sure if you are planning to cover as part of your work
>> or if they are follow on things that will need to be done:
>>
>> (1) Detecting shared folios, to not mess with them while they are shared.
>>      MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>>      replace cases where folio_estimated_sharers() == 1 would currently be the
>>      best we can do (and in some cases, page_mapcount() == 1).
>>
>> And I recently discovered that khugepaged doesn't collapse file-backed pages to
>> a PMD-size THP if they belong to a large folio, so I'm guessing it may also
>> suffer the same behaviour for anon memory. I'm not sure if that's what your
>> "khugepaged ..." comment refers to?
>
> Yes. But I did not look into all the details yet.
>
> "kuhepaged" collapse support to small-sized THP is probably also a very imporant
> item, although it might be less relevant than for PMD -- and I consider it
> future work. See below.

Yes I agree that it's definitely future work. Nothing regresses from today's
performance if you don't have that.

>
>>
>> So taking all that and trying to put together a complete outstanding list for
>> strict requirements:
>>
>>    - Shared vs Exclusive Mappings (DavidH)
>>        - user-triggered page migration
>>        - NUMA hinting/balancing
>>        - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
>>    - Compaction of Large Folios (Zi Yan)
>>    - Swap out small-size THP without Split (Ryan Roberts)
>
> ^ that's going to be tough, I can promise. And the only way to live without that
> would be khugepaged support. (because that's how it's all working for PMD-sized
> THP after all!)

Are you referring specifically to the "swap out" line there? If so, it wasn't my
plan that we would *swap in* large folios - only swap them *out* as large folios
to avoid the cost of splitting. Then when they come back in, the come in as
single pages, just like PMD-sized THP, if I've understood things correctly. I
have a patch working and showing the perf improvement as a result. I'm planning
to post an RFC today, hopefully.

I don't see the swap-in side as a problem for the initial patch set. OK, they
come in as single pages, so you lost the potential TLB benefits. But that's no
worse than today's performance so not a regression. And the ratio of SW savings
on THP allocation to HW savings from the TLB is very different for small-sized
THP; much more of the benefit comes from the SW and that's still there.

>
> Once a PMD-sized THP was swapped out and evicted, it will always come back in
> order-0 folios. khugeged will re-collapse into PMD-sized chunks. If we could do
> that for PTE-sized THP as well ...

Yes, sure, but that's a future improvement, not a requirement to prevent
regression vs today, right?

>
>>
>>
>>>
>>>
>>> Now, these are just my thoughts, and I'm happy about other thoughts.
>>
>> As always, thanks for taking the time - I really appreciate it.
>
> Sure. Hoping others can comment.
>
> My gut feeling is that it's best to focus on getting the sysfs interface
> right+future proof and handling the stats independently. While being a good
> debug mechanism, I wouldn't consider these stats a requirement: we don't have
> them for file/shmem small-sized thp so far as well.
>
> So maybe really better to handle the stats in meminfo and friends separately.
>

I'd be very happy with that approach if others are bought in.

Thanks,
Ryan



2023-10-13 20:16:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory


>>
>> Yes, I think there are various ways forward regarding that. Or to enable "auto"
>> mode only once all are "auto", and as soon as one is not "auto", just disable
>> it. A simple
>>
>> echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
>
> I'm not really a fan, because this implies that you have a period where "auto"
> is reported for a size, but its not really in "auto" mode yet.

I think there are various alternatives that are feasible.

For most systems later, you'd want to just "auto" via compile-time
CONFIG option as default or via some cmdline option like
"transparent_hugepage=auto".

>
>>
>> Would do to enable it. Or, have them all be "global" and have a global "auto"
>> mode as you raised.
>>
>> echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
>> echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled
>>
>
> Again, this isn't atomic either. I tend to prefer my proposal because it
> switches atomically - there are no weird intermediate states. Anyway, I guess
> the important point is we have demonstrated that your proposed interface could
> be extended to support "auto" in future, should we need it.

I don't think the atomic switch is really relevant. But that's probably
a separate discussion.

[...]

>>
>> Just because a small-sized THP is PTE-mapped doesn't tell you anything, really.
>> What you want to know is if it is "completely" and "consecutively" mapped such
>> that the HW can actually benefit from it -- if HW even supports it. So
>> "PTE-mapped THP" is just part of the story. And that's where it gets tricky I
>> think.
>>
>> I agree that it's good for debugging, but then maybe it should a) live somewhere
>> else (debugfs, bucketing below) and b) be consistent with other THPs, meaning we
>> also want similar stats somewhere.
>>
>> One idea would be to expose such stats in a R/O fashion like "nr_allocated" or
>> "nr_hugepages" in /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and
>> friends. Of course, maybe tagging them with "anon" prefix.
>
> I see your point, but I don't completely agree with it all. That said, given
> your conclusion at the bottom, perhaps we should park the discussion about the
> accounting for a separate series in future? Then we can focus on the ABI?

Yes!

>
>>
>>>
>>> I would actually argue for adding similar counters for file-backed memory too
>>> for the same reasons. (I actually posted an independent patch a while back that
>>> did this for file- and anon- memory, bucketted by size. But I think the idea of
>>> the bucketting was NAKed.
>> For debugging, I *think* it might be valuable to see how many THP of each size
>> are allocated. Tracking exactly "how is it mapped" is not easy to achieve as we
>> learned. PMD-mapped was easy, but also requires us to keep doing that tracking
>> for all eternity ...
>>
>> Do you have a pointer to the patch set? Did it try to squeeze it into
>> /proc/meminfo?
>
> I was actually only working on smaps/smaps_rollup, which has been my main tool
> for debugging. patches at [1].
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>

Thanks for the pointer!

[...]

>>>
>>> I'll need some help with clasifying them, so showing my working. Final list that
>>> I would propose as strict requirements at bottom.
>>>
>>> This is my list with status, as per response to Yu in other thread:
>>>
>>>    - David is working on "shared vs exclusive mappings"
>>
>> Probably "COW reuse support" is a separate item, although my approach would
>> cover that.
>
> Yeah that's the in the original thread as (2), but I thought we were all agreed
> that is not a prerequisite so didn't bring it over here.

Agreed. Having a full list of todo items might be reasonable.

>
>>
>> The question is, if the estimate we're using in most code for now would at least
>> be sufficient to merge it. The estimate is easily wrong, but we do have that
>> issue with PTE-mapped THP already.
>
> Well as I understand it, at least the NUMA balancing code and khugepaged are
> ignoring all folios > order-0. That's probably ok for the occasional PTE-mapped
> THP, but I assume it becomes untenable if large folios are the norm. Perhaps we
> can modify those paths to work with the current estimators in order to remove
> your work from the critical path - is that what you are getting at?

IMHO most of the code that now uses the estimate-logic is really
suboptimal, but it's all we have. It's probably interesting to see where
the false negative/positives are tolerable for now ... I hate to be at
the critical path ;) But I'm getting somewhere slowly but steadily
(slowly, because I'm constantly distracted -- and apparently sick).

[...]

>>
>>
>>> Although, since sending that, I've determined that when running kernel
>>> compilation across high number of cores on arm64, the cost of splitting the
>>> folios gets large due to needing to broadcast the extra TLBIs. So I think the
>>> last point on that list may be a prerequisite after all. (I've been able to fix
>>> this by adding support for allocating large folios in the swap file, and
>>> avoiding the split - planning to send RFC this week).
>>>
>>> There is also this set of things that you mentioned against "shared vs exclusive
>>> mappings", which I'm not sure if you are planning to cover as part of your work
>>> or if they are follow on things that will need to be done:
>>>
>>> (1) Detecting shared folios, to not mess with them while they are shared.
>>>      MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>>>      replace cases where folio_estimated_sharers() == 1 would currently be the
>>>      best we can do (and in some cases, page_mapcount() == 1).
>>>
>>> And I recently discovered that khugepaged doesn't collapse file-backed pages to
>>> a PMD-size THP if they belong to a large folio, so I'm guessing it may also
>>> suffer the same behaviour for anon memory. I'm not sure if that's what your
>>> "khugepaged ..." comment refers to?
>>
>> Yes. But I did not look into all the details yet.
>>
>> "kuhepaged" collapse support to small-sized THP is probably also a very imporant
>> item, although it might be less relevant than for PMD -- and I consider it
>> future work. See below.
>
> Yes I agree that it's definitely future work. Nothing regresses from today's
> performance if you don't have that.
>
>>
>>>
>>> So taking all that and trying to put together a complete outstanding list for
>>> strict requirements:
>>>
>>>    - Shared vs Exclusive Mappings (DavidH)
>>>        - user-triggered page migration
>>>        - NUMA hinting/balancing
>>>        - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
>>>    - Compaction of Large Folios (Zi Yan)
>>>    - Swap out small-size THP without Split (Ryan Roberts)
>>
>> ^ that's going to be tough, I can promise. And the only way to live without that
>> would be khugepaged support. (because that's how it's all working for PMD-sized
>> THP after all!)
>
> Are you referring specifically to the "swap out" line there? If so, it wasn't my
> plan that we would *swap in* large folios - only swap them *out* as large folios

Ah!

> to avoid the cost of splitting. Then when they come back in, the come in as
> single pages, just like PMD-sized THP, if I've understood things correctly. I
> have a patch working and showing the perf improvement as a result. I'm planning
> to post an RFC today, hopefully.
>
> I don't see the swap-in side as a problem for the initial patch set. OK, they
> come in as single pages, so you lost the potential TLB benefits. But that's no
> worse than today's performance so not a regression. And the ratio of SW savings
> on THP allocation to HW savings from the TLB is very different for small-sized
> THP; much more of the benefit comes from the SW and that's still there.
>
>>
>> Once a PMD-sized THP was swapped out and evicted, it will always come back in
>> order-0 folios. khugeged will re-collapse into PMD-sized chunks. If we could do
>> that for PTE-sized THP as well ...
>
> Yes, sure, but that's a future improvement, not a requirement to prevent
> regression vs today, right?

Yes. You can't just currently assume: as soon as you swap, the whole
benefit is gone because you end up will all order-0 pages.

These are certainly not limiting "merge" factors IMHO, but it's
certainly one of the things users of distributions will heavily complain
about ;)

PMD-sized THP are mostly self-healing in that sense.

>
>>
>>>
>>>
>>>>
>>>>
>>>> Now, these are just my thoughts, and I'm happy about other thoughts.
>>>
>>> As always, thanks for taking the time - I really appreciate it.
>>
>> Sure. Hoping others can comment.
>>
>> My gut feeling is that it's best to focus on getting the sysfs interface
>> right+future proof and handling the stats independently. While being a good
>> debug mechanism, I wouldn't consider these stats a requirement: we don't have
>> them for file/shmem small-sized thp so far as well.
>>
>> So maybe really better to handle the stats in meminfo and friends separately.
>>
>
> I'd be very happy with that approach if others are bought in.

Yeah. I'm expecting there still to be discussions, but then we shall
also here other proposals. memcg controls are IMHO certainly future work.

--
Cheers,

David / dhildenb

2023-10-20 12:34:03

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 06/10/2023 21:06, David Hildenbrand wrote:
> On 29.09.23 13:44, Ryan Roberts wrote:
>> Hi All,
>

[...]

>> NOTE: These changes should not be merged until the prerequisites are complete.
>> These are in progress and tracked at [7].
>
> We should probably list them here, and classify which one we see as strict a
> requirement, which ones might be an optimization.
>

Bringing back the discussion of prerequistes to this thread following the
discussion at the mm-alignment meeting on Wednesday.

Slides, updated following discussion to reflect all the agreed items that are
prerequisites and enhancements, are at [1].

I've taken a closer look at the situation with khugepaged, and can confirm that
it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
small-sized THP is enabled+always. So I've fixed that test up and will add the
patch to the next version of my series.

So I believe the khugepaged prerequisite can be marked as done.

[1]
https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA

Thanks,
Ryan

2023-10-25 16:24:38

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 20/10/2023 13:33, Ryan Roberts wrote:
> On 06/10/2023 21:06, David Hildenbrand wrote:
>> On 29.09.23 13:44, Ryan Roberts wrote:
>>> Hi All,
>>
>
> [...]
>
>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>> These are in progress and tracked at [7].
>>
>> We should probably list them here, and classify which one we see as strict a
>> requirement, which ones might be an optimization.
>>
>
> Bringing back the discussion of prerequistes to this thread following the
> discussion at the mm-alignment meeting on Wednesday.
>
> Slides, updated following discussion to reflect all the agreed items that are
> prerequisites and enhancements, are at [1].
>
> I've taken a closer look at the situation with khugepaged, and can confirm that
> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
> small-sized THP is enabled+always. So I've fixed that test up and will add the
> patch to the next version of my series.
>
> So I believe the khugepaged prerequisite can be marked as done.
>
> [1]
> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA

Hi All,

It's been a week since the mm alignment meeting discussion we had around
prerequisites and the ABI. I haven't heard any further feedback on the ABI
proposal, so I'm going to be optimistic and assume that nobody has found any
fatal flaws in it :).

Certainly, I think it held up to the potential future policies that Yu Zhou
cited on the call - the possibility of preferring a smaller size over a bigger
one, if the smaller size can be allocated without splitting a contiguous block.
I think the suggestion of adding a per-size priority file would solve it. And in
general because we have a per-size directory, that gives us lots of flexibility
for growth.

Anyway, given the lack of feedback, I'm proposing to spin a new version. I'm
planning to do the following:

- Drop the accounting patch (#3); we will continue to only account PMD-sized
THP for now. We can add more counters in future if needed. page cache large
folios haven't needed any new counters yet.

- Pivot to the ABI proposed by DavidH; per-size directories in a similar shape
to that used by hugetlb

- Drop the "recommend" keyword patch (#6); For now, users will need to
understand implicitly which sizes are beneficial to their HW perf

- Drop patch (#7); arch_wants_pte_order() is no longer needed due to dropping
patch #6

- Add patch for khugepaged selftest improvement (described in previous email
above).

- Ensure that PMD_ORDER is not assumed to be compile-time constant (current
code is broken on powerpc)

Please shout if you think this is the wrong approach.

On the prerequisites front, we have 2 items still to land:

- compaction; Zi Yan is working on a v2

- numa balancing; A developer has signed up and is working on it (I'll leave
them to reveal themself as preferred).

Thanks,
Ryan

2023-10-25 18:48:48

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 25.10.23 18:24, Ryan Roberts wrote:
> On 20/10/2023 13:33, Ryan Roberts wrote:
>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>> On 29.09.23 13:44, Ryan Roberts wrote:
>>>> Hi All,
>>>
>>
>> [...]
>>
>>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>>> These are in progress and tracked at [7].
>>>
>>> We should probably list them here, and classify which one we see as strict a
>>> requirement, which ones might be an optimization.
>>>
>>
>> Bringing back the discussion of prerequistes to this thread following the
>> discussion at the mm-alignment meeting on Wednesday.
>>
>> Slides, updated following discussion to reflect all the agreed items that are
>> prerequisites and enhancements, are at [1].
>>
>> I've taken a closer look at the situation with khugepaged, and can confirm that
>> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
>> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
>> small-sized THP is enabled+always. So I've fixed that test up and will add the
>> patch to the next version of my series.
>>
>> So I believe the khugepaged prerequisite can be marked as done.
>>
>> [1]
>> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
>
> Hi All,

Hi,

I wanted to remind people in the THP cabal meeting, but that either
didn't happen or zoomed decided to not let me join :)

>
> It's been a week since the mm alignment meeting discussion we had around
> prerequisites and the ABI. I haven't heard any further feedback on the ABI
> proposal, so I'm going to be optimistic and assume that nobody has found any
> fatal flaws in it :).

After saying in the call probably 10 times that people should comment
here if there are reasonable alternatives worth discussing, call me
"optimistic" as well; but, it's only been a week and people might still
be thinking about this/

There were two things discussed in the call:

* Yu brought up "lists" so we can have priorities. As briefly discussed
in the call, this (a) might not be needed right now in an initial
version; (b) the kernel might be able to handle that (or many cases)
automatically, TBD. Adding lists now would kind-of set the semantics
of that interface in stone. As you describe below, the approach
discussed here could easily be extended to cover priorities, if need
be.

* Hugh raised the point that "bitmap of orders" could be replaced by
"added THP sizes", which really is "bitmap of orders" shifted to the
left. To configure 2 MiB + 64Kib, one would get "2097152 + 65536" =
"2162688" or in KiB "2112". Hm.

Both approaches would require single-option files like "enable_always",
"enable_madvise" ... which I don't particularly like, but who am I to judge.


>
> Certainly, I think it held up to the potential future policies that Yu Zhou
> cited on the call - the possibility of preferring a smaller size over a bigger
> one, if the smaller size can be allocated without splitting a contiguous block.
> I think the suggestion of adding a per-size priority file would solve it. And in
> general because we have a per-size directory, that gives us lots of flexibility
> for growth.

Jup, same opinion here. But again, I'm very happy to hear other
alternatives and why they are better.

>
> Anyway, given the lack of feedback, I'm proposing to spin a new version. I'm
> planning to do the following:
>
> - Drop the accounting patch (#3); we will continue to only account PMD-sized
> THP for now. We can add more counters in future if needed. page cache large
> folios haven't needed any new counters yet.
>
> - Pivot to the ABI proposed by DavidH; per-size directories in a similar shape
> to that used by hugetlb
>
> - Drop the "recommend" keyword patch (#6); For now, users will need to
> understand implicitly which sizes are beneficial to their HW perf
>
> - Drop patch (#7); arch_wants_pte_order() is no longer needed due to dropping
> patch #6
>
> - Add patch for khugepaged selftest improvement (described in previous email
> above).
>
> - Ensure that PMD_ORDER is not assumed to be compile-time constant (current
> code is broken on powerpc)
>
> Please shout if you think this is the wrong approach.

I'll shout that this sounds good to me; rather wait a bit more for more
opinions. It probably makes sense to post something after the (upcoming)
merge window, if there are no further discussions here.

--
Cheers,

David / dhildenb

2023-10-25 19:11:04

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 10/25/23 09:24, Ryan Roberts wrote:
> On the prerequisites front, we have 2 items still to land:
>
> - compaction; Zi Yan is working on a v2
>
> - numa balancing; A developer has signed up and is working on it (I'll leave
> them to reveal themself as preferred).
>

Oh yes, that's me, thanks for providing an easy place to reply-to, for that.


thanks,
--
John Hubbard
NVIDIA

2023-10-25 19:12:32

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On Wed, Oct 25, 2023 at 12:47 PM David Hildenbrand <[email protected]> wrote:
>
> On 25.10.23 18:24, Ryan Roberts wrote:
> > On 20/10/2023 13:33, Ryan Roberts wrote:
> >> On 06/10/2023 21:06, David Hildenbrand wrote:
> >>> On 29.09.23 13:44, Ryan Roberts wrote:
> >>>> Hi All,
> >>>
> >>
> >> [...]
> >>
> >>>> NOTE: These changes should not be merged until the prerequisites are complete.
> >>>> These are in progress and tracked at [7].
> >>>
> >>> We should probably list them here, and classify which one we see as strict a
> >>> requirement, which ones might be an optimization.
> >>>
> >>
> >> Bringing back the discussion of prerequistes to this thread following the
> >> discussion at the mm-alignment meeting on Wednesday.
> >>
> >> Slides, updated following discussion to reflect all the agreed items that are
> >> prerequisites and enhancements, are at [1].
> >>
> >> I've taken a closer look at the situation with khugepaged, and can confirm that
> >> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
> >> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
> >> small-sized THP is enabled+always. So I've fixed that test up and will add the
> >> patch to the next version of my series.
> >>
> >> So I believe the khugepaged prerequisite can be marked as done.
> >>
> >> [1]
> >> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
> >
> > Hi All,
>
> Hi,
>
> I wanted to remind people in the THP cabal meeting, but that either
> didn't happen or zoomed decided to not let me join :)
>
> >
> > It's been a week since the mm alignment meeting discussion we had around
> > prerequisites and the ABI. I haven't heard any further feedback on the ABI
> > proposal, so I'm going to be optimistic and assume that nobody has found any
> > fatal flaws in it :).
>
> After saying in the call probably 10 times that people should comment
> here if there are reasonable alternatives worth discussing, call me
> "optimistic" as well; but, it's only been a week and people might still
> be thinking about this/
>
> There were two things discussed in the call:
>
> * Yu brought up "lists" so we can have priorities. As briefly discussed
> in the call, this (a) might not be needed right now in an initial
> version; (b) the kernel might be able to handle that (or many cases)
> automatically, TBD. Adding lists now would kind-of set the semantics
> of that interface in stone. As you describe below, the approach
> discussed here could easily be extended to cover priorities, if need
> be.

I want to expand on this: the argument that "if you could allocate a
higher order you should use it" is too simplistic. There are many
reasons in addition to the one above that we want to "fall back" to
higher orders, e.g., those higher orders are not on PCP or from the
local node. When we consider the sequence of orders to try, user
preference is just one of the parameters to the cost function. The
bottom line is that I think we should all agree that there needs to be
a cost function down the road, whatever it looks like. Otherwise I
don't know how we can make "auto" happen.

> * Hugh raised the point that "bitmap of orders" could be replaced by
> "added THP sizes", which really is "bitmap of orders" shifted to the
> left. To configure 2 MiB + 64Kib, one would get "2097152 + 65536" =
> "2162688" or in KiB "2112". Hm.

I'm not a big fan of the "bitmap of orders" approach, because it
doesn't address my concern above.

> Both approaches would require single-option files like "enable_always",
> "enable_madvise" ... which I don't particularly like, but who am I to judge.
>
>
> >
> > Certainly, I think it held up to the potential future policies that Yu Zhou
> > cited on the call - the possibility of preferring a smaller size over a bigger
> > one, if the smaller size can be allocated without splitting a contiguous block.
> > I think the suggestion of adding a per-size priority file would solve it. And in
> > general because we have a per-size directory, that gives us lots of flexibility
> > for growth.
>
> Jup, same opinion here. But again, I'm very happy to hear other
> alternatives and why they are better.

I'm not against David's proposal but I want to hear a lot more about
"lots of flexibility for growth" before I'm fully convinced. Why do I
need more convincing? When I brought up that we need to consider the
priority of each order and the potential need to fall back to higher
orders during the meeting, I got the impression people were surprised
why we want to fall back to higher orders. TBH, I was surprised too
that this possibility was never considered. I missed today's THP
meeting too but I'll join next time and if anyone has more ideas on
this, we can spend some time discussing it, especially on how LAF
should cooperate with the page allocator to make better decisions.

2023-10-26 09:53:50

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 25/10/2023 20:11, Yu Zhao wrote:
> On Wed, Oct 25, 2023 at 12:47 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 25.10.23 18:24, Ryan Roberts wrote:
>>> On 20/10/2023 13:33, Ryan Roberts wrote:
>>>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>>>> On 29.09.23 13:44, Ryan Roberts wrote:
>>>>>> Hi All,
>>>>>
>>>>
>>>> [...]
>>>>
>>>>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>>>>> These are in progress and tracked at [7].
>>>>>
>>>>> We should probably list them here, and classify which one we see as strict a
>>>>> requirement, which ones might be an optimization.
>>>>>
>>>>
>>>> Bringing back the discussion of prerequistes to this thread following the
>>>> discussion at the mm-alignment meeting on Wednesday.
>>>>
>>>> Slides, updated following discussion to reflect all the agreed items that are
>>>> prerequisites and enhancements, are at [1].
>>>>
>>>> I've taken a closer look at the situation with khugepaged, and can confirm that
>>>> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
>>>> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
>>>> small-sized THP is enabled+always. So I've fixed that test up and will add the
>>>> patch to the next version of my series.
>>>>
>>>> So I believe the khugepaged prerequisite can be marked as done.
>>>>
>>>> [1]
>>>> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
>>>
>>> Hi All,
>>
>> Hi,
>>
>> I wanted to remind people in the THP cabal meeting, but that either
>> didn't happen or zoomed decided to not let me join :)

I didn't make it yesterday either - was having to juggle child care.

>>
>>>
>>> It's been a week since the mm alignment meeting discussion we had around
>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>> fatal flaws in it :).
>>
>> After saying in the call probably 10 times that people should comment
>> here if there are reasonable alternatives worth discussing, call me
>> "optimistic" as well; but, it's only been a week and people might still
>> be thinking about this/
>>
>> There were two things discussed in the call:
>>
>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>> in the call, this (a) might not be needed right now in an initial
>> version; (b) the kernel might be able to handle that (or many cases)
>> automatically, TBD. Adding lists now would kind-of set the semantics
>> of that interface in stone. As you describe below, the approach
>> discussed here could easily be extended to cover priorities, if need
>> be.
>
> I want to expand on this: the argument that "if you could allocate a
> higher order you should use it" is too simplistic. There are many
> reasons in addition to the one above that we want to "fall back" to
> higher orders, e.g., those higher orders are not on PCP or from the
> local node. When we consider the sequence of orders to try, user
> preference is just one of the parameters to the cost function. The
> bottom line is that I think we should all agree that there needs to be
> a cost function down the road, whatever it looks like. Otherwise I
> don't know how we can make "auto" happen.

I don't dispute that this sounds like it could be beneficial, but I see it as
research to happen further down the road (as you say), and we don't know what
that research might conclude. Also, I think the scope of this is bigger than
anonymous memory - you would also likely want to look at the policy for page
cache folio order too, since today that's based solely on readahead. So I see it
as an optimization that is somewhat orthogonal to small-sized THP.

The proposed interface does not imply any preference order - it only states
which sizes the user wants the kernel to select from, so I think there is lots
of freedom to change this down the track if the kernel wants to start using the
buddy allocator's state as a signal to make its decisions.

>
>> * Hugh raised the point that "bitmap of orders" could be replaced by
>> "added THP sizes", which really is "bitmap of orders" shifted to the
>> left. To configure 2 MiB + 64Kib, one would get "2097152 + 65536" =
>> "2162688" or in KiB "2112". Hm.
>
> I'm not a big fan of the "bitmap of orders" approach, because it
> doesn't address my concern above.
>
>> Both approaches would require single-option files like "enable_always",
>> "enable_madvise" ... which I don't particularly like, but who am I to judge.
>>
>>
>>>
>>> Certainly, I think it held up to the potential future policies that Yu Zhou
>>> cited on the call - the possibility of preferring a smaller size over a bigger
>>> one, if the smaller size can be allocated without splitting a contiguous block.
>>> I think the suggestion of adding a per-size priority file would solve it. And in
>>> general because we have a per-size directory, that gives us lots of flexibility
>>> for growth.
>>
>> Jup, same opinion here. But again, I'm very happy to hear other
>> alternatives and why they are better.
>
> I'm not against David's proposal but I want to hear a lot more about
> "lots of flexibility for growth" before I'm fully convinced.

My point was that in an abstract sense, there are properties a user may wish to
apply individually to a size, which is catered for by having a per-size
directory into which we can add more files if/when requirements for new per-size
properties arise. There are also properties that may be applied globally, for
which we have the top-level transparent_hugepage directory where properties can
be extended or added.

For your case around tighter integration with the buddy allocator, I could
imagine a per-size file allowing the user to specify if the kernel should allow
splitting a higher order to make a THP of that size (I'm not suggesting that's a
good idea, I'm just pointing out that this sort of thing is possible with the
interface). And we have discussed how the global enabled prpoerty could be
extended to support "auto" [1].

But perhaps what we really need are lots more ideas for future directions for
small-sized THP to allow us to evaluate this interface more widely.

> Why do I
> need more convincing? When I brought up that we need to consider the
> priority of each order and the potential need to fall back to higher
> orders during the meeting, I got the impression people were surprised
> why we want to fall back to higher orders. TBH, I was surprised too
> that this possibility was never considered. I missed today's THP
> meeting too but I'll join next time and if anyone has more ideas on
> this, we can spend some time discussing it, especially on how LAF
> should cooperate with the page allocator to make better decisions.

[1]
https://lore.kernel.org/linux-mm/[email protected]/

Thanks,
Ryan

2023-10-26 15:20:17

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

[...]

>>> Hi,
>>>
>>> I wanted to remind people in the THP cabal meeting, but that either
>>> didn't happen or zoomed decided to not let me join :)
>
> I didn't make it yesterday either - was having to juggle child care.

I think it didn't happen, or started quite late (>20 min).

>
>>>
>>>>
>>>> It's been a week since the mm alignment meeting discussion we had around
>>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>>> fatal flaws in it :).
>>>
>>> After saying in the call probably 10 times that people should comment
>>> here if there are reasonable alternatives worth discussing, call me
>>> "optimistic" as well; but, it's only been a week and people might still
>>> be thinking about this/
>>>
>>> There were two things discussed in the call:
>>>
>>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>> in the call, this (a) might not be needed right now in an initial
>>> version; (b) the kernel might be able to handle that (or many cases)
>>> automatically, TBD. Adding lists now would kind-of set the semantics
>>> of that interface in stone. As you describe below, the approach
>>> discussed here could easily be extended to cover priorities, if need
>>> be.
>>
>> I want to expand on this: the argument that "if you could allocate a
>> higher order you should use it" is too simplistic. There are many
>> reasons in addition to the one above that we want to "fall back" to
>> higher orders, e.g., those higher orders are not on PCP or from the
>> local node. When we consider the sequence of orders to try, user
>> preference is just one of the parameters to the cost function. The
>> bottom line is that I think we should all agree that there needs to be
>> a cost function down the road, whatever it looks like. Otherwise I
>> don't know how we can make "auto" happen.

I agree that there needs to be a cost function, and as pagecache showed
that's independent of initial enablement.

>
> I don't dispute that this sounds like it could be beneficial, but I see it as
> research to happen further down the road (as you say), and we don't know what
> that research might conclude. Also, I think the scope of this is bigger than
> anonymous memory - you would also likely want to look at the policy for page
> cache folio order too, since today that's based solely on readahead. So I see it
> as an optimization that is somewhat orthogonal to small-sized THP.

Exactly my thoughts.

The important thing is that we should plan ahead that we still have the
option to let the admin configure if we cannot make this work
automatically in the kernel.

What we'll need, nobody knows. Maybe it's a per-size priority, maybe
it's a single global toggle.

>
> The proposed interface does not imply any preference order - it only states
> which sizes the user wants the kernel to select from, so I think there is lots
> of freedom to change this down the track if the kernel wants to start using the
> buddy allocator's state as a signal to make its decisions.

Yes.

[..]

>>> Jup, same opinion here. But again, I'm very happy to hear other
>>> alternatives and why they are better.
>>
>> I'm not against David's proposal but I want to hear a lot more about
>> "lots of flexibility for growth" before I'm fully convinced.
>
> My point was that in an abstract sense, there are properties a user may wish to
> apply individually to a size, which is catered for by having a per-size
> directory into which we can add more files if/when requirements for new per-size
> properties arise. There are also properties that may be applied globally, for
> which we have the top-level transparent_hugepage directory where properties can
> be extended or added.

Exactly, well said.

>
> For your case around tighter integration with the buddy allocator, I could
> imagine a per-size file allowing the user to specify if the kernel should allow
> splitting a higher order to make a THP of that size (I'm not suggesting that's a
> good idea, I'm just pointing out that this sort of thing is possible with the
> interface). And we have discussed how the global enabled prpoerty could be
> extended to support "auto" [1].
>
> But perhaps what we really need are lots more ideas for future directions for
> small-sized THP to allow us to evaluate this interface more widely.

David R. motivated a future size-aware setting of the defrag option. As
discussed we might want something similar to shmem_enable. What will
happen with khugepaged, nobody knows yet :)

I could imagine exposing per-size boolean read-only properties like
"native-hw-size" (PMD, cont-pte). But these things require much more
thought.

--
Cheers,

David / dhildenb

2023-10-27 12:28:07

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 26/10/2023 16:19, David Hildenbrand wrote:
> [...]
>
>>>> Hi,
>>>>
>>>> I wanted to remind people in the THP cabal meeting, but that either
>>>> didn't happen or zoomed decided to not let me join :)
>>
>> I didn't make it yesterday either - was having to juggle child care.
>
> I think it didn't happen, or started quite late (>20 min).
>
>>
>>>>
>>>>>
>>>>> It's been a week since the mm alignment meeting discussion we had around
>>>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>>>> fatal flaws in it :).
>>>>
>>>> After saying in the call probably 10 times that people should comment
>>>> here if there are reasonable alternatives worth discussing, call me
>>>> "optimistic" as well; but, it's only been a week and people might still
>>>> be thinking about this/
>>>>
>>>> There were two things discussed in the call:
>>>>
>>>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>>>     in the  call, this (a) might not be needed right now in an initial
>>>>     version;  (b) the kernel might be able to handle that (or many cases)
>>>>     automatically, TBD. Adding lists now would kind-of set the semantics
>>>>     of that interface in stone. As you describe below, the approach
>>>>     discussed here could easily be extended to cover priorities, if need
>>>>     be.
>>>
>>> I want to expand on this: the argument that "if you could allocate a
>>> higher order you should use it" is too simplistic. There are many
>>> reasons in addition to the one above that we want to "fall back" to
>>> higher orders, e.g., those higher orders are not on PCP or from the
>>> local node. When we consider the sequence of orders to try, user
>>> preference is just one of the parameters to the cost function. The
>>> bottom line is that I think we should all agree that there needs to be
>>> a cost function down the road, whatever it looks like. Otherwise I
>>> don't know how we can make "auto" happen.
>
> I agree that there needs to be a cost function, and as pagecache showed that's
> independent of initial enablement.
>
>>
>> I don't dispute that this sounds like it could be beneficial, but I see it as
>> research to happen further down the road (as you say), and we don't know what
>> that research might conclude. Also, I think the scope of this is bigger than
>> anonymous memory - you would also likely want to look at the policy for page
>> cache folio order too, since today that's based solely on readahead. So I see it
>> as an optimization that is somewhat orthogonal to small-sized THP.
>
> Exactly my thoughts.
>
> The important thing is that we should plan ahead that we still have the option
> to let the admin configure if we cannot make this work automatically in the kernel.
>
> What we'll need, nobody knows. Maybe it's a per-size priority, maybe it's a
> single global toggle.
>
>>
>> The proposed interface does not imply any preference order - it only states
>> which sizes the user wants the kernel to select from, so I think there is lots
>> of freedom to change this down the track if the kernel wants to start using the
>> buddy allocator's state as a signal to make its decisions.
>
> Yes.
>
> [..]
>
>>>> Jup, same opinion here. But again, I'm very happy to hear other
>>>> alternatives and why they are better.
>>>
>>> I'm not against David's proposal but I want to hear a lot more about
>>> "lots of flexibility for growth" before I'm fully convinced.
>>
>> My point was that in an abstract sense, there are properties a user may wish to
>> apply individually to a size, which is catered for by having a per-size
>> directory into which we can add more files if/when requirements for new per-size
>> properties arise. There are also properties that may be applied globally, for
>> which we have the top-level transparent_hugepage directory where properties can
>> be extended or added.
>
> Exactly, well said.
>
>>
>> For your case around tighter integration with the buddy allocator, I could
>> imagine a per-size file allowing the user to specify if the kernel should allow
>> splitting a higher order to make a THP of that size (I'm not suggesting that's a
>> good idea, I'm just pointing out that this sort of thing is possible with the
>> interface). And we have discussed how the global enabled prpoerty could be
>> extended to support "auto" [1].
>>
>> But perhaps what we really need are lots more ideas for future directions for
>> small-sized THP to allow us to evaluate this interface more widely.
>
> David R. motivated a future size-aware setting of the defrag option. As
> discussed we might want something similar to shmem_enable. What will happen with
> khugepaged, nobody knows yet :)
>
> I could imagine exposing per-size boolean read-only properties like
> "native-hw-size" (PMD, cont-pte). But these things require much more thought.

FWIW, the reason I opted for the "recommend" special case in the v5 posting was
because that felt like an easy thing to also add to the command line in future.
Having a separate file, native-hw-size, that the user has to read then enable
through another file is not very command-line friendly, if you want the
hw-preferred size(s) enabled from boot.

Maybe the wider observation is "how does the proposed interface translate to the
kernel command line if needed in future?".



2023-10-27 12:32:10

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 27.10.23 14:27, Ryan Roberts wrote:
> On 26/10/2023 16:19, David Hildenbrand wrote:
>> [...]
>>
>>>>> Hi,
>>>>>
>>>>> I wanted to remind people in the THP cabal meeting, but that either
>>>>> didn't happen or zoomed decided to not let me join :)
>>>
>>> I didn't make it yesterday either - was having to juggle child care.
>>
>> I think it didn't happen, or started quite late (>20 min).
>>
>>>
>>>>>
>>>>>>
>>>>>> It's been a week since the mm alignment meeting discussion we had around
>>>>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>>>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>>>>> fatal flaws in it :).
>>>>>
>>>>> After saying in the call probably 10 times that people should comment
>>>>> here if there are reasonable alternatives worth discussing, call me
>>>>> "optimistic" as well; but, it's only been a week and people might still
>>>>> be thinking about this/
>>>>>
>>>>> There were two things discussed in the call:
>>>>>
>>>>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>>>>     in the  call, this (a) might not be needed right now in an initial
>>>>>     version;  (b) the kernel might be able to handle that (or many cases)
>>>>>     automatically, TBD. Adding lists now would kind-of set the semantics
>>>>>     of that interface in stone. As you describe below, the approach
>>>>>     discussed here could easily be extended to cover priorities, if need
>>>>>     be.
>>>>
>>>> I want to expand on this: the argument that "if you could allocate a
>>>> higher order you should use it" is too simplistic. There are many
>>>> reasons in addition to the one above that we want to "fall back" to
>>>> higher orders, e.g., those higher orders are not on PCP or from the
>>>> local node. When we consider the sequence of orders to try, user
>>>> preference is just one of the parameters to the cost function. The
>>>> bottom line is that I think we should all agree that there needs to be
>>>> a cost function down the road, whatever it looks like. Otherwise I
>>>> don't know how we can make "auto" happen.
>>
>> I agree that there needs to be a cost function, and as pagecache showed that's
>> independent of initial enablement.
>>
>>>
>>> I don't dispute that this sounds like it could be beneficial, but I see it as
>>> research to happen further down the road (as you say), and we don't know what
>>> that research might conclude. Also, I think the scope of this is bigger than
>>> anonymous memory - you would also likely want to look at the policy for page
>>> cache folio order too, since today that's based solely on readahead. So I see it
>>> as an optimization that is somewhat orthogonal to small-sized THP.
>>
>> Exactly my thoughts.
>>
>> The important thing is that we should plan ahead that we still have the option
>> to let the admin configure if we cannot make this work automatically in the kernel.
>>
>> What we'll need, nobody knows. Maybe it's a per-size priority, maybe it's a
>> single global toggle.
>>
>>>
>>> The proposed interface does not imply any preference order - it only states
>>> which sizes the user wants the kernel to select from, so I think there is lots
>>> of freedom to change this down the track if the kernel wants to start using the
>>> buddy allocator's state as a signal to make its decisions.
>>
>> Yes.
>>
>> [..]
>>
>>>>> Jup, same opinion here. But again, I'm very happy to hear other
>>>>> alternatives and why they are better.
>>>>
>>>> I'm not against David's proposal but I want to hear a lot more about
>>>> "lots of flexibility for growth" before I'm fully convinced.
>>>
>>> My point was that in an abstract sense, there are properties a user may wish to
>>> apply individually to a size, which is catered for by having a per-size
>>> directory into which we can add more files if/when requirements for new per-size
>>> properties arise. There are also properties that may be applied globally, for
>>> which we have the top-level transparent_hugepage directory where properties can
>>> be extended or added.
>>
>> Exactly, well said.
>>
>>>
>>> For your case around tighter integration with the buddy allocator, I could
>>> imagine a per-size file allowing the user to specify if the kernel should allow
>>> splitting a higher order to make a THP of that size (I'm not suggesting that's a
>>> good idea, I'm just pointing out that this sort of thing is possible with the
>>> interface). And we have discussed how the global enabled prpoerty could be
>>> extended to support "auto" [1].
>>>
>>> But perhaps what we really need are lots more ideas for future directions for
>>> small-sized THP to allow us to evaluate this interface more widely.
>>
>> David R. motivated a future size-aware setting of the defrag option. As
>> discussed we might want something similar to shmem_enable. What will happen with
>> khugepaged, nobody knows yet :)
>>
>> I could imagine exposing per-size boolean read-only properties like
>> "native-hw-size" (PMD, cont-pte). But these things require much more thought.
>
> FWIW, the reason I opted for the "recommend" special case in the v5 posting was
> because that felt like an easy thing to also add to the command line in future.
> Having a separate file, native-hw-size, that the user has to read then enable
> through another file is not very command-line friendly, if you want the
> hw-preferred size(s) enabled from boot.

Jup. I strongly suspect distributions will just have their setup script
to handle such things, though.

>
> Maybe the wider observation is "how does the proposed interface translate to the
> kernel command line if needed in future?".

I guess in the distant future, "auto" is what we want.

--
Cheers,

David / dhildenb

2023-10-27 12:48:14

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 27/10/2023 13:29, David Hildenbrand wrote:
> On 27.10.23 14:27, Ryan Roberts wrote:
>> On 26/10/2023 16:19, David Hildenbrand wrote:
>>> [...]
>>>
>>>>>> Hi,
>>>>>>
>>>>>> I wanted to remind people in the THP cabal meeting, but that either
>>>>>> didn't happen or zoomed decided to not let me join :)
>>>>
>>>> I didn't make it yesterday either - was having to juggle child care.
>>>
>>> I think it didn't happen, or started quite late (>20 min).
>>>
>>>>
>>>>>>
>>>>>>>
>>>>>>> It's been a week since the mm alignment meeting discussion we had around
>>>>>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>>>>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>>>>>> fatal flaws in it :).
>>>>>>
>>>>>> After saying in the call probably 10 times that people should comment
>>>>>> here if there are reasonable alternatives worth discussing, call me
>>>>>> "optimistic" as well; but, it's only been a week and people might still
>>>>>> be thinking about this/
>>>>>>
>>>>>> There were two things discussed in the call:
>>>>>>
>>>>>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>>>>>      in the  call, this (a) might not be needed right now in an initial
>>>>>>      version;  (b) the kernel might be able to handle that (or many cases)
>>>>>>      automatically, TBD. Adding lists now would kind-of set the semantics
>>>>>>      of that interface in stone. As you describe below, the approach
>>>>>>      discussed here could easily be extended to cover priorities, if need
>>>>>>      be.
>>>>>
>>>>> I want to expand on this: the argument that "if you could allocate a
>>>>> higher order you should use it" is too simplistic. There are many
>>>>> reasons in addition to the one above that we want to "fall back" to
>>>>> higher orders, e.g., those higher orders are not on PCP or from the
>>>>> local node. When we consider the sequence of orders to try, user
>>>>> preference is just one of the parameters to the cost function. The
>>>>> bottom line is that I think we should all agree that there needs to be
>>>>> a cost function down the road, whatever it looks like. Otherwise I
>>>>> don't know how we can make "auto" happen.
>>>
>>> I agree that there needs to be a cost function, and as pagecache showed that's
>>> independent of initial enablement.
>>>
>>>>
>>>> I don't dispute that this sounds like it could be beneficial, but I see it as
>>>> research to happen further down the road (as you say), and we don't know what
>>>> that research might conclude. Also, I think the scope of this is bigger than
>>>> anonymous memory - you would also likely want to look at the policy for page
>>>> cache folio order too, since today that's based solely on readahead. So I
>>>> see it
>>>> as an optimization that is somewhat orthogonal to small-sized THP.
>>>
>>> Exactly my thoughts.
>>>
>>> The important thing is that we should plan ahead that we still have the option
>>> to let the admin configure if we cannot make this work automatically in the
>>> kernel.
>>>
>>> What we'll need, nobody knows. Maybe it's a per-size priority, maybe it's a
>>> single global toggle.
>>>
>>>>
>>>> The proposed interface does not imply any preference order - it only states
>>>> which sizes the user wants the kernel to select from, so I think there is lots
>>>> of freedom to change this down the track if the kernel wants to start using the
>>>> buddy allocator's state as a signal to make its decisions.
>>>
>>> Yes.
>>>
>>> [..]
>>>
>>>>>> Jup, same opinion here. But again, I'm very happy to hear other
>>>>>> alternatives and why they are better.
>>>>>
>>>>> I'm not against David's proposal but I want to hear a lot more about
>>>>> "lots of flexibility for growth" before I'm fully convinced.
>>>>
>>>> My point was that in an abstract sense, there are properties a user may wish to
>>>> apply individually to a size, which is catered for by having a per-size
>>>> directory into which we can add more files if/when requirements for new
>>>> per-size
>>>> properties arise. There are also properties that may be applied globally, for
>>>> which we have the top-level transparent_hugepage directory where properties can
>>>> be extended or added.
>>>
>>> Exactly, well said.
>>>
>>>>
>>>> For your case around tighter integration with the buddy allocator, I could
>>>> imagine a per-size file allowing the user to specify if the kernel should allow
>>>> splitting a higher order to make a THP of that size (I'm not suggesting
>>>> that's a
>>>> good idea, I'm just pointing out that this sort of thing is possible with the
>>>> interface). And we have discussed how the global enabled prpoerty could be
>>>> extended to support "auto" [1].
>>>>
>>>> But perhaps what we really need are lots more ideas for future directions for
>>>> small-sized THP to allow us to evaluate this interface more widely.
>>>
>>> David R. motivated a future size-aware setting of the defrag option. As
>>> discussed we might want something similar to shmem_enable. What will happen with
>>> khugepaged, nobody knows yet :)
>>>
>>> I could imagine exposing per-size boolean read-only properties like
>>> "native-hw-size" (PMD, cont-pte). But these things require much more thought.
>>
>> FWIW, the reason I opted for the "recommend" special case in the v5 posting was
>> because that felt like an easy thing to also add to the command line in future.
>> Having a separate file, native-hw-size, that the user has to read then enable
>> through another file is not very command-line friendly, if you want the
>> hw-preferred size(s) enabled from boot.
>
> Jup. I strongly suspect distributions will just have their setup script to
> handle such things, though.

OK fair enough.

>
>>
>> Maybe the wider observation is "how does the proposed interface translate to the
>> kernel command line if needed in future?".
>
> I guess in the distant future, "auto" is what we want.

Looks like hugetlb solves this with a magic tuple, where hugepagesz sets the
"directory" for the following properties. So if we did need to support per-size
properties on the command line, we have president to follow:

hugepagesz=X hugepages=Y

>

2023-10-31 11:51:00

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 06/10/2023 21:06, David Hildenbrand wrote:
[...]
>
> Change 2: sysfs interface.
>
> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> agree.
>
> What we expose there and how, is TBD. Again, not a friend of "orders" and
> bitmaps at all. We can do better if we want to go down that path.
>
> Maybe we should take a look at hugetlb, and how they added support for multiple
> sizes. What *might* make sense could be (depending on which values we actually
> support!)
>
>
> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>
> Each one would contain an "enabled" and "defrag" file. We want something minimal
> first? Start with the "enabled" option.
>
>
> enabled: always [global] madvise never
>
> Initially, we would set it for PMD-sized THP to "global" and for everything else
> to "never".

Hi David,

I've just started coding this, and it occurs to me that I might need a small
clarification here; the existing global "enabled" control is used to drive
decisions for both anonymous memory and (non-shmem) file-backed memory. But the
proposed new per-size "enabled" is implicitly only controlling anon memory (for
now).

1) Is this potentially confusing for the user? Should we rename the per-size
controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
we can reuse the same control for file-backed memory in future?

2) The global control will continue to drive the file-backed memory decision
(for now), even when hugepages-2048kB/enabled != "global"; agreed?

Thanks,
Ryan

2023-10-31 11:57:17

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 31/10/2023 11:50, Ryan Roberts wrote:
> On 06/10/2023 21:06, David Hildenbrand wrote:
> [...]
>>
>> Change 2: sysfs interface.
>>
>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>> agree.
>>
>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>> bitmaps at all. We can do better if we want to go down that path.
>>
>> Maybe we should take a look at hugetlb, and how they added support for multiple
>> sizes. What *might* make sense could be (depending on which values we actually
>> support!)
>>
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>
>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>> first? Start with the "enabled" option.
>>
>>
>> enabled: always [global] madvise never
>>
>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>> to "never".
>
> Hi David,
>
> I've just started coding this, and it occurs to me that I might need a small
> clarification here; the existing global "enabled" control is used to drive
> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> proposed new per-size "enabled" is implicitly only controlling anon memory (for
> now).
>
> 1) Is this potentially confusing for the user? Should we rename the per-size
> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> we can reuse the same control for file-backed memory in future?
>
> 2) The global control will continue to drive the file-backed memory decision
> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>
> Thanks,
> Ryan
>

Also, an implementation question:

hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
(although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
regardless. Is that by design? It couldn't fathom any reasoning from the commit log:

bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
bool smaps, bool in_pf, bool enforce_sysfs)
{
if (!vma->vm_mm) /* vdso */
return false;

/*
* Explicitly disabled through madvise or prctl, or some
* architectures may disable THP for some mappings, for
* example, s390 kvm.
* */
if ((vm_flags & VM_NOHUGEPAGE) ||
test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
return false;
/*
* If the hardware/firmware marked hugepage support disabled.
*/
if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
return false;

/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
if (vma_is_dax(vma))
return in_pf; <<<<<<<<

...
}


2023-10-31 12:05:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 31.10.23 12:55, Ryan Roberts wrote:
> On 31/10/2023 11:50, Ryan Roberts wrote:
>> On 06/10/2023 21:06, David Hildenbrand wrote:
>> [...]
>>>
>>> Change 2: sysfs interface.
>>>
>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>> agree.
>>>
>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>> bitmaps at all. We can do better if we want to go down that path.
>>>
>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>> sizes. What *might* make sense could be (depending on which values we actually
>>> support!)
>>>
>>>
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>
>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>> first? Start with the "enabled" option.
>>>
>>>
>>> enabled: always [global] madvise never
>>>
>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>> to "never".
>>
>> Hi David,
>>
>> I've just started coding this, and it occurs to me that I might need a small
>> clarification here; the existing global "enabled" control is used to drive
>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>> now).
>>
>> 1) Is this potentially confusing for the user? Should we rename the per-size
>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>> we can reuse the same control for file-backed memory in future?
>>
>> 2) The global control will continue to drive the file-backed memory decision
>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>>
>> Thanks,
>> Ryan
>>
>
> Also, an implementation question:
>
> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:

The whole DAX "hugepage" and THP mixup is just plain confusing. We're
simply using PUD/PMD mappings of DAX memory, and PMD/PTE- remap when
required (VMA split I assume, COW).

It doesn't result in any memory waste, so who really cares how it's
mapped? Apparently we want individual processes to just disable PMD/PUD
mappings of DAX using the prctl and madvise. Maybe there are good reasons.

Looks like a design decision, probably some legacy leftovers.

--
Cheers,

David / dhildenb

2023-10-31 12:08:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 31.10.23 12:50, Ryan Roberts wrote:
> On 06/10/2023 21:06, David Hildenbrand wrote:
> [...]
>>
>> Change 2: sysfs interface.
>>
>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>> agree.
>>
>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>> bitmaps at all. We can do better if we want to go down that path.
>>
>> Maybe we should take a look at hugetlb, and how they added support for multiple
>> sizes. What *might* make sense could be (depending on which values we actually
>> support!)
>>
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>
>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>> first? Start with the "enabled" option.
>>
>>
>> enabled: always [global] madvise never
>>
>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>> to "never".
>
> Hi David,

Hi!

>
> I've just started coding this, and it occurs to me that I might need a small
> clarification here; the existing global "enabled" control is used to drive
> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> proposed new per-size "enabled" is implicitly only controlling anon memory (for
> now).

Anon was (way) first, and pagecache later decided to reuse that one as
an indication whether larger folios are desired.

For the pagecache, it's just a way to enable/disable it globally. As
there is no memory waste, nobody currently really cares about the exact
sized the pagecache is allocating (maybe that will change at some point,
maybe not, who knows).

>
> 1) Is this potentially confusing for the user? Should we rename the per-size
> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> we can reuse the same control for file-backed memory in future?

The latter would be my take. Just like we did with the global toggle.

>
> 2) The global control will continue to drive the file-backed memory decision
> (for now), even when hugepages-2048kB/enabled != "global"; agreed?

That would be my take; it will allocate other sizes already, so just
glue it to the global toggle and document for the other toggles that
they only control anonymous THP for now.

--
Cheers,

David / dhildenb

2023-10-31 13:12:30

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 31/10/2023 11:58, David Hildenbrand wrote:
> On 31.10.23 12:50, Ryan Roberts wrote:
>> On 06/10/2023 21:06, David Hildenbrand wrote:
>> [...]
>>>
>>> Change 2: sysfs interface.
>>>
>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>> agree.
>>>
>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>> bitmaps at all. We can do better if we want to go down that path.
>>>
>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>> sizes. What *might* make sense could be (depending on which values we actually
>>> support!)
>>>
>>>
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>
>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>> first? Start with the "enabled" option.
>>>
>>>
>>> enabled: always [global] madvise never
>>>
>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>> to "never".
>>
>> Hi David,
>
> Hi!
>
>>
>> I've just started coding this, and it occurs to me that I might need a small
>> clarification here; the existing global "enabled" control is used to drive
>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>> now).
>
> Anon was (way) first, and pagecache later decided to reuse that one as an
> indication whether larger folios are desired.
>
> For the pagecache, it's just a way to enable/disable it globally. As there is no
> memory waste, nobody currently really cares about the exact sized the pagecache
> is allocating (maybe that will change at some point, maybe not, who knows).

Yup. Its not _just_ about allocation though; its also about collapse
(MADV_COLLAPSE, khugepaged) which is supported for pagecache pages. I can
imagine value in collapsing to various sizes that are beneficial for HW...
anyway that's for another day.

>
>>
>> 1) Is this potentially confusing for the user? Should we rename the per-size
>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>> we can reuse the same control for file-backed memory in future?
>
> The latter would be my take. Just like we did with the global toggle.

ACK

>
>>
>> 2) The global control will continue to drive the file-backed memory decision
>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>
> That would be my take; it will allocate other sizes already, so just glue it to
> the global toggle and document for the other toggles that they only control
> anonymous THP for now.

ACK

>

2023-10-31 13:13:26

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 31/10/2023 12:03, David Hildenbrand wrote:
> On 31.10.23 12:55, Ryan Roberts wrote:
>> On 31/10/2023 11:50, Ryan Roberts wrote:
>>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>> [...]
>>>>
>>>> Change 2: sysfs interface.
>>>>
>>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>>> agree.
>>>>
>>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>>> bitmaps at all. We can do better if we want to go down that path.
>>>>
>>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>>> sizes. What *might* make sense could be (depending on which values we actually
>>>> support!)
>>>>
>>>>
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>>
>>>> Each one would contain an "enabled" and "defrag" file. We want something
>>>> minimal
>>>> first? Start with the "enabled" option.
>>>>
>>>>
>>>> enabled: always [global] madvise never
>>>>
>>>> Initially, we would set it for PMD-sized THP to "global" and for everything
>>>> else
>>>> to "never".
>>>
>>> Hi David,
>>>
>>> I've just started coding this, and it occurs to me that I might need a small
>>> clarification here; the existing global "enabled" control is used to drive
>>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>>> now).
>>>
>>> 1) Is this potentially confusing for the user? Should we rename the per-size
>>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>>> we can reuse the same control for file-backed memory in future?
>>>
>>> 2) The global control will continue to drive the file-backed memory decision
>>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>>>
>>> Thanks,
>>> Ryan
>>>
>>
>> Also, an implementation question:
>>
>> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
>> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
>> regardless. Is that by design? It couldn't fathom any reasoning from the
>> commit log:
>
> The whole DAX "hugepage" and THP mixup is just plain confusing. We're simply
> using PUD/PMD mappings of DAX memory, and PMD/PTE- remap when required (VMA
> split I assume, COW).
>
> It doesn't result in any memory waste, so who really cares how it's mapped?
> Apparently we want individual processes to just disable PMD/PUD mappings of DAX
> using the prctl and madvise. Maybe there are good reasons.
>
> Looks like a design decision, probably some legacy leftovers.

OK, I'll ensure I keep this behaviour.

Thanks!

>

2023-10-31 18:30:00

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <[email protected]> wrote:
>
> On 31/10/2023 11:50, Ryan Roberts wrote:
> > On 06/10/2023 21:06, David Hildenbrand wrote:
> > [...]
> >>
> >> Change 2: sysfs interface.
> >>
> >> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> >> agree.
> >>
> >> What we expose there and how, is TBD. Again, not a friend of "orders" and
> >> bitmaps at all. We can do better if we want to go down that path.
> >>
> >> Maybe we should take a look at hugetlb, and how they added support for multiple
> >> sizes. What *might* make sense could be (depending on which values we actually
> >> support!)
> >>
> >>
> >> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> >>
> >> Each one would contain an "enabled" and "defrag" file. We want something minimal
> >> first? Start with the "enabled" option.
> >>
> >>
> >> enabled: always [global] madvise never
> >>
> >> Initially, we would set it for PMD-sized THP to "global" and for everything else
> >> to "never".
> >
> > Hi David,
> >
> > I've just started coding this, and it occurs to me that I might need a small
> > clarification here; the existing global "enabled" control is used to drive
> > decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> > proposed new per-size "enabled" is implicitly only controlling anon memory (for
> > now).
> >
> > 1) Is this potentially confusing for the user? Should we rename the per-size
> > controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> > we can reuse the same control for file-backed memory in future?
> >
> > 2) The global control will continue to drive the file-backed memory decision
> > (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> >
> > Thanks,
> > Ryan
> >
>
> Also, an implementation question:
>
> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:

The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs.

>
> bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
> bool smaps, bool in_pf, bool enforce_sysfs)
> {
> if (!vma->vm_mm) /* vdso */
> return false;
>
> /*
> * Explicitly disabled through madvise or prctl, or some
> * architectures may disable THP for some mappings, for
> * example, s390 kvm.
> * */
> if ((vm_flags & VM_NOHUGEPAGE) ||
> test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> return false;
> /*
> * If the hardware/firmware marked hugepage support disabled.
> */
> if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
> return false;
>
> /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
> if (vma_is_dax(vma))
> return in_pf; <<<<<<<<
>
> ...
> }
>
>

2023-11-01 14:03:01

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On 31/10/2023 18:29, Yang Shi wrote:
> On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 31/10/2023 11:50, Ryan Roberts wrote:
>>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>> [...]
>>>>
>>>> Change 2: sysfs interface.
>>>>
>>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>>> agree.
>>>>
>>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>>> bitmaps at all. We can do better if we want to go down that path.
>>>>
>>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>>> sizes. What *might* make sense could be (depending on which values we actually
>>>> support!)
>>>>
>>>>
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>>
>>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>>> first? Start with the "enabled" option.
>>>>
>>>>
>>>> enabled: always [global] madvise never
>>>>
>>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>>> to "never".
>>>
>>> Hi David,
>>>
>>> I've just started coding this, and it occurs to me that I might need a small
>>> clarification here; the existing global "enabled" control is used to drive
>>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>>> now).
>>>
>>> 1) Is this potentially confusing for the user? Should we rename the per-size
>>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>>> we can reuse the same control for file-backed memory in future?
>>>
>>> 2) The global control will continue to drive the file-backed memory decision
>>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>>>
>>> Thanks,
>>> Ryan
>>>
>>
>> Also, an implementation question:
>>
>> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
>> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
>> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:
>
> The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs.

That's not quite true; enabled="never" is honoured for non-DAX/non-shmem file
VMAs (for collapse via CONFIG_READ_ONLY_THP_FOR_FS and more recently for
anything that implements huge_fault() - see
7a81751fcdeb833acc858e59082688e3020bfe12).

2023-11-01 18:13:26

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On Wed, Nov 1, 2023 at 7:02 AM Ryan Roberts <[email protected]> wrote:
>
> On 31/10/2023 18:29, Yang Shi wrote:
> > On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 31/10/2023 11:50, Ryan Roberts wrote:
> >>> On 06/10/2023 21:06, David Hildenbrand wrote:
> >>> [...]
> >>>>
> >>>> Change 2: sysfs interface.
> >>>>
> >>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> >>>> agree.
> >>>>
> >>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
> >>>> bitmaps at all. We can do better if we want to go down that path.
> >>>>
> >>>> Maybe we should take a look at hugetlb, and how they added support for multiple
> >>>> sizes. What *might* make sense could be (depending on which values we actually
> >>>> support!)
> >>>>
> >>>>
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> >>>>
> >>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
> >>>> first? Start with the "enabled" option.
> >>>>
> >>>>
> >>>> enabled: always [global] madvise never
> >>>>
> >>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
> >>>> to "never".
> >>>
> >>> Hi David,
> >>>
> >>> I've just started coding this, and it occurs to me that I might need a small
> >>> clarification here; the existing global "enabled" control is used to drive
> >>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> >>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
> >>> now).
> >>>
> >>> 1) Is this potentially confusing for the user? Should we rename the per-size
> >>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> >>> we can reuse the same control for file-backed memory in future?
> >>>
> >>> 2) The global control will continue to drive the file-backed memory decision
> >>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> >>>
> >>> Thanks,
> >>> Ryan
> >>>
> >>
> >> Also, an implementation question:
> >>
> >> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
> >> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
> >> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:
> >
> > The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs.
>
> That's not quite true; enabled="never" is honoured for non-DAX/non-shmem file
> VMAs (for collapse via CONFIG_READ_ONLY_THP_FOR_FS and more recently for

When implementing READ_ONLY_THP_FOR_FS the file THP just can be
collapsed by khugepaged, but khugepaged is started iff enabled !=
"never". So READ_ONLY_THP_FOR_FS has to honor it. Unfortunately there
are some confusing exceptions... But anyway DAX is not the same class.

> anything that implements huge_fault() - see
> 7a81751fcdeb833acc858e59082688e3020bfe12).

IIUC this commit just gives the vmas which implement huge_fault() a
chance to handle the fault. Currently just DAX vmas implement
huge_fault() in vanilla kernel AFAICT.

>

2023-12-05 16:07:11

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

On Tue, Nov 14, 2023 at 10:57:07AM +0000, Ryan Roberts wrote:
> On 13/11/2023 15:04, Matthew Wilcox wrote:
> > On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote:
> >> On 13/11/2023 05:18, Matthew Wilcox wrote:
> >>> My hope is to abolish the 64kB page size configuration. ie instead of
> >>> using the mixture of page sizes that you currently are -- 64k and
> >>> 1M (right? Order-0, and order-4)
> >>
> >> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> >> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> >> intuitively you would expect the order to remain constant, but it doesn't).
> >>
> >> The "recommend" setting above will actually enable order-3 as well even though
> >> there is no HW benefit to this. So the full set of available memory sizes here is:
> >>
> >> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
> >>
> >>> , that 4k, 64k and 2MB (order-0,
> >>> order-4 and order-9) will provide better performance.
> >>>
> >>> Have you run any experiements with a 4kB page size?
> >>
> >> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> >> to get to a world were we universally deal in variable sized chunks of memory,
> >> aligned on 4K boundaries.
> >>
> >> In my experience though, there are still some performance benefits to 64K base
> >> page vs 4K+contpte; the page tables are more cache efficient for the former case
> >> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> >> latter. In practice the HW will still only read 8 bytes in the latter but that's
> >> taking up a full cache line vs the former where a single cache line stores 8x
> >> 64K entries.
> >
> > This is going to depend on your workload though -- if you're using more
> > 2MB than 64kB, you get to elide a layer of page table with 4k base,
> > rather than taking up 4 cache lines with a 64k base.
>
> True, but again depending on workload/config, you may have few levels of lookup
> for the 64K native case in the first place because you consume more VA bits at
> each level.

Sorry, missed this email ... let's work it through.

With 4k, and a 48-bit VA space, we get 12 bits at the lowest level, then
9 bits each layer, so 4 * 9 + 12 = 48. With a 2MB allocation, we
eliminate the bottom layer and examine three cachelines to find it (PGD
entry, PUD entry, PMD entry)

With 64k, we get 16 bits at the lowest level, then 13 bits each layer,
so 3 * 13 + 16 = 55. With a 2MB allocation, we can't eliminate the
bottom layer, so we still have to examine three cachelines to find it
(PGD, PMD, PTE). If you can fit into a 42-bit address space, you can
reduce it by one cache miss, but my impression is that applications
which use 21 bits of address space for a single allocation want more
address space than your average application.