2023-07-14 16:17:36

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v3 0/4] variable-order, large folios for anonymous memory

Hi All,

This is v3 of a series to implement variable order, large folios for anonymous
memory. (currently called "FLEXIBLE_THP") The objective of this is to improve
performance by allocating larger chunks of memory during anonymous page faults.
See [1] and [2] for background.

There has been quite a bit more rework and simplification, mainly based on
feedback from Yu Zhao. Additionally, I've added a command line parameter,
flexthp_unhinted_max, the idea for which came from discussion with David
Hildenbrand (thanks for all your feedback!).

The last patch is for arm64 to explicitly override the default
arch_wants_pte_order() and is intended as an example. If this series is accepted
I suggest taking the first 3 patches through the mm tree and the arm64 change
could be handled through the arm64 tree separately. Neither has any build
dependency on the other.

The patches are based on top of v6.5-rc1. I have a branch at [3].


Changes since v2 [2]
--------------------

- Dropped commit "Allow deferred splitting of arbitrary large anon folios"
- Huang, Ying suggested the "batch zap" work (which I dropped from this
series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
moved the deferred split patch to a separate series along with the batch
zap changes. I plan to submit this series early next week.
- Changed folio order fallback policy
- We no longer iterate from preferred to 0 looking for acceptable policy
- Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
- Removed vma parameter from arch_wants_pte_order()
- Added command line parameter `flexthp_unhinted_max`
- clamps preferred order when vma hasn't explicitly opted-in to THP
- Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
for process or system).
- Simplified implementation and integration with do_anonymous_page()
- Removed dependency on set_ptes()


Performance
-----------

Performance is still similar to v2; see cover letter at [2].


Opens
-----

- Feature name: FLEXIBLE_THP or LARGE_ANON_FOLIO?
- Given the closer policy ties to THP, I prefer FLEXIBLE_THP
- Prerequisits for merging
- Sounds like there is a concensus that we should wait until exisitng
features are improved to place nicely with large folios.


[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v3


Thanks,
Ryan


Ryan Roberts (4):
mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
mm: Default implementation of arch_wants_pte_order()
mm: FLEXIBLE_THP for improved performance
arm64: mm: Override arch_wants_pte_order()

.../admin-guide/kernel-parameters.txt | 10 +
arch/arm64/include/asm/pgtable.h | 6 +
include/linux/pgtable.h | 13 ++
mm/Kconfig | 10 +
mm/memory.c | 187 ++++++++++++++++--
mm/rmap.c | 28 ++-
6 files changed, 230 insertions(+), 24 deletions(-)

--
2.25.1



2023-07-14 16:43:36

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v3 4/4] arm64: mm: Override arch_wants_pte_order()

Define an arch-specific override of arch_wants_pte_order() so that when
FLEXIBLE_THP is enabled, large folios will be allocated for anonymous
memory with an order that is compatible with arm64's contpte mappings.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0bd18de9fd97..d00bb26fe28f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1106,6 +1106,12 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t old_pte, pte_t new_pte);
+
+#define arch_wants_pte_order arch_wants_pte_order
+static inline int arch_wants_pte_order(void)
+{
+ return CONT_PTE_SHIFT - PAGE_SHIFT;
+}
#endif /* !__ASSEMBLY__ */

#endif /* __ASM_PGTABLE_H */
--
2.25.1


2023-07-14 16:45:29

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v3 1/4] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()

In preparation for FLEXIBLE_THP support, improve
folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
passed to it. In this case, all contained pages are accounted using the
order-0 folio (or base page) scheme.

Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: Yu Zhao <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>
---
mm/rmap.c | 28 +++++++++++++++++++++-------
1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 0c0d8857dfce..f293d072368a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1278,31 +1278,45 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
* This means the inc-and-test can be bypassed.
* The folio does not have to be locked.
*
- * If the folio is large, it is accounted as a THP. As the folio
+ * If the folio is pmd-mappable, it is accounted as a THP. As the folio
* is new, it's assumed to be mapped exclusively by a single process.
*/
void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
unsigned long address)
{
- int nr;
+ int nr = folio_nr_pages(folio);

- VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+ VM_BUG_ON_VMA(address < vma->vm_start ||
+ address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
__folio_set_swapbacked(folio);

- if (likely(!folio_test_pmd_mappable(folio))) {
+ if (!folio_test_large(folio)) {
/* increment count (starts at -1) */
atomic_set(&folio->_mapcount, 0);
- nr = 1;
+ __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
+ } else if (!folio_test_pmd_mappable(folio)) {
+ int i;
+
+ for (i = 0; i < nr; i++) {
+ struct page *page = folio_page(folio, i);
+
+ /* increment count (starts at -1) */
+ atomic_set(&page->_mapcount, 0);
+ __page_set_anon_rmap(folio, page, vma,
+ address + (i << PAGE_SHIFT), 1);
+ }
+
+ /* increment count (starts at 0) */
+ atomic_set(&folio->_nr_pages_mapped, nr);
} else {
/* increment count (starts at -1) */
atomic_set(&folio->_entire_mapcount, 0);
atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
- nr = folio_nr_pages(folio);
+ __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
}

__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
- __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
}

/**
--
2.25.1


2023-07-14 16:55:15

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
allocated in large folios of a determined order. All pages of the large
folio are pte-mapped during the same page fault, significantly reducing
the number of page faults. The number of per-page operations (e.g. ref
counting, rmap management lru list management) are also significantly
reduced since those ops now become per-folio.

The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
defaults to disabled for now; The long term aim is for this to defaut to
enabled, but there are some risks around internal fragmentation that
need to be better understood first.

When enabled, the folio order is determined as such: For a vma, process
or system that has explicitly disabled THP, we continue to allocate
order-0. THP is most likely disabled to avoid any possible internal
fragmentation so we honour that request.

Otherwise, the return value of arch_wants_pte_order() is used. For vmas
that have not explicitly opted-in to use transparent hugepages (e.g.
where thp=madvise and the vma does not have MADV_HUGEPAGE), then
arch_wants_pte_order() is limited by the new cmdline parameter,
`flexthp_unhinted_max`. This allows for a performance boost without
requiring any explicit opt-in from the workload while allowing the
sysadmin to tune between performance and internal fragmentation.

arch_wants_pte_order() can be overridden by the architecture if desired.
Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
set of ptes map physically contigious, naturally aligned memory, so this
mechanism allows the architecture to optimize as required.

If the preferred order can't be used (e.g. because the folio would
breach the bounds of the vma, or because ptes in the region are already
mapped) then we fall back to a suitable lower order; first
PAGE_ALLOC_COSTLY_ORDER, then order-0.

Signed-off-by: Ryan Roberts <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 10 +
mm/Kconfig | 10 +
mm/memory.c | 187 ++++++++++++++++--
3 files changed, 190 insertions(+), 17 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a1457995fd41..405d624e2191 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1497,6 +1497,16 @@
See Documentation/admin-guide/sysctl/net.rst for
fb_tunnels_only_for_init_ns

+ flexthp_unhinted_max=
+ [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
+ folio size that will be allocated for an anonymous vma
+ that has neither explicitly opted in nor out of using
+ transparent hugepages. The size must be a power-of-2 in
+ the range [PAGE_SIZE, PMD_SIZE). A larger size improves
+ performance by reducing page faults, while a smaller
+ size reduces internal fragmentation. Default: max(64K,
+ PAGE_SIZE). Format: size[KMG].
+
floppy= [HW]
See Documentation/admin-guide/blockdev/floppy.rst.

diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..26c5e51ef11d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -848,6 +848,16 @@ config READ_ONLY_THP_FOR_FS
support of file THPs will be developed in the next few release
cycles.

+config FLEXIBLE_THP
+ bool "Flexible order THP"
+ depends on TRANSPARENT_HUGEPAGE
+ default n
+ help
+ Use large (bigger than order-0) folios to back anonymous memory where
+ possible, even for pte-mapped memory. This reduces the number of page
+ faults, as well as other per-page overheads to improve performance for
+ many workloads.
+
endif # TRANSPARENT_HUGEPAGE

#
diff --git a/mm/memory.c b/mm/memory.c
index 01f39e8144ef..e8bc729efb9d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4050,6 +4050,148 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
return ret;
}

+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+ int i;
+
+ if (nr_pages == 1)
+ return vmf_pte_changed(vmf);
+
+ for (i = 0; i < nr_pages; i++) {
+ if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+ return true;
+ }
+
+ return false;
+}
+
+#ifdef CONFIG_FLEXIBLE_THP
+static int flexthp_unhinted_max_order =
+ ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT;
+
+static int __init parse_flexthp_unhinted_max(char *s)
+{
+ unsigned long long size = memparse(s, NULL);
+
+ if (!is_power_of_2(size) || size < PAGE_SIZE || size > PMD_SIZE) {
+ pr_warn("flexthp: flexthp_unhinted_max=%s must be power-of-2 between PAGE_SIZE (%lu) and PMD_SIZE (%lu), ignoring\n",
+ s, PAGE_SIZE, PMD_SIZE);
+ return 1;
+ }
+
+ flexthp_unhinted_max_order = ilog2(size) - PAGE_SHIFT;
+
+ /* THP machinery requires at least 3 struct pages for meta data. */
+ if (flexthp_unhinted_max_order == 1)
+ flexthp_unhinted_max_order--;
+
+ return 1;
+}
+
+__setup("flexthp_unhinted_max=", parse_flexthp_unhinted_max);
+
+static int anon_folio_order(struct vm_area_struct *vma)
+{
+ int order;
+
+ /*
+ * If THP is explicitly disabled for either the vma, the process or the
+ * system, then this is very likely intended to limit internal
+ * fragmentation; in this case, don't attempt to allocate a large
+ * anonymous folio.
+ *
+ * Else, if the vma is eligible for thp, allocate a large folio of the
+ * size preferred by the arch. Or if the arch requested a very small
+ * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
+ * which still meets the arch's requirements but means we still take
+ * advantage of SW optimizations (e.g. fewer page faults).
+ *
+ * Finally if thp is enabled but the vma isn't eligible, take the
+ * arch-preferred size and limit it to the flexthp_unhinted_max cmdline
+ * parameter. This allows a sysadmin to tune performance vs internal
+ * fragmentation.
+ */
+
+ if ((vma->vm_flags & VM_NOHUGEPAGE) ||
+ test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
+ !hugepage_flags_enabled())
+ order = 0;
+ else {
+ order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
+
+ if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
+ order = min(order, flexthp_unhinted_max_order);
+ }
+
+ return order;
+}
+
+static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
+{
+ int i;
+ gfp_t gfp;
+ pte_t *pte;
+ unsigned long addr;
+ struct vm_area_struct *vma = vmf->vma;
+ int prefer = anon_folio_order(vma);
+ int orders[] = {
+ prefer,
+ prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
+ 0,
+ };
+
+ *folio = NULL;
+
+ if (vmf_orig_pte_uffd_wp(vmf))
+ goto fallback;
+
+ for (i = 0; orders[i]; i++) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+ if (addr >= vma->vm_start &&
+ addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
+ break;
+ }
+
+ if (!orders[i])
+ goto fallback;
+
+ pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+ if (!pte)
+ return -EAGAIN;
+
+ for (; orders[i]; i++) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+ vmf->pte = pte + pte_index(addr);
+ if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
+ break;
+ }
+
+ vmf->pte = NULL;
+ pte_unmap(pte);
+
+ gfp = vma_thp_gfp_mask(vma);
+
+ for (; orders[i]; i++) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+ *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
+ if (*folio) {
+ clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
+ return 0;
+ }
+ }
+
+fallback:
+ *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+ return *folio ? 0 : -ENOMEM;
+}
+#else
+static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
+{
+ *folio = vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
+ return *folio ? 0 : -ENOMEM;
+}
+#endif
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -4057,11 +4199,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
+ int i = 0;
+ int nr_pages = 1;
bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
struct vm_area_struct *vma = vmf->vma;
struct folio *folio;
vm_fault_t ret = 0;
pte_t entry;
+ unsigned long addr;

/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
@@ -4101,10 +4246,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
/* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
goto oom;
- folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+ ret = alloc_anon_folio(vmf, &folio);
+ if (unlikely(ret == -EAGAIN))
+ return 0;
if (!folio)
goto oom;

+ nr_pages = folio_nr_pages(folio);
+ addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+
if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
goto oom_free_page;
folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4116,17 +4266,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
*/
__folio_mark_uptodate(folio);

- entry = mk_pte(&folio->page, vma->vm_page_prot);
- entry = pte_sw_mkyoung(entry);
- if (vma->vm_flags & VM_WRITE)
- entry = pte_mkwrite(pte_mkdirty(entry));
-
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
if (!vmf->pte)
goto release;
- if (vmf_pte_changed(vmf)) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
+ if (vmf_pte_range_changed(vmf, nr_pages)) {
+ for (i = 0; i < nr_pages; i++)
+ update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
goto release;
}

@@ -4141,16 +4286,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
return handle_userfault(vmf, VM_UFFD_MISSING);
}

- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- folio_add_new_anon_rmap(folio, vma, vmf->address);
+ folio_ref_add(folio, nr_pages - 1);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+ folio_add_new_anon_rmap(folio, vma, addr);
folio_add_lru_vma(folio, vma);
+
+ for (i = 0; i < nr_pages; i++) {
+ entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
+ entry = pte_sw_mkyoung(entry);
+ if (vma->vm_flags & VM_WRITE)
+ entry = pte_mkwrite(pte_mkdirty(entry));
setpte:
- if (uffd_wp)
- entry = pte_mkuffd_wp(entry);
- set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+ if (uffd_wp)
+ entry = pte_mkuffd_wp(entry);
+ set_pte_at(vma->vm_mm, addr + PAGE_SIZE * i, vmf->pte + i, entry);

- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, vmf->address, vmf->pte);
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, addr + PAGE_SIZE * i, vmf->pte + i);
+ }
unlock:
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
--
2.25.1


2023-07-14 17:00:38

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v3 2/4] mm: Default implementation of arch_wants_pte_order()

arch_wants_pte_order() can be overridden by the arch to return the
preferred folio order for pte-mapped memory. This is useful as some
architectures (e.g. arm64) can coalesce TLB entries when the physical
memory is suitably contiguous.

The first user for this hint will be FLEXIBLE_THP, which aims to
allocate large folios for anonymous memory to reduce page faults and
other per-page operation costs.

Here we add the default implementation of the function, used when the
architecture does not define it, which returns -1, implying that the HW
has no preference. In this case, mm will choose it's own default order.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/pgtable.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..2a1d83775837 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
}
#endif

+#ifndef arch_wants_pte_order
+/*
+ * Returns preferred folio order for pte-mapped memory. Must be in range [0,
+ * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
+ * to be at least order-2. Negative value implies that the HW has no preference
+ * and mm will choose it's own default order.
+ */
+static inline int arch_wants_pte_order(void)
+{
+ return -1;
+}
+#endif
+
#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long address,
--
2.25.1


2023-07-14 17:03:04

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] arm64: mm: Override arch_wants_pte_order()

On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
>
> Define an arch-specific override of arch_wants_pte_order() so that when
> FLEXIBLE_THP is enabled, large folios will be allocated for anonymous
> memory with an order that is compatible with arm64's contpte mappings.
>
> Signed-off-by: Ryan Roberts <[email protected]>

Reviewed-by: Yu Zhao <[email protected]>

2023-07-14 17:04:05

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] mm: Default implementation of arch_wants_pte_order()

On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
>
> arch_wants_pte_order() can be overridden by the arch to return the
> preferred folio order for pte-mapped memory. This is useful as some
> architectures (e.g. arm64) can coalesce TLB entries when the physical
> memory is suitably contiguous.
>
> The first user for this hint will be FLEXIBLE_THP, which aims to
> allocate large folios for anonymous memory to reduce page faults and
> other per-page operation costs.
>
> Here we add the default implementation of the function, used when the
> architecture does not define it, which returns -1, implying that the HW
> has no preference. In this case, mm will choose it's own default order.
>
> Signed-off-by: Ryan Roberts <[email protected]>

Reviewed-by: Yu Zhao <[email protected]>

Thanks: -1 actually is better than 0 (what I suggested) for the obvious reason.

2023-07-14 17:11:04

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()

On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
>
> In preparation for FLEXIBLE_THP support, improve
> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
> passed to it. In this case, all contained pages are accounted using the
> order-0 folio (or base page) scheme.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> Reviewed-by: Yu Zhao <[email protected]>
> Reviewed-by: Yin Fengwei <[email protected]>

This patch doesn't depend on the rest of the series and therefore can
be merged separately in case the rest needs more discussion.

Ryan, please feel free to post other code paths (those from v1) you've
optimized for large anon folios at any time, since each code path can
be reviewed and merged individually as well.

2023-07-14 17:31:56

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
>
> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> allocated in large folios of a determined order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
>
> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> defaults to disabled for now; The long term aim is for this to defaut to
> enabled, but there are some risks around internal fragmentation that
> need to be better understood first.
>
> When enabled, the folio order is determined as such: For a vma, process
> or system that has explicitly disabled THP, we continue to allocate
> order-0. THP is most likely disabled to avoid any possible internal
> fragmentation so we honour that request.
>
> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> that have not explicitly opted-in to use transparent hugepages (e.g.
> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> arch_wants_pte_order() is limited by the new cmdline parameter,
> `flexthp_unhinted_max`. This allows for a performance boost without
> requiring any explicit opt-in from the workload while allowing the
> sysadmin to tune between performance and internal fragmentation.
>
> arch_wants_pte_order() can be overridden by the architecture if desired.
> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> set of ptes map physically contigious, naturally aligned memory, so this
> mechanism allows the architecture to optimize as required.
>
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order; first
> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> .../admin-guide/kernel-parameters.txt | 10 +
> mm/Kconfig | 10 +
> mm/memory.c | 187 ++++++++++++++++--
> 3 files changed, 190 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a1457995fd41..405d624e2191 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1497,6 +1497,16 @@
> See Documentation/admin-guide/sysctl/net.rst for
> fb_tunnels_only_for_init_ns
>
> + flexthp_unhinted_max=
> + [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
> + folio size that will be allocated for an anonymous vma
> + that has neither explicitly opted in nor out of using
> + transparent hugepages. The size must be a power-of-2 in
> + the range [PAGE_SIZE, PMD_SIZE). A larger size improves
> + performance by reducing page faults, while a smaller
> + size reduces internal fragmentation. Default: max(64K,
> + PAGE_SIZE). Format: size[KMG].
> +

Let's split this parameter into a separate patch.

And I'm going to ask many questions about it (I can live with a sysctl
parameter but this boot parameter is unacceptable to me).

> diff --git a/mm/memory.c b/mm/memory.c
> index 01f39e8144ef..e8bc729efb9d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4050,6 +4050,148 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> return ret;
> }
>
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> + int i;
> +
> + if (nr_pages == 1)
> + return vmf_pte_changed(vmf);
> +
> + for (i = 0; i < nr_pages; i++) {
> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> + return true;
> + }
> +
> + return false;
> +}
> +
> +#ifdef CONFIG_FLEXIBLE_THP
> +static int flexthp_unhinted_max_order =
> + ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT;
> +
> +static int __init parse_flexthp_unhinted_max(char *s)
> +{
> + unsigned long long size = memparse(s, NULL);
> +
> + if (!is_power_of_2(size) || size < PAGE_SIZE || size > PMD_SIZE) {
> + pr_warn("flexthp: flexthp_unhinted_max=%s must be power-of-2 between PAGE_SIZE (%lu) and PMD_SIZE (%lu), ignoring\n",
> + s, PAGE_SIZE, PMD_SIZE);
> + return 1;
> + }
> +
> + flexthp_unhinted_max_order = ilog2(size) - PAGE_SHIFT;
> +
> + /* THP machinery requires at least 3 struct pages for meta data. */
> + if (flexthp_unhinted_max_order == 1)
> + flexthp_unhinted_max_order--;
> +
> + return 1;
> +}
> +
> +__setup("flexthp_unhinted_max=", parse_flexthp_unhinted_max);
> +
> +static int anon_folio_order(struct vm_area_struct *vma)
> +{
> + int order;
> +
> + /*
> + * If THP is explicitly disabled for either the vma, the process or the
> + * system, then this is very likely intended to limit internal
> + * fragmentation; in this case, don't attempt to allocate a large
> + * anonymous folio.
> + *
> + * Else, if the vma is eligible for thp, allocate a large folio of the
> + * size preferred by the arch. Or if the arch requested a very small
> + * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> + * which still meets the arch's requirements but means we still take
> + * advantage of SW optimizations (e.g. fewer page faults).
> + *
> + * Finally if thp is enabled but the vma isn't eligible, take the
> + * arch-preferred size and limit it to the flexthp_unhinted_max cmdline
> + * parameter. This allows a sysadmin to tune performance vs internal
> + * fragmentation.
> + */
> +
> + if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> + !hugepage_flags_enabled())
> + order = 0;
> + else {
> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> +
> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> + order = min(order, flexthp_unhinted_max_order);
> + }
> +
> + return order;
> +}
> +
> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> +{
> + int i;
> + gfp_t gfp;
> + pte_t *pte;
> + unsigned long addr;
> + struct vm_area_struct *vma = vmf->vma;
> + int prefer = anon_folio_order(vma);
> + int orders[] = {
> + prefer,
> + prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> + 0,
> + };
> +
> + *folio = NULL;
> +
> + if (vmf_orig_pte_uffd_wp(vmf))
> + goto fallback;
> +
> + for (i = 0; orders[i]; i++) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> + if (addr >= vma->vm_start &&
> + addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> + break;
> + }
> +
> + if (!orders[i])
> + goto fallback;
> +
> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> + if (!pte)
> + return -EAGAIN;

It would be a bug if this happens. So probably -EINVAL?

> +
> + for (; orders[i]; i++) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> + vmf->pte = pte + pte_index(addr);
> + if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
> + break;
> + }
> +
> + vmf->pte = NULL;
> + pte_unmap(pte);
> +
> + gfp = vma_thp_gfp_mask(vma);
> +
> + for (; orders[i]; i++) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> + *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
> + if (*folio) {
> + clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
> + return 0;
> + }
> + }
> +
> +fallback:
> + *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> + return *folio ? 0 : -ENOMEM;
> +}
> +#else
> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)

Drop "inline" (it doesn't do anything in .c).

The rest looks good to me.

2023-07-14 18:13:57

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On 14/07/2023 18:17, Yu Zhao wrote:
> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
>>
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>> allocated in large folios of a determined order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>> defaults to disabled for now; The long term aim is for this to defaut to
>> enabled, but there are some risks around internal fragmentation that
>> need to be better understood first.
>>
>> When enabled, the folio order is determined as such: For a vma, process
>> or system that has explicitly disabled THP, we continue to allocate
>> order-0. THP is most likely disabled to avoid any possible internal
>> fragmentation so we honour that request.
>>
>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>> that have not explicitly opted-in to use transparent hugepages (e.g.
>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>> arch_wants_pte_order() is limited by the new cmdline parameter,
>> `flexthp_unhinted_max`. This allows for a performance boost without
>> requiring any explicit opt-in from the workload while allowing the
>> sysadmin to tune between performance and internal fragmentation.
>>
>> arch_wants_pte_order() can be overridden by the architecture if desired.
>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>> set of ptes map physically contigious, naturally aligned memory, so this
>> mechanism allows the architecture to optimize as required.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order; first
>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> .../admin-guide/kernel-parameters.txt | 10 +
>> mm/Kconfig | 10 +
>> mm/memory.c | 187 ++++++++++++++++--
>> 3 files changed, 190 insertions(+), 17 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index a1457995fd41..405d624e2191 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -1497,6 +1497,16 @@
>> See Documentation/admin-guide/sysctl/net.rst for
>> fb_tunnels_only_for_init_ns
>>
>> + flexthp_unhinted_max=
>> + [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
>> + folio size that will be allocated for an anonymous vma
>> + that has neither explicitly opted in nor out of using
>> + transparent hugepages. The size must be a power-of-2 in
>> + the range [PAGE_SIZE, PMD_SIZE). A larger size improves
>> + performance by reducing page faults, while a smaller
>> + size reduces internal fragmentation. Default: max(64K,
>> + PAGE_SIZE). Format: size[KMG].
>> +
>
> Let's split this parameter into a separate patch.

Ha - I had it as a separate patch originally, but thought you'd ask for it to be
a single patch so squashed it ;-). I can separate it again, no problem.

>
> And I'm going to ask many questions about it (I can live with a sysctl
> parameter but this boot parameter is unacceptable to me).

Please do. Hopefully the thread with DavidH against v2 gives the rationale. Care
to elaborate on why its unacceptable?

>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 01f39e8144ef..e8bc729efb9d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4050,6 +4050,148 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> return ret;
>> }
>>
>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>> +{
>> + int i;
>> +
>> + if (nr_pages == 1)
>> + return vmf_pte_changed(vmf);
>> +
>> + for (i = 0; i < nr_pages; i++) {
>> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>> + return true;
>> + }
>> +
>> + return false;
>> +}
>> +
>> +#ifdef CONFIG_FLEXIBLE_THP
>> +static int flexthp_unhinted_max_order =
>> + ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT;
>> +
>> +static int __init parse_flexthp_unhinted_max(char *s)
>> +{
>> + unsigned long long size = memparse(s, NULL);
>> +
>> + if (!is_power_of_2(size) || size < PAGE_SIZE || size > PMD_SIZE) {
>> + pr_warn("flexthp: flexthp_unhinted_max=%s must be power-of-2 between PAGE_SIZE (%lu) and PMD_SIZE (%lu), ignoring\n",
>> + s, PAGE_SIZE, PMD_SIZE);
>> + return 1;
>> + }
>> +
>> + flexthp_unhinted_max_order = ilog2(size) - PAGE_SHIFT;
>> +
>> + /* THP machinery requires at least 3 struct pages for meta data. */
>> + if (flexthp_unhinted_max_order == 1)
>> + flexthp_unhinted_max_order--;
>> +
>> + return 1;
>> +}
>> +
>> +__setup("flexthp_unhinted_max=", parse_flexthp_unhinted_max);
>> +
>> +static int anon_folio_order(struct vm_area_struct *vma)
>> +{
>> + int order;
>> +
>> + /*
>> + * If THP is explicitly disabled for either the vma, the process or the
>> + * system, then this is very likely intended to limit internal
>> + * fragmentation; in this case, don't attempt to allocate a large
>> + * anonymous folio.
>> + *
>> + * Else, if the vma is eligible for thp, allocate a large folio of the
>> + * size preferred by the arch. Or if the arch requested a very small
>> + * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>> + * which still meets the arch's requirements but means we still take
>> + * advantage of SW optimizations (e.g. fewer page faults).
>> + *
>> + * Finally if thp is enabled but the vma isn't eligible, take the
>> + * arch-preferred size and limit it to the flexthp_unhinted_max cmdline
>> + * parameter. This allows a sysadmin to tune performance vs internal
>> + * fragmentation.
>> + */
>> +
>> + if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>> + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>> + !hugepage_flags_enabled())
>> + order = 0;
>> + else {
>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>> +
>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>> + order = min(order, flexthp_unhinted_max_order);
>> + }
>> +
>> + return order;
>> +}
>> +
>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>> +{
>> + int i;
>> + gfp_t gfp;
>> + pte_t *pte;
>> + unsigned long addr;
>> + struct vm_area_struct *vma = vmf->vma;
>> + int prefer = anon_folio_order(vma);
>> + int orders[] = {
>> + prefer,
>> + prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
>> + 0,
>> + };
>> +
>> + *folio = NULL;
>> +
>> + if (vmf_orig_pte_uffd_wp(vmf))
>> + goto fallback;
>> +
>> + for (i = 0; orders[i]; i++) {
>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>> + if (addr >= vma->vm_start &&
>> + addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
>> + break;
>> + }
>> +
>> + if (!orders[i])
>> + goto fallback;
>> +
>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>> + if (!pte)
>> + return -EAGAIN;
>
> It would be a bug if this happens. So probably -EINVAL?

Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
possible for pte_offset_map() to fail (if I understood correctly) and we have to
handle this. The intent is that we will return from the fault without making any
change, then we will refault and try again.

>
>> +
>> + for (; orders[i]; i++) {
>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>> + vmf->pte = pte + pte_index(addr);
>> + if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
>> + break;
>> + }
>> +
>> + vmf->pte = NULL;
>> + pte_unmap(pte);
>> +
>> + gfp = vma_thp_gfp_mask(vma);
>> +
>> + for (; orders[i]; i++) {
>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>> + *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
>> + if (*folio) {
>> + clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
>> + return 0;
>> + }
>> + }
>> +
>> +fallback:
>> + *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> + return *folio ? 0 : -ENOMEM;
>> +}
>> +#else
>> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>
> Drop "inline" (it doesn't do anything in .c).

There are 38 instances of inline in memory.c alone, so looks like a well used
convention, even if the compiler may choose to ignore. Perhaps you can educate
me; what's the benefit of dropping it?

>
> The rest looks good to me.

Great - just incase it wasn't obvious, I decided not to overwrite vmf->address
with the aligned version, as you suggested, for 2 reasons; 1) address is const
in the struct, so would have had to change that. 2) there is a uffd path that
can be taken after the vmf->address fixup would have occured and the path
consumes that member, so it would have had to be un-fixed-up making it more
messy than the way I opted for.

Thanks for the quick review as always!


2023-07-14 18:35:10

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()

On 14/07/2023 17:52, Yu Zhao wrote:
> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
>>
>> In preparation for FLEXIBLE_THP support, improve
>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
>> passed to it. In this case, all contained pages are accounted using the
>> order-0 folio (or base page) scheme.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> Reviewed-by: Yu Zhao <[email protected]>
>> Reviewed-by: Yin Fengwei <[email protected]>
>
> This patch doesn't depend on the rest of the series and therefore can
> be merged separately in case the rest needs more discussion.
>
> Ryan, please feel free to post other code paths (those from v1) you've
> optimized for large anon folios at any time, since each code path can
> be reviewed and merged individually as well.

Will do. Hoping to get the "batch zap" series out on Monday. Others need a bit
more time to bake as they will need to be aligned to the changes from the review
of this series first.

2023-07-14 22:29:56

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On Fri, Jul 14, 2023 at 11:59 AM Ryan Roberts <[email protected]> wrote:
>
> On 14/07/2023 18:17, Yu Zhao wrote:
> > On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
> >>
> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> >> allocated in large folios of a determined order. All pages of the large
> >> folio are pte-mapped during the same page fault, significantly reducing
> >> the number of page faults. The number of per-page operations (e.g. ref
> >> counting, rmap management lru list management) are also significantly
> >> reduced since those ops now become per-folio.
> >>
> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> >> defaults to disabled for now; The long term aim is for this to defaut to
> >> enabled, but there are some risks around internal fragmentation that
> >> need to be better understood first.
> >>
> >> When enabled, the folio order is determined as such: For a vma, process
> >> or system that has explicitly disabled THP, we continue to allocate
> >> order-0. THP is most likely disabled to avoid any possible internal
> >> fragmentation so we honour that request.
> >>
> >> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >> that have not explicitly opted-in to use transparent hugepages (e.g.
> >> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >> arch_wants_pte_order() is limited by the new cmdline parameter,
> >> `flexthp_unhinted_max`. This allows for a performance boost without
> >> requiring any explicit opt-in from the workload while allowing the
> >> sysadmin to tune between performance and internal fragmentation.
> >>
> >> arch_wants_pte_order() can be overridden by the architecture if desired.
> >> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> >> set of ptes map physically contigious, naturally aligned memory, so this
> >> mechanism allows the architecture to optimize as required.
> >>
> >> If the preferred order can't be used (e.g. because the folio would
> >> breach the bounds of the vma, or because ptes in the region are already
> >> mapped) then we fall back to a suitable lower order; first
> >> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>
> >> Signed-off-by: Ryan Roberts <[email protected]>
> >> ---
> >> .../admin-guide/kernel-parameters.txt | 10 +
> >> mm/Kconfig | 10 +
> >> mm/memory.c | 187 ++++++++++++++++--
> >> 3 files changed, 190 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> >> index a1457995fd41..405d624e2191 100644
> >> --- a/Documentation/admin-guide/kernel-parameters.txt
> >> +++ b/Documentation/admin-guide/kernel-parameters.txt
> >> @@ -1497,6 +1497,16 @@
> >> See Documentation/admin-guide/sysctl/net.rst for
> >> fb_tunnels_only_for_init_ns
> >>
> >> + flexthp_unhinted_max=
> >> + [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
> >> + folio size that will be allocated for an anonymous vma
> >> + that has neither explicitly opted in nor out of using
> >> + transparent hugepages. The size must be a power-of-2 in
> >> + the range [PAGE_SIZE, PMD_SIZE). A larger size improves
> >> + performance by reducing page faults, while a smaller
> >> + size reduces internal fragmentation. Default: max(64K,
> >> + PAGE_SIZE). Format: size[KMG].
> >> +
> >
> > Let's split this parameter into a separate patch.
>
> Ha - I had it as a separate patch originally, but thought you'd ask for it to be
> a single patch so squashed it ;-). I can separate it again, no problem.
>
> >
> > And I'm going to ask many questions about it (I can live with a sysctl
> > parameter but this boot parameter is unacceptable to me).
>
> Please do. Hopefully the thread with DavidH against v2 gives the rationale. Care
> to elaborate on why its unacceptable?

For starters, it's a bad practice not to support the first one that
works first, and then support the following ones if/when new use cases
arise.
1. per vma/process policies, e.g., madvise
2. per memcg policy, e..g, cgroup/memory.*
3. systemwide runtime knobs, e.g., sysctl
4. boot parameter, e.g., this one
5. kconfig option

Assuming the optimal state has one systemwide value, we would have to
find it by trial and error. And each trial would require a reboot,
which could be avoided if it was a sysctl parameter.

And why can we assume the optimal state has only one value?

(CONFIG_FLEXIBLE_THP is better than sysctl only because we plan to get
rid of it once we are ready -- using sysctl would cause an ABI
breakage, which might be acceptable but why do so if it can be
avoided.)

> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 01f39e8144ef..e8bc729efb9d 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -4050,6 +4050,148 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> return ret;
> >> }
> >>
> >> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> >> +{
> >> + int i;
> >> +
> >> + if (nr_pages == 1)
> >> + return vmf_pte_changed(vmf);
> >> +
> >> + for (i = 0; i < nr_pages; i++) {
> >> + if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> >> + return true;
> >> + }
> >> +
> >> + return false;
> >> +}
> >> +
> >> +#ifdef CONFIG_FLEXIBLE_THP
> >> +static int flexthp_unhinted_max_order =
> >> + ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT;
> >> +
> >> +static int __init parse_flexthp_unhinted_max(char *s)
> >> +{
> >> + unsigned long long size = memparse(s, NULL);
> >> +
> >> + if (!is_power_of_2(size) || size < PAGE_SIZE || size > PMD_SIZE) {
> >> + pr_warn("flexthp: flexthp_unhinted_max=%s must be power-of-2 between PAGE_SIZE (%lu) and PMD_SIZE (%lu), ignoring\n",
> >> + s, PAGE_SIZE, PMD_SIZE);
> >> + return 1;
> >> + }
> >> +
> >> + flexthp_unhinted_max_order = ilog2(size) - PAGE_SHIFT;
> >> +
> >> + /* THP machinery requires at least 3 struct pages for meta data. */
> >> + if (flexthp_unhinted_max_order == 1)
> >> + flexthp_unhinted_max_order--;
> >> +
> >> + return 1;
> >> +}
> >> +
> >> +__setup("flexthp_unhinted_max=", parse_flexthp_unhinted_max);
> >> +
> >> +static int anon_folio_order(struct vm_area_struct *vma)
> >> +{
> >> + int order;
> >> +
> >> + /*
> >> + * If THP is explicitly disabled for either the vma, the process or the
> >> + * system, then this is very likely intended to limit internal
> >> + * fragmentation; in this case, don't attempt to allocate a large
> >> + * anonymous folio.
> >> + *
> >> + * Else, if the vma is eligible for thp, allocate a large folio of the
> >> + * size preferred by the arch. Or if the arch requested a very small
> >> + * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> >> + * which still meets the arch's requirements but means we still take
> >> + * advantage of SW optimizations (e.g. fewer page faults).
> >> + *
> >> + * Finally if thp is enabled but the vma isn't eligible, take the
> >> + * arch-preferred size and limit it to the flexthp_unhinted_max cmdline
> >> + * parameter. This allows a sysadmin to tune performance vs internal
> >> + * fragmentation.
> >> + */
> >> +
> >> + if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> >> + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> >> + !hugepage_flags_enabled())
> >> + order = 0;
> >> + else {
> >> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> >> +
> >> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> >> + order = min(order, flexthp_unhinted_max_order);
> >> + }
> >> +
> >> + return order;
> >> +}
> >> +
> >> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >> +{
> >> + int i;
> >> + gfp_t gfp;
> >> + pte_t *pte;
> >> + unsigned long addr;
> >> + struct vm_area_struct *vma = vmf->vma;
> >> + int prefer = anon_folio_order(vma);
> >> + int orders[] = {
> >> + prefer,
> >> + prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> >> + 0,
> >> + };
> >> +
> >> + *folio = NULL;
> >> +
> >> + if (vmf_orig_pte_uffd_wp(vmf))
> >> + goto fallback;
> >> +
> >> + for (i = 0; orders[i]; i++) {
> >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >> + if (addr >= vma->vm_start &&
> >> + addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> >> + break;
> >> + }
> >> +
> >> + if (!orders[i])
> >> + goto fallback;
> >> +
> >> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >> + if (!pte)
> >> + return -EAGAIN;
> >
> > It would be a bug if this happens. So probably -EINVAL?
>
> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
> possible for pte_offset_map() to fail (if I understood correctly) and we have to
> handle this. The intent is that we will return from the fault without making any
> change, then we will refault and try again.

Thanks for checking that -- it's very relevant. One detail is that
that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
happen while we are holding mmap_lock for read here, and therefore,
the race that could cause pte_offset_map() on shmem/file PTEs to fail
doesn't apply here.

+Hugh Dickins for further consultation if you need it.

> >> +
> >> + for (; orders[i]; i++) {
> >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >> + vmf->pte = pte + pte_index(addr);
> >> + if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
> >> + break;
> >> + }
> >> +
> >> + vmf->pte = NULL;
> >> + pte_unmap(pte);
> >> +
> >> + gfp = vma_thp_gfp_mask(vma);
> >> +
> >> + for (; orders[i]; i++) {
> >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >> + *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
> >> + if (*folio) {
> >> + clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
> >> + return 0;
> >> + }
> >> + }
> >> +
> >> +fallback:
> >> + *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> >> + return *folio ? 0 : -ENOMEM;
> >> +}
> >> +#else
> >> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >
> > Drop "inline" (it doesn't do anything in .c).
>
> There are 38 instances of inline in memory.c alone, so looks like a well used
> convention, even if the compiler may choose to ignore. Perhaps you can educate
> me; what's the benefit of dropping it?

I'll let Willy and Andrew educate both of us :)

+Matthew Wilcox +Andrew Morton please. Thank you.

> > The rest looks good to me.
>
> Great - just incase it wasn't obvious, I decided not to overwrite vmf->address
> with the aligned version, as you suggested

Yes, I've noticed. Not overwriting has its own merits for sure.

> for 2 reasons; 1) address is const
> in the struct, so would have had to change that. 2) there is a uffd path that
> can be taken after the vmf->address fixup would have occured and the path
> consumes that member, so it would have had to be un-fixed-up making it more
> messy than the way I opted for.
>
> Thanks for the quick review as always!

2023-07-17 11:22:12

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] mm: Default implementation of arch_wants_pte_order()



On 7/15/23 00:17, Ryan Roberts wrote:
> arch_wants_pte_order() can be overridden by the arch to return the
> preferred folio order for pte-mapped memory. This is useful as some
> architectures (e.g. arm64) can coalesce TLB entries when the physical
> memory is suitably contiguous.
>
> The first user for this hint will be FLEXIBLE_THP, which aims to
> allocate large folios for anonymous memory to reduce page faults and
> other per-page operation costs.
>
> Here we add the default implementation of the function, used when the
> architecture does not define it, which returns -1, implying that the HW
> has no preference. In this case, mm will choose it's own default order.
>
> Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>


Regards
Yin, Fengwei

> ---
> include/linux/pgtable.h | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5063b482e34f..2a1d83775837 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
> }
> #endif
>
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> + * to be at least order-2. Negative value implies that the HW has no preference
> + * and mm will choose it's own default order.
> + */
> +static inline int arch_wants_pte_order(void)
> +{
> + return -1;
> +}
> +#endif
> +
> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> unsigned long address,

2023-07-17 13:11:14

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] mm: Default implementation of arch_wants_pte_order()

On 14.07.23 18:17, Ryan Roberts wrote:
> arch_wants_pte_order() can be overridden by the arch to return the
> preferred folio order for pte-mapped memory. This is useful as some
> architectures (e.g. arm64) can coalesce TLB entries when the physical
> memory is suitably contiguous.
>
> The first user for this hint will be FLEXIBLE_THP, which aims to
> allocate large folios for anonymous memory to reduce page faults and
> other per-page operation costs.
>
> Here we add the default implementation of the function, used when the
> architecture does not define it, which returns -1, implying that the HW
> has no preference. In this case, mm will choose it's own default order.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/pgtable.h | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5063b482e34f..2a1d83775837 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
> }
> #endif
>
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> + * to be at least order-2. Negative value implies that the HW has no preference
> + * and mm will choose it's own default order.
> + */
> +static inline int arch_wants_pte_order(void)
> +{
> + return -1;
> +}
> +#endif
> +
> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> unsigned long address,

What is the reason to have this into a separate patch? That should
simply be squashed into the actual user -- patch #3.

--
Cheers,

David / dhildenb


2023-07-17 13:16:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On 14.07.23 19:17, Yu Zhao wrote:
> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
>>
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>> allocated in large folios of a determined order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>> defaults to disabled for now; The long term aim is for this to defaut to
>> enabled, but there are some risks around internal fragmentation that
>> need to be better understood first.
>>
>> When enabled, the folio order is determined as such: For a vma, process
>> or system that has explicitly disabled THP, we continue to allocate
>> order-0. THP is most likely disabled to avoid any possible internal
>> fragmentation so we honour that request.
>>
>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>> that have not explicitly opted-in to use transparent hugepages (e.g.
>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>> arch_wants_pte_order() is limited by the new cmdline parameter,
>> `flexthp_unhinted_max`. This allows for a performance boost without
>> requiring any explicit opt-in from the workload while allowing the
>> sysadmin to tune between performance and internal fragmentation.
>>
>> arch_wants_pte_order() can be overridden by the architecture if desired.
>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>> set of ptes map physically contigious, naturally aligned memory, so this
>> mechanism allows the architecture to optimize as required.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order; first
>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> .../admin-guide/kernel-parameters.txt | 10 +
>> mm/Kconfig | 10 +
>> mm/memory.c | 187 ++++++++++++++++--
>> 3 files changed, 190 insertions(+), 17 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index a1457995fd41..405d624e2191 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -1497,6 +1497,16 @@
>> See Documentation/admin-guide/sysctl/net.rst for
>> fb_tunnels_only_for_init_ns
>>
>> + flexthp_unhinted_max=
>> + [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
>> + folio size that will be allocated for an anonymous vma
>> + that has neither explicitly opted in nor out of using
>> + transparent hugepages. The size must be a power-of-2 in
>> + the range [PAGE_SIZE, PMD_SIZE). A larger size improves
>> + performance by reducing page faults, while a smaller
>> + size reduces internal fragmentation. Default: max(64K,
>> + PAGE_SIZE). Format: size[KMG].
>> +
>
> Let's split this parameter into a separate patch.
>

Just a general comment after stumbling over patch #2, let's not start
splitting patches into things that don't make any sense on their own;
that just makes review a lot harder.

For this case here, I'd suggest first adding the general infrastructure
and then adding tunables we want to have on top.

I agree that toggling that at runtime (for example via sysfs as raised
by me previously) would be nicer.

--
Cheers,

David / dhildenb


2023-07-17 13:16:51

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()

On 14.07.23 18:17, Ryan Roberts wrote:
> In preparation for FLEXIBLE_THP support, improve
> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
> passed to it. In this case, all contained pages are accounted using the
> order-0 folio (or base page) scheme.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> Reviewed-by: Yu Zhao <[email protected]>
> Reviewed-by: Yin Fengwei <[email protected]>
> ---
> mm/rmap.c | 28 +++++++++++++++++++++-------
> 1 file changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0c0d8857dfce..f293d072368a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1278,31 +1278,45 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
> * This means the inc-and-test can be bypassed.
> * The folio does not have to be locked.
> *
> - * If the folio is large, it is accounted as a THP. As the folio
> + * If the folio is pmd-mappable, it is accounted as a THP. As the folio
> * is new, it's assumed to be mapped exclusively by a single process.
> */
> void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
> unsigned long address)
> {
> - int nr;
> + int nr = folio_nr_pages(folio);
>
> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> + VM_BUG_ON_VMA(address < vma->vm_start ||
> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> __folio_set_swapbacked(folio);
>
> - if (likely(!folio_test_pmd_mappable(folio))) {
> + if (!folio_test_large(folio)) {

Why remove the "likely" here? The patch itself does not change anything
about that condition.

> /* increment count (starts at -1) */
> atomic_set(&folio->_mapcount, 0);
> - nr = 1;
> + __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
> + } else if (!folio_test_pmd_mappable(folio)) {
> + int i;
> +
> + for (i = 0; i < nr; i++) {
> + struct page *page = folio_page(folio, i);
> +
> + /* increment count (starts at -1) */
> + atomic_set(&page->_mapcount, 0);
> + __page_set_anon_rmap(folio, page, vma,
> + address + (i << PAGE_SHIFT), 1);
> + }
> +
> + /* increment count (starts at 0) */

That comment is a bit misleading. We're not talking about a mapcount as
in the other cases here.

> + atomic_set(&folio->_nr_pages_mapped, nr);
> } else {
> /* increment count (starts at -1) */
> atomic_set(&folio->_entire_mapcount, 0);
> atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
> - nr = folio_nr_pages(folio);
> + __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
> __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
> }
>

Apart from that, LGTM.

--
Cheers,

David / dhildenb


2023-07-17 13:19:01

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()

On 17/07/2023 14:00, David Hildenbrand wrote:
> On 14.07.23 18:17, Ryan Roberts wrote:
>> In preparation for FLEXIBLE_THP support, improve
>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
>> passed to it. In this case, all contained pages are accounted using the
>> order-0 folio (or base page) scheme.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> Reviewed-by: Yu Zhao <[email protected]>
>> Reviewed-by: Yin Fengwei <[email protected]>
>> ---
>>   mm/rmap.c | 28 +++++++++++++++++++++-------
>>   1 file changed, 21 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 0c0d8857dfce..f293d072368a 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1278,31 +1278,45 @@ void page_add_anon_rmap(struct page *page, struct
>> vm_area_struct *vma,
>>    * This means the inc-and-test can be bypassed.
>>    * The folio does not have to be locked.
>>    *
>> - * If the folio is large, it is accounted as a THP.  As the folio
>> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>>    * is new, it's assumed to be mapped exclusively by a single process.
>>    */
>>   void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>           unsigned long address)
>>   {
>> -    int nr;
>> +    int nr = folio_nr_pages(folio);
>>
>> -    VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>> +    VM_BUG_ON_VMA(address < vma->vm_start ||
>> +            address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>       __folio_set_swapbacked(folio);
>>
>> -    if (likely(!folio_test_pmd_mappable(folio))) {
>> +    if (!folio_test_large(folio)) {
>
> Why remove the "likely" here? The patch itself does not change anything about
> that condition.

Good question; I'm not sure why. Will have to put it down to bad copy/paste
fixup. Will put it back in the next version.

>
>>           /* increment count (starts at -1) */
>>           atomic_set(&folio->_mapcount, 0);
>> -        nr = 1;
>> +        __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>> +    } else if (!folio_test_pmd_mappable(folio)) {
>> +        int i;
>> +
>> +        for (i = 0; i < nr; i++) {
>> +            struct page *page = folio_page(folio, i);
>> +
>> +            /* increment count (starts at -1) */
>> +            atomic_set(&page->_mapcount, 0);
>> +            __page_set_anon_rmap(folio, page, vma,
>> +                    address + (i << PAGE_SHIFT), 1);
>> +        }
>> +
>> +        /* increment count (starts at 0) */
>
> That comment is a bit misleading. We're not talking about a mapcount as in the
> other cases here.

Correct, I'm talking about _nr_pages_mapped, which starts 0, not -1 like
_mapcount. The comment was intended to be in the style used in other similar
places in rmap.c. I could change it to: "_nr_pages_mapped is 0-based, so set it
to the number of pages in the folio" or remove it entirely? What do you prefer?

>
>> +        atomic_set(&folio->_nr_pages_mapped, nr);
>>       } else {
>>           /* increment count (starts at -1) */
>>           atomic_set(&folio->_entire_mapcount, 0);
>>           atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
>> -        nr = folio_nr_pages(folio);
>> +        __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>           __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>>       }
>>
>
> Apart from that, LGTM.
>


2023-07-17 13:27:50

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] mm: Default implementation of arch_wants_pte_order()

On 17/07/2023 14:01, David Hildenbrand wrote:
> On 14.07.23 18:17, Ryan Roberts wrote:
>> arch_wants_pte_order() can be overridden by the arch to return the
>> preferred folio order for pte-mapped memory. This is useful as some
>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>> memory is suitably contiguous.
>>
>> The first user for this hint will be FLEXIBLE_THP, which aims to
>> allocate large folios for anonymous memory to reduce page faults and
>> other per-page operation costs.
>>
>> Here we add the default implementation of the function, used when the
>> architecture does not define it, which returns -1, implying that the HW
>> has no preference. In this case, mm will choose it's own default order.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>>   include/linux/pgtable.h | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 5063b482e34f..2a1d83775837 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
>>   }
>>   #endif
>>   +#ifndef arch_wants_pte_order
>> +/*
>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>> + * to be at least order-2. Negative value implies that the HW has no preference
>> + * and mm will choose it's own default order.
>> + */
>> +static inline int arch_wants_pte_order(void)
>> +{
>> +    return -1;
>> +}
>> +#endif
>> +
>>   #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>   static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>                          unsigned long address,
>
> What is the reason to have this into a separate patch? That should simply be
> squashed into the actual user -- patch #3.

There was a lot more in this at v1 IIRC, so made more sense as standalone. I
agree it can be squashed into the next patch now. Will do for next version.

>


2023-07-17 13:35:04

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()

On 17/07/2023 14:19, David Hildenbrand wrote:
> On 17.07.23 15:13, Ryan Roberts wrote:
>> On 17/07/2023 14:00, David Hildenbrand wrote:
>>> On 14.07.23 18:17, Ryan Roberts wrote:
>>>> In preparation for FLEXIBLE_THP support, improve
>>>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
>>>> passed to it. In this case, all contained pages are accounted using the
>>>> order-0 folio (or base page) scheme.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> Reviewed-by: Yu Zhao <[email protected]>
>>>> Reviewed-by: Yin Fengwei <[email protected]>
>>>> ---
>>>>    mm/rmap.c | 28 +++++++++++++++++++++-------
>>>>    1 file changed, 21 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 0c0d8857dfce..f293d072368a 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1278,31 +1278,45 @@ void page_add_anon_rmap(struct page *page, struct
>>>> vm_area_struct *vma,
>>>>     * This means the inc-and-test can be bypassed.
>>>>     * The folio does not have to be locked.
>>>>     *
>>>> - * If the folio is large, it is accounted as a THP.  As the folio
>>>> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>>>>     * is new, it's assumed to be mapped exclusively by a single process.
>>>>     */
>>>>    void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct
>>>> *vma,
>>>>            unsigned long address)
>>>>    {
>>>> -    int nr;
>>>> +    int nr = folio_nr_pages(folio);
>>>>
>>>> -    VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>>>> +    VM_BUG_ON_VMA(address < vma->vm_start ||
>>>> +            address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>>>        __folio_set_swapbacked(folio);
>>>>
>>>> -    if (likely(!folio_test_pmd_mappable(folio))) {
>>>> +    if (!folio_test_large(folio)) {
>>>
>>> Why remove the "likely" here? The patch itself does not change anything about
>>> that condition.
>>
>> Good question; I'm not sure why. Will have to put it down to bad copy/paste
>> fixup. Will put it back in the next version.
>>
>>>
>>>>            /* increment count (starts at -1) */
>>>>            atomic_set(&folio->_mapcount, 0);
>>>> -        nr = 1;
>>>> +        __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>>> +    } else if (!folio_test_pmd_mappable(folio)) {
>>>> +        int i;
>>>> +
>>>> +        for (i = 0; i < nr; i++) {
>>>> +            struct page *page = folio_page(folio, i);
>>>> +
>>>> +            /* increment count (starts at -1) */
>>>> +            atomic_set(&page->_mapcount, 0);
>>>> +            __page_set_anon_rmap(folio, page, vma,
>>>> +                    address + (i << PAGE_SHIFT), 1);
>>>> +        }
>>>> +
>>>> +        /* increment count (starts at 0) */
>>>
>>> That comment is a bit misleading. We're not talking about a mapcount as in the
>>> other cases here.
>>
>> Correct, I'm talking about _nr_pages_mapped, which starts 0, not -1 like
>> _mapcount. The comment was intended to be in the style used in other similar
>> places in rmap.c. I could change it to: "_nr_pages_mapped is 0-based, so set it
>> to the number of pages in the folio" or remove it entirely? What do you prefer?
>>
>
> We only have to comment what's weird, not what's normal.
>
> IOW, we also didn't have such a comment in the existing code when doing
> atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
>
>
> What might make sense here is a simple
>
> "All pages of the folio are PTE-mapped."
>

ACK - thanks.

2023-07-17 13:49:11

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On 17/07/2023 14:06, David Hildenbrand wrote:
> On 14.07.23 19:17, Yu Zhao wrote:
>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
>>>
>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>> allocated in large folios of a determined order. All pages of the large
>>> folio are pte-mapped during the same page fault, significantly reducing
>>> the number of page faults. The number of per-page operations (e.g. ref
>>> counting, rmap management lru list management) are also significantly
>>> reduced since those ops now become per-folio.
>>>
>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>> defaults to disabled for now; The long term aim is for this to defaut to
>>> enabled, but there are some risks around internal fragmentation that
>>> need to be better understood first.
>>>
>>> When enabled, the folio order is determined as such: For a vma, process
>>> or system that has explicitly disabled THP, we continue to allocate
>>> order-0. THP is most likely disabled to avoid any possible internal
>>> fragmentation so we honour that request.
>>>
>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>> arch_wants_pte_order() is limited by the new cmdline parameter,
>>> `flexthp_unhinted_max`. This allows for a performance boost without
>>> requiring any explicit opt-in from the workload while allowing the
>>> sysadmin to tune between performance and internal fragmentation.
>>>
>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>> set of ptes map physically contigious, naturally aligned memory, so this
>>> mechanism allows the architecture to optimize as required.
>>>
>>> If the preferred order can't be used (e.g. because the folio would
>>> breach the bounds of the vma, or because ptes in the region are already
>>> mapped) then we fall back to a suitable lower order; first
>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>>   .../admin-guide/kernel-parameters.txt         |  10 +
>>>   mm/Kconfig                                    |  10 +
>>>   mm/memory.c                                   | 187 ++++++++++++++++--
>>>   3 files changed, 190 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
>>> b/Documentation/admin-guide/kernel-parameters.txt
>>> index a1457995fd41..405d624e2191 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -1497,6 +1497,16 @@
>>>                          See Documentation/admin-guide/sysctl/net.rst for
>>>                          fb_tunnels_only_for_init_ns
>>>
>>> +       flexthp_unhinted_max=
>>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
>>> +                       folio size that will be allocated for an anonymous vma
>>> +                       that has neither explicitly opted in nor out of using
>>> +                       transparent hugepages. The size must be a power-of-2 in
>>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size improves
>>> +                       performance by reducing page faults, while a smaller
>>> +                       size reduces internal fragmentation. Default: max(64K,
>>> +                       PAGE_SIZE). Format: size[KMG].
>>> +
>>
>> Let's split this parameter into a separate patch.
>>
>
> Just a general comment after stumbling over patch #2, let's not start splitting
> patches into things that don't make any sense on their own; that just makes
> review a lot harder.

ACK

>
> For this case here, I'd suggest first adding the general infrastructure and then
> adding tunables we want to have on top.

OK, so 1 patch for the main infrastructure, then a patch to disable for
MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max
via a sysctl?

>
> I agree that toggling that at runtime (for example via sysfs as raised by me
> previously) would be nicer.

OK, I clearly misunderstood, I thought you were requesting a boot parameter.
What's the ABI compat guarrantee for sysctls? I assumed that for a boot
parameter it would be easier to remove in future if we wanted, but for sysctl,
its there forever?

Also, how do you feel about the naming and behavior of the parameter?

>


2023-07-17 14:23:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On 17.07.23 15:20, Ryan Roberts wrote:
> On 17/07/2023 14:06, David Hildenbrand wrote:
>> On 14.07.23 19:17, Yu Zhao wrote:
>>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>>> allocated in large folios of a determined order. All pages of the large
>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>> counting, rmap management lru list management) are also significantly
>>>> reduced since those ops now become per-folio.
>>>>
>>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>>> defaults to disabled for now; The long term aim is for this to defaut to
>>>> enabled, but there are some risks around internal fragmentation that
>>>> need to be better understood first.
>>>>
>>>> When enabled, the folio order is determined as such: For a vma, process
>>>> or system that has explicitly disabled THP, we continue to allocate
>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>> fragmentation so we honour that request.
>>>>
>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>> arch_wants_pte_order() is limited by the new cmdline parameter,
>>>> `flexthp_unhinted_max`. This allows for a performance boost without
>>>> requiring any explicit opt-in from the workload while allowing the
>>>> sysadmin to tune between performance and internal fragmentation.
>>>>
>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>> mechanism allows the architecture to optimize as required.
>>>>
>>>> If the preferred order can't be used (e.g. because the folio would
>>>> breach the bounds of the vma, or because ptes in the region are already
>>>> mapped) then we fall back to a suitable lower order; first
>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> ---
>>>>   .../admin-guide/kernel-parameters.txt         |  10 +
>>>>   mm/Kconfig                                    |  10 +
>>>>   mm/memory.c                                   | 187 ++++++++++++++++--
>>>>   3 files changed, 190 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
>>>> b/Documentation/admin-guide/kernel-parameters.txt
>>>> index a1457995fd41..405d624e2191 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -1497,6 +1497,16 @@
>>>>                          See Documentation/admin-guide/sysctl/net.rst for
>>>>                          fb_tunnels_only_for_init_ns
>>>>
>>>> +       flexthp_unhinted_max=
>>>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
>>>> +                       folio size that will be allocated for an anonymous vma
>>>> +                       that has neither explicitly opted in nor out of using
>>>> +                       transparent hugepages. The size must be a power-of-2 in
>>>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size improves
>>>> +                       performance by reducing page faults, while a smaller
>>>> +                       size reduces internal fragmentation. Default: max(64K,
>>>> +                       PAGE_SIZE). Format: size[KMG].
>>>> +
>>>
>>> Let's split this parameter into a separate patch.
>>>
>>
>> Just a general comment after stumbling over patch #2, let's not start splitting
>> patches into things that don't make any sense on their own; that just makes
>> review a lot harder.
>
> ACK
>
>>
>> For this case here, I'd suggest first adding the general infrastructure and then
>> adding tunables we want to have on top.
>
> OK, so 1 patch for the main infrastructure, then a patch to disable for
> MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max
> via a sysctl?

MADV_NOHUGEPAGE handling for me falls under the category "required for
correctness to not break existing workloads" and has to be there initially.

Anything that is rather a performance tunable (e.g., a sysctl to
optimize) can be added on top and discussed separately.

At least IMHO :)

>
>>
>> I agree that toggling that at runtime (for example via sysfs as raised by me
>> previously) would be nicer.
>
> OK, I clearly misunderstood, I thought you were requesting a boot parameter.

Oh, sorry about that. I wanted to actually express
"/sys/kernel/mm/transparent_hugepage/" sysctls where we can toggle that
later at runtime as well.

> What's the ABI compat guarrantee for sysctls? I assumed that for a boot
> parameter it would be easier to remove in future if we wanted, but for sysctl,
> its there forever?

sysctl are hard/impossible to remove, yes. So we better make sure what
we add has clear semantics.

If we ever want some real auto-tunable mode (and can actually implement
it without harming performance; and I am skeptical), we might want to
allow for setting such a parameter to "auto", for example.

>
> Also, how do you feel about the naming and behavior of the parameter?

Very good question. "flexthp_unhinted_max" naming is a bit suboptimal.

For example, I'm not so sure if we should expose the feature to user
space as "flexthp" at all. I think we should find a clearer feature name
to begin with.

... maybe we can initially get away with dropping that parameter and
default to something reasonably small (i.e., 64k as you have above)?

/sys/kernel/mm/transparent_hugepage/enabled=never and simply not get any
thp.

--
Cheers,

David / dhildenb


2023-07-17 14:27:24

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

>>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>>>> +{
>>>> + int i;
>>>> + gfp_t gfp;
>>>> + pte_t *pte;
>>>> + unsigned long addr;
>>>> + struct vm_area_struct *vma = vmf->vma;
>>>> + int prefer = anon_folio_order(vma);
>>>> + int orders[] = {
>>>> + prefer,
>>>> + prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
>>>> + 0,
>>>> + };
>>>> +
>>>> + *folio = NULL;
>>>> +
>>>> + if (vmf_orig_pte_uffd_wp(vmf))
>>>> + goto fallback;
>>>> +
>>>> + for (i = 0; orders[i]; i++) {
>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>>> + if (addr >= vma->vm_start &&
>>>> + addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
>>>> + break;
>>>> + }
>>>> +
>>>> + if (!orders[i])
>>>> + goto fallback;
>>>> +
>>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>> + if (!pte)
>>>> + return -EAGAIN;
>>>
>>> It would be a bug if this happens. So probably -EINVAL?
>>
>> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
>> possible for pte_offset_map() to fail (if I understood correctly) and we have to
>> handle this. The intent is that we will return from the fault without making any
>> change, then we will refault and try again.
>
> Thanks for checking that -- it's very relevant. One detail is that
> that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
> happen while we are holding mmap_lock for read here, and therefore,
> the race that could cause pte_offset_map() on shmem/file PTEs to fail
> doesn't apply here.

But Hugh's patches have changed do_anonymous_page() to handle failure from
pte_offset_map_lock(). So I was just following that pattern. If this really
can't happen, then I'd rather WARN/BUG on it, and simplify alloc_anon_folio()'s
prototype to just return a `struct folio *` (and if it's null that means ENOMEM).

Hugh, perhaps you can comment?

As an aside, it was my understanding from LWN, that we are now using a per-VMA
lock so presumably we don't hold mmap_lock for read here? Or perhaps that only
applies to file-backed memory?

>
> +Hugh Dickins for further consultation if you need it.
>
>>>> +
>>>> + for (; orders[i]; i++) {
>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>>> + vmf->pte = pte + pte_index(addr);
>>>> + if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
>>>> + break;
>>>> + }
>>>> +
>>>> + vmf->pte = NULL;
>>>> + pte_unmap(pte);
>>>> +
>>>> + gfp = vma_thp_gfp_mask(vma);
>>>> +
>>>> + for (; orders[i]; i++) {
>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>>> + *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
>>>> + if (*folio) {
>>>> + clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
>>>> + return 0;
>>>> + }
>>>> + }
>>>> +
>>>> +fallback:
>>>> + *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>>> + return *folio ? 0 : -ENOMEM;
>>>> +}
>>>> +#else
>>>> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>>>
>>> Drop "inline" (it doesn't do anything in .c).
>>
>> There are 38 instances of inline in memory.c alone, so looks like a well used
>> convention, even if the compiler may choose to ignore. Perhaps you can educate
>> me; what's the benefit of dropping it?
>
> I'll let Willy and Andrew educate both of us :)
>
> +Matthew Wilcox +Andrew Morton please. Thank you.
>
>>> The rest looks good to me.
>>
>> Great - just incase it wasn't obvious, I decided not to overwrite vmf->address
>> with the aligned version, as you suggested
>
> Yes, I've noticed. Not overwriting has its own merits for sure.
>
>> for 2 reasons; 1) address is const
>> in the struct, so would have had to change that. 2) there is a uffd path that
>> can be taken after the vmf->address fixup would have occured and the path
>> consumes that member, so it would have had to be un-fixed-up making it more
>> messy than the way I opted for.
>>
>> Thanks for the quick review as always!


2023-07-17 14:27:50

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()

On 17.07.23 15:13, Ryan Roberts wrote:
> On 17/07/2023 14:00, David Hildenbrand wrote:
>> On 14.07.23 18:17, Ryan Roberts wrote:
>>> In preparation for FLEXIBLE_THP support, improve
>>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
>>> passed to it. In this case, all contained pages are accounted using the
>>> order-0 folio (or base page) scheme.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> Reviewed-by: Yu Zhao <[email protected]>
>>> Reviewed-by: Yin Fengwei <[email protected]>
>>> ---
>>>   mm/rmap.c | 28 +++++++++++++++++++++-------
>>>   1 file changed, 21 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 0c0d8857dfce..f293d072368a 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1278,31 +1278,45 @@ void page_add_anon_rmap(struct page *page, struct
>>> vm_area_struct *vma,
>>>    * This means the inc-and-test can be bypassed.
>>>    * The folio does not have to be locked.
>>>    *
>>> - * If the folio is large, it is accounted as a THP.  As the folio
>>> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>>>    * is new, it's assumed to be mapped exclusively by a single process.
>>>    */
>>>   void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>>           unsigned long address)
>>>   {
>>> -    int nr;
>>> +    int nr = folio_nr_pages(folio);
>>>
>>> -    VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>>> +    VM_BUG_ON_VMA(address < vma->vm_start ||
>>> +            address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>>       __folio_set_swapbacked(folio);
>>>
>>> -    if (likely(!folio_test_pmd_mappable(folio))) {
>>> +    if (!folio_test_large(folio)) {
>>
>> Why remove the "likely" here? The patch itself does not change anything about
>> that condition.
>
> Good question; I'm not sure why. Will have to put it down to bad copy/paste
> fixup. Will put it back in the next version.
>
>>
>>>           /* increment count (starts at -1) */
>>>           atomic_set(&folio->_mapcount, 0);
>>> -        nr = 1;
>>> +        __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>> +    } else if (!folio_test_pmd_mappable(folio)) {
>>> +        int i;
>>> +
>>> +        for (i = 0; i < nr; i++) {
>>> +            struct page *page = folio_page(folio, i);
>>> +
>>> +            /* increment count (starts at -1) */
>>> +            atomic_set(&page->_mapcount, 0);
>>> +            __page_set_anon_rmap(folio, page, vma,
>>> +                    address + (i << PAGE_SHIFT), 1);
>>> +        }
>>> +
>>> +        /* increment count (starts at 0) */
>>
>> That comment is a bit misleading. We're not talking about a mapcount as in the
>> other cases here.
>
> Correct, I'm talking about _nr_pages_mapped, which starts 0, not -1 like
> _mapcount. The comment was intended to be in the style used in other similar
> places in rmap.c. I could change it to: "_nr_pages_mapped is 0-based, so set it
> to the number of pages in the folio" or remove it entirely? What do you prefer?
>

We only have to comment what's weird, not what's normal.

IOW, we also didn't have such a comment in the existing code when doing
atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);


What might make sense here is a simple

"All pages of the folio are PTE-mapped."

--
Cheers,

David / dhildenb


2023-07-17 15:24:01

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On 17/07/2023 14:56, David Hildenbrand wrote:
> On 17.07.23 15:20, Ryan Roberts wrote:
>> On 17/07/2023 14:06, David Hildenbrand wrote:
>>> On 14.07.23 19:17, Yu Zhao wrote:
>>>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
>>>>>
>>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>>>> allocated in large folios of a determined order. All pages of the large
>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>> counting, rmap management lru list management) are also significantly
>>>>> reduced since those ops now become per-folio.
>>>>>
>>>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>>>> defaults to disabled for now; The long term aim is for this to defaut to
>>>>> enabled, but there are some risks around internal fragmentation that
>>>>> need to be better understood first.
>>>>>
>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>> fragmentation so we honour that request.
>>>>>
>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>> arch_wants_pte_order() is limited by the new cmdline parameter,
>>>>> `flexthp_unhinted_max`. This allows for a performance boost without
>>>>> requiring any explicit opt-in from the workload while allowing the
>>>>> sysadmin to tune between performance and internal fragmentation.
>>>>>
>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>> mechanism allows the architecture to optimize as required.
>>>>>
>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>> mapped) then we fall back to a suitable lower order; first
>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>> ---
>>>>>    .../admin-guide/kernel-parameters.txt         |  10 +
>>>>>    mm/Kconfig                                    |  10 +
>>>>>    mm/memory.c                                   | 187 ++++++++++++++++--
>>>>>    3 files changed, 190 insertions(+), 17 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
>>>>> b/Documentation/admin-guide/kernel-parameters.txt
>>>>> index a1457995fd41..405d624e2191 100644
>>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>>> @@ -1497,6 +1497,16 @@
>>>>>                           See Documentation/admin-guide/sysctl/net.rst for
>>>>>                           fb_tunnels_only_for_init_ns
>>>>>
>>>>> +       flexthp_unhinted_max=
>>>>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The
>>>>> maximum
>>>>> +                       folio size that will be allocated for an anonymous vma
>>>>> +                       that has neither explicitly opted in nor out of using
>>>>> +                       transparent hugepages. The size must be a
>>>>> power-of-2 in
>>>>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size
>>>>> improves
>>>>> +                       performance by reducing page faults, while a smaller
>>>>> +                       size reduces internal fragmentation. Default: max(64K,
>>>>> +                       PAGE_SIZE). Format: size[KMG].
>>>>> +
>>>>
>>>> Let's split this parameter into a separate patch.
>>>>
>>>
>>> Just a general comment after stumbling over patch #2, let's not start splitting
>>> patches into things that don't make any sense on their own; that just makes
>>> review a lot harder.
>>
>> ACK
>>
>>>
>>> For this case here, I'd suggest first adding the general infrastructure and then
>>> adding tunables we want to have on top.
>>
>> OK, so 1 patch for the main infrastructure, then a patch to disable for
>> MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max
>> via a sysctl?
>
> MADV_NOHUGEPAGE handling for me falls under the category "required for
> correctness to not break existing workloads" and has to be there initially.
>
> Anything that is rather a performance tunable (e.g., a sysctl to optimize) can
> be added on top and discussed separately.>
> At least IMHO :)
>
>>
>>>
>>> I agree that toggling that at runtime (for example via sysfs as raised by me
>>> previously) would be nicer.
>>
>> OK, I clearly misunderstood, I thought you were requesting a boot parameter.
>
> Oh, sorry about that. I wanted to actually express
> "/sys/kernel/mm/transparent_hugepage/" sysctls where we can toggle that later at
> runtime as well.
>
>> What's the ABI compat guarrantee for sysctls? I assumed that for a boot
>> parameter it would be easier to remove in future if we wanted, but for sysctl,
>> its there forever?
>
> sysctl are hard/impossible to remove, yes. So we better make sure what we add
> has clear semantics.
>
> If we ever want some real auto-tunable mode (and can actually implement it
> without harming performance; and I am skeptical), we might want to allow for
> setting such a parameter to "auto", for example.
>
>>
>> Also, how do you feel about the naming and behavior of the parameter?
>
> Very good question. "flexthp_unhinted_max" naming is a bit suboptimal.
>
> For example, I'm not so sure if we should expose the feature to user space as
> "flexthp" at all. I think we should find a clearer feature name to begin with.
>
> ... maybe we can initially get away with dropping that parameter and default to
> something reasonably small (i.e., 64k as you have above)?

That would certainly get my vote. But it was you who was arguing for a tunable
previously ;-). I propose we use the following as the "unhinted ceiling" for
now, then we can add a tunable if/when we find a use case that doesn't work
optimally with this value:

static int flexthp_unhinted_max_order =
ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT;

(Using PAGE_SIZE when its gt 64K to cover the ppc case that looks like it can
support 256K pages. Open coding the max because max() can't be used outside a
function).

>
> /sys/kernel/mm/transparent_hugepage/enabled=never and simply not get any thp.

Yes, that should work with the patch as it is today.

>


2023-07-17 15:35:07

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On 17.07.23 16:47, Ryan Roberts wrote:
> On 17/07/2023 14:56, David Hildenbrand wrote:
>> On 17.07.23 15:20, Ryan Roberts wrote:
>>> On 17/07/2023 14:06, David Hildenbrand wrote:
>>>> On 14.07.23 19:17, Yu Zhao wrote:
>>>>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
>>>>>>
>>>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>> counting, rmap management lru list management) are also significantly
>>>>>> reduced since those ops now become per-folio.
>>>>>>
>>>>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>>>>> defaults to disabled for now; The long term aim is for this to defaut to
>>>>>> enabled, but there are some risks around internal fragmentation that
>>>>>> need to be better understood first.
>>>>>>
>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>> fragmentation so we honour that request.
>>>>>>
>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>> arch_wants_pte_order() is limited by the new cmdline parameter,
>>>>>> `flexthp_unhinted_max`. This allows for a performance boost without
>>>>>> requiring any explicit opt-in from the workload while allowing the
>>>>>> sysadmin to tune between performance and internal fragmentation.
>>>>>>
>>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>>> mechanism allows the architecture to optimize as required.
>>>>>>
>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>> ---
>>>>>>    .../admin-guide/kernel-parameters.txt         |  10 +
>>>>>>    mm/Kconfig                                    |  10 +
>>>>>>    mm/memory.c                                   | 187 ++++++++++++++++--
>>>>>>    3 files changed, 190 insertions(+), 17 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
>>>>>> b/Documentation/admin-guide/kernel-parameters.txt
>>>>>> index a1457995fd41..405d624e2191 100644
>>>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>>>> @@ -1497,6 +1497,16 @@
>>>>>>                           See Documentation/admin-guide/sysctl/net.rst for
>>>>>>                           fb_tunnels_only_for_init_ns
>>>>>>
>>>>>> +       flexthp_unhinted_max=
>>>>>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The
>>>>>> maximum
>>>>>> +                       folio size that will be allocated for an anonymous vma
>>>>>> +                       that has neither explicitly opted in nor out of using
>>>>>> +                       transparent hugepages. The size must be a
>>>>>> power-of-2 in
>>>>>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size
>>>>>> improves
>>>>>> +                       performance by reducing page faults, while a smaller
>>>>>> +                       size reduces internal fragmentation. Default: max(64K,
>>>>>> +                       PAGE_SIZE). Format: size[KMG].
>>>>>> +
>>>>>
>>>>> Let's split this parameter into a separate patch.
>>>>>
>>>>
>>>> Just a general comment after stumbling over patch #2, let's not start splitting
>>>> patches into things that don't make any sense on their own; that just makes
>>>> review a lot harder.
>>>
>>> ACK
>>>
>>>>
>>>> For this case here, I'd suggest first adding the general infrastructure and then
>>>> adding tunables we want to have on top.
>>>
>>> OK, so 1 patch for the main infrastructure, then a patch to disable for
>>> MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max
>>> via a sysctl?
>>
>> MADV_NOHUGEPAGE handling for me falls under the category "required for
>> correctness to not break existing workloads" and has to be there initially.
>>
>> Anything that is rather a performance tunable (e.g., a sysctl to optimize) can
>> be added on top and discussed separately.>
>> At least IMHO :)
>>
>>>
>>>>
>>>> I agree that toggling that at runtime (for example via sysfs as raised by me
>>>> previously) would be nicer.
>>>
>>> OK, I clearly misunderstood, I thought you were requesting a boot parameter.
>>
>> Oh, sorry about that. I wanted to actually express
>> "/sys/kernel/mm/transparent_hugepage/" sysctls where we can toggle that later at
>> runtime as well.
>>
>>> What's the ABI compat guarrantee for sysctls? I assumed that for a boot
>>> parameter it would be easier to remove in future if we wanted, but for sysctl,
>>> its there forever?
>>
>> sysctl are hard/impossible to remove, yes. So we better make sure what we add
>> has clear semantics.
>>
>> If we ever want some real auto-tunable mode (and can actually implement it
>> without harming performance; and I am skeptical), we might want to allow for
>> setting such a parameter to "auto", for example.
>>
>>>
>>> Also, how do you feel about the naming and behavior of the parameter?
>>
>> Very good question. "flexthp_unhinted_max" naming is a bit suboptimal.
>>
>> For example, I'm not so sure if we should expose the feature to user space as
>> "flexthp" at all. I think we should find a clearer feature name to begin with.
>>
>> ... maybe we can initially get away with dropping that parameter and default to
>> something reasonably small (i.e., 64k as you have above)?
>
> That would certainly get my vote. But it was you who was arguing for a tunable
> previously ;-). I propose we use the following as the "unhinted ceiling" for

Yes, I still think having tunables makes sense.


But it's certainly something to add separately, especially if it makes
your work here easier.

As long as it can be disabled, good enough for me for the initial version.

--
Cheers,

David / dhildenb


2023-07-17 17:20:56

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On Mon, Jul 17, 2023 at 7:06 AM David Hildenbrand <[email protected]> wrote:
>
> On 14.07.23 19:17, Yu Zhao wrote:
> > On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
> >>
> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> >> allocated in large folios of a determined order. All pages of the large
> >> folio are pte-mapped during the same page fault, significantly reducing
> >> the number of page faults. The number of per-page operations (e.g. ref
> >> counting, rmap management lru list management) are also significantly
> >> reduced since those ops now become per-folio.
> >>
> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> >> defaults to disabled for now; The long term aim is for this to defaut to
> >> enabled, but there are some risks around internal fragmentation that
> >> need to be better understood first.
> >>
> >> When enabled, the folio order is determined as such: For a vma, process
> >> or system that has explicitly disabled THP, we continue to allocate
> >> order-0. THP is most likely disabled to avoid any possible internal
> >> fragmentation so we honour that request.
> >>
> >> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >> that have not explicitly opted-in to use transparent hugepages (e.g.
> >> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >> arch_wants_pte_order() is limited by the new cmdline parameter,
> >> `flexthp_unhinted_max`. This allows for a performance boost without
> >> requiring any explicit opt-in from the workload while allowing the
> >> sysadmin to tune between performance and internal fragmentation.
> >>
> >> arch_wants_pte_order() can be overridden by the architecture if desired.
> >> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> >> set of ptes map physically contigious, naturally aligned memory, so this
> >> mechanism allows the architecture to optimize as required.
> >>
> >> If the preferred order can't be used (e.g. because the folio would
> >> breach the bounds of the vma, or because ptes in the region are already
> >> mapped) then we fall back to a suitable lower order; first
> >> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>
> >> Signed-off-by: Ryan Roberts <[email protected]>
> >> ---
> >> .../admin-guide/kernel-parameters.txt | 10 +
> >> mm/Kconfig | 10 +
> >> mm/memory.c | 187 ++++++++++++++++--
> >> 3 files changed, 190 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> >> index a1457995fd41..405d624e2191 100644
> >> --- a/Documentation/admin-guide/kernel-parameters.txt
> >> +++ b/Documentation/admin-guide/kernel-parameters.txt
> >> @@ -1497,6 +1497,16 @@
> >> See Documentation/admin-guide/sysctl/net.rst for
> >> fb_tunnels_only_for_init_ns
> >>
> >> + flexthp_unhinted_max=
> >> + [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
> >> + folio size that will be allocated for an anonymous vma
> >> + that has neither explicitly opted in nor out of using
> >> + transparent hugepages. The size must be a power-of-2 in
> >> + the range [PAGE_SIZE, PMD_SIZE). A larger size improves
> >> + performance by reducing page faults, while a smaller
> >> + size reduces internal fragmentation. Default: max(64K,
> >> + PAGE_SIZE). Format: size[KMG].
> >> +
> >
> > Let's split this parameter into a separate patch.
> >
>
> Just a general comment after stumbling over patch #2, let's not start
> splitting patches into things that don't make any sense on their own;
> that just makes review a lot harder.

Sorry to hear this -- but there are also non-subjective reasons we
split patches this way.

Initially we had minimum to no common ground, so we had to divide and
conquer by smallest steps.

if you look at previous discussions: there was a disagreement on patch
2 in v2 -- that's the patch you asked to be squashed into the main
patch 3. Fortunately we've resolved that. If that disagreement had
persisted, we would leave patch 2 out rather than let it bog down
patch 3, which would work indifferently for all arches except arm and
could be merged separately.

> For this case here, I'd suggest first adding the general infrastructure
> and then adding tunables we want to have on top.
>
> I agree that toggling that at runtime (for example via sysfs as raised
> by me previously) would be nicer.

2023-07-17 17:36:13

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On 17.07.23 19:07, Yu Zhao wrote:
> On Mon, Jul 17, 2023 at 7:06 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 14.07.23 19:17, Yu Zhao wrote:
>>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>>> allocated in large folios of a determined order. All pages of the large
>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>> counting, rmap management lru list management) are also significantly
>>>> reduced since those ops now become per-folio.
>>>>
>>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>>> defaults to disabled for now; The long term aim is for this to defaut to
>>>> enabled, but there are some risks around internal fragmentation that
>>>> need to be better understood first.
>>>>
>>>> When enabled, the folio order is determined as such: For a vma, process
>>>> or system that has explicitly disabled THP, we continue to allocate
>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>> fragmentation so we honour that request.
>>>>
>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>> arch_wants_pte_order() is limited by the new cmdline parameter,
>>>> `flexthp_unhinted_max`. This allows for a performance boost without
>>>> requiring any explicit opt-in from the workload while allowing the
>>>> sysadmin to tune between performance and internal fragmentation.
>>>>
>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>> mechanism allows the architecture to optimize as required.
>>>>
>>>> If the preferred order can't be used (e.g. because the folio would
>>>> breach the bounds of the vma, or because ptes in the region are already
>>>> mapped) then we fall back to a suitable lower order; first
>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> ---
>>>> .../admin-guide/kernel-parameters.txt | 10 +
>>>> mm/Kconfig | 10 +
>>>> mm/memory.c | 187 ++++++++++++++++--
>>>> 3 files changed, 190 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>>> index a1457995fd41..405d624e2191 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -1497,6 +1497,16 @@
>>>> See Documentation/admin-guide/sysctl/net.rst for
>>>> fb_tunnels_only_for_init_ns
>>>>
>>>> + flexthp_unhinted_max=
>>>> + [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
>>>> + folio size that will be allocated for an anonymous vma
>>>> + that has neither explicitly opted in nor out of using
>>>> + transparent hugepages. The size must be a power-of-2 in
>>>> + the range [PAGE_SIZE, PMD_SIZE). A larger size improves
>>>> + performance by reducing page faults, while a smaller
>>>> + size reduces internal fragmentation. Default: max(64K,
>>>> + PAGE_SIZE). Format: size[KMG].
>>>> +
>>>
>>> Let's split this parameter into a separate patch.
>>>
>>
>> Just a general comment after stumbling over patch #2, let's not start
>> splitting patches into things that don't make any sense on their own;
>> that just makes review a lot harder.
>
> Sorry to hear this -- but there are also non-subjective reasons we
> split patches this way.
>
> Initially we had minimum to no common ground, so we had to divide and
> conquer by smallest steps.
>
> if you look at previous discussions: there was a disagreement on patch
> 2 in v2 -- that's the patch you asked to be squashed into the main
> patch 3. Fortunately we've resolved that. If that disagreement had
> persisted, we would leave patch 2 out rather than let it bog down
> patch 3, which would work indifferently for all arches except arm and
> could be merged separately.

All makes sense to me, and squashing it now is most probably the logical
step and was different before.

As I said, just a general comment when we talk about splitting stuff out.

--
Cheers,

David / dhildenb


2023-07-17 19:54:31

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On Mon, Jul 17, 2023 at 7:36 AM Ryan Roberts <[email protected]> wrote:
>
> >>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >>>> +{
> >>>> + int i;
> >>>> + gfp_t gfp;
> >>>> + pte_t *pte;
> >>>> + unsigned long addr;
> >>>> + struct vm_area_struct *vma = vmf->vma;
> >>>> + int prefer = anon_folio_order(vma);
> >>>> + int orders[] = {
> >>>> + prefer,
> >>>> + prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> >>>> + 0,
> >>>> + };
> >>>> +
> >>>> + *folio = NULL;
> >>>> +
> >>>> + if (vmf_orig_pte_uffd_wp(vmf))
> >>>> + goto fallback;
> >>>> +
> >>>> + for (i = 0; orders[i]; i++) {
> >>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >>>> + if (addr >= vma->vm_start &&
> >>>> + addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> >>>> + break;
> >>>> + }
> >>>> +
> >>>> + if (!orders[i])
> >>>> + goto fallback;
> >>>> +
> >>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >>>> + if (!pte)
> >>>> + return -EAGAIN;
> >>>
> >>> It would be a bug if this happens. So probably -EINVAL?
> >>
> >> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
> >> possible for pte_offset_map() to fail (if I understood correctly) and we have to
> >> handle this. The intent is that we will return from the fault without making any
> >> change, then we will refault and try again.
> >
> > Thanks for checking that -- it's very relevant. One detail is that
> > that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
> > happen while we are holding mmap_lock for read here, and therefore,
> > the race that could cause pte_offset_map() on shmem/file PTEs to fail
> > doesn't apply here.
>
> But Hugh's patches have changed do_anonymous_page() to handle failure from
> pte_offset_map_lock(). So I was just following that pattern. If this really
> can't happen, then I'd rather WARN/BUG on it, and simplify alloc_anon_folio()'s
> prototype to just return a `struct folio *` (and if it's null that means ENOMEM).
>
> Hugh, perhaps you can comment?
>
> As an aside, it was my understanding from LWN, that we are now using a per-VMA
> lock so presumably we don't hold mmap_lock for read here? Or perhaps that only
> applies to file-backed memory?

For anon under mmap_lock for read:
1. pte_offset_map[_lock]() fails when a parallel PF changes PMD from
none to leaf.
2. changing PMD from non-leaf to leaf is a bug. See the comments in
the "else" branch in handle_pte_fault().

So for do_anonymous_page(), there is only one case
pte_offset_map[_lock]() can fail. For the code above, this case was
ruled out by vmf_orig_pte_uffd_wp().

Checking the return value from pte_offset_map[_lock]() is a good
practice. What I'm saying is that -EAGAIN would mislead people to
think, in our case, !pte is legitimate, and hence the suggestion of
replacing it with -EINVAL.

No BUG_ON() please. As I've previously mentioned, it's against
Documentation/process/coding-style.rst.

> > +Hugh Dickins for further consultation if you need it.
> >
> >>>> +
> >>>> + for (; orders[i]; i++) {
> >>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >>>> + vmf->pte = pte + pte_index(addr);
> >>>> + if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
> >>>> + break;
> >>>> + }
> >>>> +
> >>>> + vmf->pte = NULL;
> >>>> + pte_unmap(pte);
> >>>> +
> >>>> + gfp = vma_thp_gfp_mask(vma);
> >>>> +
> >>>> + for (; orders[i]; i++) {
> >>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >>>> + *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
> >>>> + if (*folio) {
> >>>> + clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
> >>>> + return 0;
> >>>> + }
> >>>> + }
> >>>> +
> >>>> +fallback:
> >>>> + *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> >>>> + return *folio ? 0 : -ENOMEM;
> >>>> +}
> >>>> +#else
> >>>> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >>>
> >>> Drop "inline" (it doesn't do anything in .c).
> >>
> >> There are 38 instances of inline in memory.c alone, so looks like a well used
> >> convention, even if the compiler may choose to ignore. Perhaps you can educate
> >> me; what's the benefit of dropping it?
> >
> > I'll let Willy and Andrew educate both of us :)
> >
> > +Matthew Wilcox +Andrew Morton please. Thank you.
> >
> >>> The rest looks good to me.
> >>
> >> Great - just incase it wasn't obvious, I decided not to overwrite vmf->address
> >> with the aligned version, as you suggested
> >
> > Yes, I've noticed. Not overwriting has its own merits for sure.
> >
> >> for 2 reasons; 1) address is const
> >> in the struct, so would have had to change that. 2) there is a uffd path that
> >> can be taken after the vmf->address fixup would have occured and the path
> >> consumes that member, so it would have had to be un-fixed-up making it more
> >> messy than the way I opted for.
> >>
> >> Thanks for the quick review as always!
>

2023-07-17 20:56:37

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On Mon, Jul 17, 2023 at 1:31 PM Yu Zhao <[email protected]> wrote:
>
> On Mon, Jul 17, 2023 at 7:36 AM Ryan Roberts <[email protected]> wrote:
> >
> > >>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> > >>>> +{
> > >>>> + int i;
> > >>>> + gfp_t gfp;
> > >>>> + pte_t *pte;
> > >>>> + unsigned long addr;
> > >>>> + struct vm_area_struct *vma = vmf->vma;
> > >>>> + int prefer = anon_folio_order(vma);
> > >>>> + int orders[] = {
> > >>>> + prefer,
> > >>>> + prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> > >>>> + 0,
> > >>>> + };
> > >>>> +
> > >>>> + *folio = NULL;
> > >>>> +
> > >>>> + if (vmf_orig_pte_uffd_wp(vmf))
> > >>>> + goto fallback;
> > >>>> +
> > >>>> + for (i = 0; orders[i]; i++) {
> > >>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> > >>>> + if (addr >= vma->vm_start &&
> > >>>> + addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> > >>>> + break;
> > >>>> + }
> > >>>> +
> > >>>> + if (!orders[i])
> > >>>> + goto fallback;
> > >>>> +
> > >>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> > >>>> + if (!pte)
> > >>>> + return -EAGAIN;
> > >>>
> > >>> It would be a bug if this happens. So probably -EINVAL?
> > >>
> > >> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
> > >> possible for pte_offset_map() to fail (if I understood correctly) and we have to
> > >> handle this. The intent is that we will return from the fault without making any
> > >> change, then we will refault and try again.
> > >
> > > Thanks for checking that -- it's very relevant. One detail is that
> > > that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
> > > happen while we are holding mmap_lock for read here, and therefore,
> > > the race that could cause pte_offset_map() on shmem/file PTEs to fail
> > > doesn't apply here.
> >
> > But Hugh's patches have changed do_anonymous_page() to handle failure from
> > pte_offset_map_lock(). So I was just following that pattern. If this really
> > can't happen, then I'd rather WARN/BUG on it, and simplify alloc_anon_folio()'s
> > prototype to just return a `struct folio *` (and if it's null that means ENOMEM).
> >
> > Hugh, perhaps you can comment?
> >
> > As an aside, it was my understanding from LWN, that we are now using a per-VMA
> > lock so presumably we don't hold mmap_lock for read here? Or perhaps that only
> > applies to file-backed memory?
>
> For anon under mmap_lock for read:
> 1. pte_offset_map[_lock]() fails when a parallel PF changes PMD from
> none to leaf.
> 2. changing PMD from non-leaf to leaf is a bug. See the comments in
> the "else" branch in handle_pte_fault().
>
> So for do_anonymous_page(), there is only one case
> pte_offset_map[_lock]() can fail.

===
> For the code above, this case was
> ruled out by vmf_orig_pte_uffd_wp().

Actually I was wrong about this part.
===

> Checking the return value from pte_offset_map[_lock]() is a good
> practice. What I'm saying is that -EAGAIN would mislead people to
> think, in our case, !pte is legitimate, and hence the suggestion of
> replacing it with -EINVAL.

Yes, -EAGAIN is suitable.

> No BUG_ON() please. As I've previously mentioned, it's against
> Documentation/process/coding-style.rst.
>
> > > +Hugh Dickins for further consultation if you need it.
> > >
> > >>>> +
> > >>>> + for (; orders[i]; i++) {
> > >>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> > >>>> + vmf->pte = pte + pte_index(addr);
> > >>>> + if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
> > >>>> + break;
> > >>>> + }
> > >>>> +
> > >>>> + vmf->pte = NULL;
> > >>>> + pte_unmap(pte);
> > >>>> +
> > >>>> + gfp = vma_thp_gfp_mask(vma);
> > >>>> +
> > >>>> + for (; orders[i]; i++) {
> > >>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> > >>>> + *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
> > >>>> + if (*folio) {
> > >>>> + clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
> > >>>> + return 0;
> > >>>> + }
> > >>>> + }
> > >>>> +
> > >>>> +fallback:
> > >>>> + *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> > >>>> + return *folio ? 0 : -ENOMEM;
> > >>>> +}
> > >>>> +#else
> > >>>> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> > >>>
> > >>> Drop "inline" (it doesn't do anything in .c).
> > >>
> > >> There are 38 instances of inline in memory.c alone, so looks like a well used
> > >> convention, even if the compiler may choose to ignore. Perhaps you can educate
> > >> me; what's the benefit of dropping it?
> > >
> > > I'll let Willy and Andrew educate both of us :)
> > >
> > > +Matthew Wilcox +Andrew Morton please. Thank you.
> > >
> > >>> The rest looks good to me.
> > >>
> > >> Great - just incase it wasn't obvious, I decided not to overwrite vmf->address
> > >> with the aligned version, as you suggested
> > >
> > > Yes, I've noticed. Not overwriting has its own merits for sure.
> > >
> > >> for 2 reasons; 1) address is const
> > >> in the struct, so would have had to change that. 2) there is a uffd path that
> > >> can be taken after the vmf->address fixup would have occured and the path
> > >> consumes that member, so it would have had to be un-fixed-up making it more
> > >> messy than the way I opted for.
> > >>
> > >> Thanks for the quick review as always!

2023-07-17 23:53:43

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On Mon, 17 Jul 2023, Ryan Roberts wrote:

> >>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >>>> +{
> >>>> + int i;
> >>>> + gfp_t gfp;
> >>>> + pte_t *pte;
> >>>> + unsigned long addr;
> >>>> + struct vm_area_struct *vma = vmf->vma;
> >>>> + int prefer = anon_folio_order(vma);
> >>>> + int orders[] = {
> >>>> + prefer,
> >>>> + prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> >>>> + 0,
> >>>> + };
> >>>> +
> >>>> + *folio = NULL;
> >>>> +
> >>>> + if (vmf_orig_pte_uffd_wp(vmf))
> >>>> + goto fallback;
> >>>> +
> >>>> + for (i = 0; orders[i]; i++) {
> >>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >>>> + if (addr >= vma->vm_start &&
> >>>> + addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> >>>> + break;
> >>>> + }
> >>>> +
> >>>> + if (!orders[i])
> >>>> + goto fallback;
> >>>> +
> >>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >>>> + if (!pte)
> >>>> + return -EAGAIN;
> >>>
> >>> It would be a bug if this happens. So probably -EINVAL?
> >>
> >> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
> >> possible for pte_offset_map() to fail (if I understood correctly) and we have to
> >> handle this. The intent is that we will return from the fault without making any
> >> change, then we will refault and try again.
> >
> > Thanks for checking that -- it's very relevant. One detail is that
> > that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
> > happen while we are holding mmap_lock for read here, and therefore,
> > the race that could cause pte_offset_map() on shmem/file PTEs to fail
> > doesn't apply here.
>
> But Hugh's patches have changed do_anonymous_page() to handle failure from
> pte_offset_map_lock(). So I was just following that pattern. If this really
> can't happen, then I'd rather WARN/BUG on it, and simplify alloc_anon_folio()'s
> prototype to just return a `struct folio *` (and if it's null that means ENOMEM).
>
> Hugh, perhaps you can comment?

I agree with your use of -EAGAIN there: I find it better to allow for the
possibility, than to go to great effort persuading that it's impossible;
especially because what's possible tomorrow may differ from today.

And notice that, before my changes, there used to be a pmd_trans_unstable()
check above, implying that it is possible for it to fail (for more reasons
than corruption causing pmd_bad()) - one scenario would be that the
pte_alloc() above succeeded *because* someone else had managed to insert
a huge pmd there already (maybe we have MMF_DISABLE_THP but they did not).

But I see from later mail that Yu Zhao now agrees with your -EAGAIN too,
so we are all on the same folio.

Hugh

p.s. while giving opinions, I'm one of those against using "THP" for
large but not pmd-mappable folios; and was glad to see Matthew arguing
the same way when considering THP_SWPOUT in another thread today.

2023-07-18 10:53:24

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On 18/07/2023 00:37, Hugh Dickins wrote:
> On Mon, 17 Jul 2023, Ryan Roberts wrote:
>
>>>>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>>>>>> +{
>>>>>> + int i;
>>>>>> + gfp_t gfp;
>>>>>> + pte_t *pte;
>>>>>> + unsigned long addr;
>>>>>> + struct vm_area_struct *vma = vmf->vma;
>>>>>> + int prefer = anon_folio_order(vma);
>>>>>> + int orders[] = {
>>>>>> + prefer,
>>>>>> + prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
>>>>>> + 0,
>>>>>> + };
>>>>>> +
>>>>>> + *folio = NULL;
>>>>>> +
>>>>>> + if (vmf_orig_pte_uffd_wp(vmf))
>>>>>> + goto fallback;
>>>>>> +
>>>>>> + for (i = 0; orders[i]; i++) {
>>>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>>>>> + if (addr >= vma->vm_start &&
>>>>>> + addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
>>>>>> + break;
>>>>>> + }
>>>>>> +
>>>>>> + if (!orders[i])
>>>>>> + goto fallback;
>>>>>> +
>>>>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>>>> + if (!pte)
>>>>>> + return -EAGAIN;
>>>>>
>>>>> It would be a bug if this happens. So probably -EINVAL?
>>>>
>>>> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
>>>> possible for pte_offset_map() to fail (if I understood correctly) and we have to
>>>> handle this. The intent is that we will return from the fault without making any
>>>> change, then we will refault and try again.
>>>
>>> Thanks for checking that -- it's very relevant. One detail is that
>>> that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
>>> happen while we are holding mmap_lock for read here, and therefore,
>>> the race that could cause pte_offset_map() on shmem/file PTEs to fail
>>> doesn't apply here.
>>
>> But Hugh's patches have changed do_anonymous_page() to handle failure from
>> pte_offset_map_lock(). So I was just following that pattern. If this really
>> can't happen, then I'd rather WARN/BUG on it, and simplify alloc_anon_folio()'s
>> prototype to just return a `struct folio *` (and if it's null that means ENOMEM).
>>
>> Hugh, perhaps you can comment?
>
> I agree with your use of -EAGAIN there: I find it better to allow for the
> possibility, than to go to great effort persuading that it's impossible;
> especially because what's possible tomorrow may differ from today.
>
> And notice that, before my changes, there used to be a pmd_trans_unstable()
> check above, implying that it is possible for it to fail (for more reasons
> than corruption causing pmd_bad()) - one scenario would be that the
> pte_alloc() above succeeded *because* someone else had managed to insert
> a huge pmd there already (maybe we have MMF_DISABLE_THP but they did not).
>
> But I see from later mail that Yu Zhao now agrees with your -EAGAIN too,
> so we are all on the same folio.

Thanks for the explanation. I think we are all now agreed that the error case
needs handling and -EAGAIN is the correct code.

>
> Hugh
>
> p.s. while giving opinions, I'm one of those against using "THP" for
> large but not pmd-mappable folios; and was glad to see Matthew arguing
> the same way when considering THP_SWPOUT in another thread today.

Honestly, I don't have an opinion either way on this (probably because I don't
have the full context and history of THP like many of you do). So given there is
a fair bit of opposition to FLEXIBLE_THP, I'll change it back to
LARGE_ANON_FOLIO (and move it out of the THP sub-menu) in the next revision.


2023-07-21 11:13:06

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

On 14/07/2023 17:17, Ryan Roberts wrote:
> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> allocated in large folios of a determined order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
>
> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> defaults to disabled for now; The long term aim is for this to defaut to
> enabled, but there are some risks around internal fragmentation that
> need to be better understood first.
>
> When enabled, the folio order is determined as such: For a vma, process
> or system that has explicitly disabled THP, we continue to allocate
> order-0. THP is most likely disabled to avoid any possible internal
> fragmentation so we honour that request.
>
> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> that have not explicitly opted-in to use transparent hugepages (e.g.
> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> arch_wants_pte_order() is limited by the new cmdline parameter,
> `flexthp_unhinted_max`. This allows for a performance boost without
> requiring any explicit opt-in from the workload while allowing the
> sysadmin to tune between performance and internal fragmentation.
>
> arch_wants_pte_order() can be overridden by the architecture if desired.
> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> set of ptes map physically contigious, naturally aligned memory, so this
> mechanism allows the architecture to optimize as required.
>
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order; first
> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>
> Signed-off-by: Ryan Roberts <[email protected]>

...

> +
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4057,11 +4199,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> {
> + int i = 0;
> + int nr_pages = 1;
> bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
> struct vm_area_struct *vma = vmf->vma;
> struct folio *folio;
> vm_fault_t ret = 0;
> pte_t entry;
> + unsigned long addr;
>
> /* File mapping without ->vm_ops ? */
> if (vma->vm_flags & VM_SHARED)
> @@ -4101,10 +4246,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> /* Allocate our own private page. */
> if (unlikely(anon_vma_prepare(vma)))
> goto oom;
> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> + ret = alloc_anon_folio(vmf, &folio);
> + if (unlikely(ret == -EAGAIN))
> + return 0;
> if (!folio)
> goto oom;
>
> + nr_pages = folio_nr_pages(folio);
> + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +
> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> goto oom_free_page;
> folio_throttle_swaprate(folio, GFP_KERNEL);
> @@ -4116,17 +4266,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> */
> __folio_mark_uptodate(folio);
>
> - entry = mk_pte(&folio->page, vma->vm_page_prot);
> - entry = pte_sw_mkyoung(entry);
> - if (vma->vm_flags & VM_WRITE)
> - entry = pte_mkwrite(pte_mkdirty(entry));
> -
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> - &vmf->ptl);
> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> if (!vmf->pte)
> goto release;
> - if (vmf_pte_changed(vmf)) {
> - update_mmu_tlb(vma, vmf->address, vmf->pte);
> + if (vmf_pte_range_changed(vmf, nr_pages)) {
> + for (i = 0; i < nr_pages; i++)
> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> goto release;
> }
>
> @@ -4141,16 +4286,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> return handle_userfault(vmf, VM_UFFD_MISSING);
> }
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> + folio_ref_add(folio, nr_pages - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + folio_add_new_anon_rmap(folio, vma, addr);
> folio_add_lru_vma(folio, vma);
> +
> + for (i = 0; i < nr_pages; i++) {
> + entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
> + entry = pte_sw_mkyoung(entry);
> + if (vma->vm_flags & VM_WRITE)
> + entry = pte_mkwrite(pte_mkdirty(entry));
> setpte:
> - if (uffd_wp)
> - entry = pte_mkuffd_wp(entry);
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> + if (uffd_wp)
> + entry = pte_mkuffd_wp(entry);
> + set_pte_at(vma->vm_mm, addr + PAGE_SIZE * i, vmf->pte + i, entry);

I've just spotted a bug here for the case where we arrive via goto setpte; in
this case, addr is not initialized. This crept in during the refactoring and I
have no idea how this could possibly have not fallen over in a heap when
executed. Sorry about that. I'm fixing in v4.

>
> - /* No need to invalidate - it was non-present before */
> - update_mmu_cache(vma, vmf->address, vmf->pte);
> + /* No need to invalidate - it was non-present before */
> + update_mmu_cache(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> + }
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);


2023-07-24 12:14:39

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] variable-order, large folios for anonymous memory

On 14/07/2023 17:04, Ryan Roberts wrote:
> Hi All,
>
> This is v3 of a series to implement variable order, large folios for anonymous
> memory. (currently called "FLEXIBLE_THP") The objective of this is to improve
> performance by allocating larger chunks of memory during anonymous page faults.
> See [1] and [2] for background.

A question for anyone that can help; I'm preparing v4 and as part of that am
running the mm selftests, now that I've fixed them up to run reliably for
arm64. This is showing 2 regressions vs the v6.5-rc3 baseline:

1) khugepaged test fails here:
# Run test: collapse_max_ptes_none (khugepaged:anon)
# Maybe collapse with max_ptes_none exceeded.... Fail
# Unexpected huge page

2) split_huge_page_test fails with:
# Still AnonHugePages not split

I *think* (but haven't yet verified) that (1) is due to khugepaged ignoring
non-order-0 folios when looking for candidates to collapse. Now that we have
large anon folios, the memory allocated by the test is in large folios and
therefore does not get collapsed. We understand this issue, and I believe
DavidH's new scheme for determining exclusive vs shared should give us the tools
to solve this.

But (2) is weird. If I run this test on its own immediately after booting, it
passes. If I then run the khugepaged test, then re-run this test, it fails.

The test is allocating 4 hugepages, then requesting they are split using the
debugfs interface. Then the test looks at /proc/self/smaps to check that
AnonHugePages is back to 0.

In both the passing and failing cases, the kernel thinks that it has
successfully split the pages; the debug logs in split_huge_pages_pid() confirm
this. In the failing case, I wonder if somehow khugepaged could be immediately
re-collapsing the pages before user sapce can observe the split? Perhaps the
failed khugepaged test has left khugepaged in an "awake" state and it
immediately pounces?

Thanks,
Ryan



2023-07-24 15:39:46

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] variable-order, large folios for anonymous memory

On 24 Jul 2023, at 7:59, Ryan Roberts wrote:

> On 14/07/2023 17:04, Ryan Roberts wrote:
>> Hi All,
>>
>> This is v3 of a series to implement variable order, large folios for anonymous
>> memory. (currently called "FLEXIBLE_THP") The objective of this is to improve
>> performance by allocating larger chunks of memory during anonymous page faults.
>> See [1] and [2] for background.
>
> A question for anyone that can help; I'm preparing v4 and as part of that am
> running the mm selftests, now that I've fixed them up to run reliably for
> arm64. This is showing 2 regressions vs the v6.5-rc3 baseline:
>
> 1) khugepaged test fails here:
> # Run test: collapse_max_ptes_none (khugepaged:anon)
> # Maybe collapse with max_ptes_none exceeded.... Fail
> # Unexpected huge page
>
> 2) split_huge_page_test fails with:
> # Still AnonHugePages not split
>
> I *think* (but haven't yet verified) that (1) is due to khugepaged ignoring
> non-order-0 folios when looking for candidates to collapse. Now that we have
> large anon folios, the memory allocated by the test is in large folios and
> therefore does not get collapsed. We understand this issue, and I believe
> DavidH's new scheme for determining exclusive vs shared should give us the tools
> to solve this.
>
> But (2) is weird. If I run this test on its own immediately after booting, it
> passes. If I then run the khugepaged test, then re-run this test, it fails.
>
> The test is allocating 4 hugepages, then requesting they are split using the
> debugfs interface. Then the test looks at /proc/self/smaps to check that
> AnonHugePages is back to 0.
>
> In both the passing and failing cases, the kernel thinks that it has
> successfully split the pages; the debug logs in split_huge_pages_pid() confirm
> this. In the failing case, I wonder if somehow khugepaged could be immediately
> re-collapsing the pages before user sapce can observe the split? Perhaps the
> failed khugepaged test has left khugepaged in an "awake" state and it
> immediately pounces?

This is more likely to be a stats issue. Have you checked smap to see if
AnonHugePages is 0 KB by placing a getchar() before the exit(EXIT_FAILURE)?
Since split_huge_page_test checks that stats to make sure the split indeed
happened.

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-07-24 15:57:47

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] variable-order, large folios for anonymous memory

On 24/07/2023 15:58, Zi Yan wrote:
> On 24 Jul 2023, at 7:59, Ryan Roberts wrote:
>
>> On 14/07/2023 17:04, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> This is v3 of a series to implement variable order, large folios for anonymous
>>> memory. (currently called "FLEXIBLE_THP") The objective of this is to improve
>>> performance by allocating larger chunks of memory during anonymous page faults.
>>> See [1] and [2] for background.
>>
>> A question for anyone that can help; I'm preparing v4 and as part of that am
>> running the mm selftests, now that I've fixed them up to run reliably for
>> arm64. This is showing 2 regressions vs the v6.5-rc3 baseline:
>>
>> 1) khugepaged test fails here:
>> # Run test: collapse_max_ptes_none (khugepaged:anon)
>> # Maybe collapse with max_ptes_none exceeded.... Fail
>> # Unexpected huge page
>>
>> 2) split_huge_page_test fails with:
>> # Still AnonHugePages not split
>>
>> I *think* (but haven't yet verified) that (1) is due to khugepaged ignoring
>> non-order-0 folios when looking for candidates to collapse. Now that we have
>> large anon folios, the memory allocated by the test is in large folios and
>> therefore does not get collapsed. We understand this issue, and I believe
>> DavidH's new scheme for determining exclusive vs shared should give us the tools
>> to solve this.
>>
>> But (2) is weird. If I run this test on its own immediately after booting, it
>> passes. If I then run the khugepaged test, then re-run this test, it fails.
>>
>> The test is allocating 4 hugepages, then requesting they are split using the
>> debugfs interface. Then the test looks at /proc/self/smaps to check that
>> AnonHugePages is back to 0.
>>
>> In both the passing and failing cases, the kernel thinks that it has
>> successfully split the pages; the debug logs in split_huge_pages_pid() confirm
>> this. In the failing case, I wonder if somehow khugepaged could be immediately
>> re-collapsing the pages before user sapce can observe the split? Perhaps the
>> failed khugepaged test has left khugepaged in an "awake" state and it
>> immediately pounces?
>
> This is more likely to be a stats issue. Have you checked smap to see if
> AnonHugePages is 0 KB by placing a getchar() before the exit(EXIT_FAILURE)?

Yes - its still 8192K. But looking at the code that value is determined from the
fact that there is a PMD block mapping present. And the split definitely
succeeded so something must have re-collapsed it.

Looking into the khugepaged test suite, it saves the thp and khugepaged settings
out of sysfs, modifies them for the tests, then restores them when finished. But
it doesn't restore if exiting early (due to failure). It changes the settings
for alloc_sleep_millisecs and scan_sleep_millisecs from a large number of
seconds to 10 ms, for example. So I'm pretty sure this is the culprit.


> Since split_huge_page_test checks that stats to make sure the split indeed
> happened.
>
> --
> Best Regards,
> Yan, Zi


2023-07-26 09:18:43

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] variable-order, large folios for anonymous memory

On 26/07/2023 08:36, Itaru Kitayama wrote:
> Ryan,
> Do you have a kselfrest code for this new feature?
> I’d like to test it out on FVP when I have the chance.

A very timely question! I have modified the mm/cow tests to additionally test
large anon folios. That patch is part of v4, which I am about to start writing
the cover letter for. So look out for that in around an hour.