2023-06-26 17:34:40

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v1 00/10] variable-order, large folios for anonymous memory

Hi All,

Following on from the previous RFCv2 [1], this series implements variable order,
large folios for anonymous memory. The objective of this is to improve
performance by allocating larger chunks of memory during anonymous page faults:

- Since SW (the kernel) is dealing with larger chunks of memory than base
pages, there are efficiency savings to be had; fewer page faults, batched PTE
and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
overhead. This should benefit all architectures.
- Since we are now mapping physically contiguous chunks of memory, we can take
advantage of HW TLB compression techniques. A reduction in TLB pressure
speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
TLB entries; "the contiguous bit" (architectural) and HPA (uarch).

This patch set deals with the SW side of things only and based on feedback from
the RFC, aims to be the most minimal initial change, upon which future
incremental changes can be added. For this reason, the new behaviour is hidden
behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
default. Although the code has been refactored to parameterize the desired order
of the allocation, when the feature is disabled (by forcing the order to be
always 0) my performance tests measure no regression. So I'm hoping this will be
a suitable mechanism to allow incremental submissions to the kernel without
affecting the rest of the world.

The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
[2], which is a hard dependency. I'm not sure of Matthew's exact plans for
getting that series into the kernel, but I'm hoping we can start the review
process on this patch set independently. I have a branch at [3].

I've posted a separate series concerning the HW part (contpte mapping) for arm64
at [4].


Performance
-----------

Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
javascript benchmark running in Chromium). Both cases are running on Ampere
Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
is repeated 15 times over 5 reboots and averaged.

All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
'anonfolio' is the full patch set similar to the RFC with the additional changes
to the extra 3 fault paths. The rest of the configs are described at [4].

Kernel Compilation (smaller is better):

| kernel | real-time | kern-time | user-time |
|:----------------|------------:|------------:|------------:|
| baseline-4k | 0.0% | 0.0% | 0.0% |
| anonfolio-basic | -5.3% | -42.9% | -0.6% |
| anonfolio | -5.4% | -46.0% | -0.3% |
| contpte | -6.8% | -45.7% | -2.1% |
| exefolio | -8.4% | -46.4% | -3.7% |
| baseline-16k | -8.7% | -49.2% | -3.7% |
| baseline-64k | -10.5% | -66.0% | -3.5% |

Speedometer 2.0 (bigger is better):

| kernel | runs_per_min |
|:----------------|---------------:|
| baseline-4k | 0.0% |
| anonfolio-basic | 0.7% |
| anonfolio | 1.2% |
| contpte | 3.1% |
| exefolio | 4.2% |
| baseline-16k | 5.3% |


Changes since RFCv2
-------------------

- Simplified series to bare minimum (on David Hildenbrand's advice)
- Removed changes to 3 fault paths:
- write fault on zero page: wp_page_copy()
- write fault on non-exclusive CoW page: wp_page_copy()
- write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
- Only 1 fault path change remains:
- write fault on unallocated address: do_anonymous_page()
- Removed support patches that are no longer needed
- Added Kconfig CONFIG_LARGE_ANON_FOLIO and friends
- Whole feature defaults to off
- Arch opts-in to allowing feature and provides max allocation order


Future Work
-----------

Once this series is in, there are some more incremental changes I plan to follow
up with:

- Add the other 3 fault path changes back in
- Properly support pte-mapped folios for:
- numa balancing (do_numa_page())
- fix assumptions about exclusivity for large folios in madvise()
- compaction (although I think this is already a problem for large folios
in the file cache so perhaps someone is working on it?)


[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v1
[4] https://lore.kernel.org/linux-arm-kernel/[email protected]/

Thanks,
Ryan


Ryan Roberts (10):
mm: Expose clear_huge_page() unconditionally
mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
mm: Introduce try_vma_alloc_movable_folio()
mm: Implement folio_add_new_anon_rmap_range()
mm: Implement folio_remove_rmap_range()
mm: Allow deferred splitting of arbitrary large anon folios
mm: Batch-zap large anonymous folio PTE mappings
mm: Kconfig hooks to determine max anon folio allocation order
arm64: mm: Declare support for large anonymous folios
mm: Allocate large folios for anonymous memory

arch/alpha/include/asm/page.h | 5 +-
arch/arm64/Kconfig | 13 ++
arch/arm64/include/asm/page.h | 3 +-
arch/arm64/mm/fault.c | 7 +-
arch/ia64/include/asm/page.h | 5 +-
arch/m68k/include/asm/page_no.h | 7 +-
arch/s390/include/asm/page.h | 5 +-
arch/x86/include/asm/page.h | 5 +-
include/linux/highmem.h | 23 ++-
include/linux/mm.h | 3 +-
include/linux/rmap.h | 4 +
mm/Kconfig | 39 ++++
mm/memory.c | 324 ++++++++++++++++++++++++++++++--
mm/rmap.c | 107 ++++++++++-
14 files changed, 506 insertions(+), 44 deletions(-)

--
2.25.1



2023-06-26 17:35:13

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v1 05/10] mm: Implement folio_remove_rmap_range()

Like page_remove_rmap() but batch-removes the rmap for a range of pages
belonging to a folio, for effciency savings. All pages are accounted as
small pages.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/rmap.h | 2 ++
mm/rmap.c | 62 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 64 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 15433a3d0cbf..50f50e4cb0f8 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -204,6 +204,8 @@ void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
struct vm_area_struct *, bool compound);
void page_remove_rmap(struct page *, struct vm_area_struct *,
bool compound);
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+ int nr, struct vm_area_struct *vma);

void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long address, rmap_t flags);
diff --git a/mm/rmap.c b/mm/rmap.c
index 4050bcea7ae7..ac1d93d43f2b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1434,6 +1434,68 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
folio_add_file_rmap_range(folio, page, nr_pages, vma, compound);
}

+/*
+ * folio_remove_rmap_range - take down pte mappings from a range of pages
+ * belonging to a folio. All pages are accounted as small pages.
+ * @folio: folio that all pages belong to
+ * @page: first page in range to remove mapping from
+ * @nr: number of pages in range to remove mapping from
+ * @vma: the vm area from which the mapping is removed
+ *
+ * The caller needs to hold the pte lock.
+ */
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+ int nr, struct vm_area_struct *vma)
+{
+ atomic_t *mapped = &folio->_nr_pages_mapped;
+ int nr_unmapped = 0;
+ int nr_mapped;
+ bool last;
+ enum node_stat_item idx;
+
+ VM_BUG_ON_FOLIO(folio_test_hugetlb(folio), folio);
+
+ if (!folio_test_large(folio)) {
+ /* Is this the page's last map to be removed? */
+ last = atomic_add_negative(-1, &page->_mapcount);
+ nr_unmapped = last;
+ } else {
+ for (; nr != 0; nr--, page++) {
+ /* Is this the page's last map to be removed? */
+ last = atomic_add_negative(-1, &page->_mapcount);
+ if (last) {
+ /* Page still mapped if folio mapped entirely */
+ nr_mapped = atomic_dec_return_relaxed(mapped);
+ if (nr_mapped < COMPOUND_MAPPED)
+ nr_unmapped++;
+ }
+ }
+ }
+
+ if (nr_unmapped) {
+ idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
+ __lruvec_stat_mod_folio(folio, idx, -nr_unmapped);
+
+ /*
+ * Queue anon THP for deferred split if we have just unmapped at
+ * least 1 page, while at least 1 page remains mapped.
+ */
+ if (folio_test_large(folio) && folio_test_anon(folio))
+ if (nr_mapped)
+ deferred_split_folio(folio);
+ }
+
+ /*
+ * It would be tidy to reset folio_test_anon mapping when fully
+ * unmapped, but that might overwrite a racing page_add_anon_rmap
+ * which increments mapcount after us but sets mapping before us:
+ * so leave the reset to free_pages_prepare, and remember that
+ * it's only reliable while mapped.
+ */
+
+ munlock_vma_folio(folio, vma, false);
+}
+
/**
* page_remove_rmap - take down pte mapping from a page
* @page: page to remove mapping from
--
2.25.1


2023-06-26 17:35:13

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order

For variable-order anonymous folios, we need to determine the order that
we will allocate. From a SW perspective, the higher the order we
allocate, the less overhead we will have; fewer faults, fewer folios in
lists, etc. But of course there will also be more memory wastage as the
order increases.

From a HW perspective, there are memory block sizes that can be
beneficial to reducing TLB pressure. arm64, for example, has the ability
to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
64K base pages) such that one of these chunks only uses a single TLB
entry.

So we let the architecture specify the order of the maximally beneficial
mapping unit when PTE-mapped. Furthermore, because in some cases, this
order may be quite big (and therefore potentially wasteful of memory),
allow the arch to specify 2 values; One is the max order for a mapping
that _would not_ use THP if all size and alignment constraints were met,
and the other is the max order for a mapping that _would_ use THP if all
those constraints were met.

Implement this with Kconfig by introducing some new options to allow the
architecture to declare that it supports large anonymous folios along
with these 2 preferred max order values. Then introduce a user-facing
option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
enabled if the architecture has declared its support. When disabled, it
forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
allocated.

Signed-off-by: Ryan Roberts <[email protected]>
---
mm/Kconfig | 39 +++++++++++++++++++++++++++++++++++++++
mm/memory.c | 8 ++++++++
2 files changed, 47 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..f4ba48c37b75 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1208,4 +1208,43 @@ config PER_VMA_LOCK

source "mm/damon/Kconfig"

+config ARCH_SUPPORTS_LARGE_ANON_FOLIO
+ def_bool n
+ help
+ An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
+ to be enabled. It must also set the following integer values:
+ - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+ - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
+
+config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+ int
+ help
+ The maximum size of folio to allocate for an anonymous VMA PTE-mapping
+ that does not have the MADV_HUGEPAGE hint set.
+
+config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
+ int
+ help
+ The maximum size of folio to allocate for an anonymous VMA PTE-mapping
+ that has the MADV_HUGEPAGE hint set.
+
+config LARGE_ANON_FOLIO
+ bool "Allocate large folios for anonymous memory"
+ depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
+ default n
+ help
+ Use large (bigger than order-0) folios to back anonymous memory where
+ possible. This reduces the number of page faults, as well as other
+ per-page overheads to improve performance for many workloads.
+
+config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+ int
+ default 0 if !LARGE_ANON_FOLIO
+ default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+
+config LARGE_ANON_FOLIO_THP_ORDER_MAX
+ int
+ default 0 if !LARGE_ANON_FOLIO
+ default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
+
endmenu
diff --git a/mm/memory.c b/mm/memory.c
index 9165ed1b9fc2..a8f7e2b28d7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3153,6 +3153,14 @@ static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
}

+static inline int max_anon_folio_order(struct vm_area_struct *vma)
+{
+ if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
+ return CONFIG_LARGE_ANON_FOLIO_THP_ORDER_MAX;
+ else
+ return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
+}
+
/*
* Handle write page faults for pages that can be reused in the current vma
*
--
2.25.1


2023-06-26 17:36:12

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()

Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
belonging to a folio, for effciency savings. All pages are accounted as
small pages.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/rmap.h | 2 ++
mm/rmap.c | 43 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a3825ce81102..15433a3d0cbf 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long address);
void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
unsigned long address);
+void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
+ int nr, struct vm_area_struct *vma, unsigned long address);
void page_add_file_rmap(struct page *, struct vm_area_struct *,
bool compound);
void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
diff --git a/mm/rmap.c b/mm/rmap.c
index 1d8369549424..4050bcea7ae7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
}

+/**
+ * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
+ * anonymous potentially large folio.
+ * @folio: The folio containing the pages to be mapped
+ * @page: First page in the folio to be mapped
+ * @nr: Number of pages to be mapped
+ * @vma: the vm area in which the mapping is added
+ * @address: the user virtual address of the first page to be mapped
+ *
+ * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
+ * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
+ * bypassed and the folio does not have to be locked. All pages in the folio are
+ * individually accounted.
+ *
+ * As the folio is new, it's assumed to be mapped exclusively by a single
+ * process.
+ */
+void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
+ int nr, struct vm_area_struct *vma, unsigned long address)
+{
+ int i;
+
+ VM_BUG_ON_VMA(address < vma->vm_start ||
+ address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
+ __folio_set_swapbacked(folio);
+
+ if (folio_test_large(folio)) {
+ /* increment count (starts at 0) */
+ atomic_set(&folio->_nr_pages_mapped, nr);
+ }
+
+ for (i = 0; i < nr; i++) {
+ /* increment count (starts at -1) */
+ atomic_set(&page->_mapcount, 0);
+ __page_set_anon_rmap(folio, page, vma, address, 1);
+ page++;
+ address += PAGE_SIZE;
+ }
+
+ __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
+
+}
+
/**
* folio_add_file_rmap_range - add pte mapping to page range of a folio
* @folio: The folio to add the mapping to
--
2.25.1


2023-06-27 02:59:08

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <[email protected]> wrote:
>
> For variable-order anonymous folios, we need to determine the order that
> we will allocate. From a SW perspective, the higher the order we
> allocate, the less overhead we will have; fewer faults, fewer folios in
> lists, etc. But of course there will also be more memory wastage as the
> order increases.
>
> From a HW perspective, there are memory block sizes that can be
> beneficial to reducing TLB pressure. arm64, for example, has the ability
> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
> 64K base pages) such that one of these chunks only uses a single TLB
> entry.
>
> So we let the architecture specify the order of the maximally beneficial
> mapping unit when PTE-mapped. Furthermore, because in some cases, this
> order may be quite big (and therefore potentially wasteful of memory),
> allow the arch to specify 2 values; One is the max order for a mapping
> that _would not_ use THP if all size and alignment constraints were met,
> and the other is the max order for a mapping that _would_ use THP if all
> those constraints were met.
>
> Implement this with Kconfig by introducing some new options to allow the
> architecture to declare that it supports large anonymous folios along
> with these 2 preferred max order values. Then introduce a user-facing
> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
> enabled if the architecture has declared its support. When disabled, it
> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
> allocated.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> mm/Kconfig | 39 +++++++++++++++++++++++++++++++++++++++
> mm/memory.c | 8 ++++++++
> 2 files changed, 47 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..f4ba48c37b75 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>
> source "mm/damon/Kconfig"
>
> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
> + def_bool n
> + help
> + An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
> + to be enabled. It must also set the following integer values:
> + - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> + - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +
> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> + int
> + help
> + The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> + that does not have the MADV_HUGEPAGE hint set.
> +
> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> + int
> + help
> + The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> + that has the MADV_HUGEPAGE hint set.
> +
> +config LARGE_ANON_FOLIO
> + bool "Allocate large folios for anonymous memory"
> + depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
> + default n
> + help
> + Use large (bigger than order-0) folios to back anonymous memory where
> + possible. This reduces the number of page faults, as well as other
> + per-page overheads to improve performance for many workloads.
> +
> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> + int
> + default 0 if !LARGE_ANON_FOLIO
> + default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +
> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
> + int
> + default 0 if !LARGE_ANON_FOLIO
> + default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +
> endmenu

I don't think an MVP should add this many Kconfigs. One Kconfig sounds
reasonable to me for now.

2023-06-27 03:15:42

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v1 05/10] mm: Implement folio_remove_rmap_range()

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <[email protected]> wrote:
>
> Like page_remove_rmap() but batch-removes the rmap for a range of pages
> belonging to a folio, for effciency savings. All pages are accounted as
> small pages.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/rmap.h | 2 ++
> mm/rmap.c | 62 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 64 insertions(+)

Sorry for nagging: this can be included in a followup series.

2023-06-27 04:24:04

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <[email protected]> wrote:
>
> Hi All,
>
> Following on from the previous RFCv2 [1], this series implements variable order,
> large folios for anonymous memory. The objective of this is to improve
> performance by allocating larger chunks of memory during anonymous page faults:
>
> - Since SW (the kernel) is dealing with larger chunks of memory than base
> pages, there are efficiency savings to be had; fewer page faults, batched PTE
> and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
> overhead. This should benefit all architectures.
> - Since we are now mapping physically contiguous chunks of memory, we can take
> advantage of HW TLB compression techniques. A reduction in TLB pressure
> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>
> This patch set deals with the SW side of things only and based on feedback from
> the RFC, aims to be the most minimal initial change, upon which future
> incremental changes can be added. For this reason, the new behaviour is hidden
> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> default. Although the code has been refactored to parameterize the desired order
> of the allocation, when the feature is disabled (by forcing the order to be
> always 0) my performance tests measure no regression. So I'm hoping this will be
> a suitable mechanism to allow incremental submissions to the kernel without
> affecting the rest of the world.
>
> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> getting that series into the kernel, but I'm hoping we can start the review
> process on this patch set independently. I have a branch at [3].
>
> I've posted a separate series concerning the HW part (contpte mapping) for arm64
> at [4].
>
>
> Performance
> -----------
>
> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> javascript benchmark running in Chromium). Both cases are running on Ampere
> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> is repeated 15 times over 5 reboots and averaged.
>
> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> 'anonfolio' is the full patch set similar to the RFC with the additional changes
> to the extra 3 fault paths. The rest of the configs are described at [4].
>
> Kernel Compilation (smaller is better):
>
> | kernel | real-time | kern-time | user-time |
> |:----------------|------------:|------------:|------------:|
> | baseline-4k | 0.0% | 0.0% | 0.0% |
> | anonfolio-basic | -5.3% | -42.9% | -0.6% |
> | anonfolio | -5.4% | -46.0% | -0.3% |
> | contpte | -6.8% | -45.7% | -2.1% |
> | exefolio | -8.4% | -46.4% | -3.7% |
> | baseline-16k | -8.7% | -49.2% | -3.7% |
> | baseline-64k | -10.5% | -66.0% | -3.5% |
>
> Speedometer 2.0 (bigger is better):
>
> | kernel | runs_per_min |
> |:----------------|---------------:|
> | baseline-4k | 0.0% |
> | anonfolio-basic | 0.7% |
> | anonfolio | 1.2% |
> | contpte | 3.1% |
> | exefolio | 4.2% |
> | baseline-16k | 5.3% |

Thanks for pushing this forward!

> Changes since RFCv2
> -------------------
>
> - Simplified series to bare minimum (on David Hildenbrand's advice)

My impression is that this series still includes many pieces that can
be split out and discussed separately with followup series.

(I skipped 04/10 and will look at it tomorrow.)

2023-06-27 07:18:22

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <[email protected]> wrote:
>
> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
> belonging to a folio, for effciency savings. All pages are accounted as
> small pages.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/rmap.h | 2 ++
> mm/rmap.c | 43 +++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 45 insertions(+)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index a3825ce81102..15433a3d0cbf 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
> unsigned long address);
> void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
> unsigned long address);
> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
> + int nr, struct vm_area_struct *vma, unsigned long address);

We should update folio_add_new_anon_rmap() to support large() &&
!folio_test_pmd_mappable() folios instead.

I double checked all places currently using folio_add_new_anon_rmap(),
and as expected, none actually allocates large() &&
!folio_test_pmd_mappable() and maps it one by one, which makes the
cases simpler, i.e.,
if (!large())
// the existing basepage case
else if (!folio_test_pmd_mappable())
// our new case
else
// the existing THP case

> void page_add_file_rmap(struct page *, struct vm_area_struct *,
> bool compound);
> void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1d8369549424..4050bcea7ae7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
> __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
> }
>
> +/**
> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
> + * anonymous potentially large folio.
> + * @folio: The folio containing the pages to be mapped
> + * @page: First page in the folio to be mapped
> + * @nr: Number of pages to be mapped
> + * @vma: the vm area in which the mapping is added
> + * @address: the user virtual address of the first page to be mapped
> + *
> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
> + * bypassed and the folio does not have to be locked. All pages in the folio are
> + * individually accounted.
> + *
> + * As the folio is new, it's assumed to be mapped exclusively by a single
> + * process.
> + */
> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
> + int nr, struct vm_area_struct *vma, unsigned long address)
> +{
> + int i;
> +
> + VM_BUG_ON_VMA(address < vma->vm_start ||
> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma);

BTW, VM_BUG_ON* shouldn't be used in new code:
Documentation/process/coding-style.rst

2023-06-27 08:17:23

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()

On 27/06/2023 08:08, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <[email protected]> wrote:
>>
>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>> belonging to a folio, for effciency savings. All pages are accounted as
>> small pages.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> include/linux/rmap.h | 2 ++
>> mm/rmap.c | 43 +++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 45 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index a3825ce81102..15433a3d0cbf 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>> unsigned long address);
>> void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>> unsigned long address);
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> + int nr, struct vm_area_struct *vma, unsigned long address);
>
> We should update folio_add_new_anon_rmap() to support large() &&
> !folio_test_pmd_mappable() folios instead.
>
> I double checked all places currently using folio_add_new_anon_rmap(),
> and as expected, none actually allocates large() &&
> !folio_test_pmd_mappable() and maps it one by one, which makes the
> cases simpler, i.e.,
> if (!large())
> // the existing basepage case
> else if (!folio_test_pmd_mappable())
> // our new case
> else
> // the existing THP case

I don't have a strong opinion either way. Happy to go with this suggestion. But
the reason I did it as a new function was because I was following the pattern in
[1] which adds a new folio_add_file_rmap_range() function.

[1] https://lore.kernel.org/linux-mm/[email protected]/


>
>> void page_add_file_rmap(struct page *, struct vm_area_struct *,
>> bool compound);
>> void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1d8369549424..4050bcea7ae7 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>> __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>> }
>>
>> +/**
>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>> + * anonymous potentially large folio.
>> + * @folio: The folio containing the pages to be mapped
>> + * @page: First page in the folio to be mapped
>> + * @nr: Number of pages to be mapped
>> + * @vma: the vm area in which the mapping is added
>> + * @address: the user virtual address of the first page to be mapped
>> + *
>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>> + * individually accounted.
>> + *
>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>> + * process.
>> + */
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> + int nr, struct vm_area_struct *vma, unsigned long address)
>> +{
>> + int i;
>> +
>> + VM_BUG_ON_VMA(address < vma->vm_start ||
>> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>
> BTW, VM_BUG_ON* shouldn't be used in new code:
> Documentation/process/coding-style.rst

Thanks, sorry about that. Was copy-pasting from folio_add_new_anon_rmap().


2023-06-27 08:31:50

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory

On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <[email protected]> wrote:
>
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <[email protected]> wrote:
> >
> > Hi All,
> >
> > Following on from the previous RFCv2 [1], this series implements variable order,
> > large folios for anonymous memory. The objective of this is to improve
> > performance by allocating larger chunks of memory during anonymous page faults:
> >
> > - Since SW (the kernel) is dealing with larger chunks of memory than base
> > pages, there are efficiency savings to be had; fewer page faults, batched PTE
> > and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
> > overhead. This should benefit all architectures.
> > - Since we are now mapping physically contiguous chunks of memory, we can take
> > advantage of HW TLB compression techniques. A reduction in TLB pressure
> > speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> > TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> >
> > This patch set deals with the SW side of things only and based on feedback from
> > the RFC, aims to be the most minimal initial change, upon which future
> > incremental changes can be added. For this reason, the new behaviour is hidden
> > behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> > default. Although the code has been refactored to parameterize the desired order
> > of the allocation, when the feature is disabled (by forcing the order to be
> > always 0) my performance tests measure no regression. So I'm hoping this will be
> > a suitable mechanism to allow incremental submissions to the kernel without
> > affecting the rest of the world.
> >
> > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> > [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> > getting that series into the kernel, but I'm hoping we can start the review
> > process on this patch set independently. I have a branch at [3].
> >
> > I've posted a separate series concerning the HW part (contpte mapping) for arm64
> > at [4].
> >
> >
> > Performance
> > -----------
> >
> > Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> > javascript benchmark running in Chromium). Both cases are running on Ampere
> > Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> > is repeated 15 times over 5 reboots and averaged.
> >
> > All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> > 'anonfolio' is the full patch set similar to the RFC with the additional changes
> > to the extra 3 fault paths. The rest of the configs are described at [4].
> >
> > Kernel Compilation (smaller is better):
> >
> > | kernel | real-time | kern-time | user-time |
> > |:----------------|------------:|------------:|------------:|
> > | baseline-4k | 0.0% | 0.0% | 0.0% |
> > | anonfolio-basic | -5.3% | -42.9% | -0.6% |
> > | anonfolio | -5.4% | -46.0% | -0.3% |
> > | contpte | -6.8% | -45.7% | -2.1% |
> > | exefolio | -8.4% | -46.4% | -3.7% |
> > | baseline-16k | -8.7% | -49.2% | -3.7% |
> > | baseline-64k | -10.5% | -66.0% | -3.5% |
> >
> > Speedometer 2.0 (bigger is better):
> >
> > | kernel | runs_per_min |
> > |:----------------|---------------:|
> > | baseline-4k | 0.0% |
> > | anonfolio-basic | 0.7% |
> > | anonfolio | 1.2% |
> > | contpte | 3.1% |
> > | exefolio | 4.2% |
> > | baseline-16k | 5.3% |
>
> Thanks for pushing this forward!
>
> > Changes since RFCv2
> > -------------------
> >
> > - Simplified series to bare minimum (on David Hildenbrand's advice)
>
> My impression is that this series still includes many pieces that can
> be split out and discussed separately with followup series.
>
> (I skipped 04/10 and will look at it tomorrow.)

I went through the series twice. Here what I think a bare minimum
series (easier to review/debug/land) would look like:
1. a new arch specific function providing a prefered order within (0,
PMD_ORDER).
2. an extended anon folio alloc API taking that order (02/10, partially).
3. an updated folio_add_new_anon_rmap() covering the large() &&
!pmd_mappable() case (similar to 04/10).
4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
(06/10, reviewed-by provided).
5. finally, use the extended anon folio alloc API with the arch
preferred order in do_anonymous_page() (10/10, partially).

The rest can be split out into separate series and move forward in
parallel with probably a long list of things we need/want to do.

2023-06-27 10:16:59

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory

On 27/06/2023 08:49, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <[email protected]> wrote:
>>
>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <[email protected]> wrote:
>>>
>>> Hi All,
>>>
>>> Following on from the previous RFCv2 [1], this series implements variable order,
>>> large folios for anonymous memory. The objective of this is to improve
>>> performance by allocating larger chunks of memory during anonymous page faults:
>>>
>>> - Since SW (the kernel) is dealing with larger chunks of memory than base
>>> pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>> and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>> overhead. This should benefit all architectures.
>>> - Since we are now mapping physically contiguous chunks of memory, we can take
>>> advantage of HW TLB compression techniques. A reduction in TLB pressure
>>> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>> TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>
>>> This patch set deals with the SW side of things only and based on feedback from
>>> the RFC, aims to be the most minimal initial change, upon which future
>>> incremental changes can be added. For this reason, the new behaviour is hidden
>>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
>>> default. Although the code has been refactored to parameterize the desired order
>>> of the allocation, when the feature is disabled (by forcing the order to be
>>> always 0) my performance tests measure no regression. So I'm hoping this will be
>>> a suitable mechanism to allow incremental submissions to the kernel without
>>> affecting the rest of the world.
>>>
>>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
>>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
>>> getting that series into the kernel, but I'm hoping we can start the review
>>> process on this patch set independently. I have a branch at [3].
>>>
>>> I've posted a separate series concerning the HW part (contpte mapping) for arm64
>>> at [4].
>>>
>>>
>>> Performance
>>> -----------
>>>
>>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
>>> javascript benchmark running in Chromium). Both cases are running on Ampere
>>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
>>> is repeated 15 times over 5 reboots and averaged.
>>>
>>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
>>> 'anonfolio' is the full patch set similar to the RFC with the additional changes
>>> to the extra 3 fault paths. The rest of the configs are described at [4].
>>>
>>> Kernel Compilation (smaller is better):
>>>
>>> | kernel | real-time | kern-time | user-time |
>>> |:----------------|------------:|------------:|------------:|
>>> | baseline-4k | 0.0% | 0.0% | 0.0% |
>>> | anonfolio-basic | -5.3% | -42.9% | -0.6% |
>>> | anonfolio | -5.4% | -46.0% | -0.3% |
>>> | contpte | -6.8% | -45.7% | -2.1% |
>>> | exefolio | -8.4% | -46.4% | -3.7% |
>>> | baseline-16k | -8.7% | -49.2% | -3.7% |
>>> | baseline-64k | -10.5% | -66.0% | -3.5% |
>>>
>>> Speedometer 2.0 (bigger is better):
>>>
>>> | kernel | runs_per_min |
>>> |:----------------|---------------:|
>>> | baseline-4k | 0.0% |
>>> | anonfolio-basic | 0.7% |
>>> | anonfolio | 1.2% |
>>> | contpte | 3.1% |
>>> | exefolio | 4.2% |
>>> | baseline-16k | 5.3% |
>>
>> Thanks for pushing this forward!
>>
>>> Changes since RFCv2
>>> -------------------
>>>
>>> - Simplified series to bare minimum (on David Hildenbrand's advice)
>>
>> My impression is that this series still includes many pieces that can
>> be split out and discussed separately with followup series.
>>
>> (I skipped 04/10 and will look at it tomorrow.)
>
> I went through the series twice. Here what I think a bare minimum
> series (easier to review/debug/land) would look like:
> 1. a new arch specific function providing a prefered order within (0,
> PMD_ORDER).
> 2. an extended anon folio alloc API taking that order (02/10, partially).
> 3. an updated folio_add_new_anon_rmap() covering the large() &&
> !pmd_mappable() case (similar to 04/10).
> 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
> (06/10, reviewed-by provided).
> 5. finally, use the extended anon folio alloc API with the arch
> preferred order in do_anonymous_page() (10/10, partially).
>
> The rest can be split out into separate series and move forward in
> parallel with probably a long list of things we need/want to do.

Thanks for the fadt review - I really appreciate it!

I've responded to many of your comments. I'd appreciate if we can close those
points then I will work up a v2.

Thanks,
Ryan



2023-06-27 10:18:22

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order

On 27/06/2023 03:47, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <[email protected]> wrote:
>>
>> For variable-order anonymous folios, we need to determine the order that
>> we will allocate. From a SW perspective, the higher the order we
>> allocate, the less overhead we will have; fewer faults, fewer folios in
>> lists, etc. But of course there will also be more memory wastage as the
>> order increases.
>>
>> From a HW perspective, there are memory block sizes that can be
>> beneficial to reducing TLB pressure. arm64, for example, has the ability
>> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
>> 64K base pages) such that one of these chunks only uses a single TLB
>> entry.
>>
>> So we let the architecture specify the order of the maximally beneficial
>> mapping unit when PTE-mapped. Furthermore, because in some cases, this
>> order may be quite big (and therefore potentially wasteful of memory),
>> allow the arch to specify 2 values; One is the max order for a mapping
>> that _would not_ use THP if all size and alignment constraints were met,
>> and the other is the max order for a mapping that _would_ use THP if all
>> those constraints were met.
>>
>> Implement this with Kconfig by introducing some new options to allow the
>> architecture to declare that it supports large anonymous folios along
>> with these 2 preferred max order values. Then introduce a user-facing
>> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
>> enabled if the architecture has declared its support. When disabled, it
>> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
>> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
>> allocated.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> mm/Kconfig | 39 +++++++++++++++++++++++++++++++++++++++
>> mm/memory.c | 8 ++++++++
>> 2 files changed, 47 insertions(+)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..f4ba48c37b75 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>>
>> source "mm/damon/Kconfig"
>>
>> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> + def_bool n
>> + help
>> + An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
>> + to be enabled. It must also set the following integer values:
>> + - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> + - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> + int
>> + help
>> + The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> + that does not have the MADV_HUGEPAGE hint set.
>> +
>> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> + int
>> + help
>> + The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> + that has the MADV_HUGEPAGE hint set.
>> +
>> +config LARGE_ANON_FOLIO
>> + bool "Allocate large folios for anonymous memory"
>> + depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> + default n
>> + help
>> + Use large (bigger than order-0) folios to back anonymous memory where
>> + possible. This reduces the number of page faults, as well as other
>> + per-page overheads to improve performance for many workloads.
>> +
>> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> + int
>> + default 0 if !LARGE_ANON_FOLIO
>> + default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +
>> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
>> + int
>> + default 0 if !LARGE_ANON_FOLIO
>> + default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>> endmenu
>
> I don't think an MVP should add this many Kconfigs. One Kconfig sounds
> reasonable to me for now.

If we move to arch_wants_pte_order() as you suggested (in your response to patch
3) then I agree we can remove most of these. I still think we might want 2
though. For an arch that does not implement arch_wants_pte_order() we wouldn't
want LARGE_ANON_FOLIO to show up in menuconfig so we would still need
ARCH_SUPPORTS_LARGE_ANON_FOLIO:


config ARCH_SUPPORTS_LARGE_ANON_FOLIO
def_bool n
help
An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
to be enabled. In this case, It must also define arch_wants_pte_order()

config LARGE_ANON_FOLIO
bool "Allocate large folios for anonymous memory"
depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
default n
help
Use large (bigger than order-0) folios to back anonymous memory where
possible. This reduces the number of page faults, as well as other
per-page overheads to improve performance for many workloads.

What do you think?


2023-06-28 02:26:10

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()



On 6/27/23 16:09, Ryan Roberts wrote:
> On 27/06/2023 08:08, Yu Zhao wrote:
>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <[email protected]> wrote:
>>>
>>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>>> belonging to a folio, for effciency savings. All pages are accounted as
>>> small pages.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>> include/linux/rmap.h | 2 ++
>>> mm/rmap.c | 43 +++++++++++++++++++++++++++++++++++++++++++
>>> 2 files changed, 45 insertions(+)
>>>
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index a3825ce81102..15433a3d0cbf 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>> unsigned long address);
>>> void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>> unsigned long address);
>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>> + int nr, struct vm_area_struct *vma, unsigned long address);
>>
>> We should update folio_add_new_anon_rmap() to support large() &&
>> !folio_test_pmd_mappable() folios instead.
>>
>> I double checked all places currently using folio_add_new_anon_rmap(),
>> and as expected, none actually allocates large() &&
>> !folio_test_pmd_mappable() and maps it one by one, which makes the
>> cases simpler, i.e.,
>> if (!large())
>> // the existing basepage case
>> else if (!folio_test_pmd_mappable())
>> // our new case
>> else
>> // the existing THP case
>
> I don't have a strong opinion either way. Happy to go with this suggestion. But
> the reason I did it as a new function was because I was following the pattern in
> [1] which adds a new folio_add_file_rmap_range() function.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
Oh. There is different here:
For page cache, large folio could be created by previous file access. But later
file access by other process just need map partial large folio. In this case, we need
_range for filemap.

But for anonymous, I suppose we always map whole folio in. So I agree with Yu. We
don't need _range for folio_add_new_anon_rmap(). Thanks.


Regards
Yin, Fengwei

>
>
>>
>>> void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>> bool compound);
>>> void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 1d8369549424..4050bcea7ae7 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>> __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>> }
>>>
>>> +/**
>>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>>> + * anonymous potentially large folio.
>>> + * @folio: The folio containing the pages to be mapped
>>> + * @page: First page in the folio to be mapped
>>> + * @nr: Number of pages to be mapped
>>> + * @vma: the vm area in which the mapping is added
>>> + * @address: the user virtual address of the first page to be mapped
>>> + *
>>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>>> + * individually accounted.
>>> + *
>>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>>> + * process.
>>> + */
>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>> + int nr, struct vm_area_struct *vma, unsigned long address)
>>> +{
>>> + int i;
>>> +
>>> + VM_BUG_ON_VMA(address < vma->vm_start ||
>>> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>
>> BTW, VM_BUG_ON* shouldn't be used in new code:
>> Documentation/process/coding-style.rst
>
> Thanks, sorry about that. Was copy-pasting from folio_add_new_anon_rmap().
>

2023-06-28 02:26:58

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()



On 6/27/23 15:08, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <[email protected]> wrote:
>>
>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>> belonging to a folio, for effciency savings. All pages are accounted as
>> small pages.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> include/linux/rmap.h | 2 ++
>> mm/rmap.c | 43 +++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 45 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index a3825ce81102..15433a3d0cbf 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>> unsigned long address);
>> void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>> unsigned long address);
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> + int nr, struct vm_area_struct *vma, unsigned long address);
>
> We should update folio_add_new_anon_rmap() to support large() &&
> !folio_test_pmd_mappable() folios instead.
>
> I double checked all places currently using folio_add_new_anon_rmap(),
> and as expected, none actually allocates large() &&
> !folio_test_pmd_mappable() and maps it one by one, which makes the
> cases simpler, i.e.,
> if (!large())
> // the existing basepage case
> else if (!folio_test_pmd_mappable())
> // our new case
> else
> // the existing THP case
I suppose we can merge the new case and existing THP case.


Regards
Yin, Fengwei

>
>> void page_add_file_rmap(struct page *, struct vm_area_struct *,
>> bool compound);
>> void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1d8369549424..4050bcea7ae7 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>> __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>> }
>>
>> +/**
>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>> + * anonymous potentially large folio.
>> + * @folio: The folio containing the pages to be mapped
>> + * @page: First page in the folio to be mapped
>> + * @nr: Number of pages to be mapped
>> + * @vma: the vm area in which the mapping is added
>> + * @address: the user virtual address of the first page to be mapped
>> + *
>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>> + * individually accounted.
>> + *
>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>> + * process.
>> + */
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> + int nr, struct vm_area_struct *vma, unsigned long address)
>> +{
>> + int i;
>> +
>> + VM_BUG_ON_VMA(address < vma->vm_start ||
>> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>
> BTW, VM_BUG_ON* shouldn't be used in new code:
> Documentation/process/coding-style.rst

2023-06-28 11:36:47

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()

On 28/06/2023 03:20, Yin Fengwei wrote:
>
>
> On 6/27/23 16:09, Ryan Roberts wrote:
>> On 27/06/2023 08:08, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>>>> belonging to a folio, for effciency savings. All pages are accounted as
>>>> small pages.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> ---
>>>> include/linux/rmap.h | 2 ++
>>>> mm/rmap.c | 43 +++++++++++++++++++++++++++++++++++++++++++
>>>> 2 files changed, 45 insertions(+)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index a3825ce81102..15433a3d0cbf 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>>> unsigned long address);
>>>> void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>>> unsigned long address);
>>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>>> + int nr, struct vm_area_struct *vma, unsigned long address);
>>>
>>> We should update folio_add_new_anon_rmap() to support large() &&
>>> !folio_test_pmd_mappable() folios instead.
>>>
>>> I double checked all places currently using folio_add_new_anon_rmap(),
>>> and as expected, none actually allocates large() &&
>>> !folio_test_pmd_mappable() and maps it one by one, which makes the
>>> cases simpler, i.e.,
>>> if (!large())
>>> // the existing basepage case
>>> else if (!folio_test_pmd_mappable())
>>> // our new case
>>> else
>>> // the existing THP case
>>
>> I don't have a strong opinion either way. Happy to go with this suggestion. But
>> the reason I did it as a new function was because I was following the pattern in
>> [1] which adds a new folio_add_file_rmap_range() function.
>>
>> [1] https://lore.kernel.org/linux-mm/[email protected]/
> Oh. There is different here:
> For page cache, large folio could be created by previous file access. But later
> file access by other process just need map partial large folio. In this case, we need
> _range for filemap.
>
> But for anonymous, I suppose we always map whole folio in. So I agree with Yu. We
> don't need _range for folio_add_new_anon_rmap(). Thanks.

Yes that makes sense - thanks. I'll merge the new case into
folio_add_new_anon_rmap() for v2.

>
>
> Regards
> Yin, Fengwei
>
>>
>>
>>>
>>>> void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>>> bool compound);
>>>> void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 1d8369549424..4050bcea7ae7 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>>> __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>>> }
>>>>
>>>> +/**
>>>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>>>> + * anonymous potentially large folio.
>>>> + * @folio: The folio containing the pages to be mapped
>>>> + * @page: First page in the folio to be mapped
>>>> + * @nr: Number of pages to be mapped
>>>> + * @vma: the vm area in which the mapping is added
>>>> + * @address: the user virtual address of the first page to be mapped
>>>> + *
>>>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>>>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>>>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>>>> + * individually accounted.
>>>> + *
>>>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>>>> + * process.
>>>> + */
>>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>>> + int nr, struct vm_area_struct *vma, unsigned long address)
>>>> +{
>>>> + int i;
>>>> +
>>>> + VM_BUG_ON_VMA(address < vma->vm_start ||
>>>> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>>
>>> BTW, VM_BUG_ON* shouldn't be used in new code:
>>> Documentation/process/coding-style.rst
>>
>> Thanks, sorry about that. Was copy-pasting from folio_add_new_anon_rmap().
>>


2023-06-28 18:47:39

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory

On Tue, Jun 27, 2023 at 3:59 AM Ryan Roberts <[email protected]> wrote:
>
> On 27/06/2023 08:49, Yu Zhao wrote:
> > On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <[email protected]> wrote:
> >>
> >> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <[email protected]> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> Following on from the previous RFCv2 [1], this series implements variable order,
> >>> large folios for anonymous memory. The objective of this is to improve
> >>> performance by allocating larger chunks of memory during anonymous page faults:
> >>>
> >>> - Since SW (the kernel) is dealing with larger chunks of memory than base
> >>> pages, there are efficiency savings to be had; fewer page faults, batched PTE
> >>> and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
> >>> overhead. This should benefit all architectures.
> >>> - Since we are now mapping physically contiguous chunks of memory, we can take
> >>> advantage of HW TLB compression techniques. A reduction in TLB pressure
> >>> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> >>> TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> >>>
> >>> This patch set deals with the SW side of things only and based on feedback from
> >>> the RFC, aims to be the most minimal initial change, upon which future
> >>> incremental changes can be added. For this reason, the new behaviour is hidden
> >>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> >>> default. Although the code has been refactored to parameterize the desired order
> >>> of the allocation, when the feature is disabled (by forcing the order to be
> >>> always 0) my performance tests measure no regression. So I'm hoping this will be
> >>> a suitable mechanism to allow incremental submissions to the kernel without
> >>> affecting the rest of the world.
> >>>
> >>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> >>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> >>> getting that series into the kernel, but I'm hoping we can start the review
> >>> process on this patch set independently. I have a branch at [3].
> >>>
> >>> I've posted a separate series concerning the HW part (contpte mapping) for arm64
> >>> at [4].
> >>>
> >>>
> >>> Performance
> >>> -----------
> >>>
> >>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> >>> javascript benchmark running in Chromium). Both cases are running on Ampere
> >>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> >>> is repeated 15 times over 5 reboots and averaged.
> >>>
> >>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> >>> 'anonfolio' is the full patch set similar to the RFC with the additional changes
> >>> to the extra 3 fault paths. The rest of the configs are described at [4].
> >>>
> >>> Kernel Compilation (smaller is better):
> >>>
> >>> | kernel | real-time | kern-time | user-time |
> >>> |:----------------|------------:|------------:|------------:|
> >>> | baseline-4k | 0.0% | 0.0% | 0.0% |
> >>> | anonfolio-basic | -5.3% | -42.9% | -0.6% |
> >>> | anonfolio | -5.4% | -46.0% | -0.3% |
> >>> | contpte | -6.8% | -45.7% | -2.1% |
> >>> | exefolio | -8.4% | -46.4% | -3.7% |
> >>> | baseline-16k | -8.7% | -49.2% | -3.7% |
> >>> | baseline-64k | -10.5% | -66.0% | -3.5% |
> >>>
> >>> Speedometer 2.0 (bigger is better):
> >>>
> >>> | kernel | runs_per_min |
> >>> |:----------------|---------------:|
> >>> | baseline-4k | 0.0% |
> >>> | anonfolio-basic | 0.7% |
> >>> | anonfolio | 1.2% |
> >>> | contpte | 3.1% |
> >>> | exefolio | 4.2% |
> >>> | baseline-16k | 5.3% |
> >>
> >> Thanks for pushing this forward!
> >>
> >>> Changes since RFCv2
> >>> -------------------
> >>>
> >>> - Simplified series to bare minimum (on David Hildenbrand's advice)
> >>
> >> My impression is that this series still includes many pieces that can
> >> be split out and discussed separately with followup series.
> >>
> >> (I skipped 04/10 and will look at it tomorrow.)
> >
> > I went through the series twice. Here what I think a bare minimum
> > series (easier to review/debug/land) would look like:

===

> > 1. a new arch specific function providing a prefered order within (0,
> > PMD_ORDER).
> > 2. an extended anon folio alloc API taking that order (02/10, partially).
> > 3. an updated folio_add_new_anon_rmap() covering the large() &&
> > !pmd_mappable() case (similar to 04/10).
> > 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
> > (06/10, reviewed-by provided).
> > 5. finally, use the extended anon folio alloc API with the arch
> > preferred order in do_anonymous_page() (10/10, partially).

===

> > The rest can be split out into separate series and move forward in
> > parallel with probably a long list of things we need/want to do.
>
> Thanks for the fadt review - I really appreciate it!
>
> I've responded to many of your comments. I'd appreciate if we can close those
> points then I will work up a v2.

Thanks!

Based on the latest discussion here [1], my original list above can be
optionally reduced to 4 patches: item 2 can be quashed into item 5.

Also please make sure we have only one global (apply to all archs)
Kconfig option, and it should be added in item 5:

if TRANSPARENT_HUGEPAGE
config FLEXIBLE/VARIABLE_THP # or whatever name you see fit
end if

(How many new Kconfig options added within arch/arm64/ is not a concern of MM.)

And please make sure it's disabled by default, because we are still
missing many important functions, e.g., I don't think we can mlock()
when large() && !pmd_mappable(), see mlock_pte_range() and
mlock_vma_folio(). We can fix it along with many things later, but we
need to present a plan and a schedule now. Otherwise, there would be
pushback if we try to land the series without supporting mlock().

Do you or Fengwei plan to take on it? (I personally don't.) If not,
I'll try to find someone from our team to look at it. (It'd be more
scalable if we have a coordinated group of people individually solving
different problems.)

[1] https://lore.kernel.org/r/[email protected]/

2023-06-29 00:01:17

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory

Hi Yu,

On 6/29/23 02:22, Yu Zhao wrote:
> And please make sure it's disabled by default, because we are still
> missing many important functions, e.g., I don't think we can mlock()
> when large() && !pmd_mappable(), see mlock_pte_range() and
> mlock_vma_folio(). We can fix it along with many things later, but we
> need to present a plan and a schedule now. Otherwise, there would be
> pushback if we try to land the series without supporting mlock().
>
> Do you or Fengwei plan to take on it? (I personally don't.) If not,
Do you mean the mlock() with large folio? Yes. I can work on it. Thanks.


Regards
Yin, Fengwei

> I'll try to find someone from our team to look at it. (It'd be more
> scalable if we have a coordinated group of people individually solving
> different problems.)

2023-06-29 00:34:57

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory

On Wed, Jun 28, 2023 at 5:59 PM Yin Fengwei <[email protected]> wrote:
>
> Hi Yu,
>
> On 6/29/23 02:22, Yu Zhao wrote:
> > And please make sure it's disabled by default, because we are still
> > missing many important functions, e.g., I don't think we can mlock()
> > when large() && !pmd_mappable(), see mlock_pte_range() and
> > mlock_vma_folio(). We can fix it along with many things later, but we
> > need to present a plan and a schedule now. Otherwise, there would be
> > pushback if we try to land the series without supporting mlock().
> >
> > Do you or Fengwei plan to take on it? (I personally don't.) If not,
> Do you mean the mlock() with large folio? Yes. I can work on it. Thanks.

Great. Thanks!

Other places that have the similar problem but are probably easier to
fix than the mlock() case:
* madvise_cold_or_pageout_pte_range()
* shrink_folio_list()

2023-06-29 00:50:39

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory



On 6/29/23 08:27, Yu Zhao wrote:
> On Wed, Jun 28, 2023 at 5:59 PM Yin Fengwei <[email protected]> wrote:
>>
>> Hi Yu,
>>
>> On 6/29/23 02:22, Yu Zhao wrote:
>>> And please make sure it's disabled by default, because we are still
>>> missing many important functions, e.g., I don't think we can mlock()
>>> when large() && !pmd_mappable(), see mlock_pte_range() and
>>> mlock_vma_folio(). We can fix it along with many things later, but we
>>> need to present a plan and a schedule now. Otherwise, there would be
>>> pushback if we try to land the series without supporting mlock().
>>>
>>> Do you or Fengwei plan to take on it? (I personally don't.) If not,
>> Do you mean the mlock() with large folio? Yes. I can work on it. Thanks.
>
> Great. Thanks!
>
> Other places that have the similar problem but are probably easier to
> fix than the mlock() case:
> * madvise_cold_or_pageout_pte_range()
This one was on my radar. :).

Regards
Yin, Fengwei

> * shrink_folio_list()

2023-06-29 01:59:30

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order

On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <[email protected]> wrote:
>
> For variable-order anonymous folios, we need to determine the order that
> we will allocate. From a SW perspective, the higher the order we
> allocate, the less overhead we will have; fewer faults, fewer folios in
> lists, etc. But of course there will also be more memory wastage as the
> order increases.
>
> From a HW perspective, there are memory block sizes that can be
> beneficial to reducing TLB pressure. arm64, for example, has the ability
> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
> 64K base pages) such that one of these chunks only uses a single TLB
> entry.
>
> So we let the architecture specify the order of the maximally beneficial
> mapping unit when PTE-mapped. Furthermore, because in some cases, this
> order may be quite big (and therefore potentially wasteful of memory),
> allow the arch to specify 2 values; One is the max order for a mapping
> that _would not_ use THP if all size and alignment constraints were met,
> and the other is the max order for a mapping that _would_ use THP if all
> those constraints were met.
>
> Implement this with Kconfig by introducing some new options to allow the
> architecture to declare that it supports large anonymous folios along
> with these 2 preferred max order values. Then introduce a user-facing
> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
> enabled if the architecture has declared its support. When disabled, it
> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
> allocated.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> mm/Kconfig | 39 +++++++++++++++++++++++++++++++++++++++
> mm/memory.c | 8 ++++++++
> 2 files changed, 47 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..f4ba48c37b75 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>
> source "mm/damon/Kconfig"
>
> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
> + def_bool n
> + help
> + An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
> + to be enabled. It must also set the following integer values:
> + - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> + - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +
> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> + int
> + help
> + The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> + that does not have the MADV_HUGEPAGE hint set.
> +
> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> + int
> + help
> + The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> + that has the MADV_HUGEPAGE hint set.
> +
> +config LARGE_ANON_FOLIO
> + bool "Allocate large folios for anonymous memory"
> + depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
> + default n
> + help
> + Use large (bigger than order-0) folios to back anonymous memory where
> + possible. This reduces the number of page faults, as well as other
> + per-page overheads to improve performance for many workloads.
> +
> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> + int
> + default 0 if !LARGE_ANON_FOLIO
> + default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +
> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
> + int
> + default 0 if !LARGE_ANON_FOLIO
> + default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +

IMHO I don't think we need all of the new kconfigs. Ideally the large
anon folios could be supported by all arches, although some of them
may not benefit from larger TLB entries due to lack of hardware
support.t

For now with a minimum implementation, I think you could define a
macro or a function that returns the hardware preferred order.

> endmenu
> diff --git a/mm/memory.c b/mm/memory.c
> index 9165ed1b9fc2..a8f7e2b28d7a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3153,6 +3153,14 @@ static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
> return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
> }
>
> +static inline int max_anon_folio_order(struct vm_area_struct *vma)
> +{
> + if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> + return CONFIG_LARGE_ANON_FOLIO_THP_ORDER_MAX;
> + else
> + return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> +}
> +
> /*
> * Handle write page faults for pages that can be reused in the current vma
> *
> --
> 2.25.1
>
>

2023-06-29 02:42:47

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory

On Tue, Jun 27, 2023 at 12:49 AM Yu Zhao <[email protected]> wrote:
>
> On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <[email protected]> wrote:
> >
> > On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <[email protected]> wrote:
> > >
> > > Hi All,
> > >
> > > Following on from the previous RFCv2 [1], this series implements variable order,
> > > large folios for anonymous memory. The objective of this is to improve
> > > performance by allocating larger chunks of memory during anonymous page faults:
> > >
> > > - Since SW (the kernel) is dealing with larger chunks of memory than base
> > > pages, there are efficiency savings to be had; fewer page faults, batched PTE
> > > and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
> > > overhead. This should benefit all architectures.
> > > - Since we are now mapping physically contiguous chunks of memory, we can take
> > > advantage of HW TLB compression techniques. A reduction in TLB pressure
> > > speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> > > TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> > >
> > > This patch set deals with the SW side of things only and based on feedback from
> > > the RFC, aims to be the most minimal initial change, upon which future
> > > incremental changes can be added. For this reason, the new behaviour is hidden
> > > behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> > > default. Although the code has been refactored to parameterize the desired order
> > > of the allocation, when the feature is disabled (by forcing the order to be
> > > always 0) my performance tests measure no regression. So I'm hoping this will be
> > > a suitable mechanism to allow incremental submissions to the kernel without
> > > affecting the rest of the world.
> > >
> > > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> > > [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> > > getting that series into the kernel, but I'm hoping we can start the review
> > > process on this patch set independently. I have a branch at [3].
> > >
> > > I've posted a separate series concerning the HW part (contpte mapping) for arm64
> > > at [4].
> > >
> > >
> > > Performance
> > > -----------
> > >
> > > Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> > > javascript benchmark running in Chromium). Both cases are running on Ampere
> > > Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> > > is repeated 15 times over 5 reboots and averaged.
> > >
> > > All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> > > 'anonfolio' is the full patch set similar to the RFC with the additional changes
> > > to the extra 3 fault paths. The rest of the configs are described at [4].
> > >
> > > Kernel Compilation (smaller is better):
> > >
> > > | kernel | real-time | kern-time | user-time |
> > > |:----------------|------------:|------------:|------------:|
> > > | baseline-4k | 0.0% | 0.0% | 0.0% |
> > > | anonfolio-basic | -5.3% | -42.9% | -0.6% |
> > > | anonfolio | -5.4% | -46.0% | -0.3% |
> > > | contpte | -6.8% | -45.7% | -2.1% |
> > > | exefolio | -8.4% | -46.4% | -3.7% |
> > > | baseline-16k | -8.7% | -49.2% | -3.7% |
> > > | baseline-64k | -10.5% | -66.0% | -3.5% |
> > >
> > > Speedometer 2.0 (bigger is better):
> > >
> > > | kernel | runs_per_min |
> > > |:----------------|---------------:|
> > > | baseline-4k | 0.0% |
> > > | anonfolio-basic | 0.7% |
> > > | anonfolio | 1.2% |
> > > | contpte | 3.1% |
> > > | exefolio | 4.2% |
> > > | baseline-16k | 5.3% |
> >
> > Thanks for pushing this forward!
> >
> > > Changes since RFCv2
> > > -------------------
> > >
> > > - Simplified series to bare minimum (on David Hildenbrand's advice)
> >
> > My impression is that this series still includes many pieces that can
> > be split out and discussed separately with followup series.
> >
> > (I skipped 04/10 and will look at it tomorrow.)
>
> I went through the series twice. Here what I think a bare minimum
> series (easier to review/debug/land) would look like:
> 1. a new arch specific function providing a prefered order within (0,
> PMD_ORDER).
> 2. an extended anon folio alloc API taking that order (02/10, partially).
> 3. an updated folio_add_new_anon_rmap() covering the large() &&
> !pmd_mappable() case (similar to 04/10).
> 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
> (06/10, reviewed-by provided).
> 5. finally, use the extended anon folio alloc API with the arch
> preferred order in do_anonymous_page() (10/10, partially).
>
> The rest can be split out into separate series and move forward in
> parallel with probably a long list of things we need/want to do.

Yeah, the suggestion makes sense to me. And I'd like to go with the
simplest way unless there is strong justification for extra
optimization for the time being IMHO.

>

2023-06-29 12:06:50

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order

On 29/06/2023 02:38, Yang Shi wrote:
> On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <[email protected]> wrote:
>>
>> For variable-order anonymous folios, we need to determine the order that
>> we will allocate. From a SW perspective, the higher the order we
>> allocate, the less overhead we will have; fewer faults, fewer folios in
>> lists, etc. But of course there will also be more memory wastage as the
>> order increases.
>>
>> From a HW perspective, there are memory block sizes that can be
>> beneficial to reducing TLB pressure. arm64, for example, has the ability
>> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
>> 64K base pages) such that one of these chunks only uses a single TLB
>> entry.
>>
>> So we let the architecture specify the order of the maximally beneficial
>> mapping unit when PTE-mapped. Furthermore, because in some cases, this
>> order may be quite big (and therefore potentially wasteful of memory),
>> allow the arch to specify 2 values; One is the max order for a mapping
>> that _would not_ use THP if all size and alignment constraints were met,
>> and the other is the max order for a mapping that _would_ use THP if all
>> those constraints were met.
>>
>> Implement this with Kconfig by introducing some new options to allow the
>> architecture to declare that it supports large anonymous folios along
>> with these 2 preferred max order values. Then introduce a user-facing
>> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
>> enabled if the architecture has declared its support. When disabled, it
>> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
>> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
>> allocated.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> mm/Kconfig | 39 +++++++++++++++++++++++++++++++++++++++
>> mm/memory.c | 8 ++++++++
>> 2 files changed, 47 insertions(+)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..f4ba48c37b75 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>>
>> source "mm/damon/Kconfig"
>>
>> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> + def_bool n
>> + help
>> + An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
>> + to be enabled. It must also set the following integer values:
>> + - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> + - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> + int
>> + help
>> + The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> + that does not have the MADV_HUGEPAGE hint set.
>> +
>> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> + int
>> + help
>> + The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> + that has the MADV_HUGEPAGE hint set.
>> +
>> +config LARGE_ANON_FOLIO
>> + bool "Allocate large folios for anonymous memory"
>> + depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> + default n
>> + help
>> + Use large (bigger than order-0) folios to back anonymous memory where
>> + possible. This reduces the number of page faults, as well as other
>> + per-page overheads to improve performance for many workloads.
>> +
>> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> + int
>> + default 0 if !LARGE_ANON_FOLIO
>> + default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +
>> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
>> + int
>> + default 0 if !LARGE_ANON_FOLIO
>> + default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>
> IMHO I don't think we need all of the new kconfigs. Ideally the large
> anon folios could be supported by all arches, although some of them
> may not benefit from larger TLB entries due to lack of hardware
> support.t
>
> For now with a minimum implementation, I think you could define a
> macro or a function that returns the hardware preferred order.

Thanks for the feedback - that aligns with what Yu Zhao suggested. I'm
implementing it for v2.

Thanks,
Ryan


>
>> endmenu
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 9165ed1b9fc2..a8f7e2b28d7a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3153,6 +3153,14 @@ static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>> return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>> }
>>
>> +static inline int max_anon_folio_order(struct vm_area_struct *vma)
>> +{
>> + if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>> + return CONFIG_LARGE_ANON_FOLIO_THP_ORDER_MAX;
>> + else
>> + return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>> +}
>> +
>> /*
>> * Handle write page faults for pages that can be reused in the current vma
>> *
>> --
>> 2.25.1
>>
>>


2023-06-29 15:59:18

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory

On 28/06/2023 19:22, Yu Zhao wrote:
> On Tue, Jun 27, 2023 at 3:59 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 27/06/2023 08:49, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <[email protected]> wrote:
>>>>
>>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <[email protected]> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> Following on from the previous RFCv2 [1], this series implements variable order,
>>>>> large folios for anonymous memory. The objective of this is to improve
>>>>> performance by allocating larger chunks of memory during anonymous page faults:
>>>>>
>>>>> - Since SW (the kernel) is dealing with larger chunks of memory than base
>>>>> pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>>>> and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>>>> overhead. This should benefit all architectures.
>>>>> - Since we are now mapping physically contiguous chunks of memory, we can take
>>>>> advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>>> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>>>> TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>>>
>>>>> This patch set deals with the SW side of things only and based on feedback from
>>>>> the RFC, aims to be the most minimal initial change, upon which future
>>>>> incremental changes can be added. For this reason, the new behaviour is hidden
>>>>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
>>>>> default. Although the code has been refactored to parameterize the desired order
>>>>> of the allocation, when the feature is disabled (by forcing the order to be
>>>>> always 0) my performance tests measure no regression. So I'm hoping this will be
>>>>> a suitable mechanism to allow incremental submissions to the kernel without
>>>>> affecting the rest of the world.
>>>>>
>>>>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
>>>>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
>>>>> getting that series into the kernel, but I'm hoping we can start the review
>>>>> process on this patch set independently. I have a branch at [3].
>>>>>
>>>>> I've posted a separate series concerning the HW part (contpte mapping) for arm64
>>>>> at [4].
>>>>>
>>>>>
>>>>> Performance
>>>>> -----------
>>>>>
>>>>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
>>>>> javascript benchmark running in Chromium). Both cases are running on Ampere
>>>>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
>>>>> is repeated 15 times over 5 reboots and averaged.
>>>>>
>>>>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
>>>>> 'anonfolio' is the full patch set similar to the RFC with the additional changes
>>>>> to the extra 3 fault paths. The rest of the configs are described at [4].
>>>>>
>>>>> Kernel Compilation (smaller is better):
>>>>>
>>>>> | kernel | real-time | kern-time | user-time |
>>>>> |:----------------|------------:|------------:|------------:|
>>>>> | baseline-4k | 0.0% | 0.0% | 0.0% |
>>>>> | anonfolio-basic | -5.3% | -42.9% | -0.6% |
>>>>> | anonfolio | -5.4% | -46.0% | -0.3% |
>>>>> | contpte | -6.8% | -45.7% | -2.1% |
>>>>> | exefolio | -8.4% | -46.4% | -3.7% |
>>>>> | baseline-16k | -8.7% | -49.2% | -3.7% |
>>>>> | baseline-64k | -10.5% | -66.0% | -3.5% |
>>>>>
>>>>> Speedometer 2.0 (bigger is better):
>>>>>
>>>>> | kernel | runs_per_min |
>>>>> |:----------------|---------------:|
>>>>> | baseline-4k | 0.0% |
>>>>> | anonfolio-basic | 0.7% |
>>>>> | anonfolio | 1.2% |
>>>>> | contpte | 3.1% |
>>>>> | exefolio | 4.2% |
>>>>> | baseline-16k | 5.3% |
>>>>
>>>> Thanks for pushing this forward!
>>>>
>>>>> Changes since RFCv2
>>>>> -------------------
>>>>>
>>>>> - Simplified series to bare minimum (on David Hildenbrand's advice)
>>>>
>>>> My impression is that this series still includes many pieces that can
>>>> be split out and discussed separately with followup series.
>>>>
>>>> (I skipped 04/10 and will look at it tomorrow.)
>>>
>>> I went through the series twice. Here what I think a bare minimum
>>> series (easier to review/debug/land) would look like:
>
> ===
>
>>> 1. a new arch specific function providing a prefered order within (0,
>>> PMD_ORDER).
>>> 2. an extended anon folio alloc API taking that order (02/10, partially).
>>> 3. an updated folio_add_new_anon_rmap() covering the large() &&
>>> !pmd_mappable() case (similar to 04/10).
>>> 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
>>> (06/10, reviewed-by provided).
>>> 5. finally, use the extended anon folio alloc API with the arch
>>> preferred order in do_anonymous_page() (10/10, partially).
>
> ===
>
>>> The rest can be split out into separate series and move forward in
>>> parallel with probably a long list of things we need/want to do.
>>
>> Thanks for the fadt review - I really appreciate it!
>>
>> I've responded to many of your comments. I'd appreciate if we can close those
>> points then I will work up a v2.
>
> Thanks!
>
> Based on the latest discussion here [1], my original list above can be
> optionally reduced to 4 patches: item 2 can be quashed into item 5.
>
> Also please make sure we have only one global (apply to all archs)
> Kconfig option, and it should be added in item 5:
>
> if TRANSPARENT_HUGEPAGE
> config FLEXIBLE/VARIABLE_THP # or whatever name you see fit
> end if

Naming is always the hardest part. I've been calling it LARGE_ANON_FOLIO up
until now. But I think you are right that we should show that it is related to
THP, so I'll go with FLEXIBLE_THP for v2, and let people shout if they hate it.

If we are not letting the arch declare that it supports FLEXIBLE_THP, then I
think we need the default version of arch_wants_pte_order() to return a value
higher than 0 (which is what I have it returning at the moment). Because
otherwise, for an arch that hasn't defined its own version of
arch_wants_pte_order(), FLEXIBLE_THP on vs off will give the same result. So I
propose to set the default to ilog2(SZ_64K >> PAGE_SHIFT). Shout if you have any
concerns.

>
> (How many new Kconfig options added within arch/arm64/ is not a concern of MM.)
>
> And please make sure it's disabled by default,

Done

because we are still
> missing many important functions, e.g., I don't think we can mlock()
> when large() && !pmd_mappable(), see mlock_pte_range() and
> mlock_vma_folio(). We can fix it along with many things later, but we
> need to present a plan and a schedule now. Otherwise, there would be
> pushback if we try to land the series without supporting mlock().

There are other areas that I'm aware off. I'll put together a table and send it
out once I have v2 out the door (hopefully tomorrow or Monday). Hopefully we can
work together to fill it in and figure out who can do what? I'm certainly
planning to continue to push this work forwards beyond this initial patch set.

Thanks,
Ryan

>
> Do you or Fengwei plan to take on it? (I personally don't.) If not,
> I'll try to find someone from our team to look at it. (It'd be more
> scalable if we have a coordinated group of people individually solving
> different problems.)
>
> [1] https://lore.kernel.org/r/[email protected]/